# ⏱️ Profiling ## Measuring Quadrants kernel execution time, and checking launch latency Add pytorch profiler to the code, e.g.: ```python schedule=torch.profiler.schedule( wait=80, warmup=3, active=1, repeat=1 ) with torch.profiler.profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=schedule, record_shapes=False, profile_memory=False, with_stack=True, with_flops=False, ) as profiler: for _ in range(steps): profiler.step() # note that this must be OUTSIDE of the context manager profiler.export_chrome_trace("trace.json") ``` - within the code you wish to profile, call `profiler.step()` at regular times - after running, open the trace in http://ui.perfetto.dev/ **Notes:** - pytorch profiler can be used both for CPU and for GPU, even if torch is not used even a tiny bit in the program - you’ll need to call profiler.step() at least enough times to match what you have put in wait/warmup/active - generally you want: - `wait` to be long enough to get past any initial steps you don’t want to look at - `warmup` not sure if needs to be non-0, but I put 3, just in case - `active` ⇒ 1 is generally enough, and will reduce memory used. You can experiment with larger values if you wish, of course - `repeat` should be 1 in general: run the sequence of steps once, then stop profiling - see official documentation [PyTorch profiler schedule documentation](https://docs.pytorch.org/docs/stable/profiler.html#torch.profiler.schedule) - for cpu code, both pyspy and pytorch profiler will give a hierarchical flame graph style view - however, the step() ‘wait’ functionality means you’ll skip all the initialization stuff at the start, that you’re not interested in, and the ‘active’ functionality means you’ll get consistent times - also, pytorch profiler shows the actual sequence of calls, rather than the statistically sampled distribution (I think) - for gpu code, you don’t directly get any sort of hierarchy - you do however have very precise duration of each kernel launch time and duration - and you can clearly see any non-hidden kernel launch overhead, which is visible as white gaps between each kernel - if you do want to see the gpu kernels aligned with the python-side hierarchical view, which can help with understanding what the gpu kernel relates to, you can modify the code to call `sync()`, just before each step - this will add some latency (e.g. 2x slower, for example) - but means you can trust the alignment between the python hierarchical view and the gpu kernel view For example, something like: ```bash # Step the profiler after the physics step if self.profiler is not None: qd.sync() # Ensure all Quadrants GPU operations complete before profiling self.profiler.step() ``` ## Within Quadrants kernels Torch profiler records the time spend in CUDA kernels, not Quadrants kernels. This is already one level deeper than what you could do with a CPU-only profiler (e.g. pyspy) + sync. But if you want to go deeper and profile code blocks inside individual GPU kernels per GPU-thread (block actually), you can use clock_counter for this. First, create an enum with the things you will want to measure, e.g.: ```bash from enum import IntEnum class Time(IntEnum): LineSearch = 1 Step2 = 2 UpdateConstraint = 3 HessianIncremental = 4 UpdateGradient = 5 StepLast = 6 ``` Pass in a tensor of qd.64, e.g. timers. Then, inside the kernel, do things like: ```bash @qd.kernel def k1(... previous args, times: qd.types.NDArray[qd.i64, 1]: start = qd.clock_counter() linesearch() end = qd.clock_counter() if i_b == 0: times[Time.LineSearch, it] = end - start start = end step2() end = qd.clock_counter() if i_b == 0: times[Time.Step2, it] = end - start start = end update_constraint() end = qd.clock_counter() if i_b == 0: times[Time.UpdateConstraint, it] = end - start start = end ``` For an example of processing the results, see [genesis/examples/speed_benchmark/timers.py](genesis/examples/speed_benchmark/timers.py).