โฑ๏ธ Profiling#
Measuring Quadrants kernel execution time, and checking launch latency#
Add pytorch profiler to the code, e.g.:
schedule=torch.profiler.schedule(
wait=80,
warmup=3,
active=1,
repeat=1
)
with torch.profiler.profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule,
record_shapes=False,
profile_memory=False,
with_stack=True,
with_flops=False,
) as profiler:
for _ in range(steps):
profiler.step()
# note that this must be OUTSIDE of the context manager
profiler.export_chrome_trace("trace.json")
within the code you wish to profile, call
profiler.step()at regular timesafter running, open the trace in http://ui.perfetto.dev/
Notes:
pytorch profiler can be used both for CPU and for GPU, even if torch is not used even a tiny bit in the program
youโll need to call profiler.step() at least enough times to match what you have put in wait/warmup/active
generally you want:
waitto be long enough to get past any initial steps you donโt want to look atwarmupnot sure if needs to be non-0, but I put 3, just in caseactiveโ 1 is generally enough, and will reduce memory used. You can experiment with larger values if you wish, of courserepeatshould be 1 in general: run the sequence of steps once, then stop profilingsee official documentation PyTorch profiler schedule documentation
for cpu code, both pyspy and pytorch profiler will give a hierarchical flame graph style view
however, the step() โwaitโ functionality means youโll skip all the initialization stuff at the start, that youโre not interested in, and the โactiveโ functionality means youโll get consistent times
also, pytorch profiler shows the actual sequence of calls, rather than the statistically sampled distribution (I think)
for gpu code, you donโt directly get any sort of hierarchy
you do however have very precise duration of each kernel launch time and duration
and you can clearly see any non-hidden kernel launch overhead, which is visible as white gaps between each kernel
if you do want to see the gpu kernels aligned with the python-side hierarchical view, which can help with understanding what the gpu kernel relates to, you can modify the code to call
sync(), just before each stepthis will add some latency (e.g. 2x slower, for example)
but means you can trust the alignment between the python hierarchical view and the gpu kernel view
For example, something like:
# Step the profiler after the physics step
if self.profiler is not None:
qd.sync() # Ensure all Quadrants GPU operations complete before profiling
self.profiler.step()
Within Quadrants kernels#
Torch profiler records the time spend in CUDA kernels, not Quadrants kernels. This is already one level deeper than what you could do with a CPU-only profiler (e.g. pyspy) + sync. But if you want to go deeper and profile code blocks inside individual GPU kernels per GPU-thread (block actually), you can use clock_counter for this.
First, create an enum with the things you will want to measure, e.g.:
from enum import IntEnum
class Time(IntEnum):
LineSearch = 1
Step2 = 2
UpdateConstraint = 3
HessianIncremental = 4
UpdateGradient = 5
StepLast = 6
Pass in a tensor of qd.64, e.g. timers. Then, inside the kernel, do things like:
@qd.kernel
def k1(... previous args, times: qd.types.NDArray[qd.i64, 1]:
start = qd.clock_counter()
linesearch()
end = qd.clock_counter()
if i_b == 0:
times[Time.LineSearch, it] = end - start
start = end
step2()
end = qd.clock_counter()
if i_b == 0:
times[Time.Step2, it] = end - start
start = end
update_constraint()
end = qd.clock_counter()
if i_b == 0:
times[Time.UpdateConstraint, it] = end - start
start = end
For an example of processing the results, see genesis/examples/speed_benchmark/timers.py.