โฑ๏ธ Profiling#

Measuring Quadrants kernel execution time, and checking launch latency#

Add pytorch profiler to the code, e.g.:

    schedule=torch.profiler.schedule(
        wait=80,
        warmup=3,
        active=1,
        repeat=1
    )
    with torch.profiler.profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=schedule,
        record_shapes=False,
        profile_memory=False,
        with_stack=True,
        with_flops=False,
    ) as profiler:
        for _ in range(steps):
            profiler.step()
    # note that this must be OUTSIDE of the context manager
    profiler.export_chrome_trace("trace.json")
  • within the code you wish to profile, call profiler.step() at regular times

  • after running, open the trace in http://ui.perfetto.dev/

Notes:

  • pytorch profiler can be used both for CPU and for GPU, even if torch is not used even a tiny bit in the program

  • youโ€™ll need to call profiler.step() at least enough times to match what you have put in wait/warmup/active

  • generally you want:

    • wait to be long enough to get past any initial steps you donโ€™t want to look at

    • warmup not sure if needs to be non-0, but I put 3, just in case

    • active โ‡’ 1 is generally enough, and will reduce memory used. You can experiment with larger values if you wish, of course

    • repeat should be 1 in general: run the sequence of steps once, then stop profiling

    • see official documentation PyTorch profiler schedule documentation

  • for cpu code, both pyspy and pytorch profiler will give a hierarchical flame graph style view

    • however, the step() โ€˜waitโ€™ functionality means youโ€™ll skip all the initialization stuff at the start, that youโ€™re not interested in, and the โ€˜activeโ€™ functionality means youโ€™ll get consistent times

    • also, pytorch profiler shows the actual sequence of calls, rather than the statistically sampled distribution (I think)

  • for gpu code, you donโ€™t directly get any sort of hierarchy

    • you do however have very precise duration of each kernel launch time and duration

    • and you can clearly see any non-hidden kernel launch overhead, which is visible as white gaps between each kernel

    • if you do want to see the gpu kernels aligned with the python-side hierarchical view, which can help with understanding what the gpu kernel relates to, you can modify the code to call sync(), just before each step

      • this will add some latency (e.g. 2x slower, for example)

      • but means you can trust the alignment between the python hierarchical view and the gpu kernel view

For example, something like:

# Step the profiler after the physics step
if self.profiler is not None:
    qd.sync()  # Ensure all Quadrants GPU operations complete before profiling
    self.profiler.step()

Within Quadrants kernels#

Torch profiler records the time spend in CUDA kernels, not Quadrants kernels. This is already one level deeper than what you could do with a CPU-only profiler (e.g. pyspy) + sync. But if you want to go deeper and profile code blocks inside individual GPU kernels per GPU-thread (block actually), you can use clock_counter for this.

First, create an enum with the things you will want to measure, e.g.:

from enum import IntEnum

class Time(IntEnum):
    LineSearch = 1
    Step2 = 2
    UpdateConstraint = 3
    HessianIncremental = 4
    UpdateGradient = 5
    StepLast = 6

Pass in a tensor of qd.64, e.g. timers. Then, inside the kernel, do things like:

@qd.kernel
def k1(... previous args, times: qd.types.NDArray[qd.i64, 1]:
	start = qd.clock_counter()
	linesearch()
	end = qd.clock_counter()
  if i_b == 0:
      times[Time.LineSearch, it] = end - start
  start = end

  step2()
	end = qd.clock_counter()
  if i_b == 0:
      times[Time.Step2, it] = end - start
  start = end

  update_constraint()
	end = qd.clock_counter()
  if i_b == 0:
      times[Time.UpdateConstraint, it] = end - start
  start = end

For an example of processing the results, see genesis/examples/speed_benchmark/timers.py.