Experiment: Drawing Roofline for Nemotron-3 Nano • Ball's Blog

Purpose

The purpose of this experiment is as follows:

Analyze the compute and memory overhead of MLP Layer, Attention Layer, and Mamba2 Layer as sequence length increases.(I fixed the batch size to 1 for this experiment)
Draw roofline for MLP Layer, Attention Layer, and Mamba2 Layer as sequence length increases.

Nsight Compute: This is a profiling tool provided by NVIDIA. It allows us to analyze the detailed performance of each kernel, and also provides various metrics such as achieved occupancy, achieved FLOPS, and achieved memory bandwidth.
Nsight Systems: This is another profiling tool provided by NVIDIA. It allows us to analyze the overall performance of the application, and also provides various metrics such as CPU utilization, GPU utilization, and memory usage.
vLLM: State of the art inference engine for LLM.

Prompt Length: 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072
Batch Size: 1

Eager mode and offline inference: I used eager mode vLLM offline inference API to run the inference of Nemotron-3 Nano model.
Chunked prefill with chunk size of 4096: I used chunked prefill with chunk size of 4096. This means if prefill sequence length is less than 4096, it will prefill with the actual sequence length. However, if prefill sequence length is greater than 4096, it will prefill with chunk size of 4096 until the end. For example, if prefill sequence length is 5000, vllm runs prefill with chunk size of 4096 for the first 4096 tokens, and then run prefill with chunk size of 904 for the remaining tokens.
Monkey patching: I used monkey patching to insert nvtx marker for each layer, and also to insert “cudaProfilerStart” and “cudaProfilerStop” for each layer. This allows me to distinguish performance measures between layers in Nsight Compute and Nsight Systems.
Prefill the prompt and decode 1 token: I setup the experiment to prefill the prompt and decode 1 token. This is to analyze the performance of each layer during prefill, and also during decode. For example, if the prompt length is 8192, then vllm will process as follows: “prefill 4096 tokens” -> “prefill 4096 tokens” -> “decode 1 token”. By doing so, I can analyze the performance of each layer during prefill and decode separately.

Coming soon!