Experiment: Drawing Roofline for Nemotron-3 Nano
/ 2 min read
Updated:Table of Contents
Purpose
The purpose of this experiment is as follows:
- Analyze the compute and memory overhead of MLP Layer, Attention Layer, and Mamba2 Layer as sequence length increases.(I fixed the batch size to 1 for this experiment)
- Draw roofline for MLP Layer, Attention Layer, and Mamba2 Layer as sequence length increases.
Tools
- Nsight Compute: This is a profiling tool provided by NVIDIA. It allows us to analyze the detailed performance of each kernel, and also provides various metrics such as achieved occupancy, achieved FLOPS, and achieved memory bandwidth.
- Nsight Systems: This is another profiling tool provided by NVIDIA. It allows us to analyze the overall performance of the application, and also provides various metrics such as CPU utilization, GPU utilization, and memory usage.
- vLLM: State of the art inference engine for LLM.
Experiment Settings
Hardware
- NVIDIA 3090 GPU
Software
- Model: “nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16”
- CUDA version: 12.8
- NVIDIA Driver version: 570.133.07
- vLLM version: 0.18.0
- nsight compute version: 2024.1.1.0
- nsight systems version: 2026.2.1.210
Profile Space
- Prompt Length: 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072
- Batch Size: 1
Experiment Implementation Details
- Eager mode and offline inference: I used eager mode vLLM offline inference API to run the inference of Nemotron-3 Nano model.
- Chunked prefill with chunk size of 4096: I used chunked prefill with chunk size of 4096. This means if prefill sequence length is less than 4096, it will prefill with the actual sequence length. However, if prefill sequence length is greater than 4096, it will prefill with chunk size of 4096 until the end. For example, if prefill sequence length is 5000, vllm runs prefill with chunk size of 4096 for the first 4096 tokens, and then run prefill with chunk size of 904 for the remaining tokens.
- Monkey patching: I used monkey patching to insert nvtx marker for each layer, and also to insert “cudaProfilerStart” and “cudaProfilerStop” for each layer. This allows me to distinguish performance measures between layers in Nsight Compute and Nsight Systems.
- Prefill the prompt and decode 1 token: I setup the experiment to prefill the prompt and decode 1 token. This is to analyze the performance of each layer during prefill, and also during decode. For example, if the prompt length is 8192, then vllm will process as follows: “prefill 4096 tokens” -> “prefill 4096 tokens” -> “decode 1 token”. By doing so, I can analyze the performance of each layer during prefill and decode separately.
Experiment Code
Experiment Results
Coming soon!