🎉 Congratulations to the following users for winning in the #Gate CBO Kevin Lee# - 6/26 event!
KaRaDeNiZ, Sakura_3434, Anza01, asiftahsin, GateUser-d0654db3, milaluxury, Ryakpanda, 静.和, milaluxury, 币大亨1
💰 Each winner will receive $5 Points!
🎁 Rewards will be distributed within 14 working days. Please make sure to complete identity verification to be eligible.
📌 Event details: https://www.gate.com/post/status/11782130
🙏 Thank you all for your enthusiastic participation — more exciting events are on the way!
DeepSeekOpen Source Third Wave: V3/R1 Training Inference Key Cheats
Source: Qubits
On the third day of Open Source Week, DeepSeek unveiled the "power" behind training inference V3/R1.
DeepGEMM: An FP8 GEMM (General Matrix Multiplication) library that supports dense and mixed expert (MoE) matrix multiplication operations.
Let's start with a brief look at GEMM.
GEMM, which stands for General Matrix Multiplication, is a fundamental operation in linear algebra. It is a common operation in scientific computing, machine learning, deep learning, and many high-performance computing tasks.
But because its computational workload is often relatively large, performance optimization of GEMM is crucial.
DeepSeek's open-sourced DeepGEMM still maintains the characteristics of "high performance + low cost", with the following highlights:
Simply put, DeepGEMM is mainly used to accelerate matrix operations in deep learning, especially in large-scale model training and inference. It is particularly suitable for scenarios that require efficient computing resources and can significantly improve computing efficiency.
Many netizens are quite "buying in" on this open source release, with some likening DeepGEMM to the superhero of the mathematical world, believing it to be faster than a lightning-quick calculator and more powerful than polynomial equations.
Others likened the release of DeepGEMM to the stabilization of quantum states to a new reality, praising its cleanliness in instant compilation.
!
Of course... some people are starting to worry about their NVIDIA stocks...
!
Learn more about DeepGEMM
DeepGEMM is a library specifically designed to achieve concise and efficient FP8 general matrix multiplication (GEMMs), with fine-grained scaling capabilities inspired by DeepSeek V3.
It can handle both general matrix multiplication and support general matrix multiplication for MoE groups.
This library is written in CUDA, and does not need to be compiled during installation, because it will compile all kernel programs at runtime through a lightweight just-in-time compilation (JIT) module.
Currently, DeepGEMM only supports NVIDIA's Hopper Tensor Core.
In order to solve the problem that the FP8 tensor core is not accurate enough to calculate the accumulation, it adopts the two-stage accumulation (boosting) method of the CUDA core.
While DeepGEMM borrows some of the ideas from CUTLASS and CuTe, it doesn't overrely on their templates or algebraic operations.
Instead, the library is designed succinctly, with only one core kernel function and about 300 lines of code.
This makes it a concise and easy-to-understand resource for learning FP8 matrix multiplication and optimization techniques under the Hopper architecture.
Despite its lightweight design, the performance of DeepGEMM can match or exceed expert-tuned libraries for various matrix shapes.
So how about the specific performance?
The team tested all shapes that may be used in inference with DeepSeek-V3/R1 on H800 using NVCC 12.8, including pre-padding and decoding, but without tensor parallelism.
The following figure shows the performance of ordinary DeepGEMM for dense models:
According to the test results, the DeepGEMM computing performance can reach up to 1358 TFLOPS, and the memory bandwidth can reach up to 2668 GB/s.
In terms of acceleration ratio, it can reach up to 2.7 times compared to the optimized implementation based on CUTLASS 3.6.
Let's take a look at the performance of DeepGEMM to support the contiguous layout of MoE models:
And the performance of supporting the MoE model masked layout is like this:
How to use?
To use DeepGEMM, you need to pay attention to several dependencies, including:
The development code is as follows:
The installation code is as follows:
After the above steps, you can import deep_gemm into your Python project.
In terms of interfaces, for ordinary DeepGEMM, the deep_gemm.gemm_fp8_fp8_bf16_nt function can be called, and NT format (non-transposed LHS and transposed RHS) is supported.
For grouped DeepGEMMs, m_grouped_gemm_fp8_fp8_bf16_nt_contiguous in the case of continuous layout. In the case of mask layout, it is m_grouped_gemm_fp8_fp8_bf16_nt_masked.
DeepGEMM also provides tool functions for setting the maximum number of SMs, obtaining TMA alignment size, etc.; supports environment variables, such as DG_NVCC_COMPILER, DG_JIT_DEBUG, etc.
In addition to this, the DeepSeek team offers several ways to optimize, including:
JIT Design: All kernels are compiled at runtime, no need to compile at installation time; Supports dynamic selection of the optimal block size and pipeline stage.
Interested friends can click on the GitHub link at the end of the article to view the details~
One More Thing
Nvidia's stock ...... these days Well...... Keep falling again:
!
However, in the early hours of the 27th Beijing time, Nvidia's performance report for the fourth quarter of FY2025 is also about to be released. Let's look forward to its performance~