diff --git a/README.md b/README.md index e241363..77c7ce7 100644 --- a/README.md +++ b/README.md @@ -24,10 +24,11 @@ Daily unlocks begin soon. No ivory towers - just pure garage-energy and communit Stay tuned – let's geek out in the open together. ### Day 1 - [FlashMLA](https://github.com/deepseek-ai/FlashMLA) + **Efficient MLA Decoding Kernel for Hopper GPUs** Optimized for variable-length sequences, battle-tested in production -🔗 FlashMLA GitHub Repo +🔗 [**FlashMLA GitHub Repo**](https://github.com/deepseek-ai/FlashMLA) ✅ BF16 support ✅ Paged KV cache (block size 64) ⚡ Performance: 3000 GB/s memory-bound | BF16 580 TFLOPS compute-bound on H800 @@ -36,7 +37,7 @@ Optimized for variable-length sequences, battle-tested in production Excited to introduce **DeepEP** - the first open-source EP communication library for MoE model training and inference. -🔗 DeepEP GitHub Repo +🔗 [**DeepEP GitHub Repo**](https://github.com/deepseek-ai/DeepEP) ✅ Efficient and optimized all-to-all communication ✅ Both intranode and internode support with NVLink and RDMA ✅ High-throughput kernels for training and inference prefilling @@ -46,30 +47,30 @@ Excited to introduce **DeepEP** - the first open-source EP communication library ### Day 3 - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) -Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference. +Introducing **DeepGEMM** - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference. -⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs +🔗 [**DeepGEMM GitHub Repo**](https://github.com/deepseek-ai/DeepGEMM) +⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs ✅ No heavy dependency, as clean as a tutorial -✅ Fully Just-In-Time compiled -✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes -✅ Supports dense layout and two MoE layouts - -🔗 GitHub: https://github.com/deepseek-ai/DeepGEMM +✅ Fully Just-In-Time compiled +✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes +✅ Supports dense layout and two MoE layouts ### Day 4 - Optimized Parallelism Strategies -✅ DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. -🔗 https://github.com/deepseek-ai/DualPipe -✅ EPLB - an expert-parallel load balancer for V3/R1. - 🔗 https://github.com/deepseek-ai/eplb +✅ **DualPipe** - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/DualPipe) -📊 Analyze computation-communication overlap in V3/R1. - 🔗 https://github.com/deepseek-ai/profile-data +✅ **EPLB** - an expert-parallel load balancer for V3/R1. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/eplb) + +📊 Analyze computation-communication overlap in V3/R1. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/profile-data) ### Ongoing Releases... -## 2024 AI Infrastructure Paper (SC24) +## 2024 AI Infrastructure Paper (SC24) ### Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning -📄 Paper Link -📄 Arxiv Paper Link +[**📄 Paper Link**](https://dl.acm.org/doi/10.1109/SC41406.2024.00089) +[**📄 Arxiv Paper Link**](https://arxiv.org/abs/2408.14158)