New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second
OpenAI and NVIDIA have jointly introduced two powerful open-weight large language models — gpt-oss-120b and gpt-oss-20b — aimed at democratizing access to advanced AI reasoning for developers, researchers, startups, and enterprises across the globe.
Representing a major leap in open-source AI, these models deliver top-tier performance, flexibility, and efficiency across a wide range of hardware setups. Trained on NVIDIA's H100 GPUs and finely tuned for deployment across its expansive CUDA ecosystem, both models are optimized to run best on Blackwell-powered GB200 NVL72 systems, reaching record-breaking inference speeds of 1.5 million tokens per second.
Figure 1. GPT-OSS Model.
Powered by Blackwell, Licensed for Innovation
Both gpt-oss models are available under the permissive Apache 2.0 license, enabling full commercial and research use. Figure 1 shows GPT-OSS Model.
“OpenAI demonstrated what’s possible with NVIDIA AI — and now they’re advancing open-source innovation,” said Jensen Huang, NVIDIA’s founder and CEO. “These models give developers worldwide access to a cutting-edge foundation to build upon, reinforcing U.S. leadership in AI, all backed by the world’s most powerful compute infrastructure.”
- The flagship gpt-oss-120b features 117 billion parameters (with 5.1B active per token) and achieves near-parity with OpenAI’s o4-mini on reasoning benchmarks — all while running on a single 80 GB GPU.
- The lighter gpt-oss-20b houses 21 billion parameters (3.6B active) and matches the performance of o3-mini, while being optimized to run on edge devices with just 16 GB of memory.
Advanced Architecture, Maximum Efficiency
Both models are built on a Mixture-of-Experts (MoE) architecture, support 128K context windows, utilize Rotary Positional Embeddings, and incorporate efficient attention mechanisms to balance computational power with memory usage.
Designed for real-time, low-latency applications, the models excel in:
- Chain-of-thought (CoT) reasoning
- Tool use
- Structured output generation
The gpt-oss models integrate seamlessly with major frameworks including FlashInfer, Hugging Face, llama.cpp, Ollama, vLLM, and NVIDIA’s TensorRT-LLM stack [1]. Benchmark results show that gpt-oss-120b outperforms proprietary models like OpenAI’s o1 and o4-mini in key domains such as:
- Healthcare (HealthBench)
- Mathematics (AIME 2024/2025)
- Programming (Codeforces)
Training leveraged a combination of supervised fine-tuning, reinforcement learning, and methods inspired by OpenAI’s advanced proprietary systems. Both models support variable reasoning effort settings (low, medium, high) to help users balance speed and performance as needed.
Safety and Deployment at Scale
The models underwent rigorous safety evaluation using OpenAI’s Preparedness Framework and adversarial fine-tuning. Independent reviews ensured that safety standards match those of OpenAI’s frontier models.
For deployment, OpenAI and NVIDIA have partnered with top platforms like Azure, AWS, Vercel, and Databricks, and hardware providers including AMD, Cerebras, and Groq. Additionally, Microsoft is supporting local inference of gpt-oss-20b on Windows devices via ONNX Runtime.
With unmatched speed, broad compatibility, and open access, the gpt-oss models represent a bold step forward in bringing cutting-edge AI to the global community.
reference:
- https://interestingengineering.com/innovation/openai-nvidia-open-weight-ai-models
Cite this article:
Keerthana S (2025), New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second, AnaTechMaz, pp.760















