AnaTech Maz Technology Magazine

New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second

Keerthana S August 07, 2025| 04:44 PM Technology

OpenAI and NVIDIA have jointly introduced two powerful open-weight large language models — gpt-oss-120b and gpt-oss-20b — aimed at democratizing access to advanced AI reasoning for developers, researchers, startups, and enterprises across the globe.

Representing a major leap in open-source AI, these models deliver top-tier performance, flexibility, and efficiency across a wide range of hardware setups. Trained on NVIDIA's H100 GPUs and finely tuned for deployment across its expansive CUDA ecosystem, both models are optimized to run best on Blackwell-powered GB200 NVL72 systems, reaching record-breaking inference speeds of 1.5 million tokens per second.

Figure 1. GPT-OSS Model.

Powered by Blackwell, Licensed for Innovation

Both gpt-oss models are available under the permissive Apache 2.0 license, enabling full commercial and research use. Figure 1 shows GPT-OSS Model.

“OpenAI demonstrated what’s possible with NVIDIA AI — and now they’re advancing open-source innovation,” said Jensen Huang, NVIDIA’s founder and CEO. “These models give developers worldwide access to a cutting-edge foundation to build upon, reinforcing U.S. leadership in AI, all backed by the world’s most powerful compute infrastructure.”

The flagship gpt-oss-120b features 117 billion parameters (with 5.1B active per token) and achieves near-parity with OpenAI’s o4-mini on reasoning benchmarks — all while running on a single 80 GB GPU.
The lighter gpt-oss-20b houses 21 billion parameters (3.6B active) and matches the performance of o3-mini, while being optimized to run on edge devices with just 16 GB of memory.

This adaptability allows developers to use their preferred tools and platforms while benefiting from NVIDIA’s deep AI optimization.

Advanced Architecture, Maximum Efficiency

Both models are built on a Mixture-of-Experts (MoE) architecture, support 128K context windows, utilize Rotary Positional Embeddings, and incorporate efficient attention mechanisms to balance computational power with memory usage.

Designed for real-time, low-latency applications, the models excel in:

Chain-of-thought (CoT) reasoning
Tool use
Structured output generation

Broad Framework Compatibility and Real-World Performance

The gpt-oss models integrate seamlessly with major frameworks including FlashInfer, Hugging Face, llama.cpp, Ollama, vLLM, and NVIDIA’s TensorRT-LLM stack [1]. Benchmark results show that gpt-oss-120b outperforms proprietary models like OpenAI’s o1 and o4-mini in key domains such as:

Healthcare (HealthBench)
Mathematics (AIME 2024/2025)
Programming (Codeforces)

Despite its smaller size, gpt-oss-20b delivers competitive results while requiring far less infrastructure.

Training leveraged a combination of supervised fine-tuning, reinforcement learning, and methods inspired by OpenAI’s advanced proprietary systems. Both models support variable reasoning effort settings (low, medium, high) to help users balance speed and performance as needed.

Safety and Deployment at Scale

The models underwent rigorous safety evaluation using OpenAI’s Preparedness Framework and adversarial fine-tuning. Independent reviews ensured that safety standards match those of OpenAI’s frontier models.

For deployment, OpenAI and NVIDIA have partnered with top platforms like Azure, AWS, Vercel, and Databricks, and hardware providers including AMD, Cerebras, and Groq. Additionally, Microsoft is supporting local inference of gpt-oss-20b on Windows devices via ONNX Runtime.

With unmatched speed, broad compatibility, and open access, the gpt-oss models represent a bold step forward in bringing cutting-edge AI to the global community.

reference:

https://interestingengineering.com/innovation/openai-nvidia-open-weight-ai-models

Cite this article:

Keerthana S (2025), New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second, AnaTechMaz, pp.760

Previous Post AI Cracks the Hidden Code of Gut Bacteria Communication

Next Post Revolutionary AI Tool Transforms the Future of Protein Engineering

New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second

Powered by Blackwell, Licensed for Innovation

Advanced Architecture, Maximum Efficiency

Safety and Deployment at Scale

reference:

Cite this article:

Recent Post

Revolutionary AI Tool Transforms the Future of Protein Engineering

Meta Patches Bug That May Have Exposed Users' AI Prompts and Outputs

XAI Claims Grok 4's Response Issues Have Been Resolved

DuckDuckGo Now Offers Option to Hide AI-Generated Images in Search Results

AI Uncovers 86,000 Hidden Earthquakes Beneath Yellowstone

Amazon Acquires Bee, The Always-on AI Wearable That Records Your Conversations

AI Cracks the Hidden Code of Gut Bacteria Communication

Google Pilots a Vibe–Coding App Called Opal

AI Takes Imaging to the Edge of What Physics Allows

New GPT-OSS Model by NVIDIA And OpenAI Sets Record With 1.5 million Tokens Per Second

Google’s Genie 3 Breathes Life Into AI-Generated Worlds with Dynamic, Real-Time Gameplay

OpenAI Brings Back GPT-4o After User Backlash Over Complaints That the New GPT-5 Feels Less Intelligent

Hexaware Technologies to Establish South African Hub for Delivering AI-Driven Solutions Across Africa

Google Confronts $34.5 Billion Threat as AI Challenger Perplexity Targets Chrome

AI-Powered Radar Can Eavesdrop on Phone Calls from 10 Feet Away, Raising Fresh Privacy Concerns

Blog Archive

Popular Lnks