State of the Union's Open Source AI: How American Open-Weights Models Compare Globally

Jul 03, 2026 07:00 PM - 2 days ago 2373

Introduction

The United States produces immoderate of the world’s astir wide utilized open-weights models, spanning hyperscaler releases for illustration Meta’s Llama and Google’s Gemma, hardware-tuned models from NVIDIA, small-but-capable models from Microsoft’s Phi team, and the afloat transparent OLMo family from the nonprofit Allen Institute for AI. Together they scope from the astir openly documented models connected world to immoderate of the astir commercially restricted, and from 14-billion-parameter models that tally connected a laptop to 550-billion-parameter reasoning systems. The consequence is simply a uniquely divers ecosystem pinch nary azygous creation accuracy holding it together.

In grant of the 250th day of the United States, this article reviews the authorities of the United States’ open-weights ample connection models. We return a look astatine the apical performing options, analyse really open-weight LLMs from the US disagree successful architecture, and estimate connected what kinds of improvements we mightiness spot successful early open-source models. The extremity is not to crown a azygous winner, but to representation who builds unfastened models successful the US, really they build them, and wherever the ecosystem mightiness beryllium headed next.

Key Takeaways

American open-weights AI spans the afloat scope from 1 of the astir unfastened exemplary families successful the world (Ai2’s OLMo) to the highest-scoring unfastened exemplary successful the US (NVIDIA’s Nemotron 3 Ultra 550B), pinch beardown self-hostable options for illustration Gemma 4 31B successful between.
American models are unsocial for their architectural diverseness and are mostly distributed done incumbent platforms. Most of them deficiency techniques communal crossed architectures overseas (Multi-head Latent Attention (MLA), auxiliary-loss-free Mixture-of-Experts (MoE), reasoning pretraining). Chinese labs converge connected shared architectures and non-US/non-China labs for illustration Mistral compete connected sovereignty and permissive licensing.
Its unfastened releases person historically trailed the soul proprietary frontier, but newer entrants are closing that gap. NVIDIA’s hybrid Mamba-MoE Nemotron 3 statement now tops US unfastened benchmarks connected some value and throughput.

The Best US Open-Weights Models by the Numbers

The tables beneath database verified benchmark results for the starring American open-weights models arsenic of mid-2026, drawn from charismatic exemplary cards, method reports, and the independent Artificial Analysis leaderboard.

The reasoning models and general-purpose models are separated. A reasoning exemplary pinch test-time chain-of-thought will hit a non-reasoning exemplary connected mathematics and subject benchmarks. Comparing them straight is not a adjacent comparison, truthful they are successful abstracted tables. Also, benchmark versions and conditions vary. LiveCodeBench versions disagree (v3 vs v6), American Invitational Mathematics Examination (AIME) editions disagree by twelvemonth (2024/2025/2026), and astir scores are vendor-reported from self-run evals. GPQA Diamond and LiveCodeBench are high-variance. Treat differences of a constituent aliases 2 arsenic noise.

Reasoning-Capable Open Models

Scores are reasoning-mode (“thinking on”) wherever the exemplary supports a toggle. All figures are from charismatic exemplary cards aliases method reports.

† The Nemotron 3 Ultra exemplary paper reports GPQA, MMLU-Pro, LiveCodeBench, SWE-Bench Verified (70.7), and RULER-1M (94.7) but not a standalone AIME 2025 score; connected tool-augmented olympiad mathematics (IMOAnswerBench) it scores 92.3. ‡ OLMo 3 32B Think reports modular MMLU 85.4 and MATH 96.1 alternatively than MMLU-Pro. It is the strongest afloat unfastened (weights + codification + data) reasoning model.

General-Purpose (Non-Reasoning) Open Models

Model (Lab) Params (total/active) MMLU-Pro GPQA Diamond LiveCodeBench Throughput Context License

Llama 4 Maverick (Meta)	400B/17B	80.5	69.8	43.4	~108 tok/s	1M	Llama 4 Community
Llama 4 Scout (Meta)	109B/17B	74.3	57.2	32.8	~95 tok/s	10M	Llama 4 Community
Phi-4 (Microsoft)	14B dense	70.4	56.1	—	—	16K	MIT
Gemma 3 27B (Google)	27B dense	67.5	42.4	29.7	—	128K	Gemma
DBRX Instruct (Databricks)	132B/36B	— ‡	—	—	~150 tok/s	32K	Databricks Open

‡ DBRX (March 2024) predates MMLU-Pro/GPQA becoming standard; it reports MMLU 73.7, HumanEval 70.1, and GSM8K 66.9. It is included arsenic a size and velocity reference point, not a current-quality contender.

Summarizing the Numbers

Best wide (mid-2026): NVIDIA’s Nemotron 3 Ultra 550B. It tops MMLU-Pro (86.8), GPQA Diamond (87.0), and LiveCodeBench (89.0) among American open-weights models. Notably, it is besides 1 of the astir unfastened flagships, released nether OpenMDW-1.1 pinch training information and post-training recipes. 550B full parameters (55B active) intends it needs an 8×H100-class node to serve.
Best you tin self-host: Google’s Gemma 4 31B. It leads connected AIME, runs connected a azygous high-end GPU, and ships nether Apache 2.0. For astir builders, it is the strongest applicable American unfastened model.
Best efficiency: NVIDIA’s Nemotron 3 Nano 30B. With only ~3B progressive parameters (MoE), it posts MMLU-Pro 78.1 and GPQA 72.5. It outscores dense models galore times its progressive size.
Best mini model: Microsoft’s Phi-4-reasoning-plus (14B). GPQA Diamond and AIME numbers competitory pinch models 5–40x its size.
Best afloat open: Ai2’s OLMo 3 32B Think. Class-leading MATH (96.1) and beardown coding scores. Not the apical scorer, but a afloat unfastened merchandise pinch weights, training code, afloat data, and hundreds of intermediate checkpoints.
Fastest / longest context: Meta’s Llama 4. No reasoning version and mid-tier scores, but Scout’s 10M-token discourse is the largest of immoderate unfastened model, and some variants are tuned for high-throughput serving astatine scale.

What Defines American Open-Weights AI

American open-weights AI is defined by a postulation of divergent bets made by ample exertion incumbents, a spot vendor, and a fewer investigation nonprofits, pinch small shared creation accuracy betwixt them. The defining trait is architectural diverseness without consensus. Grouped Query Attention (GQA) is the astir communal building block, but beyond it American labs prosecute radically divergent bets successful parallel. Mamba-2 state abstraction models, Meta’s interleaved Rotary Position Embedding (iRoPE) attention, NVIDIA’s LatentMoE, and various layer-wise scaling schemes are diverging experiments pinch nary shared creation accuracy connecting them. NVIDIA is the clearest illustration of a laboratory pushing its ain direction. It co-evolves exemplary and silicon together, pretraining Nemotron successful the NVIDIA FP4 (NVFP4) 4-bit format and building hardware-aware hybrid Mamba-2 architectures astir its ain GPUs.

How these models get their capabilities besides differs from the Chinese approach. American labs person historically treated reasoning arsenic a post-training problem, grafting it connected done supervised fine-tuning and reinforcement learning alternatively than embedding it successful pretraining. And because the largest players are level companies, distribution is simply a structural advantage nary independent laboratory has matched yet. Meta’s multi-billion-user footprint gives Llama real-world scope acold beyond its benchmark standing.

Openness is wherever the ecosystem has the astir contradictions. In past years, the shape seemed to beryllium “open aft proprietary,” pinch unfastened weights trailing a lab’s soul frontier merchandise by a generation. NVIDIA’s Nemotron 3 statement has precocious upended that, shipping unfastened weights (and, for the Ultra model, training data) that apical the US benchmark tables outright. The Allen Institute’s OLMo family goes further still, releasing weights, training code, afloat training data, and intermediate checkpoints together, making it 1 of the astir wholly unfastened releases anywhere. Yet the aforesaid state produces the astir restricted licenses too, from Llama’s monthly-active-user headdress to DBRX’s no-compete clause. The 1 throughline is ratio astatine mini scale. Microsoft’s 14-billion-parameter Phi-4-reasoning matches models galore times its size, and Apple’s OpenELM has precocious layer-wise ratio research.

What Defines Chinese Open-Weights AI

Where American labs diverge, starring Chinese labs person converged. Multi-head Latent Attention (MLA), first introduced by DeepSeek, has since been adopted by Moonshot’s Kimi K2 and, arsenic of GLM-5, Zhipu’s GLM line (earlier GLM versions utilized GQA). Several of these models besides stock a fine-grained Mixture-of-Experts creation on pinch DeepSeek’s auxiliary-loss-free load balancing.

Their training accuracy is besides distinctive. Chinese labs progressively dainty reasoning arsenic a pretraining target, dedicating full “stage 2” pretraining phases to elevated math, code, and STEM information alternatively of relying connected post-training alone. Some besides build self-sustaining synthetic information loops. Alibaba, for instance, utilized specialized Qwen2.5-Math and Qwen2.5-Coder models to make synthetic training information for Qwen3, reducing dependence connected proprietary API teachers. And Chinese models look to do it cheaply. DeepSeek V3 claims to person been trained connected 14.8 trillion tokens for astir $5.6 million, which would make it the astir cost-efficient frontier training tally ever.

Qwen leads world HuggingFace downloads and DeepSeek leads open-weight reasoning leaderboards, trailing only the strongest closed models. The limit is transparency. These labs people beardown results and elaborate arXiv papers, but they merchandise weights only. None of the awesome frontier Chinese labs people training code, training data, aliases intermediate checkpoints.

What Defines European and Other Global Models

Outside the US and China, open-weights activity seems to beryllium driven arsenic overmuch by sovereignty and connection sum arsenic by capability. France’s Mistral is the astir frontier-competitive player, and its largest models now vessel nether Apache 2.0, a rarity astatine that scale. For astir European efforts, though, the information is reducing dependence connected American and Chinese platforms alternatively than beating them outright. That extremity is supported by nationalist money, astir notably EuroHPC’s AI Factories, which springiness mini and medium-sized enterprises (SMEs) and nonprofits free GPU access. In 1 striking case, a Latvian translator institution trained a 30-billion-parameter exemplary utilizing wholly subsidized compute, thing pinch nary US equivalent.

The remainder of the section tends to capable gaps the giants ignore. Multilingual sum is simply a recurring theme. OpenEuroLLM spans the EU’s 24 charismatic languages, Singapore’s SEA-LION covers Southeast Asian languages, and India’s Sarvam handles 22 Indian languages. Licensing approaches alteration widely: Canada’s Cohere releases its Command A models for investigation nether a non-commercial (CC-BY-NC) license, requiring a abstracted statement for commercialized use. Several one-time contenders person simply retreated. Germany’s Aleph Alpha near the frontier title for endeavor sovereignty software, and the UAE’s Falcon pulled backmost to acold smaller models.

What Comes Next for American Open-Source AI

Memory-efficient attention, whether DeepSeek-style MLA aliases the hybrid Mamba-Transformer designs NVIDIA is now shipping, is connected way to go modular because it cuts conclusion costs without sacrificing quality. Reasoning is apt to move earlier successful the pipeline, treated arsenic a pretraining nonsubjective alternatively than a post-training patch, and a azygous checkpoint will progressively service some a “thinking” and a “fast” mode alternatively of a laboratory shipping 2 abstracted models. More of the afloat stack, including training data, code, and moreover training-cost disclosures, could vessel alongside weights.

The larger opportunity is organizational alternatively than technical. Almost each awesome American unfastened exemplary is simply a broadside output of a institution whose existent merchandise is thing else, which intends openness is ever secondary to a proprietary roadmap. The Allen Institute’s OLMo shows really overmuch a genuinely open-source-first statement tin accomplish, but it’s practically alone. There is room successful the US for much organizations whose superior ngo is the unfastened merchandise itself, particularly successful the underserved 30-to-70-billion-parameter scope wherever nary afloat open, architecturally modern, data-released American exemplary yet exists.

Finally, not each opportunity is simply a bigger model. Some of the astir valuable open-source activity will beryllium small, hyper-specific, compact models tuned for a azygous domain, positive the routers, model-selection architectures, and specialized verifiers for speculative decoding. These systems reward openness, because they request to beryllium inspected, fine-tuned, and freely composed, and they play straight to America’s demonstrated spot successful small-model efficiency. A early US open-source ecosystem whitethorn compete little connected owning the azygous biggest exemplary and much connected offering a rich | toolkit of small, interoperable, purpose-built ones.

Common Questions

Is American open-source AI down China?

Not connected openness, and not uniformly connected capability. The US produces some 1 of the astir unfastened exemplary families successful the world (Ai2’s OLMo) and immoderate of the astir wide utilized unfastened models (Llama, Gemma); however, Qwen (China) now leads world downloads and derivatives. Where American unfastened models lag is top-line benchmark performance. The highest-scoring open-weights models are presently Chinese.

What is the astir unfastened American model?

Ai2’s OLMo family (OLMo 2 and the newer OLMo 3). It releases weights, training code, the afloat training dataset, and hundreds of intermediate checkpoints together nether a permissive Apache 2.0 license. This is among the astir complete unfastened releases anywhere. The mostly of “open” models merchandise weights only.

Mistral has offices successful the U.S. Is Mistral an American unfastened root project?

No. Mistral is simply a French institution headquartered successful Paris. Its US beingness is simply a income and operational office. Model improvement is controlled by the French parent. It is the strongest near-frontier open-weights laboratory extracurricular the US and China.

Why don’t American labs usage MLA?

A operation of timing, infrastructure lock-in, hardware privilege, and mission. Meta’s architecture predates MLA’s maturity. NVIDIA chose a competing approach. Google and Microsoft look serving-stack constraints. Underlying each of it, only Ai2 treats unfastened weights arsenic its superior mission, truthful astir US labs adopt architecture connected merchandise timelines.

Has NVIDIA’s Mamba-2 stake paid off?

Increasingly, yes. The throughput advantage is good established, and pinch the Nemotron 3 statement the hybrid Mamba-2 creation now reaches near-frontier quality. Nemotron 3 Ultra tops the US open-weights benchmarks. What it has not shown is simply a value advantage complete the champion pure-attention models. The apical Chinese models still lead the composite leaderboards, and NVIDIA still keeps a mini fraction of afloat attraction layers alternatively than going axenic state-space.

Conclusion

Two 100 and 50 years in, the authorities of American open-source AI is 1 of genuine activity paired pinch self-imposed lag. Many companies and overmuch of the talent successful the U.S. are focused connected starring the world successful closed models, leaving open-weights activity pinch little attraction and attention. The US hosts immoderate of the astir unfastened models successful the world and immoderate of the astir wide utilized ones, but its strongest labs progressively reserve their frontier activity for closed products. Meta’s 2026 pivot to the proprietary Muse Spark is the clearest example.

The astir absorbing opportunities are structural alternatively than incremental. An open-source-first organization, building MLA-first astatine 30B+ scale, pinch reasoning successful pretraining and the afloat training stack released, would adjacent astir of the gaps identified present astatine once. The techniques are already public. What is missing is an American laboratory whose superior ngo is to usage them.

Olmo 3: Fully Open-Source LLM from AI2 (Models, Data, & Code)
The MoE-ification of the Open Model Ecosystem, and What It Means for Your Inference Bill
The LLM Inference Optimization: Quantization to Speculative Decoding Part 1

This activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.