State of AI: March 2026

Comprehensive Intelligence Briefing — Week of March 25, 2026
PRZC Research | March 25, 2026 | AI Intelligence Briefing | Report ID: T36-AI-SOTA-MAR2026

Executive Summary

Four developments dominate this week's briefing and carry immediate investment signal:

  1. Anthropic launches Computer Use on macOS (March 24) — The biggest agent-layer move of the week. Claude can now operate macOS natively: open apps, navigate browsers, fill spreadsheets, run dev tools. Paired with Dispatch (launched March 17), the architecture — phone-command, desktop-execute — is a direct assault on Microsoft's Copilot home turf and on OpenAI's Operator product. This is not a demo. It is a research preview shipping to Pro and Max subscribers today.
  2. NVIDIA GTC 2026 (March 16–19) resets the hardware ceiling — Jensen Huang revealed the full Vera Rubin platform: six new chips, the Groq 3 LPU integration, a standalone Vera CPU rack, and $1 trillion in cumulative orders forecast through 2027 (doubling the prior $500B projection). AWS, GCP, Azure, and OCI are first-wave Vera Rubin deployers. Inference token cost drops 10x vs. Blackwell. This is the single most consequential hardware event of Q1 2026.
  3. OpenAI GPT-5.4 (March 5) reshapes the frontier model tier — Three variants (Standard, Thinking, Pro), 1.05M token context, 33% factual error reduction vs. GPT-5.2, and a 92.8% Online-Mind2Web browser automation score that crushes the prior 70.9% benchmark. GPT-5.1 deprecated March 11. The model generation clock is turning every 4–6 weeks now.
  4. The compute infrastructure megaround cycle closes — February 2026 was the largest single month of startup funding in recorded history ($189B globally, 83% concentrated in OpenAI $110B, Anthropic $30B, Waymo $16B). Capital deployment at this scale means 2026 inference buildout is locked in. NVIDIA's order book is not at risk of demand shortfall.

I. Anthropic — This Week's Releases

Computer Use on macOS — Research Preview (March 24, 2026)

What it is: Anthropic enabled Claude Code and Claude Cowork to operate macOS computers autonomously. Claude perceives the screen, moves the cursor, clicks, types, opens applications, navigates browsers, and runs developer tools. No custom setup required. The capability is gated to Claude Pro ($20/mo) and Claude Max subscribers.

How it works — tool hierarchy: Claude first reaches for precise API connectors (Slack, Google Calendar, etc.). When connectors are unavailable, it falls back to direct screen interaction: mouse, keyboard, and screen perception. Explicit user permission is required before any action executes. Anthropic advises against use with sensitive credentials in the current preview window.

Windows status: x64 support confirmed as upcoming; no date given.

Competitive framing: This directly competes with:

The architectural differentiation Anthropic is pressing: persistent context across sessions. Dispatch + Computer Use together form a loop where Claude retains task state across phone and desktop interactions — a design none of the competitors have shipped end-to-end.

OSWorld-Verified scores (context): Sonnet 4.6 = 72.5%; Opus 4.6 = 72.7%. These are the agentic computer use benchmarks. Both are within 0.2% of each other — Sonnet now arguably makes Opus pricing unjustified for computer use workloads.

Claude Dispatch — Research Preview (March 17, 2026)

What it is: A persistent mobile-to-desktop control channel embedded in Claude Cowork. Users scan a QR code, link their phone to a desktop Cowork session, then issue natural language task instructions from the mobile app. Claude executes on the desktop while the user is away. Context thread does not reset between tasks.

Availability: Max subscribers received access March 17. Pro subscribers rolled in within days.

Why it matters: Dispatch is the scaffolding that makes Computer Use a product rather than a demo. Without persistent cross-device context, computer use is a novelty; with it, it becomes a workflow automation layer. The QR-code setup UX is engineered for zero friction — setup under two minutes by user reports.

Pricing Change — 1M Token Context at Standard Rates (mid-March)

What changed: Anthropic dropped the long-context surcharge on Claude Opus 4.6 and Sonnet 4.6. The full 1M token window now bills at the same per-token rate regardless of prompt length.

Current API pricing:

ModelInput ($/MTok)Output ($/MTok)ContextMax Output
Claude Haiku 4.5$1.00$5.00200K32K
Claude Sonnet 4.6$3.00$15.001M64K
Claude Opus 4.6$5.00$25.001M128K
Opus Fast Mode$30.00$150.001M128K

Effective cost floor: Combining prompt caching (90% savings) with batch API (50% off) yields up to 95% cost reduction — Sonnet 4.6 effective cost can approach $0.15/MTok input at scale. This is a structural price cut disguised as a feature change.

Investment angle: Anthropic is playing the Stripe/Twilio "make the developer experience the moat" game. Every price cut and simplification increases API consumption volume. Revenue scales with token throughput, not per-seat licensing. The pricing change is an acquisition play.

Additional March Releases (earlier in the month)


II. OpenAI — Recent Releases

GPT-5.4 (March 5, 2026)

What it is: OpenAI's current frontier model, released March 5 in three variants:

Key capabilities:

Deprecation: GPT-5.1 (Instant, Thinking, Pro) deprecated March 11. The model generation lifecycle is now approximately 8–10 weeks.

GPT-5.4 Mini rollout: GPT-5.4 mini pushed to Free and Go ChatGPT users as rate-limit fallback. GPT-5 Thinking mini to be retired within 30 days.

GPT-5.3 Instant: Replaced default ChatGPT model on March 3, just two days before GPT-5.4 launched — indicating OpenAI was staggering releases to maintain media attention cadence.

Context for the 92.8% Mind2Web score: This is the most underreported benchmark of the week. A 22-point improvement in browser automation in a single model generation means GPT-5.4 can reliably complete multi-step web tasks. This is what makes Operator (OpenAI's agent product) potentially deployable at enterprise scale in 2026.

OpenAI Organizational Context

OpenAI closed a $110 billion Series funding round on February 27, 2026 — the largest private venture round in history — at an $840B post-money valuation. Anchor investors: Amazon ($50B), NVIDIA ($30B), SoftBank ($30B). This capital locks in compute procurement and infrastructure buildout through at least 2028. OpenAI is no longer a startup in any meaningful sense; it is a hyperscale infrastructure company.


III. Google DeepMind — Recent Releases

Gemini 3 Series (March 2026)

Google has been executing a layered rollout of the Gemini 3 family across the month:

Gemini 3 Pro (Preview, March): Google's current flagship. Positioned as the best model for multimodal understanding and "most powerful agentic and vibe coding model." Available in Gemini app for Ultra subscribers and via API.

Key benchmarks vs. peers:

Gemini 3 Flash (March): Speed-optimized variant. Gemini 3's intelligence at a fraction of cost, available across Google products.

Gemini 3.1 Flash Lite (March 3, 2026 — API release): Ultra-low-cost developer tier. Priced at $0.25/1M input tokens, $1.50/1M output tokens. This is the most aggressive price point from any major lab for a production-quality model. At this pricing, Gemini Flash Lite is positioned to capture API usage that would otherwise go to Sonnet 4.6 at 12x the cost.

Gemini 3 Deep Think (upgrade, March): Specialized reasoning mode. Available in Gemini app for Google AI Ultra subscribers and via Gemini API to select researchers and enterprises.

Strategic observation: Google's multi-tier strategy — Ultra for flagship reasoning, Pro for production agentic work, Flash for speed, Flash Lite for cost — is the most complete product ladder of any lab. The $0.25 Flash Lite price is a deliberate loss-leader to capture developer mindshare before Gemini 3.1 Pro becomes the default enterprise choice.

Gemini 3.1 Pro pricing: $2.00/$12.00 per 1M tokens — significantly cheaper than both Claude Opus ($5/$25) and GPT-5.4 Pro.


IV. Meta / Open-Source AI

Llama 4 (Scout and Maverick)

What it is: Meta's first open-weight natively multimodal model family, and first using Mixture-of-Experts (MoE) architecture.

Llama 4 Scout:

Llama 4 Maverick:

MoE architecture significance: MoE allows a subset of parameters to activate per inference, dramatically reducing compute cost per token while maintaining the capability of larger dense models. Combined with open weights, this means enterprises can now deploy frontier-class multimodal agents on their own infrastructure without per-token API costs.

Agentic focus: Llama 4's central design goal is agentic execution — planning, multi-step task execution, and extended context maintenance. This is not a chat model with agentic features bolted on; it is built agent-first.

Investment angle for open-source: Meta's Llama strategy is fundamentally competitive with all API-dependent revenue models. Every enterprise that runs Llama 4 on their own infrastructure is a lost customer for OpenAI, Anthropic, and Google API revenue. The cost arbitrage is substantial. This is the persistent structural pressure on API pricing across the industry.


V. Chinese AI Labs

Alibaba — Qwen 3.5 (February 16, released broadly March 2026)

The model supports AI agent tasks and processes text, image, and video inputs — including videos up to two hours in length. This 2-hour video comprehension is technically distinctive; no Western lab has publicly matched it.

Market share context: DeepSeek and Qwen combined now hold approximately 15% of global AI market share, up from approximately 1% twelve months ago. This is the single most important competitive metric in the industry for investors tracking Western AI lab revenue growth.

DeepSeek V4 — Status: Imminent but Unconfirmed

Current status (as of March 25, 2026): DeepSeek V4 Lite appeared on DeepSeek's website on March 9, 2026, suggesting an incremental rollout. Full V4 has not launched publicly.

Architecture (from leaks and official signals):

Claimed benchmark performance (unverified, internal):

Investor warning: DeepSeek has a history of landing with full benchmark packages at release. Multiple Q1 2026 release windows have passed without launch. Do not price in competitive impact until the model ships publicly and independent benchmarks run.

Strategic significance if released: A trillion-parameter open-weight model from a Chinese lab, priced at near-zero or with a permissive license, would be a category-defining event. The Western API pricing model faces existential pressure if DeepSeek V4 performs at claimed levels under commercial-friendly licensing.

Competitive Market Share Summary (Chinese AI)

The rise of Chinese AI models represents the most underpriced risk in the AI infrastructure investment thesis. DeepSeek V3 alone demonstrated that frontier-class models could be trained at a fraction of Western lab costs (reportedly under $6M for the V3 training run vs. hundreds of millions at US labs). The efficiency gap is narrowing — but has not closed — in DeepSeek's favor.


VI. Hardware & Infrastructure

NVIDIA GTC 2026 (March 16–19) — The Hardware Event of Q1

NVIDIA unveiled the complete Vera Rubin generation at GTC 2026, with Jensen Huang raising the cumulative order forecast to $1 trillion through 2027 (from a prior $500B through 2026 projection).

Platform components:

ComponentRole
Vera CPUStandalone agentic-optimized CPU; ships in 256-chip liquid-cooled Vera racks
Rubin GPUNext-gen training/inference GPU
Groq 3 LPUAcquired via $20B licensing deal; inference-specialized
NVLink 6 SwitchHigh-bandwidth chip interconnect
ConnectX-9 SuperNICNetwork interface
BlueField-4 DPU/STXStorage rack system
Spectrum-6 SPXNetworking rack

Performance claims vs. Blackwell:

First deployers confirmed: AWS, Google Cloud, Microsoft Azure, Oracle OCI, CoreWeave, Lambda, Nebius, Nscale.

Jensen Huang on agentic AI: GTC 2026 was explicitly framed around "agentic AI infrastructure." The Vera CPU rack (256-chip standalone compute) is purpose-built for agentic workloads where reasoning loops require tight CPU-GPU coordination at microsecond latency.

Investment implications: NVIDIA's $1T order forecast through 2027 represents a sustained demand signal that de-risks near-term revenue estimates. The 10x inference cost reduction is not deflationary for NVIDIA — lower per-token cost drives higher token throughput demand, which sustains GPU demand. This is the same pattern observed with semiconductor efficiency curves driving higher unit volumes, not lower total revenue.

NVIDIA stock reaction: +2% on announcement day (March 16). Analyst consensus: GTC met expectations rather than exceeded them, suggesting the market had already priced in strong guidance.

AMD — Helios Rack-Scale Platform

AMD CEO Lisa Su introduced the Helios rack-scale AI platform at CES 2026 (January), targeting production in 2026. AMD expanded its portfolio in March with the Ryzen AI Embedded P100 series for edge inference. AMD's competitive position: meaningful alternative for customers seeking NVIDIA supply diversification, particularly in hyperscaler environments where cost leverage matters. Market share remains secondary to NVIDIA in training workloads; inference is the battleground.

Amazon — Trainium 3 and Nova 2 Pro

Amazon's Nova 2 Pro model (released earlier in Q1) combined with broad availability of Trainium 3 processors positions AWS to offer AI training and inference at approximately 40% lower cost than competing cloud providers. This is Amazon's structural play: be the cost-efficient infrastructure layer underneath everyone else's AI products.


VII. Capability Frontier — SOTA Benchmarks Table

As of March 25, 2026 — compiled from published benchmarks and third-party evaluations

ModelMMLUHumanEvalSWE-BenchOSWorld / Computer UseContext WindowKey Capability Edge
Claude Opus 4.6 (Anthropic)~87.9%~88%80.8%72.7%1M tokensBest real-world coding; highest output limit (128K); Computer Use
Claude Sonnet 4.6 (Anthropic)~86%~86%79.6%72.5%1M tokensNear-Opus performance at 60% of cost; best price/quality
GPT-5.4 (OpenAI)88.5%N/A pub.N/A pub.N/A pub.1.05M tokensBrowser automation (92.8% Mind2Web); Terminal-Bench 75.1%; 33% fewer factual errors
Gemini 3.1 Pro (Google)N/A pub.N/A pub.80.6%N/A pub.1M tokensLeads pure reasoning/graduate science; $2/$12 pricing; production-ready 1M context
Gemini 3.1 Flash Lite (Google)N/A$0.25/$1.50 per MTok — cheapest production API in market
Llama 4 Maverick (Meta)Beats GPT-4o per MetaExtendedOpen-weight multimodal; MoE architecture; self-hostable
Qwen 3.5 (Alibaba)Beats GPT-5.2 on math-vision256K397B MoE; Apache 2.0; 201 languages; 2-hour video comprehension
Grok 4.20 (xAI)N/AStill in beta; Grok 5 delayed to Q2 2026
Mistral Small 4 (Mistral)256KUnified reasoning + multimodal + coding; 119B params; Apache 2.0; open-source
DeepSeek V4 (DeepSeek)Unverified claims: 90%+90% (claimed, unverified)>80% (claimed, unverified)1M (claimed)NOT YET RELEASED — imminent; trillion-parameter MoE

Note: Benchmark comparisons across labs use different evaluation methodologies and are often run on curated subsets. Direct numeric comparison should be treated as directional, not definitive. SWE-Bench Verified is currently the most methodologically consistent cross-lab coding benchmark.


VIII. Investment Implications

Winners

NVIDIA (NVDA)
The Vera Rubin $1T order forecast is the clearest demand signal in the market. 10x inference cost reduction drives higher throughput demand, not lower GPU revenue. All four major hyperscalers are first-wave Vera Rubin deployers. The risk is execution on the production ramp — NVIDIA's prior Blackwell delays are the cautionary data point. Vera Rubin is currently in full production per Jensen Huang's January CES confirmation. Rating: Strong overweight for 12-month horizon.

Anthropic (private)
Three simultaneous moats reinforcing each other this week: Computer Use (agent capability), Dispatch (persistent cross-device context), Pricing simplification (1M tokens at standard rate). The $100M Partner Network with Accenture and Deloitte anchor partners converts enterprise relationships into distribution. The M365 integration gives Anthropic access to 400M+ enterprise seats via Microsoft. Closed $30B Series G in February at $380B valuation — still private, but secondary market exposure via NVIDIA (which invested) and Amazon (AWS partnership + investor). For pure-play exposure: watch for IPO signals in H2 2026.

Microsoft (MSFT)
Copilot + Claude Sonnet integration + Azure as Vera Rubin first-wave deployer = Microsoft is systematically hedging its OpenAI dependency while building out the enterprise AI distribution layer. Copilot Checkout (AI shopping agent) adds a consumer revenue vector. Microsoft's 2026 AI strategy is coherent: own the distribution surface regardless of which foundation model wins. Rating: Overweight.

Amazon / AWS (AMZN)
Trainium 3 + 40% cost advantage for inference + Nova 2 Pro + $50B OpenAI investment + Vera Rubin first-wave deployer. Amazon's strategy — "be the invisible engine inside every enterprise AI app" — is structurally defensible. The cost advantage is real and growing. Rating: Overweight.

CoreWeave (CRWV)
Named as a Vera Rubin first-wave deployer alongside the hyperscalers. Significant validation. CoreWeave's IPO earlier in 2026 positions it as the pure-play GPU cloud investment for investors who want NVIDIA exposure without NVIDIA's defense/gaming/automotive revenue mix. Watch closely.

Meta (META)
Llama 4 open-source strategy is the best long-term competitive position in AI for an advertising business. Meta doesn't sell AI — it uses AI to improve ad targeting and engagement. Every Llama release that keeps the open-source ecosystem alive and developer-friendly extends Meta's moat in ways that don't show up on an AI revenue line. Rating: Market weight with AI as a long-term optionality call.

Losers / Under Pressure

Scale AI, Surge AI, and similar data labeling firms
The capability jump in agentic computer use means AI systems can increasingly self-generate training data, browse and curate datasets, and perform labeling tasks autonomously. Not an immediate threat, but a 3-5 year structural headwind.

Traditional enterprise software incumbents (SAP, Salesforce CRM tier)
Microsoft Copilot + Anthropic Computer Use + GPT-5.4 Operator together represent a direct challenge to workflow software that depends on human interaction as the interface layer. When AI can navigate any GUI as fluently as a human, software pricing based on per-seat access to GUIs is under pressure.

Pure-play AI API arbitrage plays
Companies built on reselling API access to GPT-4/Claude 3 at markup face compression from both directions: model labs cutting prices directly and open-source models (Llama 4, Qwen 3.5, Mistral Small 4) enabling self-hosted alternatives.

xAI / Grok competitive position
Grok 5 missed its Q1 2026 window and has been pushed to Q2. In a market where OpenAI, Anthropic, and Google are shipping multiple significant updates per month, a multi-quarter delay is not a neutral event. xAI's $20B Series E and the xAI-SpaceX merger valuation at $1.25T reflects Musk's capital access and Grok's X integration, not demonstrated model superiority. The benchmark gap is widening during the delay window.

DeepSeek as a Western investment thesis risk
If DeepSeek V4 ships at claimed specifications (1T-parameter MoE, HumanEval 90%, open/permissive license), it invalidates the pricing floors that support current AI API revenue multiples. This is the primary tail risk for OpenAI and Anthropic revenue projections at their current venture valuations.

Watch List

Company/StockTickerTrigger to Watch
NVIDIANVDAVera Rubin production ramp timeline; Q1 earnings guidance
MicrosoftMSFTCopilot enterprise seat growth; M365 AI attach rate
AmazonAMZNAWS AI revenue line; Trainium 3 customer wins
Alphabet/GoogleGOOGLGemini 3.1 Pro API adoption; Flash Lite market capture
CoreWeaveCRWVVera Rubin deployment contracts; capacity utilization
MetaMETALlama 4 developer adoption; ad revenue AI contribution
PalantirPLTRAIP platform adoption as agentic tools proliferate
SnowflakeSNOWCortex AI product traction vs. native cloud AI competition

IX. The PRZC View — What This Week Means

Are we accelerating? Yes — and the acceleration is not primarily in benchmark scores. The acceleration is in the deployment surface. Three months ago, "AI agent" meant a chatbot that could call an API. This week, Claude is operating a macOS desktop from a phone command, GPT-5.4 is completing multi-step browser tasks at 92.8% accuracy, and NVIDIA is promising inference at 10x lower token cost on a platform already in full production. The abstraction layer between "AI model" and "AI worker" is collapsing faster than most enterprise adoption projections assumed.

The most underappreciated development this week is not the headline model. It is the pricing move. Anthropic eliminating the long-context premium on 1M-token windows is a structural statement: the era of penalizing long-context usage is over. Combined with Google's $0.25/MTok Flash Lite pricing, the industry is in the early stages of a commodity pricing collapse for mid-tier model inference. This is deflationary for AI service margins but inflationary for AI application volume. The companies that benefit are those with high-frequency, high-volume AI workflows — not those selling premium AI access at a markup. Investors pricing AI plays on revenue-per-seat models should revisit assumptions for 2027 projections.

The no-AGI thesis is not looking more correct this week — but it is not falsified either. What is happening is narrower and more consequential: specific capability thresholds are crossing in rapid succession. Computer use. Near-perfect browser automation. Trillion-parameter open-weight multimodal models. None of these individually constitutes general intelligence. Together, they constitute a general capability surface that looks, for most enterprise and consumer use cases, indistinguishable from the outcome the no-AGI thesis was trying to protect. The practical question for investors is not "when is AGI?" but "when does the current capability curve saturate the addressable market?" On current trajectory, that saturation point for knowledge work automation is closer to 2028 than 2035.


X. Mystery Benchmark Topper — New DeepSeek Imminent?

Priority Alert — Added March 25, 2026 | Urgency: High

Background: The Community Alarm

In mid-March 2026, a model appeared anonymously on OpenRouter and within days was processing over one trillion tokens while climbing to the top of usage charts. The AI community's immediate assumption: DeepSeek V4 was quietly in stealth testing. The reasoning was sound — the capabilities were clearly frontier-level, the anonymous submission fit DeepSeek's historical secrecy, and DeepSeek V4 had already missed five predicted release windows, making an unannounced test drop plausible.

The March 18 reveal fractured that narrative.

Reuters confirmed that the mystery model was not DeepSeek. It was Xiaomi's MiMo-V2-Pro, operating under the codename "Hunter Alpha" — built by a team led by Luo Fuli, a former core contributor to DeepSeek's breakthrough models who joined Xiaomi in late 2025. That personnel link is why the model's architecture and behavior pattern-matched so strongly to DeepSeek expectations.

But the underlying DeepSeek V4 question remains entirely open. And as of March 25, there is now a separate, active mystery: two anonymous OpenAI models, "Vortex" and "Zephyr," are currently on the LMSYS Chatbot Arena challenging the current top positions.

Part 1: Xiaomi MiMo-V2-Pro (Confirmed "Hunter Alpha")

What It Is

MiMo-V2-Pro is a closed-weight, trillion-parameter LLM from Xiaomi, heavily optimized for real-world tool use and complex agentic workflows. The team lead, Luo Fuli, was a core DeepSeek contributor before joining Xiaomi in late 2025 — which explains why early analysis of Hunter Alpha's outputs triggered DeepSeek attribution.

Specifications

AttributeValue
DeveloperXiaomi (team led by ex-DeepSeek contributor Luo Fuli)
ArchitectureClosed-weight, Mixture-of-Experts
Total parametersTrillion-scale
Active parameters per pass~42B (~3x predecessor MiMo-V2)
Context window1M tokens
Primary optimizationReal-world tool use, agentic workflows
Stealth testing platformOpenRouter (as "Hunter Alpha", March 11–18)
Usage during stealth period1 trillion tokens processed

Benchmark Performance

LeaderboardScoreGlobal RankNotes
Artificial Analysis Intelligence Index49 pts#8 worldwide#2 among Chinese models
PinchBench84.0#3 globallyBehind leading Claude variants only
ClawEval61.5#3 globallySurpasses recent GPT-5.x iterations
OpenRouter usage leaderboard#1During stealth period, by token volume

The Strategic Play

Xiaomi's anonymous launch generated massive organic usage and community hype before the official March 18–19 reveal. This is the same compressed-perception-cycle tactic OpenAI used with "gpt2-chatbot" on LMSYS in 2024, and repeated with "Zenith"/"Summit" ahead of GPT-5. By the time of announcement, the model is already proven at scale and the community is primed.

Investment Angle — Xiaomi (HKEx: 1810)

Most analyst models covering Xiaomi price it as a consumer hardware and smartphone company. MiMo-V2-Pro achieving globally top-10 positioning across multiple leaderboards — optimized for agentic device control with 1M-token context — repositions Xiaomi as an AI platform company with a credible frontier model. This framing has not propagated into mainstream sell-side coverage as of this writing. The ex-DeepSeek talent acquisition (Luo Fuli) signals this is a deliberate, sustained AI capability investment, not a one-cycle experiment.

Part 2: DeepSeek V4 — Still Unreleased, Increasingly Imminent

The Missed Window Timeline

WindowOutcome
Mid-February 2026No release
Lunar New Year (late Jan – early Feb)No release
Late FebruaryNo release
Early March (community predicted March 3)No release
March 9"V4 Lite" appears silently on DeepSeek web interface
March 25 (today)Full V4 still unreleased; no official announcement

The March 9 "V4 Lite" appearance — improved coding performance, May 2025 knowledge cutoff, no official announcement from DeepSeek — is consistent with a staged capability rollout ahead of a full release. This mirrors how Claude and GPT-5 families first deployed lighter variants before flagship releases. It is the strongest signal yet that full V4 is close.

Architecture (Unverified — sourced from Reuters, The Information, community reverse engineering)

Three claimed architectural innovations distinguish V4 from V3:

  1. Manifold-Constrained Hyper-Connections — Improved gradient flow in very deep MoE architectures; addresses training instability at trillion-parameter scale
  2. Engram Conditional Memory — A novel memory mechanism for more efficient use of long-context information without full attention over the entire window; reduces compute on multi-hop retrieval tasks
  3. DeepSeek Sparse Attention — Achieves context windows exceeding 1M tokens at approximately 50% lower computational cost vs standard attention mechanisms

Model scale: Trillion total parameters, ~37B active per token — consistent with DeepSeek's established MoE activation ratio (same as V3).

Claimed Benchmark Numbers (Internal, Unverified — Treat as Aspirational)

BenchmarkDeepSeek V4 ClaimedCurrent Western FrontierNote
HumanEval90%~88% (Claude 3.5 Opus)Marginal delta; plausible
SWE-bench Verified80%+~40–50%If true, this is a reset event
Long-context codingDominant (internal)Claude/GPT-5.xNot independently verifiable
Inference cost (coding)10–40x lower than Western competitors~$3–15/MTokExtrapolated from V3 efficiency gains

The SWE-bench claim is the one to watch. A validated 80%+ on SWE-bench Verified — where current frontier models sit in the 40–50% range — would represent the most significant single benchmark jump in AI history for software engineering tasks. It would immediately reshape the developer AI tooling market.

DeepSeek R2 (Reasoning Model) — Also Missing

Separate from V4, DeepSeek's reasoning model R2 has also missed every predicted window. Community timelines have now shifted to Q2 2026. Leaked specs (all unverified):

The Polymarket prediction market for "DeepSeek R2 released before July 2026" was sitting at approximately 65% probability as of late March — the market-weighted expectation is a release before summer.

Community Intelligence Snapshot

SourceSignal
r/LocalLLaMAEarly March window missed; community median now May–June 2026
X/TwitterV4 Lite on March 9 = imminent full release; majority view
PolymarketR2 before July: ~65% probability
ManifoldR2/V4-Thinking median: June 2026
Hacker NewsCEO quality bar delay = authentic, not misdirection (split opinion)

Part 3: Current Arena Mystery — OpenAI "Vortex" and "Zephyr"

Separately from the DeepSeek watch, two anonymous OpenAI variants — "Vortex" and "Zephyr" — appeared on LMSYS Chatbot Arena in early March 2026 and are actively challenging the current top positions (Claude Opus 4.6 at 1504 Elo, Gemini 3.1 Pro at 1500 preliminary Elo).

The pattern is identical to OpenAI's GPT-5 pre-launch testing, which used "Zenith" and "Summit" as codenames. The dual-variant structure (one optimized for speed/breadth, one for deep multi-step reasoning) is consistent with OpenAI's established release playbook. Arena pre-testing typically precedes official announcement by 2–6 weeks — pointing to a GPT-5.4 extended announcement or GPT-5.5 reveal in late March to late April 2026.

PRZC Position: Investment Implications of the Mystery Model Situation

1. DeepSeek V4 Release = Highest-Probability Near-Term Market Catalyst

When V4 launches (base case: April–June 2026):

2. Xiaomi Re-Rating Opportunity

MiMo-V2-Pro is a materially underappreciated data point for Xiaomi's long-term AI platform thesis. A consumer hardware company that can field a globally top-10 AI model — built by ex-DeepSeek talent, optimized for the agentic device control use case that every major tech company is racing toward — deserves a different valuation framework than "premium Android phone maker." Watch for analyst re-ratings in Q2 2026 if MiMo-V3 signals emerge.

3. Watch OpenRouter and LMSYS for the Next Mystery Appearance

The Hunter Alpha playbook (anonymous OpenRouter drop → viral organic adoption → official reveal) is now a documented strategy. Track anonymous high-performance model appearances on OpenRouter, LMSYS Arena, and HuggingFace leaderboards as early-warning signals for imminent frontier releases from any lab. The next mystery entry could be DeepSeek V4, Alibaba Qwen's next flagship, or a lab not currently in primary coverage.


PRZC Research — Investment Analysis Division | T36-AI-SOTA-MAR2026 | March 25, 2026
Coverage Period: March 9 – March 25, 2026 | Classification: Internal Distribution — Tier 1 Clients
This document is for informational purposes only and does not constitute investment advice.

Commission Research

Want Research Like This on Your Own Topic?

Commission a bespoke PRZC report on any company, sector, or market question. Full CBOM methodology. Delivered within 5–7 business days. £500 per report.

Commission a Report — £500

Bulk credits available from £350/report  ·  Browse all free research