← Back to All Research
Commission a Report
On March 24–25, 2026, Google Research published TurboQuant — a compression algorithm that reduces the memory required to run large language models during inference by at least 6x, with zero accuracy loss, requiring no retraining. Memory stocks responded immediately: SK Hynix fell 6%, Samsung fell 5%, Micron fell 5%, Kioxia fell 6%. Morgan Stanley told investors to buy the dip. TrendForce called the reaction "likely an overreaction." Both may be right about the immediate quarter — and both are wrong about the structural trajectory.
TurboQuant is not an isolated event. It is the latest data point in a compounding series of efficiency breakthroughs — DeepSeek V3, Mamba/SSM hybrid architectures, Mixture-of-Experts, the Groq LPU acquisition by Nvidia, SanDisk's High Bandwidth Flash — that are collectively exerting deflationary pressure on the hardware requirements per unit of AI compute. Each breakthrough alone is manageable. The trend they collectively represent is not.
This report makes a specific, structural argument: the AI application bubble and the AI infrastructure bubble are distinct phenomena and must be evaluated separately. Our companion report (T38) argues that Anthropic's revenue trajectory, enterprise adoption data, and market displacement effects are inconsistent with a bubble at the application layer. That argument holds. But the infrastructure layer operates under entirely different economics — and those economics are increasingly fragile.
The four core findings:
TurboQuant was published at ICLR 2026 on March 24–25, 2026. The paper targets a specific and well-known bottleneck in AI inference: the KV cache (key-value cache). Every time an LLM generates tokens, it must store and retrieve attention context for all prior tokens in the sequence. As context windows have grown from 4,096 tokens to 200,000 tokens and beyond, KV caches have ballooned from megabytes to hundreds of gigabytes per concurrent session. This is why inference at scale is memory-bound, not compute-bound — the GPU sits idle waiting for memory reads.
TurboQuant compresses the KV cache to 3 bits (down from 32 bits in standard float32 representations) with zero accuracy loss on all major long-context benchmarks. At 4-bit compression it achieves up to an 8x speedup in computing attention logits on Nvidia H100 GPUs. The algorithm works in two stages:
No retraining. No fine-tuning. Works on existing models including Gemma and Mistral families. Open-source PyTorch implementations appeared on GitHub within 48 hours of the paper's publication.
Morgan Stanley's "buy the dip" call and TrendForce's "overreaction" assessment are defensible on one specific technical claim: TurboQuant does not directly reduce HBM demand. This distinction is critical and worth dwelling on.
There are two separate memory systems in a modern AI data center:
| Memory Type | Location | Function | TurboQuant Impact |
|---|---|---|---|
| HBM (High Bandwidth Memory) | Inside the GPU die package (stacked on chip) | Stores model weights; feeds compute cores during forward pass | None — not targeted |
| DRAM / Server RAM | On the inference server motherboard, outside GPU | Stores KV cache during extended conversations; feeds data to GPU | Direct 6x compression |
HBM is what SK Hynix, Micron, and Samsung are supplying to Nvidia for H100, H200, and Blackwell GPUs. TurboQuant compresses the data that lives in server DRAM, not in HBM. So the immediate impact on the HBM supply chain is limited. Morgan Stanley is technically correct.
But this is a narrow argument deployed in defense of a broad position. The question is not "does TurboQuant hurt HBM this quarter?" The question is: "what does the compounding trajectory of efficiency improvements like TurboQuant imply for the $690B of infrastructure investment being made this year?" That is a different and more dangerous question.
TurboQuant does not exist in isolation. It is the latest in a sequence of algorithmic breakthroughs that are collectively compressing the hardware cost per unit of AI capability. The trend is not new — but its pace is accelerating.
| Breakthrough | Date | Efficiency Gain | Target |
|---|---|---|---|
| DeepSeek V3 (MoE + MLA) | January 2025 | ~90% reduction in reasoning cost | Training + inference compute |
| Apple "LLM in a Flash" | Ongoing (arxiv 2023, deployed 2024–25) | Run 2x DRAM-size models on flash; 20–25x CPU speedup | Inference on-device memory |
| IBM Granite 4 (Mamba-2 Hybrid) | November 2025 | Up to 8x faster token generation vs. equivalent transformer | Inference compute + memory |
| Nvidia Nemotron 3 Super (MoE) | Early 2026 | 7.5x higher throughput; only 12.7B of 120.6B params active per pass | Inference compute |
| Groq 3 LPU (Nvidia, post-acquisition) | March 16, 2026 (GTC) | Deterministic streaming; no external HBM required | Inference memory architecture |
| Google TurboQuant | March 24–25, 2026 | 6x KV cache compression; 8x attention logit speedup | Inference DRAM (KV cache) |
Each row in this table represents a published, deployable, or production-track breakthrough. The cumulative effect across the stack — training compute, inference compute, inference memory, on-device memory — is an AI capability cost curve that is falling far faster than the infrastructure investment curve. The capex being committed in 2026 is being priced against hardware assumptions that are becoming obsolete before the concrete dries.
The five largest US technology companies have committed a combined $660–690 billion in capital expenditure for 2026, approximately 75% of which (~$450–500 billion) is directed at AI infrastructure: data centers, chips, cooling, power, networking.
| Company | 2024 Capex | 2025 Capex | 2026 Capex (committed) | YoY Growth |
|---|---|---|---|---|
| Amazon (AWS) | ~$77B | ~$125B | ~$200B | +60% |
| Alphabet / Google | $52.5B | ~$95B | $175–185B | +84–95% |
| Microsoft | ~$55B | ~$80B | $120B+ | +50% |
| Meta | ~$38B | ~$65B | $115–135B | +77–108% |
| Oracle | ~$9B | ~$15B | ~$25–30B | +67–100% |
| Total | ~$231B | ~$380B | ~$635–670B | +67–76% |
For context: the entire US electric utility industry invested approximately $160 billion in generation, transmission, and distribution in 2024. The five largest technology companies are outspending the entire US utility sector on energy-adjacent AI infrastructure by a factor of four. The comparison is not rhetorical — power grid constraints are now the single most cited bottleneck in data center deployment.
This capex is no longer being funded purely from free cash flow. Hyperscalers issued $108–182 billion in new debt in 2025 — roughly double the prior year. Cumulative projected debt issuance over the coming years is estimated at $1.5 trillion. The debt instruments have maturities of 7–15 years.
This creates a structural mismatch that is the core of the infrastructure bubble risk:
Ares Management Co-President Kipp deVeer stated this explicitly in October 2025: "If you look historically in areas like this over the past 20 or 30 years, typically when this much capacity comes online, some of it at the end of the day has to be marginal." Analyst Gil Luria was more direct: "If the market for artificial intelligence were even to steady in its growth, pretty quickly we will have over-built capacity, and the debt will be worthless, and the financial institutions will lose money."
The near-term HBM picture is unambiguously positive for suppliers. High Bandwidth Memory — the stacked DRAM packages inside Nvidia GPUs that store model weights and feed compute cores — is supply-constrained through at least 2026. All three major producers have effectively pre-sold their entire 2026 output.
| Supplier | HBM Market Share (Q2 2025) | 2026 Outlook | Key Metrics |
|---|---|---|---|
| SK Hynix | 62% | ~70% of HBM4 for Nvidia Rubin (UBS) | 2026 revenue forecast +37.9% YoY; wafer shortage "could last until 2030" |
| Micron | ~20% (ramping) | 2026 HBM supply entirely pre-sold; HBM4 ramp in Q2 2026 | Q2 FY2026 revenue $23.86B (+57% YoY); HBM TAM forecast $100B by 2028 |
| Samsung | ~18% | Recovering position; warned of memory shortage driving prices up | Collaborating on HBF standardization; advancing PIM-enabled LPDDR6 |
SK Hynix's statement that "memory wafer shortages could last until 2030" captures the near-term consensus. In this environment, Morgan Stanley's "buy the dip" reflex on TurboQuant news is understandable — the near-term supply/demand balance is too tight for algorithmic efficiency gains to make a dent in 2026 or even early 2027.
The consensus view breaks down when the analysis extends beyond the current shortage cycle. Three forces converge in the 2027–2028 window that make the HBM market structurally vulnerable:
Micron raised capex to $20 billion specifically for HBM expansion. SK Hynix is investing into the mid-30% of revenue range. Samsung has committed its most aggressive memory capex since the DRAM wars of the early 2010s. When all three major players expand simultaneously into a market driven by a single customer ecosystem (Nvidia's GPU platform), overcapacity risk is not hypothetical — it is the base case for every prior memory super-cycle. DRAM spot prices dropped 50% from 2023–2025 in standard segments before HBM shielded producers from the correction. That shielding is not permanent.
SanDisk's High Bandwidth Flash integrates NAND flash with HBM packaging, offering up to 16x the capacity of HBM at comparable bandwidth and similar price points. First samples are targeted for H2 2026; first AI inference devices with HBF are expected in early 2027. SK Hynix, Samsung, and SanDisk are all collaborating on HBF standardization. This is not a fringe development — three of the four largest memory companies on earth are building toward a new memory format that, at 16x capacity advantage, could make the $35 billion HBM market look like a transitional technology.
The GPU+HBM stack is not the only architecture in the field. Three alternative approaches are in active development or production:
None of these alternatives currently threatens Nvidia's GPU dominance for training. But inference is a different workload — latency-sensitive, memory-bandwidth-bound, and increasingly the economic center of gravity as pre-training spend matures. If inference migrates toward LPU/wafer-scale/PNM architectures, the demand for stacked HBM packages shrinks with it.
Data center deals hit a record $61 billion in 2025. But the constraint is no longer capital — it is power. Seventy-two percent of data center operators identify power and grid capacity as their most severe operational challenge. The consequence is already materializing: €5.8 billion ($6.8 billion) in Irish data center projects are stranded — land acquired, planning permission secured, construction permits granted, but no grid connection available. These are not hypothetical future assets. They are real capital deployed into unmonetizable real estate.
U.S. data center electricity consumption is projected to reach 300 TWh by 2028, roughly doubling from current levels. AI data centers are forecast to consume 9% of all U.S. electricity by 2030. Current load is approximately 41 GW, growing 15–20% annually. The U.S. electric grid — much of which was built 30–50 years ago — was not designed for this growth rate. New grid capacity requires 5–10 year permitting and construction timelines. The AI buildout cycle is running 100x faster.
| Metric | 2024 | 2026 (current) | 2028 (projected) | 2030 (projected) |
|---|---|---|---|---|
| US data center power load | ~25 GW | ~41 GW | ~55 GW | ~75 GW |
| US data center TWh/year | ~150 TWh | ~200 TWh | ~300 TWh | ~400+ TWh |
| As % of US electricity generation | ~3.7% | ~5% | ~7% | ~9% |
The power constraint creates a category of infrastructure risk that is entirely independent of AI capability progress: even if AI application adoption grows exactly as projected, and even if hardware efficiency stays constant, the physical power infrastructure to host the built-out data center capacity may simply not exist where it is needed. The new industry performance metric — "tokens per watt per dollar" — is a direct acknowledgment of this reality. The previous metric (raw FLOPS or raw GPU count) no longer captures the binding constraint.
Modern AI chips generate heat densities that standard air-cooled data centers cannot handle. Retrofitting existing air-cooled facilities for liquid cooling costs 7–10% more than building new liquid-cooled facilities from scratch. The legacy data center stock built during the 2015–2022 cloud buildout — worth hundreds of billions — faces either expensive retrofit or functional obsolescence as AI compute density requirements grow. This is a hidden write-down embedded in infrastructure balance sheets that has not yet been marked to market.
The Jevons Paradox, first articulated by economist William Stanley Jevons in 1865 regarding coal consumption, holds that when the efficiency of a resource's use improves, total consumption of that resource rises rather than falls — because lower cost per unit enables new applications that did not previously exist at higher prices. Applied to AI infrastructure: every time compute becomes cheaper, demand for AI services expands faster than the efficiency gain, and total hardware demand grows.
The empirical evidence for Jevons in AI is genuinely strong. In January 2025, DeepSeek V3 caused Nvidia to fall 17% in a single session. The market assumed cheaper AI meant less GPU demand. Within three months, AI inference demand had grown so rapidly that the net effect on GPU order books was positive. Inference costs dropped roughly 90% post-DeepSeek; total AI usage and GPU demand grew. H100 cloud instance prices declined 64–75% from Q4 2024 to Q1 2026 — yet Nvidia's order book reached $1 trillion at GTC 2026. This is Jevons operating precisely as the theory predicts.
Morgan Stanley applies the same argument to TurboQuant: if inference memory costs fall 6x, more AI services become economically viable, and total memory demand grows rather than falls. KKR argues AI infrastructure demand is durable and real. Microsoft's data center executive said he is "more worried we are underbuilding than overbuilding."
The Jevons argument is compelling — but it contains an assumption that is not examined carefully enough in the current AI infrastructure debate: Jevons requires that marginal demand can grow fast enough to absorb efficiency gains before the financial structures financing the infrastructure mature.
This is where the infrastructure bubble risk is specific and different from a simple overbuilding story. Consider the chain of logic:
Sam Altman — not a habitually bearish voice on AI — said in August 2025: "Are we in a phase where investors as a whole are overexcited about AI? My opinion is yes." This statement was made by the CEO of the company that is the most AI-optimistic large institution on earth. Its significance has been underweighted by the market.
The entire infrastructure investment thesis rests on a scaling law assumption: that larger models trained on more data with more compute produce systematically better capabilities. This assumption was empirically validated from approximately 2018 through 2024 and drove Nvidia's extraordinary revenue growth. The question for 2025–2030 is whether the assumption continues to hold — and the evidence is genuinely mixed.
The current consensus is that three separate scaling dimensions are operating simultaneously, with different cost/benefit profiles:
| Scaling Axis | Status | Infrastructure Implication |
|---|---|---|
| Pre-training scale (larger models + more data) | Maturing; diminishing returns at frontier | Massive but potentially peaking training cluster demand |
| Post-training / RLHF / fine-tuning | Active; primary capability driver at frontier | Continuous but smaller-scale GPU demand; different hardware profile |
| Test-time compute (inference-time reasoning) | Emerging; not yet saturating | High inference compute demand — but with memory efficiency (TurboQuant) as the key constraint |
The significance: if pre-training scaling is maturing, the marginal return on the massive training cluster buildout embedded in hyperscaler capex is declining. The demand growth is migrating toward inference — which is exactly the workload being targeted by TurboQuant, Mamba, Groq LPU, and Cerebras. The infrastructure built for pre-training (massive HBM-dense GPU clusters optimized for matrix multiplication throughput) is the wrong profile for the workload that is growing fastest.
This report, read alongside its companion (The Anti-Bubble Thesis), maps a specific thesis: the application and infrastructure layers of the AI economy have different bubble risk profiles, different time horizons, and different burst mechanisms.
| Dimension | AI Application Layer | AI Infrastructure Layer (This Report) |
|---|---|---|
| Is it a bubble? | No — displacement cycle with real revenue | Partially — structural fragility is real and growing |
| Revenue validation | Anthropic $20B ARR, enterprise-contracted | Hyperscaler AI revenue growing but far below $690B capex run-rate |
| Burst mechanism | Would require sustained enterprise ROI failure | Debt maturity mismatch + architecture migration + power grid constraints |
| Burst timeline | Not imminent; 5+ year displacement cycle | 2027–2028 window; keyed to debt refinancing cycles and HBM capacity |
| Key signal to watch | Enterprise contract renewal rates; Claude/OpenAI enterprise churn | H100/H200 utilization rates; HBM spot prices post-2026 shortage; hyperscaler AI revenue vs. capex ratio |
| Who wins if the layer deflates | Application winners → Anthropic, Microsoft, Salesforce (via integration) | Infrastructure → Efficient architecture providers (Groq/Nvidia LPU, Cerebras), power generators, cooling |
| Who loses | Enterprise SaaS companies being displaced | HBM suppliers in overcapacity; data center REITs with stranded assets; debt investors in AI infra vehicles |
PRZC Research does not provide individual security recommendations in this format. The following represents sector-level framing for institutional consideration.
Morgan Stanley and TrendForce are likely correct that the initial TurboQuant selloff overshot. HBM supply is sold out through 2026. SK Hynix and Micron have the most favourable near-term supply/demand positioning in memory they have had in a decade. The 5–6% single-session drop in memory stocks on the TurboQuant announcement represents an overreaction to a paper that targets DRAM, not HBM, in a market where HBM is presold. Near-term: the dip is defensible to buy on fundamentals.
The consensus bullish case for memory through 2028 assumes the GPU+HBM architecture remains the dominant inference stack. This assumption faces three specific challenges that arrive in the 2026–2027 window: HBF samples reaching market (H2 2026), Groq 3 LPU scaling in Nvidia's own product line, and PNM/PIM-enabled LPDDR6 reaching mass production. If any two of these three materialize at commercial scale, the HBM market enters overcapacity before its TAM expansion thesis completes. The $35B → $100B HBM TAM forecast through 2028 is the right number if architecture stays constant. It is a generous number if architecture migrates.
The $690B 2026 capex commitment is real and locked in. But the sustainability of the 2027–2028 trajectory depends on AI cloud revenue growing into the capex commitment faster than debt maturities come due. At current AI cloud revenue growth rates, the gap is closing — but slowly. The ratio to watch is hyperscaler AI-attributed cloud revenue as a percentage of AI-directed capex. When this ratio crosses 30%, the model is self-sustaining. The current estimate for 2026 puts it at roughly 15–20%. The debt structures begin to stress in the 2027–2028 window if that ratio does not improve substantially.
The structural beneficiaries of a shift away from HBM-dense GPU stacks are the providers of efficient inference architectures: Nvidia (which already owns both sides of the bet via Groq acquisition), Cerebras, and the companies building PNM/HBF-native inference infrastructure. Power generation and grid infrastructure companies benefit regardless of which compute architecture wins — the power demand is inelastic to chip brand. Nuclear, grid-scale storage, and transmission infrastructure are the cleanest infrastructure plays in the AI economy.
Data center REITs and private credit vehicles with long-duration exposure to AI infrastructure carry the highest concentration of the structural mismatch risk. The assets are real; the revenue is contractually committed for 5–10 year leases; but the risk is mark-to-model on the terminal value of the physical plant. If the AI workloads at maturity run on architectures that require 1/6th the physical footprint of today's GPU clusters (combining TurboQuant-type efficiency gains with Mamba/SSM throughput improvements), the terminal utilization rate of 2026-vintage data centers is materially lower than underwriting models assume.
The AI bubble debate has been conducted as if "AI" is a single thing with a single risk profile. It is not. The application layer — where Anthropic, OpenAI, and the enterprise software transformation are happening — has the characteristics of a genuine platform shift: real revenue, real enterprise contracts, real productivity gains in measured use cases, real market displacement. The bubble narrative fails at the application layer.
The infrastructure layer operates under different physics. $690 billion per year in capex, financed by $182 billion in new annual debt, deployed into hardware with a 12-month economic life, built on power grids that cannot sustain the trajectory, and optimized for a GPU+HBM architecture that three separate technological vectors are actively migrating away from — this is not the same story. This is a structural fragility story with a 2027–2028 inflection point.
Google TurboQuant, by itself, changes nothing. As a data point in a trend, it changes the probability distribution of the infrastructure fragility thesis. The market's initial reaction to the paper — sharp drops in memory stocks followed by "overreaction" calls from sell-side analysts — captures both the signal and the noise simultaneously. The signal is real. The noise was in the timing and the specific target.
The correct investor posture is not to choose between "AI is a bubble" and "AI is transformative." Both can be true in different parts of the stack at the same time. Anthropic replacing the NASDAQ 100 is a story about the application layer. The AI infrastructure reckoning is a story about what happens when $690 billion in annual physical capital formation meets a technology efficiency curve that is outrunning the financial structures built to monetize it.
"Every major technology platform shift in history has featured a period in which the infrastructure buildout runs ahead of monetization — followed by a reckoning that destroys infrastructure capital while preserving application capital. The railroad land grants funded stranded iron in the 1870s. The fiber optic overbuild of the late 1990s gave us Google. The AI infrastructure overbuild of 2025–2026 will give us something. It will not give everyone their money back."
— PRZC Research, March 2026
| Metric | Value | Date / Source |
|---|---|---|
| Google TurboQuant KV cache compression | 6x (3-bit, zero accuracy loss) | Google Research / ICLR 2026, March 2026 |
| TurboQuant H100 attention logit speedup | Up to 8x at 4-bit | Google Research, March 2026 |
| Memory stock reaction to TurboQuant | SK Hynix -6%, Samsung -5%, Micron -5%, Kioxia -6% | CNBC, March 26, 2026 |
| Hyperscaler combined capex 2026 | $660–690 billion (75% AI-directed) | IEEE ComSoc / Futurum, 2026 |
| Hyperscaler debt issuance 2025 | $108–182 billion | Multiple, 2025 |
| Nvidia acquisition of Groq | $20 billion | December 24, 2025 |
| Groq 3 LPU on-chip SRAM bandwidth | 80 TB/s; 230MB on-chip; no external HBM | Nvidia GTC 2026, March 16, 2026 |
| Cerebras WSE-3 on-wafer SRAM | 44GB at 220+ TB/s; no external memory | Cerebras, 2025 |
| SanDisk HBF capacity vs. HBM | 8–16x higher at comparable bandwidth | SanDisk, 2025 |
| SanDisk HBF first AI inference devices | Early 2027 | SanDisk roadmap, 2025 |
| SK Hynix HBM market share (Q2 2025) | 62%; ~70% of HBM4 (Rubin) per UBS | Counterpoint Research / UBS, 2025 |
| HBM TAM 2025 → 2028 forecast | $35B → $100B at ~40% CAGR | Micron, December 2025 |
| Micron Q2 FY2026 revenue | $23.86B (+57% YoY) | Micron earnings, March 2026 |
| Irish stranded data center assets | €5.8 billion | Enlit World, 2025 |
| US AI data centers as % of US power by 2030 | 9% | Tech-Insider, 2026 |
| US data center power consumption projection | 300 TWh by 2028; current 41 GW load | HPCwire / BigDATAwire, 2025 |
| DeepSeek V3 training cost | ~$5.6M on 2,048 GPUs | Multiple, January 2025 |
| AI reasoning cost decline post-DeepSeek | ~90% | Multiple, Q1 2025 |
| H100 cloud instance price decline | 64–75% (Q4 2024 → Q1 2026) | byteiota, 2026 |
| Nvidia GTC 2026 order book forecast | $1 trillion cumulative through 2027 | Nvidia GTC, March 2026 |
| IBM Granite 4 (Mamba-2 Hybrid) inference speedup | Up to 8x vs. equivalent transformer | InfoQ / IBM, November 2025 |
| Nvidia Nemotron 3 Super active params per pass | 12.7B of 120.6B total; 7.5x throughput gain | Nvidia, 2026 |
| Sam Altman on AI overexcitement | "My opinion is yes" | August 2025 |
Disclaimer: This report is produced by PRZC Research for informational and analytical purposes only. It does not constitute investment advice, a solicitation, or a recommendation to buy or sell any security. All figures cited are attributed to third-party sources and have not been independently verified by PRZC Research. Past market reactions are not predictive of future outcomes. Readers should conduct their own due diligence before making any investment decision.