
Image: Flickr / Wikimedia Commons
Subquadratic Launches SubQ With 12 Million Token Context Window and $29M in Funding
The Miami startup says its sparse-attention architecture reduces compute by nearly 1,000x at long context lengths, outperforming GPT-5.5 and Claude Opus on retrieval benchmarks.
This article was produced by the AETW editorial team.
Subquadratic launched SubQ, an LLM built on a new sparse-attention architecture that handles a 12 million token context window at a fraction of the cost of leading frontier models. The Miami startup raised $29 million to take on the quadratic scaling problem that has constrained transformer-based AI since 2017.
The problem every major lab has worked around
Every transformer-based LLM built since 2017 shares the same core constraint: attention cost scales quadratically with context length. Double the input, and the model quadruples its compute. That single law has quietly shaped every design decision in frontier AI - why RAG exists, why agentic systems break tasks into chunks, why context windows stalled at 1 million tokens for most models even as labs marketed them as long-context.
The workarounds work, but they have real costs. Retrieval-augmented generation adds latency. Agentic retrieval adds complexity. Manual prompt curation wastes time and introduces bias into what the model actually sees. The underlying problem was never solved; it was managed.
Subquadratic, a Miami-based AI research startup, launched on May 5 with a bet that the underlying architecture can be fixed directly. Its first model, SubQ, is built on what the company calls Subquadratic Selective Attention (SSA) - a sparse attention mechanism that scales linearly in both compute and memory, regardless of context length.
What the model actually does
Source: subq.ai
SubQ offers a 12 million token context window - roughly 9 million words, or about 120 books loaded into a single prompt. At that length, the company says its architecture reduces attention compute by close to 1,000 times compared to standard dense-attention transformer models. At 1 million tokens, SubQ reportedly runs 52 times faster than dense attention approaches.
The core idea behind SSA is content-dependent selection. Rather than comparing every token to every other token - the approach that makes standard attention quadratic - the model selects which token relationships actually matter for a given input, and processes only those. The selection mechanism itself does not reintroduce quadratic scaling at the indexing stage, unlike some earlier sparse attention approaches.
CTO Alexander Whedon explained: for one prompt, words one and six might matter to each other. For another, it is words two and three. The model learns to make that determination per input, dynamically, rather than relying on fixed patterns or sliding windows.
The benchmark picture
On RULER at 128K tokens, SubQ reports 97.1% accuracy against Claude Opus 4.6's 94.8%. On SWE-Bench Verified, the model scores 82.4% - ahead of Opus 4.6 at 81.4% and Gemini 3.1 Pro at 80.6%. On MRCR v2, the multi-round coreference retrieval benchmark at 1 million tokens, SubQ scores 83, beating GPT-5.5's 74.0% by nine points.
At 12 million tokens - a context length no frontier model currently reaches - SubQ claims 92.1% accuracy on needle-in-a-haystack retrieval. The cost comparison is striking: SubQ ran the RULER 128K benchmark for around $8, versus roughly $2,600 for Claude Opus.
There are caveats. Each model was run once due to high inference cost, meaning variance is unquantified. The SWE-Bench margin is narrow enough that benchmark harness differences could account for it, which the company acknowledges. And SubQ is considerably smaller than the models it is being compared against.
A field with a complicated track record
Subquadratic is not the first to claim a fix for quadratic attention scaling. Longformer used sliding window attention to achieve linear scaling but broke when relevant context was not nearby. State-space models like Mamba achieved efficiency through lossy compression of prior state. Hybrid architectures kept a few dense attention layers for quality, which meant their cost savings capped rather than compounded.
The most recent serious attempt was DeepSeek's Native Sparse Attention, which won the ACL 2025 best paper award. Its follow-up, DSA, ships in DeepSeek V3.2 - but the selection step still requires scoring every query against every key, making the indexer itself quadratic.
The category's cautionary tale is Magic.dev, which raised over $500 million after announcing a 100 million token context window model in 2024. As of early 2026, there is no public evidence of that model being used outside the company. Subquadratic's technical claims are real enough to attract serious investors, but the gap between launch benchmarks and production deployment has tripped up this space before.
Products, funding, and what comes next
Subquadratic is launching two products in early access: an API exposing the full 12 million token context window, and SubQ Code, a CLI-based coding agent that loads entire repositories into a single context for planning and review. Both run on neoclouds rather than major hyperscalers. A search product is also planned, initially free, as a land-and-expand play across research and enterprise use cases.
The model will not be open-sourced in the near term. The company plans to offer post-training tools so enterprise customers can fine-tune on their own data. A 50 million token context window is targeted for Q4 2026.
The $29 million seed round includes Javier Villamizar, former partner at SoftBank Vision Fund, and Justin Mateen, co-founder of Tinder and founder of JAM Fund, alongside early backers of Anthropic, OpenAI, Stripe, and Brex. The team has 11 PhD researchers from Meta, Google, Oxford, Cambridge, and BYU. The company was previously called Aldea and built speech models before pivoting.
Why this matters for builders
If the architecture holds at scale and in production, the practical implications are significant. Developers building on long-context models today spend real engineering time on retrieval pipelines, chunking strategies, and prompt curation - work that exists purely to compensate for context and cost constraints. A model that holds an entire codebase or full document corpus in a single prompt without RAG middleware simplifies the stack considerably.
The cost angle matters too. At one-fifth the price of other leading LLMs by the company's figures, and with per-token economics that improve at longer contexts rather than worsen, SubQ targets workloads that are currently economically unviable - persistent agent state, full-repository code review, long document analysis at scale.
Whether the benchmarks translate to real-world reliability across diverse tasks is the question early access users will answer. The architecture is genuinely novel. What the field has learned is that novel architectures at launch and proven architectures in production are two different things.
Sources
AI & Technology Researcher
Brian Weerasinhe is the founder and editor of AI Eating The World, where he covers artificial intelligence, tech companies, layoffs, startups, and the future of work. His reporting focuses on how AI is transforming businesses, products, and the global workforce. He writes about major developments across the AI industry, from enterprise adoption and funding trends to the real-world impact of automation and emerging technologies.


