subquadratic models

Subquadratic models are the new way people talk about ultra long context AI work these days.

If you build apps that need very long documents, big chat logs, or whole codebases in memory, subquadratic models could change how you design them.

This article explains what subquadratic models are, why they matter, how they work, and how you can try them with tools you already use.

What subquadratic models mean in simple terms

Subquadratic models are models that process long inputs more cheaply than classic transformers.

Transformers have a cost that grows roughly with the square of the input length.

That means double the context and you can see the compute and memory cost jump a lot.

Subquadratic models change that math so the cost grows slower than the square.

If a transformer costs N squared for N tokens, a subquadratic model might cost N log N or N times a small factor.

That matters when your app needs millions of tokens or a 12 million token window like the SubQ preview claims.

Why we care about long context

People want models to read whole books, long legal files, long conversations, or big code repositories.

Here are real examples where long context helps:

Summarizing hours of meeting transcripts without losing the thread.
Doing code review across an entire project instead of single files.
Running a RAG system where you keep a long timeline of past messages to avoid repeating information.
Allowing agents to plan with a full mission history instead of just a short window.

When the model can see everything that matters at once, results get clearer and fewer retrieval tricks are needed.

How transformers handle long context and the problem

Standard transformer attention compares every token to every other token.

That full attention is powerful but expensive.

If you have 1 million tokens, the attention matrix is huge and slow.

Engineers found many ways to approximate or reduce that cost.

Some methods break the input into chunks, some use sparse attention, some compress tokens.

But each approach has tradeoffs in performance, cost, or complexity.

How subquadratic models work at a high level

Subquadratic models use smarter math or structure to lower the cost growth.

There are a few common patterns:

Sparse attention where tokens only see nearby or selected tokens.
Kernel methods that turn attention into cheaper operations.
Memory or retrieval layers that keep long-term info outside the main attention.
New architectures that replace full pairwise attention with faster approximations.

The SubQ preview claims a non-transformer approach that handles 12 million tokens at low cost.

That is a practical example of subquadratic scaling.

Real world claims and sources

SubQ 1M-Preview by Subquadratic AI claims a model able to handle 12M token contexts at 20 percent of the cost of standard transformers.

Engine and model updates from big labs also push longer contexts: OpenAI updated ChatGPT with better pacing and UX for longer tasks, and Anthropic released Opus 4.8 with Dynamic Workflows for dividing tasks across sub-agents.

Tools around long context are also improving, like TruLens for batch evaluation and OpenCrabs adding recursive self-improvement for agents.

Links: Anthropic announced Opus 4.8 on their site, and n8n community posts show instance-level MCP for workflow agents.

If you build systems for long documents, watch these sources and developer repos.

Anthropic Opus 4.8: https://www.anthropic.com/news/introducing-claude-opus-4-8
n8n community: https://n8n.io/workflows/
OpenCrabs repo: https://github.com/adolfousier/opencrabs

What subquadratic models enable that we could not do before

Subquadratic models open several possibilities:

True single-pass analysis of huge docs or corpora.
Agents that plan using the full mission history.
Cheaper long-form summarization at scale.
RAG systems that use far larger context windows with less vector lookups.

This can simplify application design because you may not need complex retrieval layers or repeated chunking.

That saves engineering time and reduces failure modes like missing the right chunk.

How to evaluate a subquadratic model for your project

If you plan to test a subquadratic model, use a checklist:

Cost per token at target length.
Latency for the full context you need.
Accuracy on your tasks when working with long inputs.
Failure modes like hallucinations or token truncation.
Tooling and API limits from the provider.

Do small end to end tests with real inputs.

A model that looks fast on toy tests may struggle with real messy documents.

Comparing approaches for long context

Here are the main paths teams take today, with pros and cons:

Chunking plus RAG
- Pros: Works now with many models and vector stores.
- Cons: Retrieval mistakes, extra cost, complex design.
Sparse or compressed attention transformers
- Pros: Keep transformer strengths.
- Cons: Complexity in training and sometimes lower quality on local tasks.
Subquadratic non-transformer models
- Pros: Cheaper long context and higher token limits.
- Cons: New tech, may not match transformer quality on some tasks.
Hybrid systems with external memory or databases
- Pros: Control and auditability.
- Cons: Engineering overhead and synchronization issues.

Choose based on whether you value immediate quality, cost, or architecture novelty.

Practical design patterns for developers

Here are some practical ideas to use subquadratic models safely.

Keep a short context plus a long index.
- Put the most recent information in the model context.
- Use the long context only when necessary.
Use progressive summarization.
- Keep rolling summaries that compress older parts of the conversation.
- Summaries make long context smaller while keeping meaning.
Mix local reasoning and retrieval.
- Let the model do the heavy lifting on the parts it sees directly.
- Pull additional facts from a vector store as a backup.
Monitor for hallucination and check outputs with tools like TruLens.
- TruLens 2.8 adds parallel batch evals and schema checks to help validate outputs.
- See TruLens: https://trulens.org
Use models with open embedding APIs to avoid heavy local downloads.
- OpenCrabs now supports OpenAI-style embedding APIs so you can choose provider without a large GGUF download.
- Check OpenCrabs: https://github.com/adolfousier/opencrabs

Cost and infrastructure considerations

Even if subquadratic models reduce compute growth, you should still plan infrastructure for:

Memory use on the serving side.
Disk bandwidth for loading large contexts.
Network latency when sending huge inputs.
Token billing or inference costs with providers.

Subquadratic models can lower the bill, but they do not remove the need to measure and monitor.

If you deploy on your own servers, test how memory and concurrency interact.

If you use a hosted API, check rate limits and how they handle giant inputs.

Use cases and examples

Here are realistic ways teams will use subquadratic models.

Legal teams reading full contracts and cross referencing earlier clauses.
Engineering agents that inspect an entire repo and propose project-level refactors.
Research assistants that read many papers and synthesize trends.
Customer support that ingests months of chat logs to find the real issue.
Video systems that index full transcripts from long recordings for search.

Large context means fewer tricks and more straightforward prompts.

Tooling that helps with long context systems

You will want systems to manage and test long inputs:

Vector databases for hybrid systems.
Validation tools like TruLens for programmatic checks.
Agent frameworks like n8n that can orchestrate many steps when processing long content.
Research engines like NeuraRTS to find sources and citations.
Routers like Neura Router if you want to connect many models with one API.

Internal links you might find useful:

NeuraRTS research engine: https://rts.meetneura.ai/
Neura Router model aggregator: https://router.meetneura.ai
Neura Artifacto multi-tool chat: https://artifacto.meetneura.ai
Neura Open-Source AI Chatbot: https://opensource-ai-chatbot.meetneura.ai

These tools help you run experiments, fetch sources, and handle model selection.

Safety and evaluation when using long inputs

Long inputs mean model mistakes can be bigger.

A wrong fact pulled from a 100 page doc can mislead a whole response.

Use these checks:

Grounding checks: force the model to cite sections or lines.
Schema validation: require outputs to match a programmatic format.
Red team the pipeline: probe with adversarial inputs to see where the model fails.
Batch evals: use parallel tests to catch regressions fast. Tools like TruLens are built for that.

Always have a human review step where outputs lead to important decisions.

How to start experimenting today

If you want to try subquadratic models or long context workflows:

Pick a small real dataset you care about.
Test a transformer with chunking and a subquadratic model if you can get access.
Measure cost, latency, and accuracy.
Add validation checks with TruLens or similar tools.
Iterate on prompt and system design.

If you use agent frameworks like n8n, you can automate experiments and orchestration. See n8n community notes on instance-level MCP for deploying full workflows from a prompt: https://n8n.io/workflows/

What the near future looks like

Expect more models and more open code for long context.

Some providers will offer specialized endpoints for long inputs.

Tooling will follow with better validators, batching, and memory management.

Open-source projects will add modes for FTS only or embedding APIs to let you choose what fits your infra. OpenCrabs updated memory modes and embedding options are good examples of this trend.