Training-to-test scaling explained: How to optimize your end-to-end AI compute budget for inference


Standard guidelines for building large language models (LLMs) only optimize training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as extracting multiple reasoning samples from a model in deployment.

To bridge this gap, researchers from the University of Wisconsin-Madison and Stanford University have presented Training to try (T2) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of inference samples in testing time.

In practice, their approach shows that it is computationally optimal to train substantially smaller models with much more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples in inference.

For enterprise AI application developers who are training their own models, this research provides a proven model for maximizing return on investment. It shows that AI reasoning doesn’t necessarily require spending huge amounts on frontier models. In contrast, smaller models can achieve stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate how best to allocate computation during model building test time scaling laws guide how to allocate computation during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have developed completely independently of each other despite being fundamentally intertwined.

A model’s parameter size and training duration directly dictate both the quality and cost per query of its inference samples. Currently, the industry gold standard for pre-training is the Chinchilla rulewhich suggests an optimal calculation ratio of approximately 20 training tiles for each model parameter.

However, creators of modern AI model families such as Llama, Gemma, and Qwen regularly break this rule by intentionally overtraining their smaller models on massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach fails when building complex agent workflows: “In my view, the inference stack breaks when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling.” Instead of relying on massive models, developers can use compact overtraining models to run this repeated sampling at a fraction of the cost.

But because the training laws and testing time scale are examined in isolation, there is no rigorous framework for calculating how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

Consequently, there was previously no formula that jointly optimized model size, training data volume, and test time inference budgets.

The reason this framework is difficult to formulate is that the prior training and the test time scale speak two different mathematical languages. During pretraining, a model’s performance is measured by “loss,” a smooth, continuous metric that tracks prediction errors as the model learns.

At test time, developers use real-world downstream metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model produces at least one correct answer in k independent, repeated attempts.

Scale laws train to test

To resolve the disconnect between training and deployment, researchers introduce Train-to-Test (T2) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the size of the model (N), the volume of training tiles it learns from (D), and the number of reasoning samples it generates during inference (k).

train to test

“Train-to-test” combines pretraining and test time scaling laws into a unified framework (source: arXiv)

T2 combines pretraining and inference budgets into an optimization formula that takes into account both the reference cost to train the model (6ND) and the composition cost to query it repeatedly in inference (2Nk). The researchers tested different modeling approaches: whether to model pretraining loss or test time performance (pass@k) as functions of N, D, and k.

The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model’s prediction error or loss) and directly modifies it by adding a new variable that accounts for the number of repeated samples at test time (k). This allows developers to see how the increased inference computation reduces the overall model error rate.

The second approach directly models the downstream pass@k accuracy. It tells developers how likely their app is to solve a problem given a specific compute budget.

But should companies use this framework for all applications? Roberts clarifies that this approach is highly specialized. “I imagine you wouldn’t see as much benefit for knowledge-intensive applications like chat models,” he said. Instead, “T2 it suits reasoning-heavy applications such as coding, where resampling will typically be used as a method of scaling test time.”

What it means for developers

To validate the T2 scaling laws, the researchers built an extensive test bank of more than 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, highly overtrained checkpoints from scratch to see if their mathematical predictions held true. They then compared the models across eight diverse tasks, which included real-world datasets such as SciQ and OpenBookQA, along with synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge memory.

Both mathematical models showed that the optimal computational frontier deviates dramatically from the standard Chinchilla scale. To maximize performance on a fixed budget, the optimal choice is a significantly smaller model trained with much more data than the traditional rule of 20 tiles per parameter dictates.

performance train to test

Train scaling laws to test show that small overtrained models outperform Chinchilla-optimized models in reasoning tasks (source: arXiv)

In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models on all eight evaluation tasks when trial-time sampling costs were taken into account.

For developers looking to deploy these findings, the technical barrier is surprisingly low.

“It doesn’t take anything fancy to accomplish the test time scale with our current models,” Roberts said. “On deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (eg KV cache if you’re using a transformer).”

KV caching it helps by storing the pre-processed context so that the model doesn’t have to re-read the initial request from scratch for each new reasoning sample.

However, extreme overtraining comes with practical trade-offs. Although overtrained models can be notoriously stubborn and harder to tune, Roberts notes that when they applied supervised tuning, “while this effect was present, it was not a strong enough effect to return the optimal model to Chinchilla.” The optimal calculation strategy remains definitely biased towards compact models.

However, teams pushing this to the absolute limit must be careful about hitting physical data limits. “Another angle is that if you take our overtraining recommendations to the extreme, you might run out of training data,” Roberts said, referring to the approaching “data wall” where high-quality Internet data runs out.

These experiments confirm that if an application relies on generating multiple reasoning samples at test time, aggressively overtraining a compact model is practically and mathematically the most efficient way to spend an end-to-end computing budget.

To help developers get started, the research team plans to open up its checkpoints and code soon, allowing companies to plug in their own data and test scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI ​​industry.

This is especially crucial, as the high price of frontier models can become a barrier as agentic applications that rely on reasoning models are scaled.

“T2 it fundamentally changes who can build strong reasoning models,” concludes Roberts. “You may not need massive computing budgets to get state-of-the-art reasoning. Instead, you need good data and a smart allocation of your training and inference budget.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *