Back
Obsidian

When the Interference Isn't There: What a Null Result Taught Me About LLM Serving

Jun 9, 2026·6 min read·2 views

I spent several weeks writing a research proposal around a problem that turned out not to exist, at least not under the conditions I tested. This is a write-up of what I proposed, what I ran, and what the data actually said.


The Problem I Thought I Was Solving

The motivating scenario was simple. A user submits a 50,000-token document for summarisation. At the same moment, ten colleagues send short one-line questions. All eleven requests land on the same GPU. The serving system admits them together. The large request consumes most of the available KV cache memory, stalls token generation for everyone else, and those short requests, which would take under 200 ms alone, now wait several seconds for their first token.

The proposed fix was a lightweight admission controller sitting above vLLM [1] at the API gateway. At each request arrival, the controller reads two signals from vLLM's /metrics endpoint: current KV cache utilisation and the input token count of the incoming request. If both exceed calibrated thresholds, the incoming long-context request gets routed to a separate GPU instance rather than the one already under load.

The hypothesis: routing long-context requests away from loaded GPUs reduces p95 TTFT (time to first token) for short, co-running requests.

It is a clean hypothesis. It maps to a real operational concern. And the literature seemed to back it up. Systems like ShuffleInfer [3] and TetriInfer reported measurable interference and built architectural solutions around it.

Before building anything, I ran an experiment to check whether the interference was actually reproducible in my setup.


What I Ran

Hardware: RunPod A100 80GB
Model: Qwen3-8B
Serving engine: vLLM 0.22.1 [1], no custom routing
Workload: 8 concurrent long-context requests (~6,200 tokens each) alongside 10 short requests, 10 repetitions

The chart below shows the result:

TTFT across 10 repetitions showing short request p95 TTFT vs long-context mean TTFT

The two lines are nearly identical across all 10 repetitions. Both converge quickly after Rep 1 and stabilise around the mean p95 of 0.330s. The short request p95 TTFT and the long-context mean TTFT are indistinguishable. There is no interference gap to fix.


Why There Was No Interference

After seeing the result, the explanation was obvious in hindsight.

The model was too small for the hardware. Qwen3-8B at fp16 occupies roughly 16 GB of VRAM. An A100 80GB has five times that available. 8 concurrent long-context requests at ~6,200 tokens each produce a KV cache footprint well within what the GPU can hold comfortably. The memory pressure that drives interference never materialises.

ShuffleInfer [3] and TetriInfer measured the interference they were solving for because they were running much larger models at or near the memory ceiling of their hardware. At that boundary, a single large request can occupy a meaningful fraction of the KV cache, forcing the scheduler to stall other requests waiting for space to free up. That is a real phenomenon. It just requires the right model-to-hardware ratio to appear.

vLLM's PagedAttention scheduler is better than I gave it credit for. vLLM [1] uses paged virtual memory for KV cache allocation, similar in principle to OS virtual memory management. Under moderate load, it handles mixed workloads well. The FCFS (First-Come First-Served) scheduler the proposal was planning to outperform was doing a reasonable job at the scale I tested.

The token counts were too low. 6,200 tokens is a mid-length context, not a 50,000-token document. Reproducing the pressure described in the motivating scenario requires prompts genuinely at the long end of the model's context window.


What the Null Result Means

This is not a failure of the research idea. It is a finding about the conditions under which the problem exists.

The interference problem is real and has been measured and published. But it is conditional on memory pressure, which is itself conditional on the model size, hardware capacity, and request token lengths being close enough to the hardware ceiling. Testing on an A100 80GB with an 8B parameter model at 6,200-token inputs puts you nowhere near that ceiling.

To reproduce the interference, the experiment would need:

  • A model large enough to consume a significant share of GPU memory at serving precision, for example a 70B model at fp8 on a single A100, or a 13B+ model at fp16 where KV cache growth becomes a real constraint
  • Long-context requests genuinely at the upper end, 32K to 100K tokens, not mid-range
  • A GPU with less headroom, or deliberately reduced KV cache capacity to simulate a loaded production node

The gap in the literature the proposal identified is still a gap. No existing system, including SageSched [2], uses live KV cache occupancy across concurrent requests as an admission signal. But testing it on hardware with abundant headroom is the wrong setup, and that should have been calibrated before running.


What Changes Next

The research direction needs to shift to one of two places.

Reproduce the interference first. Run the same workload on hardware where the model-to-GPU memory ratio is tight enough to produce measurable TTFT degradation in the baseline. Once the interference is visible and reproducible, there is something worth building a routing controller against.

Reframe the contribution around observability. The interference attribution mechanism the proposal described, scoring KV cache pressure per active request and detecting short-request deceleration in real time, is useful independent of whether a routing controller acts on it. Production LLM serving has poor per-request latency attribution today. A tool that tells operators which request caused a batch to slow down, and when, has practical value even if the routing policy becomes future work.

Either way, the null result is the most important output of this experiment. A result that shows no effect under these conditions is honest and repeatable. It saves someone else from running the same test.


References

[1] W. Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," Proc. ACM SOSP, 2023. https://arxiv.org/abs/2309.06180

[2] Z. Gan et al., "SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity," arXiv:2603.07917, 2026. https://arxiv.org/abs/2603.07917

[3] C. Hu et al., "ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads," ACM Trans. Archit. Code Optim., vol. 22, 2025. https://doi.org/10.1145/3732941