Back
Obsidian

Interference-Aware Admission Control for LLM Serving

Jun 3, 2026·7 min read·5 views
Interference-Aware Admission Control for LLM Serving

Does routing long-context requests away from loaded GPUs reduce latency for short requests in mixed LLM workloads?


Motivating Scenario

A user submits a 50,000-token document for summarisation. At the same moment, ten colleagues send short one-line questions. All eleven requests land on the same GPU. The serving system admits them together because it has no way to know that the large request will fill most of the available memory and stall token generation for everyone else. Each short request, which would take under 200 ms alone, now waits up to four seconds for its first token. The users see a frozen screen. The operator sees "GPU utilisation: 82%" and has no way to tell what happened or who caused it.


Research Question

In mixed LLM workloads, does routing long-context requests away from GPUs under memory pressure reduce p95 TTFT (time to first token) for co-running short requests, compared to standard vLLM scheduling?


What the Literature Has Already Built

SystemWhat it doesWhy the gap remains
vLLM (SOSP 2023) [1]Default scheduler uses First-Come First-Served (FCFS) ordering with no awareness of relationships between concurrent requestsTreats every request as independent; blind to how one large request blocks others sharing the same GPU
SageSched (ICML 2026) [2]Predicts each request's cost from its output-length distributionModels each request's own cost in isolation, not the pressure it places on other co-running requests
TRUFFLD (ICLR 2026) [3]Traces each request's own GPU call chain to detect anomaliesPer-request view only; cannot link one request's resource consumption to another request's slowdown
LatencyPrism (Alibaba, 2026) [4]Monitors latency at the batch level across distributed inferenceBatch granularity only: knows a batch was slow, not which individual request caused it
ShuffleInfer / TetriInfer (ACM TACO, 2025) [5]Separates prefill (prompt processing) and decode (token generation) onto different hardware to prevent phase interferenceReduces interference through architecture changes, not through runtime admission decisions

The pattern: Every system either avoids interference by hardware separation or monitors one request at a time. None use the live Key-Value (KV) cache occupancy of all currently running requests as a signal for deciding where to admit the next one. That is the gap.


Proposed Solution

A lightweight admission controller, sitting above vLLM [1] at the API gateway. It requires no changes to the serving engine itself. At each request arrival, it reads two signals directly from vLLM's /metrics HTTP endpoint, polled by a background thread at approximately 100 ms intervals — fast enough to track KV cache growth across batch iterations, and not dependent on Prometheus scrape intervals which bottom out at around 1 second:

  • Current KV cache utilisation across all active requests
  • Input token count of the incoming request

If both exceed calibrated thresholds, the incoming request is routed to a separate GPU instance rather than the one already under load. Thresholds are calibrated from the first 20% of each trace before evaluation begins. Prometheus is used as the audit trail: after each routing decision, the controller logs the decision, KV utilisation at that moment, and the input token count, making every routing choice observable and debuggable in Grafana.

Where attribution fits in: The controller's routing decision is informed by a lightweight interference signal: for each overlap window between concurrent requests, the system scores KV cache pressure from each active request and checks whether short requests are decelerating beyond their expected baseline. This is the verification mechanism, not a diagnostic product. It answers the question: how do we know the routing rule is responding to real interference and not noise? Without it the controller is a black box. With it, each routing decision can be traced back to a measured pressure event on a specific GPU at a specific time.

Why this improves performance: Short requests that would take under 200 ms in isolation are currently delayed by seconds whenever a large request shares their GPU. Routing the large request away before it enters the batch eliminates that delay. The improvement is direct, measurable, and repeatable: p95 TTFT for short requests goes down because the resource that was blocking them, KV cache memory, is no longer being consumed by a request that did not need to be there.

Why a threshold rule is the starting point: The routing policy is intentionally simple: a calibrated threshold on two variables. A learned model predicting interference would be more flexible, but would require labelled training data that does not currently exist in any public dataset and would obscure whether the improvement comes from the signal or the model. A simple rule that still outperforms the baselines is stronger evidence that the signal itself carries the information. A learned policy is the natural next step and is explicitly scoped as future work.


Experiment Design

Hardware: 2 to 4 GPU node, Llama-3-8B, vLLM [1]. No specialised cluster required.

Synthetic workload (establish ground truth): One long-context request (50,000 input tokens) alongside ten short requests (at most 256 output tokens) on the same GPU.

  • Run A: Standard vLLM with First-Come First-Served scheduling, all requests on GPU 1
  • Run B: Admission controller active, long-context request routed to GPU 2
  • 10 repetitions with different random seeds. Mean p95 TTFT reported with 95% confidence intervals and confirmed with a paired t-test (p < 0.05).

Real workload (establish production relevance): Replay ShareGPT and LMSYS-Chat-1M [6] traces with controller active versus inactive. Measure p95 TTFT for requests with at most 256 output tokens across both conditions. ShareGPT and LMSYS-Chat-1M are real user conversation datasets used as standard production-like benchmarks in the original vLLM paper [1] and in LLM serving research more broadly.

Metric: p95 TTFT for short requests. One number, directly compared between conditions.

Baselines: vLLM First-Come First-Served (the production default) [1] and SageSched (ICML 2026) [2]. The remaining systems in the literature table (TRUFFLD, LatencyPrism, ShuffleInfer) prove the gap; they address different sub-problems and are not drop-in routing alternatives.


Alignment

This research sits at the orchestration layer: the layer between individual requests and the GPU cluster that decides how resources are allocated. My supervisor has described the core of the OS for AI vision, where physical hardware is abstracted and scheduling decisions are made based on observable resource state rather than blind heuristics.

The contribution is concrete and measurable: a controller that sees something existing schedulers do not (live KV pressure across concurrent requests), uses that signal to make a smarter placement decision, and produces a directly measurable improvement in user-facing latency. It does not just describe the problem. It fixes it and shows the numbers.

This work also maps directly to the infrastructure engineering roles that motivate it. Building a production-grade routing layer above vLLM in Python and Go, integrating with Kubernetes and Prometheus for signal collection and deployment, and evaluating on real serving traces is exactly the kind of work done by inference infrastructure engineers at companies building the next generation of LLM serving platforms. The implementation requires no GPU kernel work or model architecture changes, only systems-level orchestration.


References

[1] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention," in Proc. ACM SOSP, Koblenz, Germany, Oct. 2023. Online. Available: https://arxiv.org/abs/2309.06180

[2] Z. Gan, Y. Bao, Y. Liu, C. Chen, Q. Chen, and M. Guo, "SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity," arXiv:2603.07917 cs.DC, Mar. 2026. Online. Available: https://arxiv.org/abs/2603.07917

[3] R. Xu, J. Li, Z. Xie, and P. Chen, "Bridging Non-Intrusive Tracing and Fine-Grained Cross-Layer Representations for LLM Inference Diagnosis (TRUFFLD)," submitted to ICLR 2026, Oct. 2025. Online. Available: https://openreview.net/forum?id=S9iB7BA3Zq

[4] Y. Du et al., "LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference," arXiv:2601.09258 cs.DC, Jan. 2026. Online. Available: https://arxiv.org/abs/2601.09258

[5] C. Hu, H. Huang, L. Xu, X. Chen, C. Wang, J. Xu, S. Chen, H. Feng, S. Wang, Y. Bao, N. Sun, and Y. Shan, "ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads," ACM Trans. Archit. Code Optim., vol. 22, pp. 1-24, Jul. 2025. https://doi.org/10.1145/3732941

[6] L. Zheng, W.-L. Chiang, Y. Sheng, T. Li, S. Zhuang, Z. Wu, Y. Zhuang, Z. Li, Z. Lin, E. P. Xing, J. E. Gonzalez, I. Stoica, and H. Zhang, "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset," arXiv:2309.11998 cs.CL, Sep. 2023. Online. Available: https://arxiv.org/abs/2309.11998