Once a language model has been refined, its effectiveness depends on how well it can be delivered in real-world environments. This book examines the systems and techniques that enable efficient inference, with a particular focus on vLLM and the architectural decisions that support high-throughput execution. The text begins by establishing the relationship between model size, hardware constraints, and response latency. It then explores how memory is managed during inference, including strategies that reduce overhead while ...
Read More
Once a language model has been refined, its effectiveness depends on how well it can be delivered in real-world environments. This book examines the systems and techniques that enable efficient inference, with a particular focus on vLLM and the architectural decisions that support high-throughput execution. The text begins by establishing the relationship between model size, hardware constraints, and response latency. It then explores how memory is managed during inference, including strategies that reduce overhead while maintaining output quality. Concepts such as batching, caching, and token-level scheduling are presented in a way that reveals their practical impact on performance. A central theme of the book is parallel execution, where multiple requests are handled simultaneously without degrading responsiveness. The discussion highlights how modern inference frameworks distribute workloads, coordinate computation, and maintain consistency across concurrent processes. Token streaming is examined as a critical component of user-facing systems, showing how incremental output generation improves perceived responsiveness and interaction flow. The material connects these techniques to broader system considerations, including scaling across machines, managing resource allocation, and maintaining stability under load. As the book progresses, it presents a unified view of inference as both a technical and operational challenge. It demonstrates how decisions made at the system level directly influence user experience, cost efficiency, and reliability. By the end, readers will have a clear understanding of how optimized inference transforms a refined model into a responsive and scalable system capable of operating under demanding conditions.
Read Less
Add this copy of Vllm and High-Performance Inference to cart. $20.29, new condition, Sold by Books2anywhere rated 5.0 out of 5 stars, ships from Fairford, GLOUCESTERSHIRE, UNITED KINGDOM, published 2026 by Independently Published.
Choose your shipping method in Checkout. Costs may vary based on destination.
Seller's Description:
PLEASE NOTE, WE DO NOT SHIP TO DENMARK. New Book. Shipped from UK in 4 to 14 days. Established seller since 2000. Please note we cannot offer an expedited shipping service from the UK.
Add this copy of vLLM and High-Performance Inference: Memory to cart. $23.90, new condition, Sold by Ingram Customer Returns Center rated 5.0 out of 5 stars, ships from NV, USA, published 2026 by Independently Published.