Cloud Infrastructure Management in the Age of AI Agentsdoi: 10.1145/3759441.3759443pmid: N/A
Cloud infrastructure is the cornerstone of the modern IT industry. However, managing this infrastructure effectively requires considerable manual effort from the DevOps engineering team. We make a case for developing AI agents powered by large language models (LLMs) to automate cloud infrastructure management tasks. In a preliminary study, we investigate the potential for AI agents to use different cloud/user interfaces such as software development kits (SDK), command line interfaces (CLI), Infrastructure-as-Code (IaC) platforms, and web portals. We report takeaways on their effectiveness on different management tasks, and identify research challenges and potential solutions.
Efficient LLM Inference via Chunked Prefillsdoi: 10.1145/3759441.3759444pmid: N/A
Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference.
Towards Large Language Model-Friendly APlsdoi: 10.1145/3759441.3759445pmid: N/A
Conventional Application Programming Interfaces (APIs) are designed for human developers. However, when Large Language Models (LLMs) act as API clients, these humancentric design choices may fail to harness the potential of LLMs, thus causing excessive overhead and task failures. We present Symphony AP1s, a class of semi-open APIs allowing LLMs to extend the API's internal logic at runtime, under the constraints of safety and controllability. Our case studies using the POSIX 'find' utility and the Robot 'PickAndPlace' API show that Symphony APIs can enable LLMs to extend API capabilities in a cost-effective and controllable manner.
DREAM: Distributed Regional Efficient Agent Management with LLMs for Online Multi-Agent Pathfindingdoi: 10.1145/3759441.3759446pmid: N/A
In this paper, we introduce DREAM, Distributed Regional Efficient Agent Management, a novel method using Large Language Models (LLMs) to solve Multi-Agent Pathfinding (MAPF) problems in complicated environments. Our approach splits up the area into various local regions and an LLM agent handles each one of them intelligently in reasoning and decision making. We present some novel designs in our system: 1) Adaptive region management and allocation to regions, supporting the dynamic partitioning of different complexity or density areas. 2) The multi-level LLM-driven agents collaboration framework that enables peer-peer, interLLM coordination and controls for effective monitoring intelligence across a hierarchical path planning organization hierarchy level ensures autonomy whilst improving overall understanding among LLM agents, leading to more accurate planning decisions from real-time analysis. (3) Failurereflection- replanning mechanism integrated within an individual LLM's management scope eventually results continual improvement. (4) LLM agents can do function calling to interact with the typical algorithms also. Our system successfully processes complex and large-scale MAPF scenarios by merging the higher-orderality of reasoning capabilities in LLMs with this novel distributed framework. For instance, the distributed and hierarchical nature of this approach helps to break a high-dimensional MAPF problem into several groups of smaller dimension. As such, this approach also opens up the development of AI language models in more complex robotics and logistics scenarios, potentially changing how multi-agent coordination is done for actual situations.
Toward Weight Sharing Paradigm for Efficient AI: Training and Inference Servingdoi: 10.1145/3759441.3759447pmid: N/A
Deep neural networks are increasingly required to operate across diverse hardware platforms, latency constraints, and power budgets, which motivates the need for specialized models for each scenario. However, designing and training a separate model per scenario or serving a large ensemble of models is often impractical. Weight sharing has emerged as a promising paradigm to address this challenge by training a single ''SuperNet'' that subsumes many sub-models (SubNets), and by reusing weights across those SubNets both at training and inference time. This paper provides an abridged survey of our recent advances that leverage weight sharing for efficient AI, covering both training and inference serving. In centralized once-for-all training, Delayed ε-Shrinking (DεS) improves training efficiency by strategically scheduling the introduction of smaller SubNets during training. In a federated fashion, SuperFedNas co-trains a SuperNet across distributed clients and disjoins training and searching, which enables oneshot specialization to many deployment targets at minimal cost. ∇QDARTS integrates quantization into differentiable architecture search, jointly finding neural architectures, weights, and low-precision settings to yield highly efficient models in a single search. For inference serving, SuperServe introduces a weight-shared model with dynamic SubNet routing (SubNetAct) to instantaneously switch among a spectrum of accuracy-latency operating points, coupled with a scheduler (SlackFit) for unpredictable workloads. Finally, SUSHI co-designs model, system, and accelerator to exploit weightshared SuperNets on tinyML devices, caching SubGraphs on FPGA to reduce latency and energy. Together, these works demonstrate that the weight sharing paradigm can dramatically improve the efficiency of both training and inference serving of deep models across a range of scenarios.
EMPIRIC: Exploring Missing Pieces in KV Cache Compression for Reducing Computation, Storage, and Latency in Long-Context LLM Inferencedoi: 10.1145/3759441.3759448pmid: N/A
Transformer-based Large Language Models (LLMs) heavily depend on the KV cache for efficient handling of long context sequences. However, the size of the KV cache grows linearly with the input sequence length, increasingly straining system memory, computational resources, bandwidth, and latency during decoding. Although recent research has proposed various techniques to compress the KV cache -targeting either storage or computational efficiency-few methods effectively achieve both simultaneously. Additionally, existing methods primarily rely on heuristic-driven approaches, lacking comprehensive insights into token selection criteria, and often significantly compromise model accuracy under strict KV cache token budget constraints (e.g., keeping 512 tokens). Building upon our recent work, RocketKV, this paper introduces EMPIRIC as an oracle-based vision study, which explicitly defines theoretical bounds for accuracy, computation, and storage in KV cache compression. By analyzing intrinsic patterns in KV cache attention heads, EMPIRIC provides novel insights into effective token pruning without accuracy degradation. This work clarifies the overlooked elements critical to KV cache compression during decoding and optimally balances computational efficiency, storage optimization, inference latency, and accuracy. We envision that EMPIRIC will guide future research efforts toward creating scalable, efficient KV cache compression techniques, significantly improving inference performance for long context LLM inference.
GSST: Parallel string decompression at 191 GB/s on GPUdoi: 10.1145/3759441.3759450pmid: N/A
Most of the commonly used compression standards make use of some form of the LZ algorithm. Decompressing this type of data is not a good match for the Single-Instruction, Multiple Thread (SIMT) model of computation used by GPUs, resulting in low throughput and poor utilization of the GPU parallel compute capabilities. In this paper, we introduce GSST, a GPU-optimized version of the FSST compression algorithm, which targets string compression. The optimizations proposed in this paper make the algorithm particularly suitable for GPUs, which allows it to achieve a significantly better tradeoff for decompression throughput vs compression ratio as compared to the state of the art. Our results show that the new algorithm pushes the Pareto curve closer towards the ideal region, completely dominating LZ-based compressors in the nvCOMP library (LZ4, Snappy, GDeflate). GSST provides a compression ratio of 2.7 4x and achieves a throughput of 191 GB/s on an A100 GPu.
Erasure Coding Aware Block Placement for Data-Intensive Applicationsdoi: 10.1145/3759441.3759451pmid: N/A
Erasure Coding (EC) has recently been integrated and deployed in the Hadoop Distributed File System (HDFS) to provide the same fault tolerance guarantees as replication, but with significantly less storage overhead. When EC is used, data reads typically involve only data chunks. In this paper, we study the effect of data chunk distribution on the performance of reads and data-intensive applications, and present the design and evaluation of an erasure coding aware (EC-aware) block placement that balances the distribution of data chunks across nodes. Experimental results show that EC-aware block placement can reduce the execution time of Sort and WordCount applications by up to 25%.