Splitwise: Efficient Generative LLM Inference Using Phase Splitting

1. Introduction

Generative large language models (LLMs) have revolutionized AI applications but face significant deployment challenges due to their computational intensity and resource requirements. The rapid adoption of LLMs across various domains has created unprecedented demand for GPU capacity, leading to a worldwide GPU shortage and power constraints in datacenters.

2. Background and Motivation

2.1 LLM Inference Characteristics

LLM inference consists of two distinct phases with contrasting resource requirements:

Prompt Computation Phase: Computationally intensive parallel processing of all input tokens
Token Generation Phase: Memory-bandwidth bound sequential generation of output tokens

2.2 Hardware Limitations

GPU Specification Comparison

3.43× Compute Increase (H100 vs A100)

1.64× Memory Bandwidth Increase

2.16× Cost Increase

1.75× Power Increase

3. Splitwise Design

3.1 Phase Splitting Architecture

Splitwise proposes separating the two inference phases onto different hardware platforms:

Prompt Machines: High-end GPUs (H100) for compute-intensive prompt processing
Token Machines: Cost-effective GPUs (A100) for memory-bound token generation

3.2 Resource Management

The system uses optimized network libraries and fast interconnects for efficient state transfer between phases. The mathematical foundation involves modeling the inference latency as:

$L_{total} = L_{prompt} + n \times L_{token}$

where $n$ is the number of output tokens, $L_{prompt}$ is prompt computation latency, and $L_{token}$ is per-token generation latency.

4. Experimental Results

4.1 Performance Evaluation

Splitwise achieves significant improvements over conventional approaches:

1.4× higher throughput compared to homogeneous clusters
20% lower cost for equivalent performance
2.35× more throughput under same power and cost budgets

4.2 Cost and Power Analysis

The heterogeneous cluster design demonstrates superior resource utilization, particularly for token generation phases that don't require the latest GPU compute capabilities.

5. Technical Analysis Framework

Core Insight

Splitwise fundamentally challenges the industry's one-size-fits-all approach to GPU deployment. The research exposes a critical flaw in current LLM serving architectures: treating inference as a monolithic process when it clearly consists of two distinct computational patterns. This insight is as significant as the original transformer architecture paper's revelation about attention mechanisms.

Logical Flow

The argument progresses with mathematical precision: (1) Characterize the bimodal nature of LLM inference, (2) Demonstrate hardware mismatch through A100/H100 analysis, (3) Propose phase separation as surgical solution, (4) Validate with empirical results. This logical progression mirrors the approach in seminal systems papers like the Google Borg cluster management system.

Strengths & Flaws

Strengths: The 2.35× throughput improvement under fixed constraints is revolutionary—comparable to the leap achieved by NVIDIA's tensor cores. The cost reduction addresses the primary barrier to enterprise LLM adoption.

Flaws: The approach introduces network latency between phases, creating a new bottleneck. Like early microservices architectures, the complexity of distributed state management could outweigh benefits for smaller deployments.

Actionable Insights

Cloud providers should immediately implement phase-split architectures in their LLM offerings. Enterprises building inference clusters must adopt this heterogeneous approach or face 20-40% cost penalties. The research suggests we're entering an era of specialized AI hardware, much like the CPU/GPU divergence of the 2000s.

6. Future Applications and Directions

The phase splitting concept extends beyond current LLMs to emerging architectures:

Multi-modal models: Separate processing for different modality encoders
Mixture of Experts: Dynamic routing between specialized phase-specific hardware
Edge deployments: Split between edge devices and cloud resources
Specialized hardware: Custom ASICs for token generation phases

7. References

Vaswani, A., et al. "Attention is All You Need." NeurIPS 2017.
Brown, T., et al. "Language Models are Few-Shot Learners." NeurIPS 2020.
NVIDIA Corporation. "NVIDIA H100 Tensor Core GPU Architecture." 2022.
Verma, A., et al. "Large-scale cluster management at Google with Borg." EuroSys 2015.
Cloud GPU Pricing. "AWS EC2 Instance Pricing." Accessed 2024.

Table of Contents