# preble

**Repository Path**: annanxue/preble

## Basic Information

- **Project Name**: preble
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: multi_modal_main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-03
- **Last Updated**: 2026-04-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Preble

Preble is a load balancer for effecient prefix caching systems. 
PrePrint release at https://arxiv.org/abs/2407.00023
## Installation

You can install the package using pip:

# Code Structure
The `multi_node` directory contains the code for running as a separate abstraction layer to SGLang/vLLM in a distributed setting. This code is responsible for coordinating and managing the execution of the distributed system.

Editable Installation
```
pip3 install -e .
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
```

Regular Pip Installation:
```
pip3 install preble
pip install git+https://github.com/wuklab/preble.git#egg=preble[all]
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
```


We release a custom version of sglang that supports chunked prefill

## Programatically starting the server
We can support providing a list of runtime urls
```
from preble.main import start_server

start_server(
    runtime_selection_policy="custom",
    runtime_urls="http://127.0.0.1:30000/generate,http://127.0.0.1:30001/generate",
    host='127.0.0.1',
    port=8000,
    model="mistralai/Mistral-7B-v0.1"
)
```

We can also support dynamically loading the models to seperate cuda devices
```
from preble.main import start_server_and_load_models

start_server_and_load_models(
    model_name="mistralai/Mistral-7B-v0.1",
    devices=[0, 1],
    host="127.0.0.1",
    port=8000
)
```

The server can be run via:
```
python3 multi_node/server/server.py <server/deploy_and_run>
```
- server runs the server given a list of urls
- deploy_and_run generates two endpoints

CLI Configuration
```
    runtime_selection_policy: The policy to select the runtime (e.g., custom, round_robin).
    runtime_urls: Comma-separated list of runtime URLs.
    host: The host address for the server.
    port: The port number for the server.
    model: The model to be used (e.g., mistralai/Mistral-7B-v0.1).
```

## Citation And Acknowledgment
The code is forked of sglang
```
@inproceedings{
srivatsa2025preble,
title={Preble: Efficient Distributed Prompt Scheduling for {LLM} Serving},
author={Vikranth Srivatsa and Zijian He and Reyna Abhyankar and Dongming Li and Yiying Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
}
```

# pypi build and install instructions
Currently uploaded at:
```python setup.py bdist_wheel```
``` twine upload --repository testpypi dist/* --verbose```
```python3 -m pip install --index-url https://test.pypi.org/simple/ preble```

 
# Test running the server in isolation
Launch the server in isolation to help debug issues
```
python -m sglang.launch_server --model-path mistralai/Mistral-7B-v0.1 --cuda-devices 0 --tp-size 2 
```

# License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.