# preble **Repository Path**: annanxue/preble ## Basic Information - **Project Name**: preble - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: multi_modal_main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-03 - **Last Updated**: 2026-04-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Preble Preble is a load balancer for effecient prefix caching systems. PrePrint release at https://arxiv.org/abs/2407.00023 ## Installation You can install the package using pip: # Code Structure The `multi_node` directory contains the code for running as a separate abstraction layer to SGLang/vLLM in a distributed setting. This code is responsible for coordinating and managing the execution of the distributed system. Editable Installation ``` pip3 install -e . pip install -e "python[all]" pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ ``` Regular Pip Installation: ``` pip3 install preble pip install git+https://github.com/wuklab/preble.git#egg=preble[all] pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ ``` We release a custom version of sglang that supports chunked prefill ## Programatically starting the server We can support providing a list of runtime urls ``` from preble.main import start_server start_server( runtime_selection_policy="custom", runtime_urls="http://127.0.0.1:30000/generate,http://127.0.0.1:30001/generate", host='127.0.0.1', port=8000, model="mistralai/Mistral-7B-v0.1" ) ``` We can also support dynamically loading the models to seperate cuda devices ``` from preble.main import start_server_and_load_models start_server_and_load_models( model_name="mistralai/Mistral-7B-v0.1", devices=[0, 1], host="127.0.0.1", port=8000 ) ``` The server can be run via: ``` python3 multi_node/server/server.py ``` - server runs the server given a list of urls - deploy_and_run generates two endpoints CLI Configuration ``` runtime_selection_policy: The policy to select the runtime (e.g., custom, round_robin). runtime_urls: Comma-separated list of runtime URLs. host: The host address for the server. port: The port number for the server. model: The model to be used (e.g., mistralai/Mistral-7B-v0.1). ``` ## Citation And Acknowledgment The code is forked of sglang ``` @inproceedings{ srivatsa2025preble, title={Preble: Efficient Distributed Prompt Scheduling for {LLM} Serving}, author={Vikranth Srivatsa and Zijian He and Reyna Abhyankar and Dongming Li and Yiying Zhang}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, } ``` # pypi build and install instructions Currently uploaded at: ```python setup.py bdist_wheel``` ``` twine upload --repository testpypi dist/* --verbose``` ```python3 -m pip install --index-url https://test.pypi.org/simple/ preble``` # Test running the server in isolation Launch the server in isolation to help debug issues ``` python -m sglang.launch_server --model-path mistralai/Mistral-7B-v0.1 --cuda-devices 0 --tp-size 2 ``` # License This project is licensed under the Apache 2.0 License. See the LICENSE file for details.