TL;DR
- Open-source LLM deployment framework from the MLC (Machine Learning Compilation) project, built on TVM Unity.
- Compiles models ahead of time into optimised native libraries — CUDA, Metal, Vulkan, WebGPU, OpenCL — with no Python runtime required.
- Targets desktop, mobile, browser and embedded devices, with notable demos including in-browser Llama 3 8B via WebGPU.
- Trades compile-time complexity for runtime portability — once compiled, the engine is a tiny native binary or WASM module.
Overview#
MLC-LLM grew out of the Machine Learning Compilation course taught by Tianqi Chen and collaborators, and is built on the TVM Unity compiler stack. The premise: instead of writing a runtime that interprets model graphs at execution time, ahead-of-time compile each model into a self-contained native library tuned for the target device.
The compiled library bundles a kernel for every operator the model needs, configured for the exact dtype, quantisation scheme, sequence length and parallelism layout. The result runs without TVM, PyTorch or Python — just the compiled binary, the weights and a thin C++ or JavaScript runtime.
Compilation Pipeline#
- Convert a model from HuggingFace into the MLC checkpoint format.
- Apply quantisation (q4f16, q3f16, fp16, fp8) and weight packing.
- Use TVM Unity to lower the model graph to device-specific kernels.
- Emit a shared library (`.so`, `.dylib`, `.dll`), Metal library, WASM/WebGPU module or Android `.so`.
- Link the library with the MLC runtime in the host application.
Target Devices#
MLC-LLM's reach is unusually broad: NVIDIA CUDA, AMD ROCm, Apple Metal on Mac and iOS, Vulkan on most desktop and mobile GPUs, OpenCL on older Android GPUs, and WebGPU in the browser. The Web-LLM project is the in-browser flavour — Llama 3 8B running entirely client-side over WebGPU, with model weights cached in the browser.
Trade-offs versus llama.cpp#
Both projects share the goal of running quantised LLMs on consumer hardware. The differences are philosophical. llama.cpp interprets a single binary runtime against any GGUF model, optimising via hand-tuned kernels. MLC-LLM compiles each model into its own binary, optimising via the TVM auto-scheduler and operator fusion.
On Apple Silicon llama.cpp's Metal backend and MLC's Metal backend perform within ten to twenty percent of each other on most workloads; on the web MLC has no equivalent. The trade-off is workflow: llama.cpp is point-and-shoot, MLC needs a build step per (model × device × quantisation) combination.
Server Mode#
MLC-LLM also ships a Python serving package (`mlc_llm.serve`) that exposes an OpenAI-compatible REST API and supports continuous batching, paged KV cache and speculative decoding on NVIDIA, AMD and Apple Silicon. For server-side deployments it is a reasonable alternative to vLLM on AMD and Apple targets.
# Compile and serve a Llama 3.1 8B for Metal
mlc_llm convert_weight ./Meta-Llama-3.1-8B-Instruct \
--quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm gen_config ./Meta-Llama-3.1-8B-Instruct \
--quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm compile ./dist/llama3-8b/mlc-chat-config.json \
--device metal -o ./dist/llama3-8b/lib.dylib
mlc_llm serve ./dist/llama3-8b --port 8080When to Use#
Use MLC-LLM when the target is a non-NVIDIA accelerator (Apple Silicon, AMD desktop, Intel Arc, mobile GPU), when in-browser deployment is required, or when the application benefits from an ahead-of-time compiled binary with no Python runtime. For Kubernetes GPU fleets stick with vLLM or TensorRT-LLM; MLC's strengths are elsewhere on the spectrum.
References
- MLC-LLM on GitHub · GitHub
- MLC-LLM Documentation · MLC AI
- Web-LLM · GitHub