MLC-LLM

TL;DR

Open-source LLM deployment framework from the MLC (Machine Learning Compilation) project, built on TVM Unity.
Compiles models ahead of time into optimised native libraries — CUDA, Metal, Vulkan, WebGPU, OpenCL — with no Python runtime required.
Targets desktop, mobile, browser and embedded devices, with notable demos including in-browser Llama 3 8B via WebGPU.
Trades compile-time complexity for runtime portability — once compiled, the engine is a tiny native binary or WASM module.

Overview#

MLC-LLM grew out of the Machine Learning Compilation course taught by Tianqi Chen and collaborators, and is built on the TVM Unity compiler stack. The premise: instead of writing a runtime that interprets model graphs at execution time, ahead-of-time compile each model into a self-contained native library tuned for the target device.

The compiled library bundles a kernel for every operator the model needs, configured for the exact dtype, quantisation scheme, sequence length and parallelism layout. The result runs without TVM, PyTorch or Python — just the compiled binary, the weights and a thin C++ or JavaScript runtime.

Compilation Pipeline#

Convert a model from HuggingFace into the MLC checkpoint format.
Apply quantisation (q4f16, q3f16, fp16, fp8) and weight packing.
Use TVM Unity to lower the model graph to device-specific kernels.
Emit a shared library (`.so`, `.dylib`, `.dll`), Metal library, WASM/WebGPU module or Android `.so`.
Link the library with the MLC runtime in the host application.

Target Devices#

MLC-LLM's reach is unusually broad: NVIDIA CUDA, AMD ROCm, Apple Metal on Mac and iOS, Vulkan on most desktop and mobile GPUs, OpenCL on older Android GPUs, and WebGPU in the browser. The Web-LLM project is the in-browser flavour — Llama 3 8B running entirely client-side over WebGPU, with model weights cached in the browser.

Trade-offs versus llama.cpp#

Both projects share the goal of running quantised LLMs on consumer hardware. The differences are philosophical. llama.cpp interprets a single binary runtime against any GGUF model, optimising via hand-tuned kernels. MLC-LLM compiles each model into its own binary, optimising via the TVM auto-scheduler and operator fusion.

On Apple Silicon llama.cpp's Metal backend and MLC's Metal backend perform within ten to twenty percent of each other on most workloads; on the web MLC has no equivalent. The trade-off is workflow: llama.cpp is point-and-shoot, MLC needs a build step per (model × device × quantisation) combination.

Server Mode#

MLC-LLM also ships a Python serving package (`mlc_llm.serve`) that exposes an OpenAI-compatible REST API and supports continuous batching, paged KV cache and speculative decoding on NVIDIA, AMD and Apple Silicon. For server-side deployments it is a reasonable alternative to vLLM on AMD and Apple targets.

bash

# Compile and serve a Llama 3.1 8B for Metal
mlc_llm convert_weight ./Meta-Llama-3.1-8B-Instruct \
    --quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm gen_config ./Meta-Llama-3.1-8B-Instruct \
    --quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm compile ./dist/llama3-8b/mlc-chat-config.json \
    --device metal -o ./dist/llama3-8b/lib.dylib
mlc_llm serve ./dist/llama3-8b --port 8080

When to Use#

Use MLC-LLM when the target is a non-NVIDIA accelerator (Apple Silicon, AMD desktop, Intel Arc, mobile GPU), when in-browser deployment is required, or when the application benefits from an ahead-of-time compiled binary with no Python runtime. For Kubernetes GPU fleets stick with vLLM or TensorRT-LLM; MLC's strengths are elsewhere on the spectrum.

References

MLC-LLM on GitHub · GitHub
MLC-LLM Documentation · MLC AI
Web-LLM · GitHub

Overview#

Compilation Pipeline#

Convert a model from HuggingFace into the MLC checkpoint format.

Apply quantisation (q4f16, q3f16, fp16, fp8) and weight packing.

Use TVM Unity to lower the model graph to device-specific kernels.

Emit a shared library (`.so`, `.dylib`, `.dll`), Metal library, WASM/WebGPU module or Android `.so`.

Link the library with the MLC runtime in the host application.

Target Devices#

Trade-offs versus llama.cpp#

Server Mode#

bash

# Compile and serve a Llama 3.1 8B for Metal
mlc_llm convert_weight ./Meta-Llama-3.1-8B-Instruct \
    --quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm gen_config ./Meta-Llama-3.1-8B-Instruct \
    --quantization q4f16_1 -o ./dist/llama3-8b
mlc_llm compile ./dist/llama3-8b/mlc-chat-config.json \
    --device metal -o ./dist/llama3-8b/lib.dylib
mlc_llm serve ./dist/llama3-8b --port 8080

When to Use#

MLC-LLM

Overview#

Compilation Pipeline#

Target Devices#

Trade-offs versus llama.cpp#

Server Mode#

When to Use#

References

Browse all entries

Deploy on Yobitel

MLC-LLM

Overview#

Compilation Pipeline#

Target Devices#

Trade-offs versus llama.cpp#

Server Mode#

When to Use#

References

Browse all entries

Deploy on Yobitel