TL;DR
- MIT-licensed open-source data framework for LLM applications, started by Jerry Liu as GPT-Index in November 2022 and renamed LlamaIndex in early 2023. Now governed at github.com/run-llama with contributions from Anthropic, Cohere, Pinecone, Weaviate, Qdrant, MongoDB, Snowflake and Yobitel.
- Where LangChain frames the problem as orchestration, LlamaIndex frames it as data ingestion and indexing. The core primitives are DataConnectors (loaders), NodeParsers (chunkers), Indexes (Vector / Summary / Tree / Keyword / KnowledgeGraph / PropertyGraph), Retrievers, QueryEngines, and Workflows for agentic orchestration.
- Available in Python (mature) and TypeScript (LlamaIndex.TS). Targets every major vector store (Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector, OpenSearch, Vespa), every major embedding model, and any OpenAI-compatible LLM endpoint — including Yobibyte workspaces and Yobitel NeoCloud vLLM tenants.
- Commercial add-ons LlamaParse (layout-aware document parsing) and LlamaCloud (hosted ingestion + retrieval) are optional; the open-source framework is fully usable on its own.
- Yobitel uses LlamaIndex DataConnectors for MediQuery knowledge-base ingestion across hospital document corpora. Yobibyte exposes pgvector / Qdrant / Weaviate-backed vector stores that LlamaIndex's `VectorStoreIndex` targets out of the box — no Yobitel-specific connector required.
Overview#
LlamaIndex began as GPT-Index in November 2022, a small library by Jerry Liu for indexing documents in a way GPT-3 could query without exhausting the context window. The reframing was important: instead of passing the model a flat dump of text, build an index over the corpus and let the model query it. The project was renamed LlamaIndex in early 2023 as it grew into a full framework, and Jerry Liu founded the company of the same name to maintain it.
By mid-2026 LlamaIndex is the closest competitor to LangChain by adoption and the natural choice when the application's centre of gravity is documents and retrieval rather than orchestration and tool variety. It is the framework Yobitel uses internally for MediQuery's clinical knowledge-base ingestion and the framework Yobibyte customers use most often when they bring their own corpus to a managed inference endpoint.
The defining design choice is composable indexes. Most RAG stacks pick one vector store and call retrieval from it. LlamaIndex models retrieval as a tree of typed indexes: a top-level SummaryIndex can route to per-document VectorStoreIndexes, which themselves route to PropertyGraphIndexes for relationship questions. This is the primitive set that pays rent on heterogeneous corpora — financial reports plus support tickets plus Slack history plus engineering specs — where one-vector-store-for-all loses signal.
This entry documents the LlamaIndex surface a production team actually uses: the four-layer Documents / Nodes / Indexes / QueryEngines model, the DataConnector catalogue (Hub now hosts 300+ loaders), the NodeParser strategies that decide chunk quality, the Workflows agent runtime, LlamaParse for hard PDFs, sizing and quota considerations, observability hooks, the security and compliance posture, and how LlamaIndex slots in next to LangChain, raw vector-store SDKs, and Yobitel's MediQuery managed application. This entry helps you ingest a heterogeneous document corpus into a production RAG pipeline against Yobibyte or Yobitel NeoCloud endpoints with the right index choice, chunking strategy and retrieval shape — and recognise where LlamaIndex earns its keep over hand-rolled retrieval.
Quick start#
The example below installs the modern LlamaIndex split packages, ingests a local document corpus, builds a hybrid VectorStoreIndex backed by Qdrant, and serves a QueryEngine that uses a Yobibyte (or any OpenAI-compatible) LLM endpoint for response synthesis. The second block migrates to LlamaParse for hard PDFs. The third block builds a Workflow-based RAG agent that can call tools alongside retrieval.
# 1. Install the modern LlamaIndex split packages
pip install "llama-index-core>=0.12" \
"llama-index-llms-openai-like>=0.3" \
"llama-index-embeddings-huggingface>=0.4" \
"llama-index-vector-stores-qdrant>=0.4" \
"llama-index-readers-file>=0.4"
# 2. Ingest a corpus into Qdrant and query via Yobibyte
cat > rag.py <<'PY'
import os, qdrant_client
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
# Yobibyte exposes OpenAI-compatible /v1 — drop in OpenAILike with the base URL.
Settings.llm = OpenAILike(
model="llama-3.1-70b-instruct",
api_base=os.environ["LLM_BASE_URL"], # e.g. https://api.yobibyte.example/v1
api_key=os.environ["LLM_API_KEY"],
is_chat_model=True, is_function_calling_model=True,
context_window=32768, temperature=0,
)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)
vs = QdrantVectorStore(
client=qdrant_client.QdrantClient(url=os.environ["QDRANT_URL"]),
collection_name="kb",
)
docs = SimpleDirectoryReader("./corpus", recursive=True).load_data()
index = VectorStoreIndex.from_documents(docs, vector_store=vs, show_progress=True)
qe = index.as_query_engine(similarity_top_k=6, response_mode="compact")
print(qe.query("Summarise our NCSC alignment for OFFICIAL workloads."))
PY
LLM_BASE_URL=https://api.yobibyte.example/v1 \
LLM_API_KEY=sk-yb-... QDRANT_URL=http://qdrant:6333 python rag.py
# 3. Use LlamaParse for hard PDFs (tables, multi-column, scanned)
pip install "llama-cloud-services>=0.4"
cat > parse.py <<'PY'
from llama_cloud_services import LlamaParse
parser = LlamaParse(api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown", num_workers=4)
docs = parser.load_data("./reports/annual-report-2025.pdf")
# Returns markdown with tables, headings and figure captions preserved.
PY
# 4. Workflow-based RAG agent with tool calls alongside retrieval
pip install "llama-index-agent-openai>=0.4"
cat > workflow.py <<'PY'
from llama_index.core.workflow import Workflow, step, Event, StartEvent, StopEvent
from llama_index.core.tools import FunctionTool
class RagEvent(Event):
query: str; nodes: list
class RagWorkflow(Workflow):
@step
async def retrieve(self, ev: StartEvent) -> RagEvent:
nodes = await index.as_retriever(similarity_top_k=8).aretrieve(ev.query)
return RagEvent(query=ev.query, nodes=nodes)
@step
async def synthesise(self, ev: RagEvent) -> StopEvent:
from llama_index.core.response_synthesizers import CompactAndRefine
synth = CompactAndRefine(llm=Settings.llm)
return StopEvent(result=str(synth.synthesize(ev.query, ev.nodes)))
print(await RagWorkflow(timeout=60).run(query="What is paged attention?"))
PYThe `OpenAILike` LLM wrapper plus `HuggingFaceEmbedding` is the canonical combination for Yobitel customers — LlamaIndex points at Yobibyte for generation and at a local embedding model for ingestion, with no Yobitel-specific code path required. Only `api_base` and `api_key` change between providers.
How it works#
LlamaIndex is structured as a layered data pipeline. Ingestion starts with a DataConnector (Reader) that pulls raw content from a source — a filesystem, S3 bucket, Confluence space, Notion workspace, Slack export, GitHub repo, Postgres table, web page, or a custom reader. The connector emits Document objects: text plus metadata. A NodeParser chunks each Document into Node objects with parent / child relationships; the parser choice (SentenceSplitter, TokenTextSplitter, MarkdownNodeParser, HTMLNodeParser, SemanticSplitterNodeParser) determines retrieval quality more than any other knob.
Nodes are then embedded and inserted into one or more Indexes. The Index is the central abstraction: it owns the storage backend (a vector store, a docstore, an index store), the embedding model, and the retrieval strategy. The same corpus can power multiple indexes simultaneously — a VectorStoreIndex for semantic search, a SummaryIndex for whole-document synthesis, a PropertyGraphIndex for entity-relationship queries — and a top-level RouterQueryEngine dispatches each incoming query to the right one.
At query time, a Retriever fetches the most relevant Nodes given a query. Retrievers compose: VectorIndexRetriever, BM25Retriever, KnowledgeGraphRetriever, and AutoMergingRetriever can be combined into a QueryFusionRetriever that reciprocal-rank-fuses their results, or wrapped in a ContextChatEngine for multi-turn. Retrieved nodes are passed to a Response Synthesiser (Refine, Compact, TreeSummarize, Accumulate) that calls the LLM to produce a final answer with citations back to the source Nodes.
Workflows, introduced in 0.11 and now the default agent runtime, replace the legacy `Agent` class with a typed event-driven framework. Steps are async functions decorated with `@step`; events are typed objects that flow between steps. The runtime dispatches events, runs steps in parallel where the dependency graph allows, and supports streaming, checkpointing, and human-in-the-loop. Workflows is conceptually similar to LangChain's LangGraph but takes an event-driven rather than graph-of-nodes approach — both are reasonable choices, and many production stacks compose the two.
- Document — raw input from a DataConnector with text + metadata.
- Node — chunked unit of a Document with parent / child / next / previous relationships preserved.
- DataConnector (Reader) — pulls content from a source. 300+ readers in LlamaHub: filesystem, S3, GitHub, Confluence, Notion, Slack, Jira, Salesforce, Postgres, MongoDB, web, etc.
- NodeParser — chunking strategy: Sentence / Token / Markdown / HTML / SemanticSplitter / HierarchicalNodeParser.
- Embedding — vector for a Node; HuggingFace, OpenAI, Cohere, Voyage, BGE, E5 supported uniformly.
- VectorStore — backing store: Qdrant, Pinecone, Weaviate, Milvus, Chroma, pgvector, OpenSearch, Vespa, Mongo Atlas, Azure AI Search, plus a default in-memory store.
- Index — Vector / Summary / Tree / KeywordTable / KnowledgeGraph / PropertyGraph / Composable.
- Retriever — Vector / BM25 / KnowledgeGraph / AutoMerging / Recursive / Fusion / SubQuestion.
- QueryEngine — Retriever + ResponseSynthesiser. RouterQueryEngine dispatches across multiple engines.
- Workflow — event-driven agent runtime with @step decorators.
- AgentRunner / AgentWorker — pre-built ReAct, OpenAI tool-calling, and Anthropic tool-use agents wired to a QueryEngine.
LlamaIndex 0.10 (Feb 2024) split the monorepo into llama-index-core plus per-integration packages. Anything written against pre-0.10 imports (`from llama_index import …`) needs migration — the modern path is `from llama_index.core import …` and explicit per-integration package installs.
Reference and specifications#
The table below is the canonical reference for the primitive classes most production teams touch. Anything ending in `Index` constructs from documents or nodes via `from_documents` / `from_nodes` factory methods; anything ending in `QueryEngine` is built from an Index via `.as_query_engine()`.
| Symbol | Package | Surface | Typical use |
|---|---|---|---|
| SimpleDirectoryReader | llama-index-core | load_data(input_dir, recursive, file_extractor) | Filesystem ingestion of PDFs, DOCX, MD, HTML, images. |
| LlamaParse | llama-cloud-services | load_data(file_path, result_type) | Layout-aware parsing of hard PDFs / scanned forms / slide decks. |
| SentenceSplitter | llama-index-core | chunk_size, chunk_overlap, separator | Default NodeParser; good general-purpose. |
| SemanticSplitterNodeParser | llama-index-core | embed_model, breakpoint_percentile_threshold | Chunks at semantic boundaries via embedding distance. |
| VectorStoreIndex | llama-index-core | from_documents / from_nodes / as_retriever / as_query_engine | Dense embedding retrieval. The most-used Index. |
| SummaryIndex | llama-index-core | from_documents / as_query_engine | Returns all nodes, lets the synthesiser summarise; for whole-document QA. |
| PropertyGraphIndex | llama-index-core | from_documents(kg_extractors=) | Entity + relationship graph; ideal for relationship-heavy corpora. |
| KnowledgeGraphIndex | llama-index-core | from_documents(kg_triplet_extract_fn=) | Older triplet-based KG; PropertyGraphIndex is the modern replacement. |
| QueryFusionRetriever | llama-index-core | retrievers, num_queries, mode='reciprocal_rerank' | Combines BM25 + vector + custom retrievers via RRF. |
| RouterQueryEngine | llama-index-core | selector, query_engine_tools | Dispatches across multiple QueryEngines based on LLM-side routing. |
| RetrieverQueryEngine | llama-index-core | retriever, response_synthesizer, node_postprocessors | Retrieve + synthesise + rerank in one pipeline. |
| OpenAILike (LLM) | llama-index-llms-openai-like | model, api_base, api_key, is_chat_model, is_function_calling_model | Connects to any OpenAI-compatible endpoint — Yobibyte / NeoCloud vLLM / TGI / TRT-LLM. |
| HuggingFaceEmbedding | llama-index-embeddings-huggingface | model_name, embed_batch_size, device | Local SentenceTransformer-style embedding. |
| Workflow / @step / StartEvent / StopEvent | llama-index-core | Async event-driven agent runtime | Default for new agent work. |
| IngestionPipeline | llama-index-core | transformations, vector_store, docstore | Reproducible ingest with deduplication and caching. |
Pin `llama-index-core` and every `llama-index-*` integration package together at the same minor version. Cross-package drift between core 0.12 and an integration on 0.11 is the single most common cause of `ImportError` and silent retrieval misbehaviour.
Workload patterns#
Three application shapes cover the bulk of production LlamaIndex usage: (A) Knowledge assistant over a heterogeneous corpus (PDFs + Confluence + Slack + tickets), (B) Document-grounded chatbot with multi-turn memory and citations, (C) Structured extraction agent that reads documents and emits typed records into a database. Each maps to a recognisable LlamaIndex composition that Yobibyte's MediQuery application uses as its ingestion fabric.
Pattern A — Knowledge assistant. IngestionPipeline reads from multiple connectors, NodeParser chunks per source-specific strategy (SentenceSplitter for prose, MarkdownNodeParser for technical docs, HTMLNodeParser for crawled web), VectorStoreIndex backs into Qdrant or pgvector, RouterQueryEngine routes queries to per-domain sub-engines, and a Yobibyte-hosted LLM synthesises responses with citations. Pattern B — Document chatbot. Same index, wrapped in a ContextChatEngine that maintains rolling conversation memory and reuses retrieved context. Pattern C — Structured extraction. PropertyGraphIndex extracts entities and relationships at ingest time; a Workflow agent queries the graph and emits Pydantic-typed records that flow into Postgres.
# Pattern A — Multi-source knowledge assistant on Yobibyte
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter, MarkdownNodeParser
from llama_index.readers.confluence import ConfluenceReader
from llama_index.readers.web import SimpleWebPageReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
pipeline = IngestionPipeline(
transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=64),
Settings.embed_model],
vector_store=QdrantVectorStore(client=qclient, collection_name="kb"),
)
confluence = ConfluenceReader(base_url="https://wiki.example", api_token=t)
docs = (confluence.load_data(space_key="ENG")
+ SimpleWebPageReader().load_data(urls=["https://docs.example/..."]))
pipeline.run(documents=docs)
# Pattern B — Document chatbot with rolling memory + citations
from llama_index.core.chat_engine import ContextChatEngine
chat = ContextChatEngine.from_defaults(
retriever=index.as_retriever(similarity_top_k=6),
llm=Settings.llm,
context_template="Use only the context below. Cite [n] per claim.\n{context_str}",
memory_token_limit=4000,
)
print(chat.chat("Summarise yesterday's incident from the postmortem.").response)
# Pattern C — Structured extraction Workflow
from pydantic import BaseModel
from llama_index.core.program import LLMTextCompletionProgram
class Patient(BaseModel):
nhs_number: str; admit_date: str; diagnosis_codes: list[str]
prog = LLMTextCompletionProgram.from_defaults(
output_cls=Patient,
prompt_template_str="Extract NHS, admit date and ICD codes from:\n{text}",
llm=Settings.llm,
)
record = prog(text=document_text)For Yobitel customers, the simplest production shape is: LlamaIndex IngestionPipeline writes to a Yobibyte-hosted Qdrant or pgvector tenant, OpenAILike points at the same workspace's LLM endpoint, and the application persists nothing locally. Yobibyte handles the vector-store SLA and the LLM SLA; the customer owns the ingestion code and the corpus.
Sizing and capacity planning#
LlamaIndex itself is light. The cost lives in three places: the embedding throughput at ingest, the LLM round-trip at query time, and the vector-store latency. The table below is realistic per-call overhead measured against a Yobibyte H100 vLLM tenant from a Python 3.11 client on the same continent; treat it as a planning anchor and re-measure for your own topology.
The single biggest ingest-throughput lever is batched embedding. The default embed_batch_size of 10 leaves most embedding endpoints idle; raising it to 64-128 (for HuggingFace local) or to the provider's max (32-256 for OpenAI / Cohere) often increases ingest throughput 5-10x without quality cost. For very large corpora (>10M chunks), the IngestionPipeline's `num_workers` plus a Qdrant / Weaviate cluster shard count match each other for clean horizontal scaling.
| Dimension | Typical value | Notes |
|---|---|---|
| LlamaIndex framework overhead per query | 5-20 ms | Excludes retrieval and LLM round-trip. |
| Embedding throughput (BGE-large local, A100) | 200-600 chunks/s | Per replica; batch_size=64-128. |
| Embedding throughput (BGE-large local, H100) | 400-1200 chunks/s | FlashAttention-2; batch_size=128. |
| Qdrant retrieval (k=6, 10M vectors) | 10-40 ms | HNSW index; CPU-bound at high QPS. |
| pgvector retrieval (k=6, 1M vectors) | 20-80 ms | IVFFlat or HNSW; index-build cost on every update. |
| LLM TTFT on Yobibyte (Llama-70B) | 300-600 ms | Dominates the query budget. |
| LlamaParse throughput (premium tier) | 3-8 pages/s | Async API; size for total page count, not chunk count. |
| Workflow step overhead | 5-15 ms | Per @step; checkpointer adds 3-10 ms. |
| Concurrent queries per Python process | 50-500 | Async-bound; embedding model on GPU is the limit. |
Limits and quotas#
LlamaIndex itself imposes no hard limits. Every limit you hit will be one set by the upstream LLM provider, the vector store, LlamaParse, or the host process. The list below covers what bites in production when running LlamaIndex against Yobibyte, Yobitel NeoCloud, or third-party endpoints.
- Context window — bound by the upstream LLM. A 32K-context Yobibyte tenant rejects synthesis prompts that exceed it; the Refine and Compact response modes chunk the context for you, but extremely high top_k can still overflow.
- Embedding batch size — provider-specific. OpenAI embeddings cap at 8191 tokens per item and 2048 items per batch; HuggingFace local has no protocol cap but GPU memory ceilings apply.
- Vector store dimensions — Qdrant supports up to 65536 dims; pgvector ivfflat indexes are practical up to ~2000 dims; HNSW handles higher.
- LlamaParse — quota per LlamaCloud plan; the premium tier ships 7000 pages/day on the entry-level paid plan as of mid-2026.
- Token quota at the LLM — Yobibyte workspaces cap QPS and tokens per minute per tier; configure OpenAILike `max_retries` and `timeout` to absorb 429s.
- Index size — VectorStoreIndex itself does not cap; the backing store does. Qdrant single-node holds ~100M vectors comfortably, ~1B with proper sharding.
- Workflow recursion — defaults to a step limit per execution; long-horizon workflows should split into sub-workflows rather than fight the cap.
- DocStore size — local SQLite docstore is fine up to a few hundred thousand nodes; promote to Postgres or MongoDB beyond that.
Do not run a production ingest with the default in-memory docstore + in-memory vector store. Lose the process, lose the index. Configure persistent backends from day one — Qdrant or pgvector for vectors, Postgres or Redis for the docstore.
Observability#
LlamaIndex integrates with three observability tiers. First-party: a global `set_handler` API that ships callbacks to Arize Phoenix, Langfuse, Weights & Biases, MLflow, and a handful of others without code changes. Standards-based: OpenTelemetry GenAI spans via OpenLLMetry (Traceloop) auto-instrument the underlying LLM and embedding SDK calls into your existing OTLP collector. First-party hosted: Arize Phoenix is the most polished open-source option and the one the LlamaIndex team explicitly invests in.
For Yobitel customers who run LlamaIndex on Yobibyte, the LLM-side spans are already emitted at the Yobibyte gateway; the ingest-side spans (embedding, chunking, vector-store writes) live in the customer's application and want their own collector. The combination gives end-to-end traces from the customer's `index.query(...)` call through retrieval, through synthesis on Yobibyte's managed inference, back to the response.
- Per-query attributes worth recording: query text, retriever name, retrieved node ids and scores, response synthesiser mode, LLM model, prompt token count, completion token count, latency split (retrieval / synthesis), citation hit rate.
- Per-ingest attributes: document source, document hash, chunk count, embedding model, embedding latency, vector-store write latency, error class.
- Phoenix instrumentation captures the entire QueryEngine tree as nested spans; trace the same run from Yobibyte's gateway view by correlating on `trace_id`.
- Sample production queries by some criterion (low retrieval scores, citation gaps, errors) into an evaluation dataset; replay through Ragas faithfulness / context precision metrics in CI.
# Side-by-side: Phoenix for LlamaIndex spans, OTel for upstream LLM spans
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from traceloop.sdk import Traceloop
# (1) Phoenix — local UI for nested QueryEngine traces
px.launch_app()
LlamaIndexInstrumentor().instrument()
# (2) OpenLLMetry — vendor-neutral spans to your existing OTLP backend
Traceloop.init(app_name="kb-rag", api_endpoint=os.environ["OTLP_ENDPOINT"])
# Both run side-by-side; Phoenix sees the IndexQueryEngine -> Retriever ->
# Synthesiser tree, Datadog / Honeycomb / Tempo sees per-call LLM spans.Cost and FinOps#
LlamaIndex itself is free (MIT). The cost lives in four places: embedding tokens at ingest, LLM tokens at query, vector-store storage and queries, and LlamaParse pages (if used). The single largest variable cost is re-embedding on schema or model changes — switching embedding models means re-embedding the entire corpus. Plan for the cost of one full re-embed when you choose a model; budget a second full re-embed before committing to production.
Yobibyte's bundled embedding + inference offers a single billing surface for the most common pattern (BGE-large embeddings at ingest, Llama-70B at query). Direct provider billing (OpenAI text-embedding-3-large at $0.13/1M tokens, Cohere embed-v3 at $0.10/1M, Voyage 3-large at $0.18/1M) is the alternative; the trade-off is one billing line vs four.
| Cost component | Typical USD range | Driver |
|---|---|---|
| LlamaIndex licence | $0 | MIT, no surcharge. |
| Embedding (local BGE-large on H100) | Compute only (~$2-4/hr) | Free at scale beyond the GPU cost. |
| Embedding (OpenAI text-embedding-3-large) | $0.13 per 1M input tokens | API cost; scales linearly with corpus. |
| LLM tokens via Yobibyte (Llama-70B FP8) | $0.40-1.20 per 1M input + $0.80-2.40 per 1M output | Workspace pricing; prompt-caching aware. |
| LlamaParse (premium tier) | $0.003-0.015 per page | Layout-aware; price varies by document complexity. |
| Qdrant Cloud (small) | $30-200/month | 1M-10M vectors; HNSW index. |
| Qdrant Cloud (large) | $500-5,000/month | 100M+ vectors; sharded clusters. |
| pgvector on managed Postgres | $50-1,000/month | Bundled with Postgres tier; index rebuild on schema change. |
Before committing to an embedding model, run the LlamaIndex `EmbeddingQAFinetuneDataset` evaluator against a sample of your corpus. The cost of one evaluation pass is trivial; the cost of discovering six months in that a different embedding model would have given you 12 percent better retrieval is a full re-embed.
Security and compliance#
LlamaIndex inherits the security posture of the connectors, the vector store, the LLM endpoint and LlamaParse. The checklist below is the working production posture that aligns with Yobibyte's NCSC Cloud Security Principle alignment, UK GDPR Article 32 evidence requirements, and the spirit of the OWASP LLM Top 10.
- Connector credentials — Confluence, Notion, Slack, Salesforce readers all consume API tokens that inherit the human's full scope. Provision per-service-account tokens with minimum read-only scope, rotate quarterly, store in a secret manager rather than in code.
- PII at ingest — DataConnectors do not redact. Personal data in source documents ends up in the vector store; under UK GDPR the vector store becomes a personal-data store with Article 30 / 32 obligations. Pre-redact with Microsoft Presidio (the `llama-index-readers-presidio` integration) or route through a per-document classifier at ingest.
- Vector store access control — Qdrant, Pinecone, Weaviate and pgvector each have their own RBAC story; LlamaIndex does not abstract over it. Enforce per-tenant collection isolation at the store layer, not at the application layer.
- Tool injection in Workflow agents — same risks as any agent framework. Validate tool arguments against tight JSON schemas, route destructive actions through HITL, treat retrieved Node text as untrusted input to downstream tools.
- LlamaParse — uploaded documents go to a hosted service unless you self-host. For regulated workloads (PII, PHI, classified) verify the contractual stance and use the self-hosted enterprise tier or stay on open-source loaders.
- Embedding model leak — fine-tuning embeddings on private data and publishing the resulting weights leaks corpus content. Either keep the fine-tuned embeddings private or use a privacy-preserving fine-tune protocol.
- Audit logging — instrument query and ingest paths with structured logs (user id, query, retrieved node ids, response token count). Required for GDPR Article 30 records of processing and for incident forensics.
- For UK Sovereign deployments — terminate the LLM `api_base` inside the regulated boundary (Yobitel NeoCloud UK), self-host or contractually scope LlamaParse, keep the vector store and docstore inside the regulatory region, and align KMS key control with NCSC Principle 11.
The vector store is the new database in your data architecture. Apply the same data-classification, retention and right-to-erasure discipline you would apply to a customer-data RDBMS — including DELETE-by-source workflows triggered by GDPR erasure requests.
Migration and alternatives#
LlamaIndex is the natural choice when the application is RAG-first; LangChain is the natural choice when orchestration and tool variety dominate. The two are not mutually exclusive — many production stacks use LlamaIndex for ingestion and indexing, then expose the retriever to a LangGraph agent for orchestration. The table below compares the four most common alternatives Yobitel customers weigh against LlamaIndex.
- From raw vector-store + provider SDK to LlamaIndex — wrap the existing index in a `VectorStoreIndex(vector_store=…)` constructor; the query side becomes a one-line `.as_query_engine()` call. Existing embeddings are reusable.
- From LlamaIndex pre-0.10 to 0.10+ — run the `llama-index-cli upgrade` tool; the substantive piece is moving from monolithic imports to per-integration packages and switching from the legacy Agent to Workflows.
- From LangChain RAG to LlamaIndex RAG — wrap the LangChain retriever with `LangchainRetriever` or rebuild on `VectorStoreIndex`; LLM and embedding wrappers translate one-for-one.
- From self-hosted LlamaIndex RAG to Yobitel MediQuery — only when the corpus is clinical and HIPAA / NCSC managed-service economics outweigh the flexibility of self-hosting. Otherwise stay on LlamaIndex.
- From LlamaIndex self-hosted to Yobibyte managed vector store — keep the LlamaIndex code; change the `QdrantVectorStore` URL to a Yobibyte-issued endpoint; Yobibyte handles SLA, replication and quota.
| Approach | Surface | Strengths | Weaknesses |
|---|---|---|---|
| LlamaIndex (core + integrations) | DataConnectors / Indexes / QueryEngines / Workflows | Deepest RAG ingestion; composable indexes; LlamaParse for hard PDFs. | Lighter on multi-agent orchestration outside RAG; per-integration package management. |
| LangChain + LangGraph | Runnable / ChatModel / Tool / StateGraph | Largest integration catalogue; mature multi-agent; LangSmith. | Less depth on document parsing and indexing strategy. |
| Raw vector-store SDK + provider SDK | Embed / upsert / query directly | Total control; minimum dependencies. | Reinvent chunking, hybrid retrieval, response synthesis, citation handling per project. |
| Haystack 2.x (deepset) | Pipeline / Component / DocumentStore | Strong typed-pipeline ergonomics; healthcare and gov adoption. | Smaller integration list; less momentum than LlamaIndex / LangChain. |
| Yobitel MediQuery (managed alt) | Vertical clinical knowledge-base assistant | Turn-key for healthcare RAG; HIPAA / NCSC posture managed. | Vertical scope only; not a general framework. |
Troubleshooting#
The failure modes below are the ones LlamaIndex operators hit repeatedly when running production RAG against managed LLM endpoints — including Yobibyte, Yobitel NeoCloud vLLM, and third-party providers.
| Symptom | Likely cause | Fix |
|---|---|---|
| ImportError on `from llama_index import …` | Pre-0.10 monolithic import; modern path is `from llama_index.core import …`. | Run `llama-index-cli upgrade` or update imports manually. |
| Retrieval returns irrelevant nodes | Chunk size or overlap too large; default SentenceSplitter too coarse. | Tune chunk_size 256-512 with 32-64 overlap; try SemanticSplitterNodeParser. |
| QueryEngine cites the wrong source | Multiple identical chunks across documents; metadata not preserved. | Add stable doc_id and source metadata; use AutoMergingRetriever to surface parent context. |
| Ingest extremely slow | Default embed_batch_size of 10. | Raise to 64-128; parallelise IngestionPipeline with num_workers. |
| Out-of-memory at ingest | Loading the full corpus into memory before chunking. | Use SimpleDirectoryReader's lazy `iter_data` or stream via IngestionPipeline. |
| OpenAILike returns empty completion | Endpoint not actually OpenAI-compatible; missing /v1 path. | Verify `curl $api_base/v1/chat/completions` returns a response; check is_chat_model flag. |
| LlamaParse hangs on large PDFs | Default sync mode; very large docs need the async API. | Use `aload_data` with `num_workers` > 1; raise `check_interval` for very long docs. |
| Hybrid retrieval gives worse results than vector alone | BM25 and vector weights badly tuned. | Use QueryFusionRetriever with `mode='reciprocal_rerank'`; tune `alpha`. |
| Workflow stuck | Step waiting for an event that no other step emits. | Inspect `workflow.draw()` graph; verify event types match step return / receive annotations. |
| Citation hit rate low under evaluation | Synthesiser using `tree_summarize` discards node ids. | Switch to `compact_accumulate` or `refine` response modes; enable `include_source_nodes`. |
| Vector store rejects insert with dimension mismatch | Changed embedding model mid-corpus. | Re-embed all nodes with the new model; collection schemas are dimension-locked. |
When retrieval quality drifts, the answer is almost always (in order): chunking strategy, embedding model, top_k, hybrid weighting. Synthesiser prompt tuning is the last lever, not the first.
Where this fits in the Yobitel stack#
Yobitel uses LlamaIndex as the ingestion fabric for MediQuery's clinical knowledge base — DataConnectors pull from hospital document stores and EHR exports, NodeParsers chunk per source type, and a PropertyGraphIndex captures patient-condition-treatment relationships. Customers building their own RAG stacks against Yobibyte get the same compatibility for free: Yobibyte exposes pgvector / Qdrant / Weaviate vector tenants and OpenAI-compatible LLM endpoints, both of which LlamaIndex targets out of the box with `QdrantVectorStore` and `OpenAILike`.
For UK Sovereign customers, the recommended pattern is to keep the LlamaIndex code unchanged, terminate the LLM `api_base` inside the Yobibyte UK Sovereign boundary, host the vector store on a regional Yobibyte tenant, and avoid LlamaParse SaaS in favour of on-premises layout parsers or LlamaParse's self-hosted enterprise tier. Yobitel Professional Services has a pre-built reference stack for this shape; the open-source LlamaIndex code-base remains the customer's IP, owned and operated by them — Yobibyte manages the inference and the vector storage SLA below it.
References
- LlamaIndex Documentation · LlamaIndex
- LlamaIndex on GitHub · GitHub
- LlamaParse · LlamaIndex
- LlamaHub (DataConnector catalogue) · LlamaIndex
- Arize Phoenix Instrumentation for LlamaIndex · Arize