Knowledge Graph — Explained

Definition

A knowledge graph (KG) models knowledge as nodes (entities) and edges (relationships). Instead of plain text "Anthropic makes Claude, headquartered in San Francisco, CEO is Dario Amodei," a KG stores discrete triples: (Anthropic) -[makes]-> (Claude), (Anthropic) -[based-in]-> (San Francisco), (Anthropic) -[CEO]-> (Dario Amodei).

The win: multi-hop queries ("where does the CEO of the company that makes Claude live?") solve in one query. Plain-text RAG struggles to synthesize this; graph traversal nails it.

Two modern AI uses: 1. Graph RAG: parse documents and auto-extract a KG, then have the LLM walk the graph. Microsoft's open-source GraphRAG project pioneered this. 2. Hybrid retrieval: vector search + KG search together. Vectors for similarity, KG for structural queries.

Classic KG systems: Wikidata (structured Wikipedia), Google Knowledge Graph (search "knowledge panels"), Neo4j, ArangoDB.

Analogy

A library's text collection (classic RAG) vs a family tree (KG). To find "John's father's cousin" in text, you scan every page. On the tree, 3 clicks — follow the nodes, answer pops out. Structured knowledge always beats plain text — if you can extract the structure up front.

Real-world example

Microsoft GraphRAG (late 2024): ingest the Sherlock Holmes stories, auto-build a KG: - Characters (Holmes, Watson, Moriarty) - Relationships (friend, enemy, kin) - Events (murder, investigation, resolution)

Classic RAG: "who is Holmes's most frequent enemy?" pulls 50 chunks and struggles to count enemy mentions.

Graph RAG: walk "Holmes-enemy-of" edges in the KG, count, return the highest-frequency node. Sub-second, 100% correct.

Microsoft reported 40-50% accuracy gains on internal Q&A with Graph RAG.

When to use

Multi-hop queries (relationship traversal needed)
Lots of structured entities (people, companies, products with relations)
Authoritative data (Wikipedia, ontologies, taxonomies)
Reducing hallucination risk — KG data is verifiable
Medical/biological research (drug interactions, protein networks)

When not to use

Plain text is enough — KG extraction overhead is unnecessary
Few/changing relationships — KG maintenance is hard
Semantic similarity suffices (vectors are better)
Quick prototype — building a KG takes weeks

Common pitfalls

KG extraction quality drives everything

Auto-extraction (with an LLM) produces wrong triples. 'Apple' fruit or company? Disambiguation is essential. Wrong KG = wrong answers, less safe than no KG.

Graph DB complexity

Neo4j, ArangoDB use specialty query languages (Cypher, AQL). SQL teams ramp up slowly. Consider SaaS options (Memgraph, Dgraph).

Maintenance cost

New docs update the KG; conflicts must be resolved ('Acme' vs 'Acme Corp.'). Automated pipeline + human oversight required.