Knowledge Graph
Structured knowledge as nodes + edges
A structured knowledge representation: entities (people, companies, concepts) as nodes; relationships as edges. Structure instead of plain text.
A knowledge graph (KG) models knowledge as nodes (entities) and
edges (relationships). Instead of plain text "Anthropic makes Claude,
headquartered in San Francisco, CEO is Dario Amodei," a KG stores
discrete triples:
(Anthropic) -[makes]-> (Claude),
(Anthropic) -[based-in]-> (San Francisco),
(Anthropic) -[CEO]-> (Dario Amodei).
The win: multi-hop queries ("where does the CEO of the company that makes Claude live?") solve in one query. Plain-text RAG struggles to synthesize this; graph traversal nails it.
Two modern AI uses: 1. Graph RAG: parse documents and auto-extract a KG, then have the LLM walk the graph. Microsoft's open-source GraphRAG project pioneered this. 2. Hybrid retrieval: vector search + KG search together. Vectors for similarity, KG for structural queries.
Classic KG systems: Wikidata (structured Wikipedia), Google Knowledge Graph (search "knowledge panels"), Neo4j, ArangoDB.
A library's text collection (classic RAG) vs a family tree (KG). To find "John's father's cousin" in text, you scan every page. On the tree, 3 clicks — follow the nodes, answer pops out. Structured knowledge always beats plain text — if you can extract the structure up front.
Microsoft GraphRAG (late 2024): ingest the Sherlock Holmes stories, auto-build a KG: - Characters (Holmes, Watson, Moriarty) - Relationships (friend, enemy, kin) - Events (murder, investigation, resolution)
Classic RAG: "who is Holmes's most frequent enemy?" pulls 50 chunks and struggles to count enemy mentions.
Graph RAG: walk "Holmes-enemy-of" edges in the KG, count, return the highest-frequency node. Sub-second, 100% correct.
Microsoft reported 40-50% accuracy gains on internal Q&A with Graph RAG.
- Multi-hop queries (relationship traversal needed)
- Lots of structured entities (people, companies, products with relations)
- Authoritative data (Wikipedia, ontologies, taxonomies)
- Reducing hallucination risk — KG data is verifiable
- Medical/biological research (drug interactions, protein networks)
- Plain text is enough — KG extraction overhead is unnecessary
- Few/changing relationships — KG maintenance is hard
- Semantic similarity suffices (vectors are better)
- Quick prototype — building a KG takes weeks
KG extraction quality drives everything
Auto-extraction (with an LLM) produces wrong triples. 'Apple' fruit or company? Disambiguation is essential. Wrong KG = wrong answers, less safe than no KG.
Graph DB complexity
Neo4j, ArangoDB use specialty query languages (Cypher, AQL). SQL teams ramp up slowly. Consider SaaS options (Memgraph, Dgraph).
Maintenance cost
New docs update the KG; conflicts must be resolved ('Acme' vs 'Acme Corp.'). Automated pipeline + human oversight required.