Beyond Keywords
The fundamental mismatch between how humans express meaning and how computers match strings
Traditional search engines match strings. You type "laptop battery life," and the engine finds documents containing those exact words. This works remarkably well for many queries. But it fails in ways that reveal a deep limitation.
The Vocabulary Mismatch Problem
Consider searching a technical documentation site for "how to make my code run faster." The answer might be in a document titled "Performance Optimization Techniques" that never uses the word "faster." The document discusses "reducing latency," "improving throughput," and "minimizing computational overhead." Same concept, different words.
Interactive: The vocabulary mismatch
Performance Optimization Techniques
Reduce latency by minimizing computational overhead. Profile your code to identify bottlenecks. Consider caching frequently accessed data.
Battery Optimization Guide
Maximize power efficiency by reducing CPU wake-ups. Monitor energy consumption patterns. Implement aggressive sleep modes.
Authentication Best Practices
Implement secure login mechanisms. Use multi-factor authentication. Store passwords with bcrypt hashing.
Notice how keyword matching fails to find the most relevant document.
This is called the vocabulary mismatch problem. Humans express the same idea in countless ways. Keywords match strings, not concepts. The gap between user intent and document vocabulary is where traditional search breaks down.
Synonyms Are Not Enough
You might think: just expand queries with synonyms. If someone searches "fast," also search "quick" and "rapid." But this approach collapses under scrutiny.
First, synonyms are context-dependent. "Fast" can mean quick (a fast car), secure (hold fast), or abstaining from food (to fast). A thesaurus cannot tell you which meaning applies.
Second, conceptual relationships extend far beyond synonyms. "Laptop battery life" relates to "power consumption," "mAh capacity," "charging cycles," and "energy efficiency." These are not synonyms—they are concepts connected by domain knowledge.
The limits of synonym expansion
has 3 distinct meanings
A thesaurus cannot tell you which synonyms apply without understanding context.
Third, the combinatorial explosion is unmanageable. A three-word query with five synonyms each produces 125 variations. Most are noise. Some are actively misleading.
The Ambiguity Problem
The same words mean different things in different contexts. Search for "python" and you might want the programming language, the snake, or Monty Python. Search for "apple" and you might want the company, the fruit, or Apple Records.
Interactive: Context determines meaning
Without context, which meaning is intended?
Click a context to see how meaning depends entirely on surrounding words.
Keywords carry no context. Each document containing "python" looks equally relevant to a keyword matcher. The only signals are statistical—which words appear most often, in what positions—and these correlate weakly with actual meaning.
The Semantic Gap
The core problem is that meaning exists in human minds, not in text. Words are symbols that point to concepts. Two sentences can use entirely different words yet express identical meaning. Two sentences can use identical words yet express opposite meaning.
Consider sentences that share meaning but not vocabulary: "The medication reduced inflammation" and "The drug decreased swelling" say the same thing with different words. A keyword matcher sees zero overlap. A human sees identical meaning.
Now consider the reverse: "The bank approved the loan" and "The fisherman sat on the bank" share the word "bank" but have nothing in common semantically. A keyword matcher sees perfect overlap. A human sees unrelated sentences.
The semantic gap
"The medication reduced inflammation"
"The drug decreased swelling"
Keywords would see these as different. Meaning says they're the same.
Keyword search operates at the surface—matching symbols. Semantic search operates at depth—matching meaning. The gap between them is why we need a fundamentally different approach.
What Would Work
Imagine an oracle that, given any two pieces of text, could tell you how similar their meaning is. Not their words—their meaning. With such an oracle, search becomes straightforward: compute similarity between query and every document, return the most similar.
The challenge is building that oracle. We need a way to represent meaning computationally—a representation where similar meanings are similar in some measurable way.
This is exactly what embedding models provide. They transform text into vectors in a high-dimensional space, arranged so that semantically similar texts produce geometrically similar vectors. The rest of this course is about how this works and how to use it effectively.
The Scale of the Problem
Consider what semantic search must handle. The web contains billions of documents. Enterprise knowledge bases grow constantly. Codebases span millions of files. All of this must be searchable in milliseconds—users expect instant results.
The queries themselves vary wildly. One user types "fix wifi." Another pastes a multi-paragraph technical description. Both expect relevant results. The system must handle legal contracts with formal language, medical records with specialized terminology, and casual conversations with slang and abbreviations. And meaning transcends language entirely—a French query about "intelligence artificielle" should find English documents about "machine learning."
No symbolic approach can handle this scope. No rule system, thesaurus, or pattern matching scales to billions of documents across arbitrary domains and languages. We need learned representations that capture meaning at scale.
Key Takeaways
- Keyword search matches strings, not meaning—the vocabulary mismatch is fundamental
- Synonym expansion fails because relationships between concepts are far richer than synonymy
- The same words mean different things in different contexts; keywords cannot disambiguate
- The semantic gap is the distance between symbols and meaning
- We need representations where similar meanings are geometrically similar
- This is exactly what embedding models provide, which we will explore next