
Discover why embeddings are crucial in modern vector search systems and how they enhance semantic search, recommendation engines, and Retrieval-Augmented Generation.
What are we embedding?
If you're exploring semantic search, retrieval-augmented generation (RAG), or recommendation systems, you've likely encountered embeddings, those dense numeric vectors that represent meaning.
But there's a common question early in the journey:
What exactly are we embedding–individual words, entire sentences, or something else?
The short answer is: we typically embed sentences or text chunks, not words or individual tokens. But the full picture is worth unpacking, especially when working with models like OpenAI’s text-embedding-3-small, Cohere’s multilingual embeddings, or Hugging Face Sentence Transformers.
From word embeddings to sentence embeddings

In earlier Natural Language Processing (NLP) pipelines, the standard approach was to embed individual words. Models like Word2Vec and GloVe generated a single vector for each word in a vocabulary, trained based on co-occurrence with nearby words.
This worked to a point, it captured things like analogies (e.g., “king” relates to “man” and “queen” relates to “woman”) and semantic clusters (e.g., “boat”, “yacht,” and “dinghy” being close together in vector space). But it had a critical flaw; word embeddings were context-free. The word "head" had the same vector whether you meant the head of a river, the toilet on a boat, or the head of a person.
Modern language models changed that.
Transformer-based models like BERT, OpenAI’s GPT family, and Sentence Transformers embed entire sentences or documents as a single vector, capturing contextual meaning based on how words relate to one another in that specific sentence.
This shift from static word vectors to dynamic, context-aware sentence embeddings is one of the biggest reasons modern vector searches works so well.
Why sentence-level embeddings are the default
Most vector search pipelines, especially when using tools like pgvector, use sentence or chunk-level embeddings. This is because these embeddings capture:
- Disambiguation (e.g., which meaning of “head” you intended)
- Intent (e.g., is the query informational, transactional, etc.)
- Semantic relationships between phrases, not just matching keywords
Sentence-level embeddings have become the default in modern vector search systems for several practical reasons.
First and foremost, context matters. A sentence embedding captures the meaning of an entire phrase in its full context, allowing for more accurate semantic comparisons. This approach also reduces overhead by requiring just one vector per record or chunk, rather than a separate vector for each word.
Importantly, most embedding models are specifically optimized to encode similarity at the sentence or document level, which leads to more meaningful results in downstream applications. It also simplifies querying, since you can embed a full user query and directly compare it to document chunks, FAQs, or product descriptions without additional preprocessing. These advantages make sentence embeddings the go-to choice for use cases like semantic search, chat memory, recommendation engines, and retrieval-augmented generation (RAG) systems.
How tokens power context (under the hood)
While you work with full sentences or documents at the application level, under the hood, models actually operate at the token level.
A token is the smallest unit a model processes; it might be a word, sub word, or even just part of a word. For example, the sentence:
Saint Jerome’s water-line length is 32 feet.
is split into tokens like:
["Saint" "Jerome" "'" "s" "water" "-" "line" "length" "is" "32" "feet" "."]
These tokens are then converted into token embeddings and passed through a transformer model, where context is built using a mechanism called self-attention.
Each token interacts with all the others. So, the token "head" in "head of the river” will end up with a different final representation than "head" in "she has a port side head". LLMs build context by adjusting the meaning of each token based on the entire surrounding sentence.
By the time the model finishes processing, it can output:
- A vector for each token (e.g., for translation or tagging tasks), or
- A single vector for the full sentence (used in semantic search and vector databases)
The sentence embedding you use in vector search is actually the result of combining all token-level context into a single representation.
Embedding long texts: Why chunking matters
Language models typically have a token limit, often between 4,000 and 128,000 tokens depending on the model (e.g., OpenAI’s GPT-4 Turbo has a 128k limit). But even if your text fits within that limit, embedding a large chunk (like a full article or multi-page document) will produce a very general vector, which isn’t ideal for semantic search.
Below, we can see that the text-embedding-3-small model has a maximum of 8,192 tokens when I try to pass it a description made up of more than 8,200 characters.
SELECT ai.openai_embed('text-embedding-3-small',description)
FROM live.boat WHERE length(description) > 8200
[38000] ERROR: openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 9543 tokens (9543 in your prompt; 0 for the completion). Please reduce your prompt or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
Where: Traceback (most recent call last):
PL/Python function "openai_embed", line 38, in <module>
for tup in embeddings:
Instead, it's better to chunk the text into meaningful units, like paragraphs or 200–300 word segments, and embed each chunk individually. This gives you:
- Tighter matches during retrieval
- Better relevance scoring
- More efficient indexing and updates
This chunking strategy is core to techniques like RAG, where a user query is compared to smaller content chunks, and only the most relevant are used to inform the response.
I’ll go into some different approaches to chunking in the next article covering good practices for embedding in PostgreSQL.
What should you embed?
Units | When |
Tokens | Used internally by models for context building, not directly used. |
Words | Rarely used as too ambiguous and context-free. |
Sentences | Ideal for embedding semantic meaning. |
Paragraphs/ chunks | Good, especially for long content. |
Whole documents | Sometimes, If short and focused, okay, otherwise chunk. |
Final thoughts
While vector search systems are built on powerful, often complex internals, your job as a builder is usually straightforward:
Embed meaningful units of text and compare them in vector space using distance metrics like cosine distance.
By understanding that sentence and chunk embeddings are the default, and that token-level processing is what enables contextual understanding, you’ll be able to design better indexing strategies, tune your queries more effectively, and build smarter, more relevant search features.