Embeddings are vector representations that capture the semantic meaning of words or sentences. Besides having quality data, choosing a good embedding model is the most important and underrated step for optimizing your RAG application. Multilingual models are especially challenging as most are pre-trained on English data. The right embeddings make a huge difference — don’t just grab the first model you see!
The semantic space determines the relationships between words and concepts. An accurate semantic space improves retrieval performance. Inaccurate embeddings lead to irrelevant chunks or missing information. A better model directly improves your RAG system’s capabilities.
In this article, we will create a question-answer dataset from PDF documents in order to find the best model for our task and language. During RAG, if the expected answer is retrieved, it means the embedding model positioned the question and answer close enough in the semantic space.
While we focus on French and Italian, the process can be adapted to any language because the best embeddings might differ.
Embedding Models
There are two main types of embedding models: static and dynamic. Static embeddings like word2vec generate a vector for each word. The vectors are combined, often by averaging, to create a final embedding. These types of embeddings are not often used in production anymore because they don’t consider how a word’s meaning can change in function to the surrounding words.
Dynamic embeddings are based on Transformers like BERT, which incorporate context awareness through self-attention layers, allowing them to represent words based on the surrounding context.
Most current fine-tuned models use contrastive learning. The model learns semantic similarity by seeing both positive and negative text pairs during training.