Vector stores part I: a non-technical introduction

Vector Store

With the arrival of fit-for-purpose Large Language Models (LLMs), the way we store, access, and interpret data is undergoing a seismic shift. One of the most promising developments in this arena is the emergence of vector stores. But what exactly are vector stores, and why should business leaders, especially those not deeply entrenched in the technical weeds (here is an excellent series if you are), care about them? Let's dive in.

What is a vector store?

At its core, a vector store is a system that takes a chunk of text, regardless of its length (we normally call this a "document" even if it's not a PDF or Excel but rather just a short piece of text), and converts it into a series of numbers. Think of it as a translator that turns words, sentences, or even entire files into numerical coordinates.

What on earth does that actually mean? The way I think about it is that if you tried to plot the meaning different words on a piece of graph paper, you'd end up with some that were really close to each other ("hot" and "boiling"), some really far away from each other ("archaeology" and "computer"). But you'd also have some that are in some ways similar to each other and in other ways meaningfully different ("archaelogy" and "elder" for instance - you can't use one as a synonym for the other, but they share a concept of age). In order to represent that partial overlap, you might want to invent another dimension to divide them.

(Don't take my word for it; there's a number of papers on this for instance Linguistic Regularities in Continuous Space Word Representations.)

It turns out that if you use enough dimensions, this is a pretty good way of indexing (by which we mean something like "tagging") chunks of text so that we can look them up by topic later. You have to use quite a lot of dimensions for this to work really well while being mathematically tractable for today's computers – OpenAI utilizes a staggering 1536 dimensions (2^10 x 1.5 for those who are wondering why this specific number).

A vector database's primary function is to store these vectors and enable searches using new vectors (which you generate by analysing another document - which might be a query like "can you find me our expenses policy?"). It identifies and retrieves vectors that are closest in what's known as "n-dimensional" space.

(You don't need to worry too much about how exactly the vectors for a given piece of content are actually generated. There are a ton of different ways of generating so-called "embeddings", which are the actual vectors themselves, from a source document.)

The synergy with LLMs

Vector stores truly shine when paired with Large Language Models (LLMs). However, LLMs have limitations. Their prompt sizes are restricted, and their attention spans, though improving, can be unpredictable. Research, such as this study, indicates that overloading an LLM with data can lead to it overlooking crucial middle portions. This graph shows how context explicitly provided to an LLM in the middle of a longish prompt can actually lead to it performing worse than one that didn't have the context at all (as opposed to the beginning and end, where the model did much better than it would have had it not been provided the information in its prompt):

llm-recall

The key is to feed the LLM just the right amount of pertinent information, and this is where vector stores come into play.

Why not just use full-text search?

Traditional full-text searches rely heavily on keyword matches. But when working with LLMs, what we're truly seeking are matches in concepts. For instance, if someone inquires about an "airplane policy," the LLM might want to review documents related to "travel" or "aviation" – not just those containing the exact phrase "airplane policy".

You can achieve these results by building a sophisticated search system which does some or all of:

  • stemming words (so that searches for "run" give you matches for "ran"). Remember you might need to do this in lots of languages!
  • expanding abbreviations (ACV -> Annual Contract Value - or maybe it's something else in your domain?)
  • searching on synonyms or near synonyms.
  • potentially lots of other things.

One of Harriet's founders previously founded a business that had at its heart a very complicated big-data search system. That system required thousands of hours of development and constant TLC to keep it running. To put it mildly, layering this on top is not for the faint of heart.

The fundamental problem is that most full-text search systems are great at keywords, but poor at concepts.

Where vector stores are a good fit

The big advantage of vector stores is that they produce really compelling results with minimal or no configuration.

Vector stores excel in domains where the number of documents that you are searching over is relatively limited (hundreds or thousands, or maybe tens of thousands). They can efficiently narrow down vast amounts of data, allowing the LLM to perform a more focused, final review. This is invaluable because, let's face it, no one likes sifting through irrelevant search results - but LLMs can do this on a user's behalf and give a user a perfectly curated set of results (or a summary).

However, it's worth noting that while vector stores are revolutionary, they might not be as effective on an internet scale without additional techniques. Narrowing down from a billion documents to 10,000 or even 1,000 is still overwhelming for an LLM, leading to issues with data recall.

Internet or intranet scale?

For most enterprise search or business applications, vector stores are a game-changer. They can store context for applications that might otherwise be too vast or perplexing for an LLM to handle. This is especially crucial given the challenges of data "hallucinations" or misinterpretations.

For many organizations, relatively "naive" usage of vector stores can produce good results where they feed into an LLM. For instance at Harriet we find that many of our SME customers have between 200 and 10,000 pages of internal process documentation (HR policies, company handbooks, insurance policies, onboarding slide decks, etc). A vector store can be an exceptionally effective at winnowing that volume of data down so that an LLM can digest it and answer questions based on your internal company docs. (Try Harriet for free if you don't believe me!)

What are vector stores not good at?

Vector stores are not just another tech buzzword; they are an important building block of the AI future. Their ability to streamline information and work in tandem with LLMs makes them an invaluable asset for businesses.

However, if you're building a real-world app with vector stores, you need to know more about what they don't do in order to use them more effectively.

Stay tuned for Part II, where we'll delve into some of the challenges and real-world implications of using vector stores and look at some of the providers out there, including:

  • Chunking
  • Context preservation
  • Deduplication
  • Metadata filtering
  • Multi-language support

Follow Harriet on LinkedIn to get notified when we publish.