What is an inverted index?
An inverted index is a data structure that is commonly used as a database index to allow for fast full text search. It maps out each unique word or term to the documents or entries in which it appears.
The ‘inverted’ index name refers to the reversal of the normal relationship seen in a forward index - instead of listing content and then its location, the inverted index lists the terms and their corresponding locations. Essentially, an inverted index is like a dictionary where each word points to a list of documents where it can be found.
This type of index can greatly enhance search efficiency as it enables the system to find relevant documents by simply looking up the terms in the index, significantly reducing the amount of data to be processed during a search.
Inverted index use cases
An inverted index is a good idea for any type of information retrieval system that needs to be able to quickly find documents that contain the given text. Below are a few specific use cases.
Search engines like Google take advantage of inverted indexes to return web pages containing the user’s search query from their database of crawled web pages. Search engines also layer on custom algorithms and machine learning models to return the most relevant results rather than returning results based on text matching the search query.
Many databases provide variations of inverted indexes to allow faster full text search for specific columns. Postgres for examples recommends developers create a GIN(generalized inverted index) for text columns that will be searched frequently.
Inverted indexes can be used by databases for more than just basic text search as well. InfluxDB 2.0 uses an inverted index to map metadata to specific data points, which allows fast querying and filtering of time series data like application metrics or IoT sensor data.
Inverted indexes are often used in the bioinformatics field to search for specific fragments of matching DNA sequences. To index all 3 billion base pairs of human DNA with substrings requires 10s of gigabytes of RAM to store the inverted index in memory
Inverted indexes can be used for a number of text analytics and NLP tasks. One example is plagiarism detection where an inverted index can be used to find identical strings of words across different documents.
Inverted index weaknesses
Here are a few issues associated with inverted indexes:
- Memory usage - Because each unique word needs its own index entry larger datasets will consume large amounts of RAM.
- Index update cost - Inverted indexes can be expensive to maintain and update because adding new documents or updating existing documents requires updating the index as well.
- Phrase match queries - Inverted indexes are optimized for exact match text and aren’t a good fit for complex phrases and related queries.
- Multiple language support - Inverted indexes are language specific due to how text is tokenized and normalized, meaning an inverted index would need to be created for each language to support multilingual searches.