Postgres: Text Searches

So I’ve found what is easily the most interesting part of sql I’ve learned about so far: The full text search. Basically it allows you to do “intelligent (more-intelligent-than-brute-force-checking-using-the-LIKE-operator)” term searching against a longer document. The two main data types involved are the tsvector and the tsquery, which essentially represent a ‘normalized’ document and search, respectively. A tsquery contains the search query, which is searched against the tsvector.

The document to search must first be converted to the tsvector, which involves searching through the document, parsing everything into tokens (numbers, words, hyphenated words, etc). This is done by a parser. These tokens are then passed through a series of dictionaries, which attempt to find a definition for the tokens being passed in. This includes stripping a word down to a root word, and obtaining a match off that.

For example, a simple doc may contain the text: “A banker visited the Moon-Base”. The tokens produced by the parser may be something like “A”, “banker”, “visited”, “the”, “Moon-Base”. Then the dictionaries will likely resolve “banker” to the root “bank”, and it may split up “Moon-Base” into “moon” and “base”. It would also simply remove the stop words –  common words that are likely of no use (“A”, “the”…). The result of the dictionary is a set of lexemes, which just mean a normalized token. These are stored in the tsvector.

Then a query can be run against the vector to see if it matches. The tsquery can take in the standard operators (NOT, AND, OR). So something like ‘(moon | sun) & (visiting) & !gibberish’ might match, but ‘(star && moon)’ would not. There is also a ‘FOLLOWED BY’ operator (<->), to check if one word follows another. So the query ‘base <-> moon’ would not match, but ‘moon <-> base’ would.

This is just a very simple example and explanation of what it’s capable of. The best part is that this is very applicable to my Quizipedia application, and some of the terms I’ve learned about are even things that I’ve been implementing on a small-scale on my own. For example, I’ve created a list of “common words” that automatically get ignored by the relevant word algorithm. This is essentially what the dictionaries do when they catch stop words. So I will definitely be exploring this topic a lot more, and hopefully applying it to my current and future projects.