From charlesreid1

No edit summary
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Code==
==Code==


Code here: https://git.charlesreid1.com/charlesreid1/whoosh-search
Code here:
* https://git.charlesreid1.com/charlesreid1/whoosh-search
* https://git.charlesreid1.com/charlesreid1/markdown-search
* https://git.charlesreid1.com/charlesreid1/issues-search
* https://git.charlesreid1.com/charlesreid1/cheeseburger-search


==Notes==
==Notes==
Line 9: Line 13:
You define a schema and create an index, then start to add documents to the index. The result is a search index built by whoosh.
You define a schema and create an index, then start to add documents to the index. The result is a search index built by whoosh.


Because the task of building a search engine is extremely broad in scope and can cover a reall ywide range of things, Whoosh is highly configurable. Unfortunately that means it's often hard to understand what the documentation is talking about because it stays at too high a level.
===Centillion===


This page collects notes on a Whoosh setup to create a search index for something nice and concrete: a pile of Markdown files.
Centillion is a search engine that uses Whoosh to search across Google Drive documents and files, Github issues, and markdown files in Github repositories.


==How Search Engines Work==
Link: https://git.charlesreid1.com/charlesreid1/centillion


Let's take a step back and talk about how Whoosh works at a high level.
More information: [[Centillion]]


A search engine takes a given document and determines all the terms that a user might enter into a search engine to find that document. This search term, like "tuba", can be turned into a number, like 001010111010110101010101010101010101, in a way that the computer can do very fast and very consistently, and the search result of that term, "tuba", can be stored in a lookup table, or a slot labeled with that number. Inside of that slot we store references to any places where that term occurred.
===How it works===


This is a data structure known as a ''hash map''. It has incredible properties, and without it, computing as we know it would be a miserable failure.
[[Whoosh/How It Works]]


The magic of the hash map comes from the hash function, a one-way function with some peculiar statistical properties. The hash function is what can turn a string into a number/address.
===More Links===


===Simple Example===
whoosh on github: https://github.com/whoosh-community/whoosh


To start with a super simple search engine, documents might be indexed by the search engine using only a set of tags, so that search engine users would only be able to match words that were added as a tag. If I add the tags "music" and "tuba" to the document "docs/music/instruments/tuba.md", and I only implement a keyword-based search engine, then there are only two user inputs that will lead to that document being returned as a result: "music" and "tuba".
flask whoosh alchemy: https://github.com/gyllstromk/Flask-WhooshAlchemy/blob/master/flask_whooshalchemy.py


The search engine has a way to turn that search term into a number, and turn that number into a really fast lookup table, so that if we have predicted that the user might search for "music", and they do, then we have the result ready instantly in that lookup table. That's the hash function.
flask whoosh alchemy plus: https://github.com/Revolution1/Flask-WhooshAlchemyPlus


Because the number of tags that will occur is bound to be finite, such a lookup table is possible to build.
whoosh app engine: https://github.com/tallstreet/Whoosh-AppEngine/blob/master/src/whoosh/index.py


===Complications===
There are already complications about whether the tags are case-sensitive or multi-word, or whether certain kinds of punctuation are allowed or not, or whether there are length restrictions or if you can use a whole sentence as a tag.
This is where Whoosh not making any decisions for you can be useful, because it provides ways for you to make various decisions about these things (most of which you probably did not think of).
===Realistic Example: Words and Grams===
Considering a more realistic example than the tag search engine, consider a document search engine, where we are interested in matching words and phrases. Then we need to build an index of words occurring in the document, but also of n-grams (n words in a row). There are approx. as many n-grams as there are words in the document, for small n.
The procedure from above is repeated: the location of each occurrence of each word is added to the hash map (input is the word the user may search for, output is where we store the locations it occurred).
We then do that with bi-grams, two word combinations, likewise feeding the bi-gram to the hash function. And so on with tri-grams, and 4-grams, up to the maximum the user wishes.
Like the number of words in a document, the number of n-grams that actually appear in a corpus of documents will be far smaller than the number of possible n-grams, making this approach storage-efficient.




==Flags==
==Flags==


{{SearchFlag}}


{{PythonFlag}}
{{PythonFlag}}
{{SearchFlag}}

Latest revision as of 20:51, 17 February 2020

Code

Code here:

Notes

Whoosh is a library for building search engines in Python.

You define a schema and create an index, then start to add documents to the index. The result is a search index built by whoosh.

Centillion

Centillion is a search engine that uses Whoosh to search across Google Drive documents and files, Github issues, and markdown files in Github repositories.

Link: https://git.charlesreid1.com/charlesreid1/centillion

More information: Centillion

How it works

Whoosh/How It Works

More Links

whoosh on github: https://github.com/whoosh-community/whoosh

flask whoosh alchemy: https://github.com/gyllstromk/Flask-WhooshAlchemy/blob/master/flask_whooshalchemy.py

flask whoosh alchemy plus: https://github.com/Revolution1/Flask-WhooshAlchemyPlus

whoosh app engine: https://github.com/tallstreet/Whoosh-AppEngine/blob/master/src/whoosh/index.py


Flags