Centillion: Difference between revisions

Revision as of 04:15, 4 August 2018

Link: https://git.charlesreid1.com/charlesreid1/centillion

Centillion is a search engine written using Whoosh, a Python library for building search engines. It indexes Google Drive file and documents, Github issues, and markdown files in Github repositories.

More information is in the README in the repository.

How it works

Here's the basic rundown:

Backend

The search engine backend uses whoosh. There is a Search object that extends the whoosh search index object. The user also defines a scheme, and that allows them to create a search index.

Update search index method

the first important method is the update search index method. This performs three kinds of actions:

Add new documents to the search index
Update changed documents in the search index
Delete removed documents

The update search index method is the method that interfaces with API services. It downloads and stores the data from the API, but does not actually use it. Instead it determines a unique identifier for the given document, and uses the unique identifier to group the document into one of three categories: Add, Update, or Delete.

Add document method

The update search index method passes each document to an add document method that actually extracts information from the document. The add document method parses the document record (usually JSON or a custom object type from an API library) and defines how to turn that document record into a search index record.

Google drive files, for example, are accessed using the Google Drive API, which returns a JSON object with information about drive files. This record contains a field called "mimeType" which indicates the type of file (document, presentation, spreadsheet, PDF, video, etc).

Depending on the value of this field, the search index record that is created will be different. If "mimeType" contains "document", the file can be downloaded and converted to text, and the text of the document added to the search index record that is created for that Google Drive file. However, if "mimeType" contains "spreadsheet", the program will not attempt to extract text from the spreadsheet.

Github issues and comments provide another example. The add issue method takes an object as an input (an Issue object from the PyGithub library), and turns that issue into a search record. To do that, the issue's text is extracted, but then the text of each comment is also extracted and appended to the issue text, so that issues and comments in a single thread show up as a single search result.

Whoosh schema

one thing we did not mention about Centillion was how the schema works, given that Centillion searches across such heterogeneous record types (files, documents, issues, and markdown files).

The key is to create a Schema that has many different fields, and to set them to blank values if the field does not apply to the type of record. For example, if I want to store the "mimeType" of Google Drive files, so that I can present them to the user as context for the Google Drive search results, I can create a schema with a "mimeType" field. When I am creating a search index record for a Google Drive Spreadsheet, I will set "mimeType" to the file's mimeType value returned by the Google Drive API. When I am creating a search index record for a Github issue, I will set the "mimeType" field to the empty string.

(Note that you cannot simply leave the field out entirely, you must explicitly set it to an empty string when you create your search index record.)

Frontend

The front end page

@@ Line 32: / Line 32: @@
 Github issues and comments provide another example. The add issue method takes an object as an input (an Issue object from the PyGithub library), and turns that issue into a search record. To do that, the issue's text is extracted, but then the text of each comment is also extracted and appended to the issue text, so that issues and comments in a single thread show up as a single search result.
-===Schema===
+====Whoosh schema====
 one thing we did not mention about Centillion was how the schema works, given that Centillion searches across such heterogeneous record types (files, documents, issues, and markdown files).