Xapian in Python
Xapian is a full text search engine that can index various data structures and then allow searching and querying the indexed content.Xapian is one of full text search engines which can be used in Python (also in PHP :nice:). As others it has poor tutorial section and you are left with pure API docs. There are also Xapwrap and PyXapian project but I won't touch them in this article. If you use MySQL the use it full text search features ;) If not, try Xapian.
InstalationYou need "xapian" and "xapian-bindings" packages. Most Linux distributions should have them in repositories. "xapian-bindings" may be named as "xapian-bindings-python" or "xapian-python" if splitted. In other cases check the project website.
Introduction to XapianHere is a basic indexer, similar to that one from xapian-bindings examles: Save this code to a file, and then create folder "test" for the indexer database. Then execute the script, change the indexed phrase to:
para = '''this is a test'''And execute it again. Now we can search using a searcher: Save to file and execute it passing as a parameter the search term:
python search.py testYou will get both phrases but that one with "testing" will have lower probability. Xapian supports stemming for few languages. You set it with:
stemmer = xapian.Stem("english")Supported languages: none, danish (da), dutch (nl), english (en), finnish (fi), french (fr), german (de), italian (it), norwegian (no), portuguese (pt), russian (ru), spanish (es), swedish (sv).
Xapian and databasesTo index something from the database we need to add entry ID and possibly some other data to the indexed entry in xapian. We need to modify our indexer: There are 3 new lines doc.add_value(NEWS_ID, str(323)). First argument is an int ("name" of a field) and second one - value (string). I've added fields for ID, title and description of a news. ID will tell use which news in the database is it, and title + desc will be used for displaying results. Remove files from "test" folder and execute the indexer. Next we can use this modified searcher to display the results: We get a value of a field by using code like this: print match[xapian.MSET_DOCUMENT].get_value(NEWS_TITLE)
For a Django news application with a model like this one: Indexing all news would look like this: Save it to a file in Django project folder, create "test" folder and execute (if you have such application ;) Diamanda news used here). To search you can use the same searcher as above. A example using news from my rkblog.rk.edu.pl website:
[piotr@localhost biblioteka]$ python simplesearch.py django Performing query `Xapian::Query(django)' 8 results found How to Beat Rails - 100% More crazy changes for Django 1.0 ? - 98% Big Django project ;) - 87% Django 0.96 released - 86% polib - gettext translation manager - 71% Diamanda 2006.12 Stable Released - 65% More on Django 1.0 changes - 65% Djangoish Gettext Translator - 57%
Pumping Up Your Applications with Xapian Full-Text Search - More advanced example using XML-RPC and Twisted