written by owen on 2018-May-02.
In my last article I covered what I did to achieve the first phase of tackling the search problem. When you do not know how big a problem is it is best to start out with the low hanging fruit so you can make some positive progress really quickly. In this article I am going to go over features that I will need to implement if I ever hope to make a large scale government search engine.
Indexing, words, letters, and phrases
Right now I search the entire database whenever a user enters a search term. This is really fast - for now. I can do this because the database is fairly small (4mb, 3500+ entries) and limited to a specific set of government websites. If I want to go really large scale like Caribbean wide I would need to set up a persistent index that would allow me to search smaller parts of the dataset. You might ask why I did not do that in the first place and the answer to that is future geeks never finish anything because they over optimize too early. I purposely seek to avoid such progress traps. Indexing is a big problem.
Storing common search results
It would save a whole lot of time if I just kept a result cache every time someone did a searchhas already been done but caching is a sign of weakness. If you start caching too early you might miss potential bugs in your search logic. As the database grows bigger and more processing intensive there would need to be some form cache to reduce the amount of re-work that the search engine has to do for common searches.
Deep Cross matching and word association
Currently the "do you mean?" tool tip that appears when you get no results uses a sounddex() function to attempt to identify a mis-spelt words. I will have to figure out how to combine words and cross reference all the words that are closely associated so that I can infer meaning without hand picking them myself. This would have to be done on pure many-to-many logic basis or what people call nowadays "Artificial Intelligence" or machine learning. This would mostly be an offline process since I may have to make several trips through the database to map the relationships.
Mining User data/behavior
The common trick nowadays to is monitor what results the user clicks and using that to determine relevance. There are many downsides to this but it helps make the search engine seem smarter than it actually is. But this is only useful if you are getting millions of hits per hour.
Freshness, ranking, Dates and Updates
Currently whenever I refresh the search index I delete the whole http web page cache. I will need to have a way to identify when a page and been updated and the time or the freshness of a web page. This will bring into play a issues with websites that update more frequently and spammy blogs/websites like jis.gov.jm which are engineered to not maintain any form of history or structure. The http web cache is currently totally 152 mb on disk which is 6000 files. Each file equates to a unique url, I will have to devise a way to routinely cycle through the most frequently updated websites and gather more data without running out of harddrive space or picking up irrelevant crap.
Multi-Portal entry points
Currently I start crawling at the gov.jm portal only. This simplifies my robot and ensure that it does not go rouge off to a big unrelated website with 1 million links. But eventually I will have to create my own portal because the people that built gov.jm seem to have left it to stagnate which is the case with many such government brochure websites. New websites like www.nidsfacts.com/ and jamaicaeye.gov.jm/ cannot be found on it at all. I will either have to devise a new way of detecting changes in the gov.jm domain or go deeper into the dark web using some sort of blockchain.
Ranking important pages
I will have to develop a way to rank individual website against each other in terms of importance in relating to each other as well as particular keywords so I can speed up keyword searches by search them first. This is different from the indexing mentioned above. This purpose of this is to ensure that relevant sites do not get taken over by noisy and spammy websites.
I am probably missing a few but these are right off the top of my head. The government search problem is not feature complete but people seem to find it useful in its current state. Give it a whirl and let me know how it can be improved; owensoft.net/project/jmgov/
NB. Certain search terms/words like "jamaica" or "andrew" are still difficult to rank because they are either too common or have double meanings.