written by owen on 2019-Oct-15.
The main goal of this re-index was to be able to find the keywords "salary scale" on the mof website - mission accomplished!
As for the code I had to make distinctions between ignored words and ignored websites because the test to see if a link was "trash" was getting really slow so I had to split out the filter into 2 parts: for malformed links I want to skip and the other for websites that I do not want to crawl (but still want to list).
Stats; Links count: 29016. Words: 6594 => 10555. References to [jamaica] was 15029 now 59093. Index size was 14.5 mb is now 36.8mb. Almost a 50% file size increase and considering I am quickly running out of server space. I uploaded it and was promptly welcomed with this message;
"Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 77 bytes)"
I had to reduce the file size so that it can fit in memory. Yes I load the entire file into memory for each search. So I looked at the index data array to see what I could possible throw away. I was collecting some meta tags from the html pages in the index so I threw those out and got the file down to 25mb. I thought of streaming the file into memory but that is currently now possible since it is serialized. If push comes to shove I may have to resort to a new file formate in the future.
I finally went down into the awfully bloated jis.gov.jm website but I had to block cirt.gov.jm because it is pure link spam. Everytime I look at this code I find a new bug or edgecase. I am not sure how modern programmers get things done on projects which have so many complications. Yeah I could use the a search API and apply some filters but where is the fun in that? I would be already done but what would I have learned? Nothing.
Anyhow version 1.7 is up and running