I have updated the softnews project. Fixed some display bugs I had on mobile devices (probably added some new ones too). The feed now links directly to youtube/vimeo videos instead of trying to play them embedded. Photo blogs seem to be going the way of the dinosaur but I am still looking out for new ones which are not on instagram. View softnews, it gets updated every morning magically from RSS feeds. It follows almost 80 Jamaica/Trinidadian/Caribbean blogs, youtube/vimeo channels and a few international ones. Post a comment below if you want it to follow your stuff (you must have a RSS feed).
Always be improving. I am not even sure how long ago I originally wrote this application (Oct2009) but I dug into it a couple days ago, added a few new things that make it better and easier to use. I removed the WYSIWIG editor because I forgot that it was there. See screenshots to compare the old to the new. It is basically a public todo list / note pad for keeping text and random notes. Anyone can post into or edit it. It has basic tagging which allows the user to label notes with 1 or more words in the formate "label1,label2:note title" or "movie,links:things to watch". This was a design choice that took me a little while to plan out. I am still tinkering with the tagging.
This is written in the PHP programming language, MySQL database, HTML, CSS. I was undecided about whether to go with Vue.js, react or Laravel backend but I might re-consider in the future. Will post the github repo at a later date.
The main goal of this re-index was to be able to find the keywords "salary scale" on the mof website - mission accomplished!
As for the code I had to make distinctions between ignored words and ignored websites because the test to see if a link was "trash" was getting really slow so I had to split out the filter into 2 parts: for malformed links I want to skip and the other for websites that I do not want to crawl (but still want to list).
Stats; Links count: 29016. Words: 6594 => 10555. References to [jamaica] was 15029 now 59093. Index size was 14.5 mb is now 36.8mb. Almost a 50% file size increase and considering I am quickly running out of server space. I uploaded it and was promptly welcomed with this message;
"Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 77 bytes)"
I had to reduce the file size so that it can fit in memory. Yes I load the entire file into memory for each search. So I looked at the index data array to see what I could possible throw away. I was collecting some meta tags from the html pages in the index so I threw those out and got the file down to 25mb. I thought of streaming the file into memory but that is currently now possible since it is serialized. If push comes to shove I may have to resort to a new file formate in the future.
I finally went down into the awfully bloated jis.gov.jm website but I had to block cirt.gov.jm because it is pure link spam. Everytime I look at this code I find a new bug or edgecase. I am not sure how modern programmers get things done on projects which have so many complications. Yeah I could use the a search API and apply some filters but where is the fun in that? I would be already done but what would I have learned? Nothing.
I had some time so I worked on the search project. Prepare for a highly technical rant;
So I upgraded my hard drive to a 525gb SSD (from a 320 gb, 7200 rpm) because I saw it being sold and I figured I could use the speed. Reads and writes increased by 150%. The SDD is SATA3 but my motherboard is SATA2. The increase is noticeable but not enough to do a full crawl but at least I can quickly open a directory with 30k or more fies. I also deleted the entire cache and started re-crawling all the old stuff - I figured this would be a good time to start over and see if I could get rid of unused files.
More Crawling
I am going 8 levels deep this time around. I am now using the path cache to reduce the amount of double crawling that happens when I check a link several times because most websites have a redundant menu on every page - often this menu has a million links point to the same places.
Roughly 22k unique links now. I added better detection of 404 links and someone seems to have installed Mod_security on my website which causes other problems which I had to work around.
My search cache went from 2gb to 6gb in the matter of a day - I figured my deep crawl must have been doing better. But then I checked out the cache and realised that I had stumbled across a new type of unecessarily HUGE HTML5 webpage; 2 websites had a chartengine which sent all the data to the client html - the biggest chart was 37mb - total was approx 4 gigs of chart data downloaded into my cache. The next culprit was visitjamaica who decided that they would embed a SVG image inside EVERY PAGE making each and every page at least 500kb - EVERY PAGE!
I need to put in a limit on how many links I will accept from one web page. Some pages have way too many links and repeat them constantly.
I have also discovered new ways in which some wordpress websites are spammed by using hidden divs full of spam links onto each page that only robots will see. Once I discover the page I have no choice but to block the entire website until I have a way of detecting this programmatically. If I have to time I might post a list of hacked websites so that they can be fixed.
Popup menus on a webpage add serious bulk. Especially if you have a country listing. It seems cool to the user but its better to just have a country page with a single list of country links rather than have the country list embeded into EVERY PAGE on the website. One website had 240+ country links in a hidden popup menu on EVERY PAGE - the site seems light on first look but under the hood its a monster. I will be forced to implement a limit on how many links I retrieve from an individual page but even if I set it to 100 the first set of links will always be menu items.
Current index 39918 links from the last re-index. Total cache is 3.91 gigs. Ignoring 167 unique phrases/websites. Crawling 8 levels deep. Sometimes I leave my computer running over night only to realise in the morning that there was a bug in the code that I have to fix and start the process out again. Most of this re-crawling is luckly local to my disc.
As I finalized my changes and watched my robot crawl I thought to myself that building a search engine is hard but not impossible. You really need a whole lotta storage and computers. Not a really fast computer but serveral computers working in parralell. 5000 computers is about right. You would need a computer for every letter of the alphabet, plus words and phrases. despite the large amount of terrible websites out there.
I eventually decided to remove visitjamaica from the index because it is a terrible terribly bloated website.
Discovered a bunch of ophaned and ignored links and got the list down to 27837. I am also trying to put descriptions under the links which requires figuring out which parts of the page are actually useful versus advertisements and menus.
Conclusion
Anyway search project version 1.5 should be up and running check it out and do some searches. Note that searches will take much longer now (2secs) because the index is now 31mb.
When I started on the 1.4 update I quickly realised that the index could not be updated in one go anymore since I had increased the crawl depth to 4 levels deep. this would be a problem since if the crawl does not finish then there would be no index on which the search engine would run. I needed another cache. Now enters the path cache. As I crawl from link to link I need to keep track of how far I am into the rabbithole and where I have already been. before all I had to do was make sure I was not wasting time re-downloading html but now even the cached html is too much for my little old machine. Here are the results of the new cache;
path_cache is 199 mb, 5355 files build_cache is 209 mb, 30,012 files html_cache is 1.81gb, 50,411 files
html_cache is loaded from reading the website raw html. build_cache is created from reading the html cache and gathering some meta information which will later be used to build the search index. This new path_cache allows me to crawl deeper into websites while still being able to continue if I time out/crash. Originally I could go 3 links deep with only the build_cache and the html_cache but now that I have added a 4th level I can no longer move through all the cache files quick enough to build a search index. So the path_cache is sorta like leaving a train of bread crumbs to grandma's house. Hopefully when I am finished with the trail I can build a new search index.
So the index finally updated and the result is a quarter the size of the previous index (4.2 KB) which means that I am probably missing a tonne of links or my index has become super efficient. I tend to lean towards the later since I am coming down from a 15.7mb index. I know that it should not be possible to have duplicate links so I must be missing a bunch. I will have to test it to see or build a comparison script.
I eventually fixed it. It turned out the levels that I had implemented in the new path_cache system were not deep enough to actually leave the root website. I have to be constantly checking to make sure I am not crawling in circles. The list cache is now bigger 18.9KB (most likely because of dumb the link structure of visitjamaica). Searching takes like 5 seconds on my personal computer.
I also added a new list of crawled links totalling 5474 links (331kb).
The next task will be finding a way to divide up the search db into smaller chunks for faster searching. Of course dividing will result in many more files and duplicate information everywhere but that is the sacrifice that will have to be made inorder to speed up the search.