Discussions at a higher level (programming)
written by owen, published: 2019-Jun-21, read 4066 times.
Create a New Topic / New Post.
written by owen, published: 2019-Jun-21, read 4066 times.
written by owen, published 2019-Jun-03, comment




I had some time so I worked on the search project. Prepare for a highly technical rant;
So I upgraded my hard drive to a 525gb SSD (from a 320 gb, 7200 rpm) because I saw it being sold and I figured I could use the speed. Reads and writes increased by 150%. The SDD is SATA3 but my motherboard is SATA2. The increase is noticeable but not enough to do a full crawl but at least I can quickly open a directory with 30k or more fies. I also deleted the entire cache and started re-crawling all the old stuff - I figured this would be a good time to start over and see if I could get rid of unused files.
I am going 8 levels deep this time around. I am now using the path cache to reduce the amount of double crawling that happens when I check a link several times because most websites have a redundant menu on every page - often this menu has a million links point to the same places.
Roughly 22k unique links now. I added better detection of 404 links and someone seems to have installed Mod_security on my website which causes other problems which I had to work around.
My search cache went from 2gb to 6gb in the matter of a day - I figured my deep crawl must have been doing better. But then I checked out the cache and realised that I had stumbled across a new type of unecessarily HUGE HTML5 webpage; 2 websites had a chart engine which sent all the data to the client html - the biggest chart was 37mb - total was approx 4 gigs of chart data downloaded into my cache. The next culprit was visitjamaica who decided that they would embed a SVG image inside EVERY PAGE making each and every page at least 500kb - EVERY PAGE!
I need to put in a limit on how many links I will accept from one web page. Some pages have way too many links and repeat them constantly.
I have also discovered new ways in which some wordpress websites are spammed by using hidden divs full of spam links onto each page that only robots will see. Once I discover the page I have no choice but to block the entire website until I have a way of detecting this programmatically. If I have to time I might post a list of hacked websites so that they can be fixed.
Popup menus on a webpage add serious bulk. Especially if you have a country listing. It seems cool to the user but its better to just have a country page with a single list of country links rather than have the country list embeded into EVERY PAGE on the website. One website had 240+ country links in a hidden popup menu on EVERY PAGE - the site seems light on first look but under the hood its a monster. I will be forced to implement a limit on how many links I retrieve from an individual page but even if I set it to 100 the first set of links will always be menu items.
Links to large none-html files with no extension; example 1, example 2, example 3.
Current index 39918 links from the last re-index. Total cache is 3.91 gigs. Ignoring 167 unique phrases/websites. Crawling 8 levels deep. Sometimes I leave my computer running over night only to realise in the morning that there was a bug in the code that I have to fix and start the process out again. Most of this re-crawling is luckly local to my disc.
As I finalized my changes and watched my robot crawl I thought to myself that building a search engine is hard but not impossible. You really need a whole lotta storage and computers. Not a really fast computer but serveral computers working in parralell. 5000 computers is about right. You would need a computer for every letter of the alphabet, plus words and phrases. despite the large amount of terrible websites out there.
I eventually decided to remove visitjamaica from the index because it is a terrible terribly bloated website.
Discovered a bunch of ophaned and ignored links and got the list down to 27837. I am also trying to put descriptions under the links which requires figuring out which parts of the page are actually useful versus advertisements and menus.
Anyway search project version 1.5 should be up and running check it out and do some searches. Note that searches will take much longer now (2secs) because the index is now 31mb.
written by owen, published: 2019-May-27, read 3724 times.
written by owen, published 2019-May-23, comment
Controlling your own emotions is one of those things which let you live from day to day. It becomes more and more difficult the more factors that are involved. Hell is other people. You cannot have it all and all that you have is not a substitute for what you will become in the future. You are not in control of your future or the rain or the stars in the night sky.
I have been spending money at a rapid pace in the past couple months and yet still more to spend. Like trees in the rain forest still more to cut down. Never been upset at the things you survive - they are lessons that you should learn. Monkeys fall out of trees a lot before they learn to fly. Unless...
I need to finish up some drafts that I have waiting but it takes so long to put new words to paper that sometimes I leave them to simmer in my inbox hoping that I get a clearer picture before I drop it on deaf ears. The things your survive.
written by owen, published 2019-May-17, comment
I prefer hatchbacks over most other types of cars; easier to park, light but not too light. Cars are always moving towards a state of chaos. The newer hatchbacks seem to be morphing into liftback sedans. Pictured is a 2009 Honda Civic made in England which is a unusual 5dr car made to look like a 2 door. True hatchbacks nowadays come in the form of small SUVs or tiny city cars which have not room in the back for anything but one suitcase.


