A small guide to distributed crawling

Discussion in 'Programming' started by pasdoy, Feb 1, 2018.

  1. pasdoy

    pasdoy Senior Member

    Joined:
    Jul 17, 2008
    Messages:
    864
    Likes Received:
    275
    I know crawling interest a lot of people so I decided to post a small example of how to scale it and what tech to use. Most threads I see or tools in BST highlight multi-threads as a feature and are proud of crawling 50,000 pages / hour, depending on your internet connection. But what if you want to crawl millions of pages per hours and gather stats about links, anchor, expired domains, etc?

    To achieve high scalability, we have to work differently. It does not crawl javascript based website like Angular or React. This is not a setup guide, mostly an overall guide.


    1. Technology used
    - Multiple VPS, we probably all know about this
    - S3, AWS Simple Storage Service
    - EMR, AWS MapReduce on demand using Hadoop https://en.wikipedia.org/wiki/MapReduce
    - RabbitMQ (or AWS SQS), message brokers https://en.wikipedia.org/wiki/Message_broker
    - SQLite, simple relational database
    - Python, Golang

    Why Go? Because I've been using it for a while and it has a good concurrency pattern. No JVM, works on any architecture.
    Why Python? Because of the library used to handle the clusters and jobs. Also python is a simple straightforward language.

    The MapReduce is written in Python. Workers are written in Go.


    2. How they interact with each other

    [S3 with crawled logs] -> [dispatcher on EMR] -> [RabbitMQ with links to crawl] -> [HTTP GET / parser workers] -> [Produce logs] -> [S3]

    - S3 with crawled logs

    All logs from workers go in a S3 bucket. The key is s3://bucket/year/month/day/. This key system allow us to select the last n days easily. There are 2 types of logged lines: SeenURL and CrawledURL. SeenURL tells that we saw the URL and parsed it. CrawledURL are discovered links on the crawled pages. Those are written directly in a text file. This is why you need a very fast log library. I had to change it twice to find the fastest one as writing logs can be CPU or I/O intensive, especially in Go. We compress the logs to BZip, it can be easily read by Hadoop. Don't use GZip. Hadoop can't split them.

    SeenURL message example
    Code:
    {"msg":"SeenURL","OriginalURL":"https://www.zappos.com/independence-day-clothing-co/Wg-wHvEEJNwDhCXkB5Ik1CTiAgEL.zso?s=recentSalesStyle/desc/","ResultURL":"https://www.zappos.com/independence-day-clothing-co/Wg-wHvEEJNwDhCXkB9QkkiTiAgEL.zso?s=recentSalesStyle/desc/","HTTPStatusCode":301}
    {"msg":"SeenURL","OriginalURL":"https://www.zappos.com/independence-day-clothing-co-women/Wgi9FIQjzQTUJMABAeABAeICAwsYHA.zso","ResultURL":"https://www.zappos.com/independence-day-clothing-co-women/Wgi9FNQkhCPNBMABAeABAeICAwsYHA.zso","HTTPStatusCode":301}
    
    CrawledURL message example
    Code:
    {"msg":"CrawledURL","OnPage":"https://www.zappos.com/independence-day-clothing-co-men-shirts-tops/CKvXARDL1wFaCO8LxQXBI9QkegLkBIIBA9vuBMABAuICBQECCxgP.zso","OriginalLink":"/men-hats/COfWARCJ1wHAAQLiAgMBAhg.zso","CleanedLink":"https://www.zappos.com/men-hats/COfWARCJ1wHAAQLiAgMBAhg.zso","LinkRel":"","Anchor":"Hats"}
    
    Now that we have logs of seen url and newly discovered ones, we need to dispatch in the queue the next batch of links to crawl. At the speed we crawl, we end up with terabytes of logs to parse. This is why we need MapReduce.


    - Dispatcher on EMR

    EMR is an AWS service that let you spawn clusters to run Haddop MapReduce jobs on demand. Combined with Spot instances, you can get cheap computer power. EMR is billed by minutes or seconds now, no need to pace your jobs to take 1h.

    We use this library to facilitate everything, from creating the cluster to running the job and checking the output https://github.com/Yelp/mrjob.

    The dispacher makes sure we only queue links we haven't seen and we are authorized to see. This is where SQLite comes into action. SQLite holds the robots.txt info we gather for ach hostname. There is one VPS dedicted to listen to the "robots" queue, read the robots.txt if it exists and save it in the SQLite database. It uploads the SQLite dtabase to s3 regularly, this way it can be used in the MapReduce. Every MapReduce job, the SQLite database is used to know if we can or cannot crawl specific URLs. If the hostname isn't in the databse, we queue it to crawl robots.txt. If it is, we queue the link we can crawl according to the robots.txt.

    Issues about the job:
    - The reduce use hostnme as key. A hostname with millions of URLs will take longer time to process by the reducer process. Your job might end up witing for the last reducer to finish while other processes are just waiting.
    - Random connection issues with RabbitMQ would make the job fail.

    - RabbitMQ with links to crawl

    Rabbitmq is a message system used to dispatch among our VPS workers. Being monetary ressourceful, someone can use SQS which you don't have to devops at all. The current approach requires a new VPS, with good amount of ram, 32-128gb, to dispatch our messages.

    We have 2 queues, robots and tocrawl.

    robots is to crawl robots.txt from hostnames.
    tocrawl is url that need to be crawled


    - HTTP GET / parser workers

    Robots.txt
    1 VPS dedicated to listening at the queue "robots". It gather the hostname's robots.txt and save it in the SQLite database. It upload the SQLite database regularly on S3 to be used by the MapReduce.

    Crawlers
    N number of VPS all running our worker executable. They listen to the queue "tocrawl". They crawl the urls and write the logs. We use logrotate https://linux.die.net/man/8/logrotate to upload and rotate logs every 250mb of logs. Note that we compress Bzip logs at level 6, giving a good tradeoff of compress speed/archive size. The logs ar eperiodicaly sent to S3 this way. It avoids to overflow the filesystem.

    Notes
    - We catch 300 redirect and don't follow them to make sure we haven't already crawled that URL yet. It makes good stats also to know where X is redirected.
    - We use Zap log instead of Logrus for speed. Logrus was our bottle neck at first. Thanks to reflect.
    - We use google bot uer-agent to get better data.


    - Get the machine started
    Once you have all your VPS up, you need to insert some seed links in the queue "tocrawl". The best is to use different domains. Workers will crawl the new links. On your first MapReduce run, all the robots.txt will be missing. Fear not, the mapreduce will send them the "robots" queue. But this run no links will be crawled since they were all missing a robots.txt entry. The next run they will be queued. At first you will find yourself running the job manually more often because there's not a lot of links. Soon you will see 100 millions + queued urls.


    3. More info.
    - I tested this in production for 2 weeks. Ran out of dollar to continue since it was done only to post here.
    - I used AWS Lightsail instances. I created a base snapshot and would replicate it when needed. AWS shared my CPU since I was consuming too much of it. Be vigilant.
    - RabbbitMQ has a neat admin view, you should enable it to track message count and find dead workers, if any.
    - The MapReduce was a cron ran every 12hrs. This could change depending on how long you crawl. It would run from 5 minutes to 115 minutes at the end of the 2 weeks. I was using a 1 Master and 10 Slaves setup. You could speed this up with more slave.
    - You could cache all pages in S3 if you have the need and money for it.
    - The SQLite database is also used to ban hostname. When a hostname is banned they are simply ignored during dispatching. Maybe you don't want to crawl amazon or google.
    - Some links are crawled twice because of logs upload delay. It's ok for now.
    - There's a lot of devops needed to run this. I used Datadog to track everything.


    4. What to do from here
    - Add a pacing system, it could avoid overcrawling the same website too quickly.
    - Fix bug when parsing the anchor link when it's an html tag.

    5. Enough talking, let the code speak:

    - Golang Crawler Worker
    https://gist.github.com/pasdoy/66421b4147c7dd15f4d15d7ab0c6a31a

    - Golang Robots.txt crawler
    https://gist.github.com/pasdoy/a6c791b9ca10ad4254804c28d9612ecc

    - mrjob dispatch job
    https://gist.github.com/pasdoy/241720d0a2f68cc711820ba6f3ba5633

    - SQLite schema
    https://gist.github.com/pasdoy/ede350ca44430ef166e85703197fff39

    If you have related questions about distributed crawling you can ask here. I tried to be concise. This could end up as a 5,000+ word essay.
     
    • Thanks Thanks x 9
  2. TheAlmightyDada

    TheAlmightyDada Jr. VIP Jr. VIP

    Joined:
    Jan 12, 2016
    Messages:
    362
    Likes Received:
    389
    Gender:
    Male
    Occupation:
    Business consultant
    Location:
    Sunny old England
    This looks pretty comprehensive, thanks a lot. Commenting now so I don't lose it as I'm on my phone.
     
  3. Cititechno

    Cititechno Jr. VIP Jr. VIP

    Joined:
    Oct 1, 2015
    Messages:
    183
    Likes Received:
    45
    Occupation:
    Potato
    Location:
    London, UK
    Home Page:
    awesome guide!
     
  4. pasdoy

    pasdoy Senior Member

    Joined:
    Jul 17, 2008
    Messages:
    864
    Likes Received:
    275
    Thanks, let me know if you have questions on a specific section.
     
  5. jamie3000

    jamie3000 Elite Member Premium Member

    Joined:
    Jun 30, 2014
    Messages:
    2,064
    Likes Received:
    953
    Occupation:
    Owner of BigGuestPosting.com
    Location:
    uk
    Home Page:
    Excellent share mate. This will go over the heads of 99% of the people on BHW though. Still, it's nice to see someone contributing real technical expertise and not just "methods".
     
    • Thanks Thanks x 1
  6. ScrapeboxWorker

    ScrapeboxWorker Regular Member

    Joined:
    Jul 23, 2012
    Messages:
    489
    Likes Received:
    278
    Home Page:
  7. pasdoy

    pasdoy Senior Member

    Joined:
    Jul 17, 2008
    Messages:
    864
    Likes Received:
    275
    nice lib! Perso I am not a docker fan when a plain exe can do the job. For sure it can be handy to easily scale. I didn't know about colly. For the scope of this project I preferred to handle the crawling to be sure it's optimized for my need, speed and cpu. For the sake of scaling I don't want the crawler to check robots.txt by itself. I'll check if I can just give it a request.body and it can extract all the links with anchor. Thanks for contributing.
     
  8. XoC--

    XoC-- Jr. VIP Jr. VIP

    Joined:
    Mar 5, 2009
    Messages:
    218
    Likes Received:
    116
    Cool tutorial, appreciate the time it would take to test this.

    How much did this cost you to run for 2 weeks? This must of cost quite a lot running on AWS.
     
  9. ScrapeboxWorker

    ScrapeboxWorker Regular Member

    Joined:
    Jul 23, 2012
    Messages:
    489
    Likes Received:
    278
    Home Page:
    Yes but they can be easier to deploy, check https://aws.amazon.com/getting-started/tutorials/deploy-docker-containers/
     
  10. pasdoy

    pasdoy Senior Member

    Joined:
    Jul 17, 2008
    Messages:
    864
    Likes Received:
    275
    Not so bad for 2 weeks, 120$ of lightsail, 50$ S3 and ~250$ of EMR with spot instances, reducing the cost by ~70%.
     
    • Thanks Thanks x 1