How to write a web crawler from scratch with Proxy support

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
Overview

Most Python web crawling/scraping tutorials use some kind of crawling library. This is great if you want to get things done quickly, but if you do not understand how scraping works under the hood then when problems arise it will be difficult to know how to fix them.

In this tutorial I will be going over how to write a web crawler completely from scratch in Python using only the Python Standard Library and the requests module (https://pypi.org/project/requests/2.7.0/). I will also be going over how you can use a proxy API (https://proxyorbit.com) to prevent your crawler from getting blacklisted.

This is mainly for educational purposes, but with a little attention and care this crawler can become as robust and useful as any scraper written using a library. Not only that, but it will most likely be lighter and more portable as well.

I am going to assume that you have a basic understanding of Python and programming in general. Understanding of how HTTP requests work and how Regular Expressions work will be needed to fully understand the code. I won't be going into deep detail on the implementation of each individual function. Instead, I will give high level overviews of how the code samples work and why certain things work the way they do.

The crawler that we'll be making in this tutorial will have the goal of "indexing the internet" similar to the way Google's crawlers work. Obviously we won't be able to index the internet, but the idea is that this crawler will follow links all over the internet and save those links somewhere as well as some information on the page.

Start Small

The first task is to set the groundwork of our scraper. We're going to use a class to house all our functions. We'll also need the re and requests modules so we'll import them

Code:
import requests   
import re   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url                     
       self.visited = set()   
          
   def start(self):   
       pass               
                          
if __name__ == "__main__":   
   crawler = PyCrawler()   
   crawler.start()

You can see that this is very simple to start. It's important to build these kinds of things incrementally. Code a little, test a little, etc.

We have two instance variables that will help us in our crawling endeavors later.

Code:
starting_url


Is the initial starting URL that our crawler will start out

Code:
visited

This will allow us to keep track of the URLs that we have currently visited to prevent visiting the same URL twice. Using a set() keeps visited URL lookup in O(1) time making it very fast.

[size=x-large]Crawl Sites[/size]

Now we will get started actually writing the crawler. The code below will make a request to the starting_url and extract all links on the page. Then it will iterate over all the new links and gather new links from the new pages. It will continue this recursive process until all links have been scraped that are possible from the starting point. Some websites don't link outside of themselves so these sites will stop sooner than sites that do link to other sites.

Code:
import requests   
import re   
from urllib.parse import urlparse   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
  
   def get_html(self, url):   
       try:   
           html = requests.get(url)   
       except Exception as e:   
           print(e)   
           return ""   
       return html.content.decode('latin-1')   
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base     
                                              
       return set(filter(lambda x: 'mailto' not in x, links))   
                                                              
   def extract_info(self, url):                               
       html = self.get_html(url)                             
       return None                 
                                  
   def crawl(self, url):                 
       for link in self.get_links(url):   
           if link in self.visited:       
               continue                   
           print(link)               
           self.visited.add(link)           
           info = self.extract_info(link)   
           self.crawl(link)                 
                                            
   def start(self):                   
       self.crawl(self.starting_url)   
                                      
if __name__ == "__main__":                         
   crawler = PyCrawler("https://google.com")       
   crawler.start()

As we can see a fair bit of new code has been added.

To start, get_html, get_links, crawl, and extract_info methods were added.

Code:
get_html()

Is used to get the HTML at the current link

Code:
get_links()

Extracts links from the current page

Code:
extract_info()

Will be used to extract specific info on the page.

the
Code:
crawl()
function has also been added and it is probably the most important and complicated piece of this code. "crawl" works recursively. It starts at the start_url, extracts links from that page, iterates over those links, and then feeds the links back into itself recursively.

If you think of the web like a series of doors and rooms, then essentially what this code is doing is looking for those doors and walking through them until it gets to a room with no doors. When this happens it works its way back to a room that has unexplored doors and enters that one. It does this forever until all doors accessible from the starting location have been accessed. This kind of process lends itself very nicely to recursive code.

If you run this script now as is it will explore and print all the new URLs it finds starting from google.com

Extract Content

Now we will extract data from the pages. This method (extract_info) is largely based on what you are trying to do with your scraper. For the sake of this tutorial, all we are going to do is extract meta tag information if we can find it on the page.

Code:
import requests   
import re   
from urllib.parse import urlparse   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
  
   def get_html(self, url):   
       try:   
           html = requests.get(url)   
       except Exception as e:   
           print(e)   
           return ""   
       return html.content.decode('latin-1')   
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base   
  
       return set(filter(lambda x: 'mailto' not in x, links))   
  
   def extract_info(self, url):   
       html = self.get_html(url)   
       meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)   
       return dict(meta)   
  
   def crawl(self, url):   
       for link in self.get_links(url):   
           if link in self.visited:   
               continue   
           self.visited.add(link)   
           info = self.extract_info(link)   
  
           print(f"""Link: {link}   
Description: {info.get('description')}   
Keywords: {info.get('keywords')}   
           """)   
  
           self.crawl(link)   
  
   def start(self):   
       self.crawl(self.starting_url)   
  
if __name__ == "__main__":   
   crawler = PyCrawler("https://google.com")   
   crawler.start()

Not much has changed here besides the new print formatting and the extract_info method.

The magic here is in the regular expression in the extract_info method. It searches in the HTML for all meta tags that follow the format <meta name=X content=Y> and returns a Python dictionary of the format {X:Y}

This information is then printed to the screen for every URL for every request.

Integrate Rotating Proxy API

One of the main problems with web crawling and web scraping is that sites will ban you either if you make too many requests, don't use an acceptable user agent, etc. One of the ways to limit this is by using proxies and setting a different user agent for the crawler. Normally the proxy approach requires you to go out and purchase or source manually a list of proxies from somewhere else. A lot of the time these proxies don't even work or are incredibly slow making web crawling much more difficult.

To avoid this problem we are going to be using what is called a "rotating proxy API". A rotating proxy API is an API that takes care of managing the proxies for us. All we have to do is make a request to their API endpoint and boom, we'll get a new working proxy for our crawler. Integrating the service into the platform will require no more than a few extra lines of Python.

The service we will be using is Proxy Orbit (https://proxyorbit.com). Full disclosure, I do own and run Proxy Orbit. The service specializes in creating proxy solutions for web crawling applications. The proxies are checked continually to make sure that only the best working proxies are in the pool.

Code:
import requests   
import re   
from urllib.parse import urlparse   
import os   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
       self.proxy_orbit_key = os.getenv("PROXY_ORBIT_TOKEN")   
       self.user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"   
       self.proxy_orbit_url = f"https://api.proxyorbit.com/v1/?token={self.proxy_orbit_key}&ssl=true&rtt=0.3&protocols=http&lastChecked=30"   
                                                                                                                                        
   def get_html(self, url):                                                                                                             
       try:                                                                                                                             
           proxy_info = requests.get(self.proxy_orbit_url).json()                                                                       
           proxy = proxy_info['curl']                                                                                               
           html = requests.get(url, headers={"User-Agent":self.user_agent}, proxies={"http":proxy, "https":proxy}, timeout=5)       
       except Exception as e:                                                                                                       
           print(e)                                                                                                                 
           return ""                                                                                                               
       return html.content.decode('latin-1')                                                                                       
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base   
  
       return set(filter(lambda x: 'mailto' not in x, links))   
  
   def extract_info(self, url):   
       html = self.get_html(url)   
       meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)   
       return dict(meta)   
  
   def crawl(self, url):   
       for link in self.get_links(url):   
           if link in self.visited:   
               continue   
           self.visited.add(link)   
           info = self.extract_info(link)   
  
           print(f"""Link: {link}   
Description: {info.get('description')}   
Keywords: {info.get('keywords')}   
           """)   
  
           self.crawl(link)   
  
   def start(self):   
       self.crawl(self.starting_url)   
  
if __name__ == "__main__":   
   crawler = PyCrawler("https://google.com/")   
   crawler.start()

As you can see, not much has really changed here. Three new class variables were created: proxy_orbit_key, user_agent, and proxy_orbit_url

proxy_orbit_key gets the Proxy Orbit API Token from an environment variable named "PROXY_ORBI:weep:OKEN"

user_agent sets the User Agent of the crawler to Firefox to make requests look like they are coming from a browser

proxy_orbit_url is the Proxy Orbit API endpoint that we will be hitting. We will be filtering our results only requesting HTTP proxies supporting SSL that have been checked in the last 30 minutes.

in get_html a new HTTP request is being made to the Proxy Orbit API URL to get the random proxy and insert it into the requests module for grabbing the URL we are trying to crawl from behind a proxy.

If all goes well then that's it! We should now have a real working web crawler that pulls data from web pages and supports rotating proxies.

If you have any questions either feel free to comment below or send me a PM
 
Last edited by a moderator:

PIEGE

Newbie
Joined
Jul 4, 2019
Messages
35
Reaction score
7
pay attention. some nice work being done here.
 

soulSEO

Regular Member
Joined
May 17, 2019
Messages
403
Reaction score
140
First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?
 

nextlvlig

Regular Member
Joined
Jun 24, 2018
Messages
452
Reaction score
156
When I try to go to the proxy_orbit_url link, it says the token is invalid. Is this suppose to happen?
 

zaogord

BANNED
Joined
Jul 18, 2019
Messages
195
Reaction score
164
First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?

https://www.reddit.com/r/learnpython/comments/2mroka/easiest_way_to_get_python_script_to_run_on_webpage/

https://www.youtube.com/watch?v=ERMRVORGvZM
 

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?

With a bit of tweaking this can do that, but not out of the box.
 

Gogol

Jr Vip
Jr. VIP
Joined
Sep 10, 2010
Messages
8,282
Reaction score
11,842
Website
LINKS-THAT-RANKS.shop
Good read. Bookmarked for later. Nice to see someone explaining stuffs with complexity (i.e. the big O). That is pretty important for scaling it up.

Sure I can pick up something from here.

Quick question (sorry I didn't read past the first paragraph yet, u might already have described it).. Do you know of a way to use requests to create a js enabled crawler? Ofcourse I can use selenium, but that takes wayyy more memory than what requests does.
 

Gogol

Jr Vip
Jr. VIP
Joined
Sep 10, 2010
Messages
8,282
Reaction score
11,842
Website
LINKS-THAT-RANKS.shop

Gogol

Jr Vip
Jr. VIP
Joined
Sep 10, 2010
Messages
8,282
Reaction score
11,842
Website
LINKS-THAT-RANKS.shop

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
Good read. Bookmarked for later. Nice to see someone explaining stuffs with complexity (i.e. the big O). That is pretty important for scaling it up.

Sure I can pick up something from here.

Quick question (sorry I didn't read past the first paragraph yet, u might already have described it).. Do you know of a way to use requests to create a js enabled crawler? Ofcourse I can use selenium, but that takes wayyy more memory than what requests does.

I did not describe it in the post, so good question. What I normally do in these cases is use the PhantomJS web driver in Selenium. PhantomJS is a lightweight headless browser that renders Javascript, so you get the best of both worlds. A light weight scraper, and the power of Selenium.
 
Last edited:

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
Isnt crawling the web illegal?

It's not illegal. There is a protocol called robots.txt that literally tells your crawler which sites it can crawl on the website. Even Blackhat World has a robots.txt file (https://www.blackhatworld.com/robots.txt)

It only becomes illegal when you STEAL information from other websites. For example, if a website is hosting premium content and you make a crawler that scrapes the premium content and puts it on your website to be sold without permission then that is illegal. But the act of scraping the web is not illegal. In some ways it is encouraged because of getting ranked within search engines.
 

Gogol

Jr Vip
Jr. VIP
Joined
Sep 10, 2010
Messages
8,282
Reaction score
11,842
Website
LINKS-THAT-RANKS.shop
I did not describe it in the post, so good question. What I normally do in these cases is using the PhantomJS web driver in Selenium. PhantomJS is a lightweight headless browser that rendering Javascript. So you get the best of both worlds. A light weight scraper, and the power of Selenium.
Gotta try it really. I always either use chrome driver (if crawling) or Firefox driver (if unit testing.. I also use chrome driver in this case). Never really used phantomjs yet. Thanks for the tip. :)
 
Top