1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[Back to basics] Scraping Millions of Unique Domains With Scrapebox

Discussion in 'Black Hat SEO' started by donaldbeck, Jul 15, 2013.

  1. donaldbeck

    donaldbeck Power Member

    Joined:
    Dec 28, 2006
    Messages:
    585
    Likes Received:
    212
    Someone sent me a private message yesterday, asking me about how to scrape hundreds of thousands or millions of domains using scrape box. This is one of the most basic skills that you need when it comes to SEO and internet marketing in general, a lot of people are going to be familiar with these ideas and won't learn anything new. But because I got this message it made me realize that there are probably a lot of people just starting out out there who need some help understanding the fundamentals of what we do here. That inspired me to write this post and provide a little help for the people who are new to using scrapebox and other similar tools.

    You are only going to need a few basic things in order to scrape a ton of URLs. You are going to need scrapebox or some other software that scrapes URLs. You're going to need a big keyword list. Simply search on BHW for keyword list and I'm sure you'll find what you're looking for. You are going to want to have at least 10,000+ keywords, the more the better. Thirdly, in most cases you will be using footprints, so you're going to need footprints too. I'm going to use GSA search engine ranker for this example. And finally you're going to need proxies. I suggest saving your sanity and buying one of the cheap public proxy, or port scanned proxy services available at BHW. These will provide you with thousands of proxies ready to use for a very affordable price.

    The first thing you will need to do is to take the footprints you want to use out of GSA search engine ranker and get them ready to use with scrapebox. Open up GSA search engine ranker, and click the options button on the top right.

    [​IMG]

    Next go to advanced, then click tools and then click search online for URLs.

    [​IMG]

    Here you are going to be able to select any of the footprints that GSA has. Simply click on add predefined footprints and select the type of target that you want to scrape URLs for. I'm going to use the footprints from the article section, so I just went to article and then add all from article to get all the footprints for all of the article platforms that GSA supports.

    [​IMG]

    Copy all the footprints that GSA gives you and open up textmechanic.com. We are going to get these ready to be merged with our keywords in scrapebox(as an aside, you could also use scrapebox to replace textmechanic, but this is easier to understand I think). Navigate to the add prefix or suffix into each line page on textmechanic(textmechanic.com/Add-Prefix-Suffix-to-Text.html). Put the footprints you just collected into the top box and then in the small box in the middle of the page that says add this prefix to each line of text put in %KW% and then a space.

    [​IMG]

    Put the results into a text file and save it. Scrapebox is going to replace the keyword placeholder with your list of keywords and combine them with your new list of footprints, this is going to allow you to scrape every footprint with every keyword. This is going to produce a lot of unique domains.

    Now you have your keywords, footprints in the correct format and your proxies. You are ready to start scraping. The first thing you need to do is select custom footprint in the scrapebox harvester section.

    [​IMG]

    Next put your list of keywords into the keywords box. After your keywords are in the box click the M above custom footprint and choose the text file that you just saved that has your footprints and the keyword placeholder.

    [​IMG]

    Now all of your keywords are combined with all of the footprints that you chose from GSA.

    Your keywords used to look like this:

    [​IMG]

    Now they looks like this:

    [​IMG]

    All that's left to do is to click start harvesting and wait. It doesn't matter if it goes beyond 1 million results, just go into the Harvester_Sessions folder in your scrapebox folder and combine all of the text files it made with the scrapebox dup remove addon. Depending on what you are doing you will want to remove duplicate URLs or duplicate domains after you combine the file.

    That's all there is to scraping hundreds of thousands and millions of URLs. You can apply this very simple process to any program and any type of footprint. Extremely simple stuff, but I'm sure it will help a few people.
     
    • Thanks Thanks x 8
  2. mikpel

    mikpel Regular Member

    Joined:
    Dec 1, 2012
    Messages:
    220
    Likes Received:
    116
    Although I've done quite a lot of seo recently I only got my hands on scrapebox just some days ago.

    My question is, is it really necessary to put %KW% in front of every line? By default scrapebox will add a keyword _after_ the footprint - same thing, right? :)
     
  3. memme

    memme Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 19, 2009
    Messages:
    1,169
    Likes Received:
    99
    Location:
    Blackhatlinks.com
    Home Page:
    you will need a LOT of proxies to harvest over a million urls...

    Think you need to have more than 500 priv. proxies...
     
  4. XavierKishner

    XavierKishner Newbie

    Joined:
    Nov 4, 2013
    Messages:
    45
    Likes Received:
    14
    Thank you. I'll try right now.
     
  5. winosergio

    winosergio Regular Member

    Joined:
    Mar 26, 2012
    Messages:
    270
    Likes Received:
    82
    Location:
    In my dreams
    Thanks donaldbeck for that really good info.
     
  6. Absurtuk

    Absurtuk Regular Member

    Joined:
    Mar 1, 2013
    Messages:
    239
    Likes Received:
    31
    You're such a great man!
     
  7. goldendeli

    goldendeli Regular Member

    Joined:
    Nov 21, 2012
    Messages:
    313
    Likes Received:
    38
    Occupation:
    dolalr bills
    i disagree a decent mix of private and public are enough to scrape millions of URLs. The hard part is working with those million of URLs.
     
  8. zeetee

    zeetee Regular Member

    Joined:
    Mar 28, 2011
    Messages:
    299
    Likes Received:
    137
    Occupation:
    Home Based Online Business
    Location:
    Men are from Mars
    it is like refresh course for the community.. well done..