1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Can Scrapebox find expired web2.0 on a specific website?

Discussion in 'Black Hat SEO Tools' started by rejoin14, Dec 22, 2016.

  1. rejoin14

    rejoin14 Regular Member

    Joined:
    Jul 4, 2016
    Messages:
    204
    Likes Received:
    35
    Hello,

    is it possible to scrape with scrapebox all DOMAIN.wordpress.com sites?

    For example I use huffingtonpost as domain and I want now scrape all whatever.wordpress.com sites?

    Is something possible?

    Like I dont want to target any specific keyword I just want all ....wordpress.com sites that is huffingtonpost linking to.
     
  2. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,064
    Likes Received:
    10,836
    Occupation:
    WHEREZ MA
    Location:
    BITCOINS AT?
    Home Page:
    Seems like a simple footprint? Target links match .wordpress.com and then you filter by unresponsive or 404 code.
     
  3. Peter Ngo

    Peter Ngo Jr. VIP Jr. VIP

    Joined:
    Apr 23, 2013
    Messages:
    2,045
    Likes Received:
    1,637
    Occupation:
    I browse BHW for a living
    Location:
    The Internet
    Easy, you will need quite an amount of keywords, i usually use Furykyle 1b keywords that i bought from a few years back.

    1. You can use Scrapebox crawling feature, but since Huffingtonpost has quite a large indexed base (>10m indexed pages), you won't be able to find all links from scrapbox crawling feature (personal experience.)

    My suggestion is scrape huffingtonpost with ton of keywords under these few footprints:
    "TheHuffingtonPost.com"
    "The Huffington Post"
    You can use site:huffingtonpost.com but as i noticed, the scraping speed is a lot lower when using "site" operator, but it is more precise though.

    2. After you have a list of indexed huffingtonpost.com pages, dedup them and remove URLs that has /tags/, /facebook/, etc, those won't bring you much juicey external OBL.

    3. Now you have a clean list of indexed URLs from huffingtonposts.com, now fire up Scrapebox addon for links extractor, import all links in, and extract external links.

    4. NOw you have a nice list of external links, go back to scrapebox main UI, trim all URLs to root, remove duplicate.

    5. This is probably the simplest part, now go and use "remove all URLs that doesn't contain" feature and put ".wordpress.com"

    Well pretty much it, It is actually more works than it appears but this should get you started somewhere.
     
    • Thanks Thanks x 2
  4. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,994
    Gender:
    Male
    Home Page:
    Peter had a good direction. I think you could shorten it, just scrape google for

    site:huffingtonpost.com ".wordpress.com"

    Remove duplicate urls

    Extract external links

    Remove urls not containing .wordpress.com
     
  5. Topiano

    Topiano Jr. VIP Jr. VIP

    Joined:
    Dec 3, 2015
    Messages:
    671
    Likes Received:
    136
    Gender:
    Male
    Occupation:
    SEO
    Those Two directions are great .

    That's one good thing i like about SB ... There is always 101 way to get a problem solved .


    You could as well use


    Site:domain.com intext:".wordpress.com"


    This should definitely do the magic too :)
     
  6. charliebrooker

    charliebrooker Jr. VIP Jr. VIP

    Joined:
    Feb 16, 2014
    Messages:
    718
    Likes Received:
    279
    Home Page:
    There are a lot of guides on how to do this using footprint searches, just google scrapebox expired web 2.0