1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Simple Scraper or Bot Question - Pls Help!

Discussion in 'Black Hat SEO' started by Rich77ard, Feb 2, 2015.

  1. Rich77ard

    Rich77ard Registered Member Premium Member

    Joined:
    Mar 20, 2010
    Messages:
    66
    Likes Received:
    43
    Gender:
    Male
    Occupation:
    Web Design - SEO - Entrepreneur
    Location:
    Australia
    I'm trying to scrape or harvest .com.au websites that have their robots txt file set to block everything - Disallow: /

    I know these sites still show up in the search results with "A description for this result is not available because of this site's robots.txt" in their organic search results.

    Does anyone know a simple Scrapebox search query that could harvest these site domains, or do I need to create a bot to do this?

    I could probably process hundreds of urls through Xenu or Screaming Frog and check for blocked robots txt that way, but that seems a bit backwards. I'm sure there's an easier way where I can just type 'keyword' and 'the blocked robots query' and harvest domains that way.

    I'm hoping I don't have to dig up my Ubot Studio and start from scratch. Thanks in Advance.
     
  2. seeplusplus

    seeplusplus Power Member

    Joined:
    Aug 18, 2008
    Messages:
    517
    Likes Received:
    165
    Difficult, I would see if there's any search engines around which don't respect the robots.txt file and harvest that search engine...?
     
  3. Rich77ard

    Rich77ard Registered Member Premium Member

    Joined:
    Mar 20, 2010
    Messages:
    66
    Likes Received:
    43
    Gender:
    Male
    Occupation:
    Web Design - SEO - Entrepreneur
    Location:
    Australia
    Hmm, the problem with that is I won't know if the site has a blocked robots.txt file just by looking at the search results. With Google at least I can see that it's blocked in the results, problem is I'd have to sort through 100's of search results relative to a keyword before I found one saying "A description for this result is not available because of this site's robots.txt"
     
  4. innocent_kid

    innocent_kid Power Member

    Joined:
    Feb 9, 2010
    Messages:
    505
    Likes Received:
    124
    well u can try this footprint
    intext:"description for this result is not available because of this site's robots.txt"
     
  5. Rich77ard

    Rich77ard Registered Member Premium Member

    Joined:
    Mar 20, 2010
    Messages:
    66
    Likes Received:
    43
    Gender:
    Male
    Occupation:
    Web Design - SEO - Entrepreneur
    Location:
    Australia
    It's very rough and not targeted enough. I get all sorts of mixed results and 98% of the sites do actually have a normal robots.txt file that's not blocking the whole site.

    I'm trying to use this strategy to get new Seo clients. When you call a business up that has their whole site unintentionally blocked by a robots txt file it's an easy way to get your foot in the door and provide them with some immediate assistance which can easily lead to a monthly Seo contract.
     
  6. innocent_kid

    innocent_kid Power Member

    Joined:
    Feb 9, 2010
    Messages:
    505
    Likes Received:
    124
    well for that you probably need a automated bot that scrapes robots.txt urls.. and than scan them for their values