1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to find all the sites on the web with a certain footprint in their home page HTML

Discussion in 'General Programming Chat' started by londate, Mar 24, 2013.

  1. londate

    londate Newbie

    Joined:
    Mar 18, 2013
    Messages:
    40
    Likes Received:
    3
    This might be a noob question, if so I apologize, but I wondered if anyone could point me in the right direction here... What I want is to find all the sites anywhere on the web that have a certain footprint in their HTML (i.e. not necessarily visible on the front end of the site). The best thing I have found so far for this is the Page Scanner Addon for Scrapebox but that needs me to feed it a list of URLs to work from first.

    Is there a tool which I can just give some keywords and it will transverse google doing searches for those keywords, and saving a list of sites in the SERPs which have the HTML footprint I specify?

    Thanks for any pointers
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,154
    Short answer, no.

    Longer answer: There is a solution, but unless you know what Hadoop is and have a big amount of money to cover the costs, look at the short answer.
     
    • Thanks Thanks x 1
  3. madoctopus

    madoctopus Supreme Member

    Joined:
    Apr 4, 2010
    Messages:
    1,249
    Likes Received:
    3,498
    Occupation:
    Full time IM
    Nope. Unless it's a footprint available on the homepage or you can figure out from the homepage then there's no reliable way. If its on the homepage you can just take the zone files for com/net/org and crawl all domain (a few billion) and look for the footprint. If it's on a deep page then you can only srape SERPs which is way too unreliable and annoying for my taste.
     
    • Thanks Thanks x 2
  4. londate

    londate Newbie

    Joined:
    Mar 18, 2013
    Messages:
    40
    Likes Received:
    3
    Alright then that settles it. Thanks both for saving me a lot of time investigating dead ends.
     
    • Thanks Thanks x 1
  5. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    According to verisign there are 107M .com domains registered and it is not really that much.
    The only option is to use SERPs to scan the domains for the footprint, for example google
    site:domain.com footprint
    It is not very reliable or accurate, but otherwise it will take years, unless you want to spend a lot of cash for hardware and bandwidth.
    I have a setup for scraping google doing around ~4M queries per day, perhaps we can JV ?