The Best URL Harvester

Discussion in 'Black Hat SEO' started by Proteus, Sep 21, 2010.

  1. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    Scrapebox does well and there is always Hrefer. But when your trying to harvest every blog and forum on the planet is a scraper the best option? Or is there a spider which runs off and just keeps going and going?
     
  2. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    How about web crawlers for instance?
     
  3. dummydecoy

    dummydecoy Junior Member

    Joined:
    Jul 4, 2010
    Messages:
    154
    Likes Received:
    39
    i can build one for you :D
    i have an imdb scraper
     
  4. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    I have read a lot about crawlers or internet bots. There are a lot of projects already there and even open source.

    They are used for indexing for search.

    With a big enough db you can slowly crawl the web. The reason for using google and other search engines is to empower the scraping tool filtered by a search term.

    The problem is that even if you use huge keyword lists to alter the search results in google, you still end up with like 90% dupes. Trust me I know this all too well.

    I have tried everything I can think of, any suggestions?
     
  5. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    979
    Likes Received:
    468
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    I did a thing where I loaded the alexa top 1,000,000 urls into a table, then used curl to index the content of them, then ran my fingerprint check / form identification algorithm to copy them into the proper tables if there was a match. I don't have it anymore but I could do it again, and with extra urls like forum.domain.com, domain.com/forum, etc.