1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

The Best URL Harvester

Discussion in 'Black Hat SEO' started by Proteus, Sep 21, 2010.

  1. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    Scrapebox does well and there is always Hrefer. But when your trying to harvest every blog and forum on the planet is a scraper the best option? Or is there a spider which runs off and just keeps going and going?
     
  2. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    How about web crawlers for instance?
     
  3. dummydecoy

    dummydecoy Junior Member

    Joined:
    Jul 4, 2010
    Messages:
    154
    Likes Received:
    39
    i can build one for you :D
    i have an imdb scraper
     
  4. Proteus

    Proteus Junior Member

    Joined:
    Sep 6, 2010
    Messages:
    109
    Likes Received:
    20
    Occupation:
    Web Design and Development
    Location:
    Earth
    I have read a lot about crawlers or internet bots. There are a lot of projects already there and even open source.

    They are used for indexing for search.

    With a big enough db you can slowly crawl the web. The reason for using google and other search engines is to empower the scraping tool filtered by a search term.

    The problem is that even if you use huge keyword lists to alter the search results in google, you still end up with like 90% dupes. Trust me I know this all too well.

    I have tried everything I can think of, any suggestions?
     
  5. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    I did a thing where I loaded the alexa top 1,000,000 urls into a table, then used curl to index the content of them, then ran my fingerprint check / form identification algorithm to copy them into the proper tables if there was a match. I don't have it anymore but I could do it again, and with extra urls like forum.domain.com, domain.com/forum, etc.