1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How To Build Your Own URL Harvester

Discussion in 'Black Hat SEO' started by smokemeoutdawg, Jan 17, 2014.

  1. smokemeoutdawg

    smokemeoutdawg Newbie

    Joined:
    Dec 14, 2013
    Messages:
    46
    Likes Received:
    12
    I'm trying to build my own URL harvester due to the fact that Scrapebox and every other silly program harvests duplicates upon duplicates and a 1,000,000 harvest turns into 78k domains if you're lucky.

    What I'm trying to do is mainly for Xrumer. I want to be able to grab the millions of "powered by vbulletin" and "powered by phpbb" "powered by smf" forums

    If you had to do this, what would the easiest way of doing it be. Harvesting with proxies is slow, so I thought about just having some sort of script created that would google "powered by vbulletin" grab the 10 urls on page 1 input them into 1 .txt and move onto page 2 --> repeat page 3 -->repeat page 4

    I am just not sure how to create this. If anyone would be willing to help, I can do a big Xrumer blast for both of us!! Add me on skype (username ryderstormhf) or PM if you can help.

    I ranked my site Page 1 of Google for "penis enlargement pills" using Xrumer alone, 1GBPS server, my problem is I can't get enough URLS to blast more! lol
     
  2. tb303

    tb303 Power Member

    Joined:
    Dec 18, 2011
    Messages:
    601
    Likes Received:
    280
    Sounds like a job for zennoposter.
     
  3. roberteb

    roberteb Regular Member

    Joined:
    Oct 30, 2010
    Messages:
    402
    Likes Received:
    120
    Location:
    UK
    I don't understand why you'd want to reinvent the wheel. Scrapebox does exactly what you describe so why write a script to do it. Duplicate URL's are unavoidable and how many you get will depend on the similarity of the words / phrases you're scrapping with. With out proxies you'll be IP banned in 5 mins so again I don't get what it is you're trying to achieve.
     
  4. smokemeoutdawg

    smokemeoutdawg Newbie

    Joined:
    Dec 14, 2013
    Messages:
    46
    Likes Received:
    12
    If you went to google, and typed in "powered by vbulletin"

    Would you get duplicates if you harvested page 1, 2, 3, 4, 5

    No you wouldn't

    That's what I'm trying to achieve. No duplicates.

    Remember the Facebook movie where that Marc Zuckerburg dude busts out the Perl script to automate the download process.

    Something like that is what I'm trying to achieve man lol.

    Just downloading URLS into a .txt, pausing for a couple seconds, hitting up the next page, you shouldn't get IP banned that way.
     
  5. SionAndes

    SionAndes Junior Member

    Joined:
    Oct 1, 2012
    Messages:
    110
    Likes Received:
    31
    Yes, I'm totally agree with Roberteb, Scrapebox is just awesome, You can remove duplicate URLs as well as duplicate domains too. I have Zenno Poster Pro and Scrapebox too, I never thought to build the url harvester script even I can do it easily in short time via Zenno but I didn't and I'll never until I sold my scrapebox license to someone.
     
  6. sirgold

    sirgold Supreme Member

    Joined:
    Jun 25, 2010
    Messages:
    1,260
    Likes Received:
    645
    Occupation:
    Busy proving the Pareto principle right
    Location:
    A hot one
    If you can code and know a little python https://github.com/scrapy/scrapy is an awesome handler for most conceivable use-cases...

    On top of it from SO:


    • Scrapy crawling is faster than mechanize, since it uses asynchronous operations (on top of Twisted).
    • Scrapy has better and faster support for parsing (x)html on top of libxml2.
    • Scrapy is a mature framework with full unicode, redirection handling, gzipped responses, odd encodings, integrated http cache, etc.
    • Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.
     
    • Thanks Thanks x 1
  7. satyr85

    satyr85 Power Member

    Joined:
    Aug 7, 2011
    Messages:
    580
    Likes Received:
    444
    Location:
    Poland
    Op you are so wrong...

    Its not problem of scrapebox or any other harvester. Its problem of google results. Harvesters only harvest google and there is no way to get from google 100% unique domians every time you harvest for "powered by xxx" + some keywords.

    1. Scraping without proxies is not possible, google will ban your IP.
    2. So you say harvesting with proxies is slow... i harvest with gscraper and with good proxies (i harvest them myself) i get 300k links per minute including duplicates.
    3. Footprints. Not every phpbb site have "powered by phpbb", not every vbulletin site have "powered by vbulletin" - these footprints are not perfect, because you will harvest tons of sites, but many of them will not be forums. Create better footprints, selfmade.

    P.S.
    Scrapebox when it comes to scraping is slow not awesome. On same proxies scrapebox - max 300urls per second (18k urls per minute), gscraper 200-300k urls per minute.
     
  8. yayoyayo

    yayoyayo Junior Member

    Joined:
    Jul 15, 2008
    Messages:
    188
    Likes Received:
    224
    You are naive and you don't have any experience in Google listing harvesting and so on.
    Google listing for key phrase "powered by vbulletin" is limited to certain pages number. So how you are going to harvest different "powered by vbulletin" forums? Yea you will add additional words and that's why you will always get duplicates.

    Hrefer already can make pauses between queries.

    And most important you think that you are smarter than others but that's not and that's your biggest mistake.
     
  9. divok

    divok Senior Member

    Joined:
    Jul 21, 2010
    Messages:
    1,015
    Likes Received:
    634
    Location:
    http://twitter.com/divok
    It is better to have those duplicates removed after you have scraped.
    Test it yourself , create a 1mn lines random dump data . Now add one more item to this data and make sure it is not a duplicate . Find the time it took to add the item to the list , try it using hashes too .
    Gscraper has this function to eliminate duplicate domains while scraping , as the list grows number of threads decrease and gscraper starts consuming more CPU and memory which f#cks up my VPS .
    If still you want your own scraper , try scrapy on Linux or use the requests library of python . Best of luck with multithreading .
     
  10. stugz

    stugz Junior Member

    Joined:
    Apr 14, 2013
    Messages:
    154
    Likes Received:
    33
    OP, there is Hrefer that comes with Xrumer. Use it. Each keyword/phrase is limited to 1000 results. To get more results you need to vary your queries. You also need better footprints than "Powered by..." which IMO gets your proxies blocked just a bit slower than inurl: queries. I have a massive list of such footprints for many different platforms and I'm sure many others have similar as well. You will have to study the platforms you want and extract footprints yourself.