1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Multiharvester - Free, general purpose content scraper

Discussion in 'Black Hat SEO Tools' started by lisper, Sep 15, 2012.

  1. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Hey there all,

    this is my second BHW release. Multiharvester is a general purpose scraping framework (similar to scrapebox but more centered around pure content scraping rather than posting itself. More on this later.). I'm releasing it for free here and would love to keep it free while actively working on this project as long as possible.

    A fair warning before we go on, I've only been at it for about 2 weeks now, so the project is very, very (very) young. Crashes, freezes and unoptimized performance are to be expected... So this is by no means a scrapebox replacement (at least not now that is), however if you'd like to try it out, be my guest.

    Firstly here are the requirements:
    -Multiharvester should run on any OS since its written in Clojure which itself compiles to JVM bytecode.
    -Since I'm using the JVM as the host platform a Java install is required (however, you should be prompted for one by the web installer that I'll provide shortly... It's best to just download the Java runtime though, googling "Get Java" should get you started)
    -The application is resource hungry (non optimized for now, didn't really have any time to run it through some filters sorry... I'll fix that in the next release though) so a somewhat okayish machine is required (My 6 year old macbook seemed to do fine, your mileage may vary...)

    Current problemls:
    -The main problem right now is me not having a proper internet connection at the moment lol. I'm currently hunting for a flat here in Brussels (if some BHWers from Brussels wanna meet up and go for a pint, I'm game btw), so working out of coffee places and hotel lobbies is really non optimal. However, this should be fixed by latest next week as hopefully I'll manage to find a flat here lol.

    -Potential freezes... Swing thread (thread in charge of the GUI) is prone to some freezes right now. This is easily fixable though, I'll just need to go through my code a little and fix those little lockups. Should be fixed by next release.

    -Potential memory leaks. A bit like issue nr. 2 really... Resource management is a little messed up right now. Will be fixed.

    -Poor proxy support. Public and private proxies "should" be working.. However, I didn't really have a chance to test it out much (once again, poor internet connection here...) and don't have any access to any private proxies (if someone could help me out with this I'd really appreciate it).

    Installing (Program has been tested on Mac only. However it should run anywhere. Let me know how it goes for you Win and nix users)(I'm assuming you have the java runtime installed already. Google it if you haven't):
    -Download the launcher from my server 50.112.250.78 / clojure / core [dot] jnlp (sorry about this, stupid spam filter...)
    -Double click to launch
    -It will download the app and ask you to accept the certificate. Do so.
    -An optional shortcut can be created (you'll be prompted for that)
    -Done. Happy scraping :)

    Here's how it should look once started up:
    screenshot1.png

    Features:
    -Google scraping
    -Time based scraping (past day, past week, past month, etc.)
    -Language based scraping (all supported Google lang options are included)
    -Domain specific scraping (all Google domains are included)
    -PR scraping
    -********/Nofollow filtering
    -Other filtering options ("Is alive?" checker, duplicate removal, domains only, string presence, string presence on site, etc.)
    -Exports to .txt/.csv
    -20 scraping threads are supported (I've only been using two at most though as I don't have any proxies available. However that was working fine)
    -150 misc operations threads (for checking pr, ******** and so on). My old mac seemed to have no troubles pushing 150 threads though.
    -Other little features

    Current goals:
    -Short term wise I'd like to fix all current issues first (which shouldn't take all that long) and then start working on a powerful, inbuilt proxy scraper/checker as I feel that this is an essential feature for any type of scraping.
    -Search engine wise, I'll add support for Bing, Yahoo, Yandex, Rambler, Ask, DuckDuckGo as well as Baidu shortly. I've got parts of the code ready, just need to run further tests.

    -Improving performance! Firstly shrinkign and optimizing the bytecode should help (current filezise is 20mb which is a bit too big for my liking. Also keep in mind that the JVM is failry slow when it comes to start up. Once its runnign though it should be fairly fast, even non optimized). I'd like to push to around 200-250 scraping threads and around 400 misc threads if possible. That should be doable with the JVM by rewriting HTTP requests as asynchronous operations (so they become non-blocking). However, this will take me some releases and I'd like to stabilize the app with its current threading capabilities first.

    -Inclusion of a scripting engine. This is one of the big features. My plan is to design a simple scripting language that would allow you to add your own scraping resource. For example, if you'd like to scrape wikipedia pages, you should be able to do so without my involvement. This should be fairly trivial as I'm using Clojure (which is a lisp-1 dialect) which is known for its extensive "domain specific language" support. Work on this part of the application will begin as soon as everything else is stable enough.

    These are the goals for now. I'll probably setup a website next and include a little tutorial as well as a FAQ. Too keep it all a little organized as I'll be adding a LOT of features in the future, so would be good to have an overview.

    I'll also need to find some type of monetization method that does not include the actual user (that means you). I'd just like to keep this free and accessible.

    Anyhow, I hope this proves useful to some and happy scraping of course!
    I'm always open to suggestions and questions so don't hesitate to post here, I'll try to reply as soon as possible (as long as my connection is working here).

    P.S. VS of the main jar. Can't post the link so here's the # b5e6093b5e2ed96e4f4bbdcf39899ab79d61429403fea5299d9c0d5e6c47b477. Just use the search feature on VS.
     
    • Thanks Thanks x 8
    Last edited: Sep 15, 2012
  2. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Any comments or suggestions guys? Seems like a couple of people have downloaded the soft but haven't really commented on it...
     
  3. emaloy97

    emaloy97 Junior Member

    Joined:
    Jul 23, 2012
    Messages:
    199
    Likes Received:
    72
    I liked it. It'll be awesome once it's optimized!
     
    • Thanks Thanks x 1
  4. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Thanks. I should have a flat in about two days btw, so the updates will start flowing from then on haha.
     
  5. alaltaierii

    alaltaierii Supreme Member

    Joined:
    Jun 11, 2010
    Messages:
    1,408
    Likes Received:
    349
    Looking nice. I will download it and play a little bit with this tool. ;)
     
    • Thanks Thanks x 1
  6. Bam Bam

    Bam Bam Newbie

    Joined:
    Jan 22, 2009
    Messages:
    23
    Likes Received:
    2
    I'll check this out... how often are you rolling out updates. Where should be send bug reports?
     
    • Thanks Thanks x 1
  7. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Not too often for now (well at least not until I get a flat over here and an internet connection). From then on probably one or two every week. I haven't really setup any bug reporting features yet (its a very early release and I just wanted to push it out there). Either way, feel free to post your findings over here and I'll get to fixing them next week or so.

    Thanks for trying it out.

    Cheers
     
  8. Crapper

    Crapper Junior Member

    Joined:
    Sep 28, 2011
    Messages:
    126
    Likes Received:
    14
    Home Page:
    Hi,

    Thanks for the free download of your software. Just tested it and it goes pretty fast. But the private proxy's didn't work. I did not get any results back with private proxy's. Just one question. How does the Google pr work ? It didn't gave me results

    By the way (we are neighbours.... I'm from The Netherlands. Just to let you know :) )
     
  9. davie9x

    davie9x Regular Member

    Joined:
    Jun 12, 2012
    Messages:
    460
    Likes Received:
    118
    Location:
    BHW <3
    Home Page:
    Damn what a great tool dude! Very fast and simple to use. Rep + Thanks given.
     
  10. FleshJoe

    FleshJoe Power Member

    Joined:
    Mar 11, 2008
    Messages:
    551
    Likes Received:
    1,702
    The image looks very good! I'd love to have my hands on it ;)

    Lisper I sent you a PM.
     
  11. t0mmy1008

    t0mmy1008 Newbie

    Joined:
    Apr 7, 2010
    Messages:
    26
    Likes Received:
    6
    Can you pm me the download link of this application?
     
  12. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Bedankt :)
    I'm not actually sure if private proxies are really working... However, maybe you messed up the format? It should be HOST:pORT:USERNAME:pASSWORD (all separated by ":"). Have a go at it and let me know if that works out.
    The PR scraper is fairly simple. You just scrape your links, set a PR limit and click the check button. HOWEVER, heres the thing, proxies are currently NOT supported with the PR checker. So if you run too many threads google will ban your IP for some time.

    I'll have a more thorough look into it tomorrow.

    My pleasure!

    First install the java runtime on your machine. Google "Get java", the first result should be it. Once installed download the launcher from my server: 50.112.250.78 / clojure / core [dot] jnlp
    Launch the launcher (you might need to right-click and open with "Java web start"). Should be pretty straight forward from there on.

    Everyone, I'll be going to bed for now. My flathunt continues tmr. I should be on tmr evening, I might even release a somewhat less buggy version then (if the internet connection is kind enough). Thanks to anyone who's tried it out so far.

    Cheers
     
  13. lisper

    lisper Newbie

    Joined:
    Aug 23, 2012
    Messages:
    44
    Likes Received:
    24
    Occupation:
    Lead developer of some German research project
    Location:
    Currently Brussels, Belgium
    Version 0.0.3 is up.
    Just pushed the update. Please let me know if you can update at all... I'm using Java Web Start for the update process which can be buggy as all shit sometimes, so post here if you have problems and I'll try to sort you out.

    With this build I've somewhat optimized the code and reduced the file size a little. There's still a lot of optimization left to do, I'll get to that as soon as possible. Also, I've added Bing scraping. However it dies around page 4-5, will fix that with the next release.

    In other news, I've got a flat and we'll be moving tmr, so I should be up and running from friday on. Get yourself ready for some serious updates lol.

    Cheers and let me know if there are any problems.
     
  14. beaurock

    beaurock Newbie

    Joined:
    Apr 19, 2012
    Messages:
    12
    Likes Received:
    1
    This seems cool thanks
     
  15. wuquater

    wuquater Junior Member

    Joined:
    Jan 1, 2013
    Messages:
    167
    Likes Received:
    63
    Location:
    no man's land
    hey Lisper, this sounds very good, any new updates on it?
     
  16. RushingWind

    RushingWind Elite Member

    Joined:
    Apr 6, 2013
    Messages:
    2,416
    Likes Received:
    3,333
    awesome :)
    tnx
     
  17. vladv

    vladv Junior Member

    Joined:
    Mar 23, 2013
    Messages:
    108
    Likes Received:
    10
    I wonder if anybody can re/upload this awesome utility?Thanks!
     
  18. garytaylor9

    garytaylor9 Newbie

    Joined:
    Jul 16, 2013
    Messages:
    1
    Likes Received:
    0
    I wonder if anybody can re/upload this scraper as it seem to be something I could use to great effect

    Thank you in advance
     
  19. itsdjango

    itsdjango Regular Member

    Joined:
    Jun 16, 2013
    Messages:
    327
    Likes Received:
    59
    Occupation:
    Engineer.
    Location:
    outside your window
    Now we are talking !
     
  20. hmeister

    hmeister BANNED BANNED

    Joined:
    Jul 29, 2009
    Messages:
    342
    Likes Received:
    195
    Hi
    Is this still available please?
    Thanks