1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Best Tool For Scraping links from Google. Hrefer? Scrapebox? Or something else?

Discussion in 'Black Hat SEO' started by peterhofmann, Jul 18, 2013.

  1. peterhofmann

    peterhofmann Newbie

    Joined:
    Mar 2, 2013
    Messages:
    22
    Likes Received:
    0
    Hello,

    I have tried scrapebox and hrefer to scrape URLs from Google.

    For scrapebox, it is easy to use but can't get enough links for me.

    For Hrefer, I think there must be some problems with the engines.ini, I saw the query code is trying to get 100 results each page, but it only got 10 results each page, also it can't scrape enough URLs for me.

    Do you know how to fix the Engines.ini problems of Hrefer?

    Or do you know which software can do better job than them?

    Hope that someone can help me, thanks.
     
  2. donaldbeck

    donaldbeck Power Member

    Joined:
    Dec 28, 2006
    Messages:
    585
    Likes Received:
    212
    Both will work.

    If SB isn't getting enough links you need to take a look at how many keywords/footprints you are using.

    It's not going to have any limitations as far as amount of links you can scrape.
     
    • Thanks Thanks x 2
  3. Hinkys

    Hinkys Jr. VIP Jr. VIP

    Joined:
    Mar 3, 2012
    Messages:
    700
    Likes Received:
    553
    Location:
    Croatia
    It's usually not what tool you use but how you use it. Scrapebox is enough for 99% of your scraping needs and Hrefer is said to be the best scraper you can find. I seriously doubt you will find something that works for you if these 2 don't.

    Did you use enough footprints / keywords / proxies?
    "No"? - Use more!
    "Yes"? - Use more anyway!
     
    • Thanks Thanks x 2
  4. cacus1

    cacus1 Newbie

    Joined:
    Jul 14, 2013
    Messages:
    16
    Likes Received:
    5
    With ScrapeBox i scraped about 276,000 URLs in about 3 hours. I pressed the abort button because it was enough for me and ScrapeBox crashed with an error saying something about Not Having Enough Memory

    Also, i got the same error when I was merging my footprints with my common words list. If I reduce the keywords and abort scraping before 100k, then it works just fine.
     
    Last edited: Jul 18, 2013
  5. donaldbeck

    donaldbeck Power Member

    Joined:
    Dec 28, 2006
    Messages:
    585
    Likes Received:
    212
    This type of stuff is going to be server dependent.
     
  6. evilgary

    evilgary Junior Member

    Joined:
    Apr 3, 2008
    Messages:
    187
    Likes Received:
    26
    definetely hrefer would be much better since there is also a sieve filter in place. if your engine.ini does not work, i suggest you to learn how to modify it. it's pretty simple
     
  7. *Hawke*

    *Hawke* Junior Member Premium Member

    Joined:
    Apr 9, 2013
    Messages:
    126
    Likes Received:
    81
    Home Page:
    Why not to try GScraper? 276,000 URLs in about 3 hours is just a piece of cake.
     
  8. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    626
    Likes Received:
    587
    Hawke, I have Gscraper. I ended up purchasing Scrapebox. GScraper is not a bad product. But it now has many bugs in it. The latest version following 1.2.1.1) inconsistantly scrapes when using a footprint such as site:.com %KW% and site:%KW%.com. By inconsistent I mean some times it will scrape and create a report and a file and sometimes it wont. I have attempted many different footprints that I use, and this problem is consistent throughout the footprints, including the one that ships with the program.

    Then as you have tried to improve the program, the memory footprint has grown from 239k of ram at boot to over 500k at boot.

    What Gscraper does better than Scrapebox is skip dead proxies where as a dead proxy will just throw an error in Scrapebox and the search is skipped.

    Being honest, I feel at this time that I wasted my money on GScraper. I know you are trying hard, but....
     
  9. briptech

    briptech Power Member

    Joined:
    Apr 4, 2011
    Messages:
    557
    Likes Received:
    218
    Gender:
    Male
    Occupation:
    SEO Specialist
    Location:
    BHW
    Home Page:
    I own ScrapeBox since the version 0.1 and I have never tried another tool. Anyway I think ALL depends on proxies (and quality of those proxies too). I have tried several proxy providers and now I have an established subscription with three vendors, giving me the chance to have ~50-60 proxy (fresh every month).

    I think you should look here at BHW for proxy vendors, there are good offers always active and this should give you the chance to test and test and test. Also, add tons of keywords during scrape (at least 50 keywords).

    Hope this helps :)
     
  10. Moofy

    Moofy Newbie

    Joined:
    Jun 28, 2013
    Messages:
    10
    Likes Received:
    0
    Hinkys and donaldbeck tell the truth out,the more you need the more you pay;)
    If you have no idear,Just try xboter Scrape Sonic,maybe it able to meet your needs,you can get proxy from their server and there are many footprint building in software.The key is that it is free now.

    Good luck!
     
  11. Sweetfunny

    Sweetfunny Jr. VIP Jr. VIP

    Joined:
    Jul 13, 2008
    Messages:
    1,797
    Likes Received:
    5,074
    Location:
    ScrapeBox v2.0
    Home Page:
    No in ScrapeBox when a query fails due to a bad proxy, it will retry the query again with a different proxy 3 times by default. You can raise it up to 20 retries under Settings > Adjust multi-threaded harvester proxy retries.

    Also the Custom Harvester will go up to 99 retries for every query, see the "Proxy Retries" setting up the top right.

    [​IMG]

    If that ever happens, all your urls are saved in the /Harvester_Sessions/ folder so nothing is ever lost.
     
    • Thanks Thanks x 3
    Last edited: Jul 19, 2013
  12. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    626
    Likes Received:
    587
    Allow me to clarify.

    GScraper places the queries into a circular que. Using the default setting for Scrapebox; if P1 is dead, and P2 is dead, and P3 is dead, then an error is thrown and query is skipped. In Gscraper, if P1...P3 is dead, then Q is moved to P4...Pn. The only time that a query will be skipped is if all the proxies are dead.

    Each method has its pro's and con's. Because of the transitory nature of proxies, I believe that GScrapers method is much better than the stack method that Scrapebox appears to use, especially for long scrapes that may be run over several days.

    Both SB and GS suffer from the fault of being unable to pause the program to insert new proxies during the scrape. Scrapebox attempts to resolve this with the custom harvester that scrapes for new proxies every X minutes, where as Hawkes suggested work around is to set the flag to delete the used keywords when the scrapping is stopped and then insert the new proxies. Both methods are cumbersome. With SB this is because the proxy scrape and test is based off the settings and may pull >34K proxies and then test them prior to continuing. With GS this is cumbersome because you may lose the majority of a search for several keywords.

    When it comes to posting, at one time GS dusted SB at ~70% to ~40% success. Now due to bugs in GS, the posting success is less than SC.

    GS has no learning mode for new platforms, whereas SB does. SB wins here.

    GS only searches Google, SB searches many different engines and can be taught new engines (if you have the proper settings for Baidu and Yandex, can you provide them as I can't seem to get them right). Scrapebox wins hands down.

    Gscraper allows utf8 encoding so that if I want to search for botnet in Russian and Chinese I can enter ботнет and 僵尸网络 directly into the keywords. Scrapebox has no facility that allows me to do this. This would be especially useful when searching foreign (non US) countries search engines using the native language.

    One of my peeves with GS is the use of the registry for the database along with attempting to disable registry tracing. Scrapebox does not do this. However, even though frowned on by MS because of the high probability of corrupting the registry, many commercial programs do this as well. I get around this problem by running Gscrapper in user space rather than administrative space. While I do not run SB in administrative space, I would feel comfortable doing so. I would not run GScraper in administrative space.

    Then my biggest peeve with .net programmers (GScraper) is that because Visual Basic (.net) has garbage collection and rudimentary memory management, they think that they do not have to manage the memory. Gee, I don't have to delete an unused object and recover the space because the garbage collector will do it. This leads to excessive memory consumption and disk thrashing. I have had to force shutdown of GScraper more than once behind this. However, GScraper is not as offensive as Proxy Goblin in this regard.

    Then there is another issue with GScraper, every search is sent to China. This appears to be validation of the programs authority to run, but I do not really know. This excessively consumes outgoing bandwidth and leads to the familiar no authority for the operation message. While I have not had SB long enough to see if this is the case with SB, on the surface SB appears only to validate at program start up.

    Not every program is going to have everything that a person might want in it, for the purpose, at this stage of development, and within the tests I have done and the observations I have made, Scrapebox wins hands down. GScraper has gone from good to bad, and is now working on terrible.

    The reason I am bringing this out is that maybe it will motivate Hawke into addressing the problems that GScraper has.
     
    • Thanks Thanks x 1
  13. CSharp

    CSharp Newbie

    Joined:
    Jun 7, 2013
    Messages:
    39
    Likes Received:
    7
    Occupation:
    Developer, Entrepeneur
    Location:
    UK
    Will have to disagree with you there, .Net doesn't have a "rudimentary memory management" system. The main problem is objects not being disposed once being used (such as a connection), which will lead to the garbage collector seeing that the object is still in use. It's a shitty programming, nothing to do with .Net framework.
     
  14. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    979
    Likes Received:
    467
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    I'll fully agree with you here, but I'd throw in there is plenty of shitty programming within the .net framework itself, and the nature of the .net framework itself just encourages shitty programming. It's a shame because C# is actually a nice language and they actually made a very nice IDE, but it runs on top of layers of shit.
     
  15. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    626
    Likes Received:
    587
    Microsoft disagrees with you:
    Source: http://msdn.microsoft.com/en-us/magazine/bb985010.aspx
     
  16. CSharp

    CSharp Newbie

    Joined:
    Jun 7, 2013
    Messages:
    39
    Likes Received:
    7
    Occupation:
    Developer, Entrepeneur
    Location:
    UK
    If you read the whole article you would have realized that in fact, Microsoft agrees with me...

    I do this for a living mate, and not only that, but the article is dated from the year 2000, more than a decade has passed
     
    Last edited: Jul 20, 2013
  17. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    626
    Likes Received:
    587
  18. CSharp

    CSharp Newbie

    Joined:
    Jun 7, 2013
    Messages:
    39
    Likes Received:
    7
    Occupation:
    Developer, Entrepeneur
    Location:
    UK
    And those links prove what?

    This goes right back to me explaining to you that resources must be disposed of to allow the GC to do its job, once you dispose a resource it doesn't sit around there forever waiting for the GC to collect it.

    lol, my work speaks for its self ;)

    TwitterMarketing.jpg
     
  19. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    626
    Likes Received:
    587
    First paragraph. first link
    So at the minimum it shows that .net has "Rudimentary memory management," as I stated. You complained about an older link, so I gave you the most current links directly from the Microsoft Developers Network.