1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scrapebox Harvesting: Always low results. Suggestions?

Discussion in 'Black Hat SEO Tools' started by muchacho, May 24, 2011.

  1. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    OK, so I scrape a list of several hundred keywords (done different sets of keywords when trying to harvest) and add them to the keyword section.

    I try different custom footprints such as "add your comment below", "inurl:edu" etc.. basically ones that result in millions of results in Google.

    Yet when I run the harvester, it brings back a few hundred results and then the connections just keep dropping before the harvester has finished. If I choose to leave keywords that weren't checked, it keeps most of them there, which basically means most of them aren't checked with Scrapebox.

    Proxies are public, but I use a decent source and they pass the Google check immediately prior to harvesting.

    I have results set to 1000, so always assumed this would mean if I had 1000 keywords with 1000 results, it would fetch back a max of 1 million.. not just a few hundred.

    I've also tried with Yahoo and it doesn't work out much better.

    I've seen threads around the net of people producing thousands and up to a million results, but I can't work out how they'd possibly do that.

    Has Scrapebox, or Google/Yahoo lowered the amount of results each IP address can get? I tried a delay of 10 seconds, which SB seemed to ignore, as there didn't appear to be any waiting.

    Maybe I need a lot more proxies? Recently, I've had about 40 and had connections ranging from 50 to the max 500.
     
    • Thanks Thanks x 1
    Last edited: May 24, 2011
  2. dragon77

    dragon77 Power Member

    Joined:
    Jan 18, 2009
    Messages:
    576
    Likes Received:
    685
    Occupation:
    Increasing 5 Figures to 6 Figures
    Location:
    BHW-Gold Mine
    i just harvested 2,7 millions fresh wordpress today. Here is my setting:
    I harvest with 125 elite proxy, 10000 unique keyword, google, yahoo and aol checked. Each 50 connection, timeout 30 secs, bandwidth speed 100 mbps.
     
    • Thanks Thanks x 1
  3. StiflersMom

    StiflersMom Registered Member

    Joined:
    May 19, 2011
    Messages:
    57
    Likes Received:
    27
    Occupation:
    All your SERPs are belong to us!
    Location:
    /dev/null
    Well, I can only speek for myself ... but when I use 50 proxies I set max connections to 10 (+timeout to 45 /w private or 90 /w public). I'm usually very specific with footprints b/c of the better success rate on moderated blogs ... but even then I usually get 2k-5k URLs.

    I just tried a rather general footprint like
    Code:
    "powered by" + "post a comment" + intitle:
    kw1
    kw2
    kw3
    .....up to kw50
    
    and got 35k+ URLs in like 5 minutes.


    Are you always seeing low results? Or just lately? If the proxies are fine ... maybe your footprints are messed up / typo?

    edit: during your harvests, what does it say in the connections row? If it shows 1, you messed something up. Example: If you scrape 100 footprints and set connections to 100 it should say 100 connections in the beginning and then start to drop down every other second until it reaches 0. If you scrape 10k footprints with 100 connections it should stay at 100 active connection for several minutes.

    ..
     
    • Thanks Thanks x 1
    Last edited: May 24, 2011
  4. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    Keywords: 2402 (duplicates removed)
    Good Proxies: 38 (time-out was 10 seconds when tested)
    Connections (Google, Yahoo, AOL): 50
    Timeout: 30 seconds

    Total harvested after dup removed = 5062

    Keywords not checked = 904


    That seems pretty low.

    What's the reason that it doesn't check all the keywords in the list at the first attempt?
     
  5. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    During the harvest the connections do start at 50, which is what I set it to. Then it goes down gradually to 0.
     
  6. clin407

    clin407 Regular Member

    Joined:
    Apr 6, 2011
    Messages:
    418
    Likes Received:
    129
    I would like to know this as well, I've been playing with scrapebox and learning how to harvest sites and I'd end up using 1000 "most common used english words", powered by WP, drupal and movable type for footprint and use approx 30-40 public proxies with settings similar to everyone else. I'd harvest around 400k of them but end up with 40k (I assume this is because of the similarity between my keywords).

    For those getting huge lists, are your keywords all different themes? Should I set it so the max total connections equal the amount of proxies I am using to keep it at max efficiency?
     
  7. StiflersMom

    StiflersMom Registered Member

    Joined:
    May 19, 2011
    Messages:
    57
    Likes Received:
    27
    Occupation:
    All your SERPs are belong to us!
    Location:
    /dev/null
    ok, something is wrong on my end now as well :D

    Code:
    "powered by" + "leave a comment" + intitle:
    gift
    gift idea
    gift ideas
    gift card
    the gift
    gift basket
    gift cards
    anniversary gift
    gift shop
    gift baskets
    
    This just returned only 798 results with 25 private proxies, but it returned 9774 results without proxies + got my ip banned :D btw only scraped G results
     
    Last edited: May 25, 2011
  8. StiflersMom

    StiflersMom Registered Member

    Joined:
    May 19, 2011
    Messages:
    57
    Likes Received:
    27
    Occupation:
    All your SERPs are belong to us!
    Location:
    /dev/null
    *OMG*

    At least for myself I found the solution:

    In Settings, disable Use multi-threaded harvester and see if your IPs are banned! The proxies I used are all sowing up as OK in the Proxy manager but 23 out of 25 show up as 302 Ip banned when harvesting single threaded.

    LMK, if you experience the same :eek:
     
  9. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    I'm thinking it might be an idea to have at least 2 instances of SB running where:

    Instance 1 harvests, so all proxies are tested with 'ignore Google' NOT checked when proxy testing.

    Instance 2 does the posting/commenting so 'ignore Google' IS checked when proxy testing.


    Start the day with an hour or so worth of proxy harvesting, and of course the list will go down as the day goes on, as the proxies become blocked/banned/not working etc.

    But I've tested harvesting with Google seconds after the proxy test came back with X amount passing Google, and it still didn't bring back good results.

    I tried disabling the setting above, but Google is still coming back with very very low numbers, considering the amount of keywords I'm putting in, so there's something wrong somewhere.
     
  10. Orbit143

    Orbit143 Senior Member

    Joined:
    Aug 8, 2010
    Messages:
    893
    Likes Received:
    588
    Location:
    /home
    For me the real problem was in proxies - when I was always able to harvest max few thousand results with 100+ public proxies.

    Today I picked 4 (yes only four) fastest proxies and started harvesting with really big keyword list and 200 connections. It's still running and I got harvested over 700k results from yahoo so far.
     
  11. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    Any delay?

    Yahoo seems to be a lot easier to harvest from, it's Google that's my main problem.
     
  12. Orbit143

    Orbit143 Senior Member

    Joined:
    Aug 8, 2010
    Messages:
    893
    Likes Received:
    588
    Location:
    /home
    What do you mean by delay? I did not set any so I don't know.

    Yahoo is easier, thats why I like it and don't see the difference in results.
     
  13. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    I mean be deselecting the Use multi-threaded harvester and selecting a delay between 1 and 10 in the delay option.

    I'll just have to have a play about with it once these proxies have been checked, but it's currently looking like harvesting from Google is a no-go.
     
  14. Orbit143

    Orbit143 Senior Member

    Joined:
    Aug 8, 2010
    Messages:
    893
    Likes Received:
    588
    Location:
    /home
    I'm using multiple (200) threads with these 4 proxies, have no idea how it works but apparently it does.
     
  15. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    Harvesting with Yahoo seems to be OK now.

    I tried it with 3 keywords and got, as I expected, close to 3000 (max 1000 per keyword with Yahoo) = 2993 results, 1044 when dups removed, but the keywords were closely related, so not too unsurprising.

    In theory if I add 2000 keywords, I should get approaching 2 million results before dups are removed.

    Google seems to be a pain with proxies as a high % are blocked or no longer working. I imagine private proxies + delay+ right amount of connections could work though, but with public proxies, it's more hassle than it's worth.
     
  16. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    Update: Yahoo harvesting still stops very very early with most of the keywords still in the list.

    I can't understand, why if I add 2000 keywords and select a footprint which I know results in millions when manually searched, it stops at a certain stage during the harvest and claims it's finished.

    If I check the proxies they are still working, so it can't be that.

    I'm just not getting what's actually made Yahoo decide to stop, if there's enough keywords in the list, and the proxies are working. Not sure what else it could be down to.
     
  17. jiipodd

    jiipodd Registered Member

    Joined:
    Dec 3, 2010
    Messages:
    84
    Likes Received:
    87
    Location:
    Moving around Europe
    I am experiencing similar problems. I have owned Scrapebox for 6 months, but havent used it a lot (big mistake).

    Now that I want to actively use it to scrape blogs where to comment to, I can't scrape basically anything.

    Also, I've noticed that if I put in the keywords for example intitle:Barack Obama, most of the results will not have that in the title. Which is very odd to me.
     
  18. Ed4252

    Ed4252 Power Member

    Joined:
    Feb 23, 2009
    Messages:
    603
    Likes Received:
    90
    I have a list of around 13k blog domains. How do I scrape so I can get the actual url's?
     
  19. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    With regards to harvesting URLs, I've now figured it out.

    Only use Yahoo and do not use proxies.

    The Yahoo API allows each IP to do 5000 queries per day; however; Scrapebox uses about a dozen APIs which rotate every request.

    Basically - you should be able to harvest around 5 million URLs a day using this method, which for the majority I think will be more than enough.

    I did a scrape earlier today from about 2000 keywords and got about 500,000 URLs... unfortunately, after duplicates were removed, I was left with about 140,000 but that just means I need to use more keywords or use a more diverse range of keywords. Either way, it's just a case of scaling it up to get more results - whereas before it was a proxy issue.

    You could use private proxies if you wanted more than 5 million. This means not using Google, but even Scrapebox themselves have said just using Yahoo is the best approach.
     
    • Thanks Thanks x 4
  20. muchacho

    muchacho Supreme Member

    Joined:
    May 14, 2009
    Messages:
    1,293
    Likes Received:
    187
    Location:
    Lancashire, England.
    Use site: as the footprint and then list the root domains in the keyword section.

    This will scrape all the sub pages of all the sites. For this though I think you'll need proxies that pass Google.