1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

scrape box harvesting - proxys keep getting blocked

Discussion in 'Black Hat SEO Tools' started by links, Jul 18, 2016.

  1. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    Whenever I try to harvest google, it goes really well initially, but when I get to around 30-60k links all the proxys start failing... ive tried reducing connections, as well as using 10 private proxys with 1 connection and running 500 proxys from scrape box with only 5 connections
     
  2. arc323

    arc323 Junior Member

    Joined:
    Sep 23, 2015
    Messages:
    197
    Likes Received:
    69
    Location:
    Denver, CO
    Home Page:
    I have success setting the harvest connections to 1 when harvesting Google. I use 60 private proxies for scraping Google. Try using Google API or even Yahoo.

    Can you share what you're scraping for? Maybe I can give you more insight if I have more info.
     
  3. BassTrackerBoats

    BassTrackerBoats Super Moderator Staff Member Moderator Jr. VIP

    Joined:
    Mar 10, 2010
    Messages:
    15,951
    Likes Received:
    29,273
    Occupation:
    Selling CPA Sites
    Location:
    Not England
    Home Page:
  4. Floopa75

    Floopa75 Jr. VIP Jr. VIP

    Joined:
    Feb 6, 2014
    Messages:
    815
    Likes Received:
    713
    Gender:
    Male
    Skip Google and scrape Bing, Duck duck go and yahoo. Your results will skyrocket
     
  5. coitza

    coitza Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 26, 2007
    Messages:
    2,686
    Likes Received:
    696
    Occupation:
    freelancer
    Home Page:
    yeah, i started skipping google long time ago, you have to do it really slow or have a lot of private proxies for it to work decently..
     
  6. 710fla

    710fla Jr. VIP Jr. VIP

    Joined:
    Aug 25, 2015
    Messages:
    654
    Likes Received:
    173
    Scrape Bing and check to see if domains are indexed on Google after getting your verified targets. Usually less than 10% aren't indexed
     
  7. arc323

    arc323 Junior Member

    Joined:
    Sep 23, 2015
    Messages:
    197
    Likes Received:
    69
    Location:
    Denver, CO
    Home Page:
    Can you use search operators on Duck duck go? I know Yahoo and Bing have their own but I'm not sure about Duck duck go. Do you get good results from DDG or do you scrape DDG, Bing and Yahoo at the same time?
     
  8. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,994
    Gender:
    Male
    Home Page:
    Some good tips in this thread.

    As far as google goes, your still going too fast. 10 private proxies with the detailed harvester and a sizeable delay is more like it. I don't know what operators your using, but you might want a 20 or more second delay.

    I have a video that is specifically on your question.




    As for in general, I will also go with the flow in saying google isn't the only fish in the sea.

    I still use google, but I reformulated some of my footprints to be specifically effective with Bing and I use bing a lot because their bans are less.

    Also google api, deeperweb and start page are all google powered, but with their own ip bans.

    At any rate there are over 20 engines in Scrapebox so don't get locked into just google. At the very least you could try the others why you sort out your google setup so that your at least working in the mean time.
     
  9. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    Thanks, I suspected my question had been asked many times before but couldn't find it on the search function. I've found bing to be just as good but always good to know ways around google and others if bing starts blocking me.

    It has given me an idea though. Could I not make my own search engine that is just a mirror of google/bing and remove i.p blocks so people can use it with scrape box... or would no host provide the kind of bandwidth I would need for that?
     
  10. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    It won't let me edit previous reply too add something.

    I'm using the footprint "powered by wordpress" "leave a trackback" I'm getting approx 10% success rate without a captcha solver (haven't been able to set up captchasniper properly yet). Can I improve this trackback?

    Also I have a file of 6 million urls now. I've split it into files of 250k urls each. When I do a "remove duplicate domains" of the first split file this works fine. When I try to remove duplicate domains of a second split file into another empty .txt fine I get "integer overflow" ?
     
  11. toughboy

    toughboy Junior Member

    Joined:
    Aug 18, 2014
    Messages:
    177
    Likes Received:
    43
    Location:
    underground
    scrape Bing