1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

I cannot harvest

Discussion in 'Black Hat SEO' started by theseodude, Aug 7, 2012.

  1. theseodude

    theseodude Regular Member

    Joined:
    Jun 25, 2012
    Messages:
    303
    Likes Received:
    88
    hi
    I am trying to harvest all the pages of a site. I am an experienced user and I have done this in the past. but now, every time I try to scrape, it scrapes and scrapes and it finds like 6000 links. but then, it says "similar links have been removed" and after that, I am left with 1 or 2 links.

    I have tried

    site:http://www.domain.com


    and I have tried
    site:http://www.domain.com John
    site:http://www.domain.com card
    site:http://www.domain.com real
    site:http://www.domain.com brain
    site:http://www.domain.com Susan
    site:http://www.domain.com (random word)
    site:http://www.domain.com (random word 2)
    etc.
    etc.

    By the time same urls are removed, I am left with like 1 or 2 links. I dont know what the hell is going on. I have done this in the past.

    I am using private proxies by the way.
     
  2. theseodude

    theseodude Regular Member

    Joined:
    Jun 25, 2012
    Messages:
    303
    Likes Received:
    88
    Hi guys, so nobody knows why scrapebox is behaving this way?
     
  3. EXtraHand

    EXtraHand Junior Member

    Joined:
    Jan 26, 2012
    Messages:
    111
    Likes Received:
    62
    Do you mean Duplicate domain are removed ? If yes, go to the options drop down (beside, settings) and untick "Automatically Remove Duplicate Domains", it's above "Auto Recovery".
     
  4. futurestic06

    futurestic06 Supreme Member

    Joined:
    Apr 16, 2011
    Messages:
    1,204
    Likes Received:
    146
    Dude sorry but I don't get your point. I think you should tell us some more about the problem. so that I can help you in this regard.thanks
     
  5. theseodude

    theseodude Regular Member

    Joined:
    Jun 25, 2012
    Messages:
    303
    Likes Received:
    88
    Alright, let me make it simple. How do I harvest all the pages that a domain has?
     
  6. t0.sh

    t0.sh Registered Member

    Joined:
    Jun 6, 2012
    Messages:
    55
    Likes Received:
    24
    Location:
    UVB-76
    It's seems like your conflicting what you're looking for. Do you mean you want to download all the clickable links available on the site so you're left with a list of links or do you want to download all the individual pages of a site?
     
  7. theseodude

    theseodude Regular Member

    Joined:
    Jun 25, 2012
    Messages:
    303
    Likes Received:
    88
    I want to harvest all the pages that a domain has.
    for example
    domain.com/index.html
    domain.com/post1.html
    domain.com/post2.html
    domain.com/about.html

    I know for a fact that this domain has hundreds, if not thousands of pages but when I do it in sb, I get like 1 or 2.
     
  8. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    Just use site:http://domain.com

    Don't put a keyword after it.

    And use hrefer, not scrapebox, with good quality tested public proxies, large list. Search some of my posts about hrefer.
     
  9. williamk

    williamk BANNED BANNED

    Joined:
    Oct 29, 2009
    Messages:
    1,030
    Likes Received:
    184
    I would advice the following:


    a) make sure that the proxies are working well and not blacklisted.
    b) use the site operator ( You are already using it)
    c) Try with a low number of threads.
    d) Call support if all these are ok and you still cannot scrape.