1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

scrapebox help

Discussion in 'Black Hat SEO Tools' started by 9to5destroyer, Dec 13, 2012.

  1. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    i'm in the process of a massive scraping mission i'm scraping all urls of certain sites then extracting data using custom tools.

    i'm having problems with certain sites mainly sites with subdomains eg health.site.com
    this is only on certain sites which i dont really understand.

    1.if i scrape site:site.com keyword it returns next to no results but if i google site:site.com it returns over million indexed subdomains.

    2. on other sites with subdomains following the same process using around 100k keywords(very varried) even if i scrape 3-4 million urls when i remove duplicates i only have about 30000 urls.

    like i said the process is working perfectly on most sites but i'm having trouble on only a few any ideas?
     
  2. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,368
    Likes Received:
    1,797
    Gender:
    Male
    Home Page:
    Can you give a more specific example of a domain, like an actual domain? Because what your saying should work, pending your not getting blocked by google on those queries for ip bans etc...
     
  3. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    hi loopline thanks for your reply

    1.if you put the footprint site:posterous.com theres a considerable amount of results now try site:posterous.com keyword you get next to non compared to other sites this seems very strange.
    2.scraped site:webgarden.com keyword scraped over three million urls after removing duplicates you only get about 30,000 results.
    keywords are varied not one is simliar around 100k ive tried dropping results to 10 per keyword and still get similar results.

    proxies are good and arent getting banned
     
    Last edited: Dec 13, 2012
  4. jr_sci

    jr_sci Senior Member

    Joined:
    Jan 30, 2010
    Messages:
    857
    Likes Received:
    686
    Occupation:
    CTO at Tiny Piglet Publishing, Bestselling Author
    Home Page:
    Try to fill the keywords box. I faced a similar problem earlier but filling the keyword box gave me better results. If it fails, Loopline can definitely help. He is like the GOD of Scrapebox ! :crucified



     
  5. pokerjk

    pokerjk Senior Member

    Joined:
    Dec 26, 2010
    Messages:
    1,167
    Likes Received:
    384
    Occupation:
    Online Marketer
    Location:
    England
    In the results are all keywords showing up as green?

    Try not using the multi-threaded harvester, if any errors are happening you can see and there response codes.
     
  6. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    yeah thats with the keywords box filled about 100k
     
  7. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    it's actually the footprint as i have tried it in google
     
  8. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,368
    Likes Received:
    1,797
    Gender:
    Male
    Home Page:
    If you go to a browser and search:

    PHP:
    site:osterous.com
    There is only 142 results. So your not going to get more then that. Its just that either google doesn't have them indexed or no more then that exists.

    for site:webgarden.com you should get decent results. It shows over a million, however if you go to about result 700 it just cuts off. Even if you hit repeat with omitted, there are no more results to be shown. Google allows you to see up to 1000 per query, but it only shows less then 700 for that query.

    Google does that sometimes, they will say I have "X millions, but Im not going to show them to you" Reminds me of a little kid who has something everyone wants to see, and he brags about it, but he won't actually show it to anyone. lol

    If you look at google results much, you will see this is common, seems to be especially true for the site:, Ive run into it lots of times.

    You may of course get more results as you use keywords, but ultimately google is saying that it doesn't want to give you everything it has for that domain. You can use the link extractor addon in scrapebox to extract more internal links. I have a video on it here:



    But the short of it is, scrapebox will only give you what google gives it, and if google won't play nicely for your query, there isn't much you can do about it except spider the site yourself.
     
    Last edited by a moderator: May 18, 2016
  9. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    thanks for your reply loopline the first site was actually

    site: posterous.com

    it seem that when you search for this site and keyword so site: posterous.com blue widget google only returns
    results from the main domain and not any subdomains(or very few) thus limited results.
    strange how it does this compared to other sites.
     
  10. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,368
    Likes Received:
    1,797
    Gender:
    Male
    Home Page:
    When you search that site in a browser it says 13 million results but will not show more then 600. When I search in a browser for site: posterous.com and a keyword, you are correct it only shows the main domain. This is when Im searching in a web browser.

    What I don't know is why google displays only the main domain, I am not an expert on the internal algos of Google. What I can say is, you can mess with footprints, but if google only wants to show the main domain with a keyword, then you can't make google show more. So scrapebox can only get what google gives. In this case google isn't giving more.

    Perhaps because they have manually set it, perhaps this niche/site falls into a different algo category, perhaps a million reasons. The bottom line is that its pretty straight forward. If you can't get google to give more in a browser then of course scrapebox can't force google to give more either. So its a case of "it is what it is".

    Wish I had a better answer. I can make scrapebox do a lot of things, but ripping results from google that it won't give isn't one of them.
     
  11. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2011
    Messages:
    355
    Likes Received:
    205
    thanks for the reply loopline, yeah sorry i should have said at the start it was a google footprint issue. I was just wondering if anyone had come across this before and had a different way of scraping/footprint.
    i would agree that google may have different rules for how it indexes these larger sites bit of a pain but i'll just alter how i'm doing things to get round it.
     
  12. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,368
    Likes Received:
    1,797
    Gender:
    Male
    Home Page:
    I can't say that I have come across it before, but then I wasn't really looking either. I don't do a lot of site: searches. I can't think of any way in google to go about it differently. I tried lots of combos and other creative uses of operators and I get even less results. Google seems to not want to give out info about that site.
     
  13. LarryPage

    LarryPage Registered Member

    Joined:
    Jan 9, 2013
    Messages:
    99
    Likes Received:
    2
    When I use the link extractor in a website I keep getting a 403 error, any idea why?