1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Need Help with a Project !!! URL Big Scale Scraping

Discussion in 'Black Hat SEO' started by richfrog, Dec 11, 2010.

  1. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    Need some advice on a project i need to complete, I have to Scrape/Harvest millions of URLs from multiple search keywords.
    Is this possible? I know of scrapebox but would this do the job before i go ahead and purchase it?
    For an example of my project.
    Keyword = Home Insurance, car insurance, etc
    the software would then scrape all URLs with that keyword in the site. It would need to be the exact URL like in the google serp [admiral.co.uk/homeinsurance

    Obviously there would millions of urls for each keyword which i would need to save.
    So in a nutshell what would be the best way of doing this

    Thanks in advance Richfrog :cool:
     
    Last edited: Dec 11, 2010
  2. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    It's not hard to code this sort of thing. I've done this for many projects, saving keyword, url and rank.

    The problem is you can only see and save the first 1,000 results. If you absolutely need more than this you're screwed.
     
  3. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    Ok i see what your saying that google only shows the top 1000 serps.
    But how does Scrapebox scrape millions of URLs?
    Going back to my example if i typed in car insurance in to Scrapebox wouldn't it find thousand or even millions of urls on this keyword?
    There surely must be ways of getting Millions of URLs how do other people do it.
    There some people on this forum that have.
     
  4. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Yeah but it does that by searching for thousands of different keywords one after the other. First you use the wonder wheel tool or whatever to generate the related keywords and then when you have a big list you scrape.
     
  5. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    Thanks Danny,
    So if i typed in Home Insurance it would then generate a load of keywords for it?
    Which it would then run a search for all them and this would get me my list of URLs. Does it only get a 1000 per keyword then.
    Whats the biggest list of targeted URLs you have made?

    On another point how does google find millions isnt it just a case oof copying them but obviously on a smaller scale with a bigger time limit?

    Cheers
     
  6. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Yeah it only gets 1,000 per keyword. I haven't done targeted lists with scrapebox but I've done several million harvesting blog posts. I did save serps 1-50 for 750,000 different keywords in my own database last year. Scraping from Google isn't really a problem it's just that damn 1,000 limit.

    Not sure what you mean, how google finds millions? I can't know this for sure, but they index however many it tells you there are and uses that data to determine the serps. I would think that then they cache the first 1,000 serps in data centers whenever resources are available. I'm not sure about that, but it's not likely that every search is actually calculating what the results should be.
     
  7. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    So you save 750,000 x 1000 = 750,000,000 URLs
    Is that correct or am i being a noob lol

    So if i used the word insurance and put that with a huge keyword list like yours
    for e.g.
    Insurance red
    insurance blue
    insurance dog etc etc

    i would would be able to extract 750,000 of 1000 search results and removing the duplicates would 10s of millions of url's do you think???
     
  8. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Oh no the keyword list in my own app was all commercial keywords with bid price and search volume and stuff I just wanted to analyze the top 50 for each one for other factors.

    But if you use the wonder wheel scraper and go a few levels deep in scrapebox you should get a good amount of search results. You could just combine your term with an english dictionary file but I don't really know what purpose that would serve.
     
  9. zimsabre

    zimsabre Regular Member

    Joined:
    Nov 11, 2010
    Messages:
    255
    Likes Received:
    174
    Depends on what you want the urls for, do you want urls with the keyword in the url?

    Take care with using the wonder wheel, or scraping some keywords.
    Its not as simple as it looks. the amount of times i have scraped urls for certain keywords and ended up with a url for something completely different is crazy.
     
  10. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    I dont need the keyword in the url, I just need the keyword too appear in the website/page then simply take that URL and store it.
    I tried a google scrape program but it only gave me 90 or so urls per search and each time i had to enter the keyword manually so this would of worked eventually but would take years lol.
    this is why im after some software to automate the process that i just put in my specific keyword insurance followed by my keyword list after it for e.g.
    insurance dog
    insurance cat
    insurance the
    insurance sat etc etc

    until it builds a huge list containing the word insurance and the keyword.
    obviously it would have duplicates that i can remove then i would be left with millions of urls all unique.

    Hope this make sense :cool:
     
  11. richfrog

    richfrog Registered Member

    Joined:
    Dec 10, 2010
    Messages:
    52
    Likes Received:
    0
    Just to add can scrapebox find urs with the keyword in the url???