1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Needing to scrape 160k results for one footprint (Google) (and Scrapebox Question)

Discussion in 'Black Hat SEO' started by mrmarchuk, Feb 24, 2014.

  1. mrmarchuk

    mrmarchuk Registered Member

    Joined:
    Jun 4, 2012
    Messages:
    65
    Likes Received:
    30
    Occupation:
    Working From Home
    Location:
    Portland, OR
    I know this sounds stupid and crazy, but I've got a footprint that I need to run to get archives of a website's previous listings - 160k URLs (which I will custom-scrape for info at a later time).

    Is this something that Scrapebox can do? The most I've been able to have it scrape at a time is ~2k URLs. Is there a way to get it to start at the last-left-off portion?

    OR

    Would I be better off having someone develop a customized scraper for this one project?

    FYI, the Footpring is something along the lines of:

    Code:
    site:website.tld/subfolder/*(wildcard)/Printview
    edit: I should mention that TIME is not an issue, I can have it scrape several thousand a day, in intervals/etc., I'd just like to have that data within the next few months.
     
  2. mrmarchuk

    mrmarchuk Registered Member

    Joined:
    Jun 4, 2012
    Messages:
    65
    Likes Received:
    30
    Occupation:
    Working From Home
    Location:
    Portland, OR
    Not one answer or tidbit of insight?
     
  3. stugz

    stugz Junior Member

    Joined:
    Apr 14, 2013
    Messages:
    154
    Likes Received:
    33
    Google returns only 1000 results max per search query. So you need to add keywords to the query. Look for words that are common to the site or use a massive list of generic keywords. I guess you are wanting to look at the Google cache otherwise it would be better to just spider the site.

    Code:
    site:website.tld/subfolder/*(wildcard)/Printview keyword1
    site:website.tld/subfolder/*(wildcard)/Printview keyword2
    site:website.tld/subfolder/*(wildcard)/Printview keyword1000
     
    • Thanks Thanks x 1
  4. mrmarchuk

    mrmarchuk Registered Member

    Joined:
    Jun 4, 2012
    Messages:
    65
    Likes Received:
    30
    Occupation:
    Working From Home
    Location:
    Portland, OR
    Ah, that'd make total sense. Should be relatively easy to get a good bit of the indexed Urls if mixed with keywords.

    I normally would just Spider the site, but these URLs are old auction Listings they are not "part of the website" any more.