1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to scrape all indexed pages of a site?

Discussion in 'Black Hat SEO' started by Drago05, Dec 3, 2011.

  1. Drago05

    Drago05 Junior Member

    Joined:
    Oct 31, 2010
    Messages:
    151
    Likes Received:
    10
    Location:
    Europe
    I want to scrape the url's of all pages of a site that have 20 000 indexed pages in google. How can i do this? I know that when using the operator "site:" google doesn't give all pages indexed but some limited number. Is there any way around this that will allow me to scrape all indexed pages?
     
  2. seoguru13

    seoguru13 Senior Member

    Joined:
    Jan 9, 2011
    Messages:
    1,074
    Likes Received:
    692
    Occupation:
    Digital Marketing Consultant
    Location:
    India
    Home Page:
    There's a tool called winhttrack, which can rip html sites completely. Should be helpful.
     
    • Thanks Thanks x 1
  3. Knoxgates

    Knoxgates Supreme Member

    Joined:
    Aug 9, 2008
    Messages:
    1,266
    Likes Received:
    919
    Scrap their Sitemap by going to Url/Sitemap.xml. This only works if they have installed Sitemaps plugin on their blog.
     
    • Thanks Thanks x 1
  4. roamer

    roamer Power Member

    Joined:
    Dec 2, 2008
    Messages:
    500
    Likes Received:
    480
    Occupation:
    Gfx designer, vfx and mgfx
    Location:
    plɹoʍ ǝɥʇ punoɹɐ ƃuıɯɐoɹ
    Try the "alphabet" trick:

    a site:yoursite.com
    b site:yoursite.com
    c site:yoursite.com
    ...
    z site:yoursite.com

    It shall give you more than the standard output (1000 I think). Then just use a tool to filter out duplicate URLs. You can also use keyword site: and even (number) site: to get more results.

    Hope this helps :).
     
    • Thanks Thanks x 3
  5. extremephp

    extremephp BANNED BANNED

    Joined:
    Oct 19, 2010
    Messages:
    1,293
    Likes Received:
    1,274
    Get scrapebox, and work out with footmarks :D

    Site: should give you all if I am not wrong. Or try site:site.com a, b upto z and remove duplicates.

    I can do it for you if you can pay me $1 per scraped url :D
     
    • Thanks Thanks x 1
  6. themidiman

    themidiman Power Member

    Joined:
    Feb 25, 2011
    Messages:
    700
    Likes Received:
    1,542
    Location:
    [email protected]/0
    If you have scrapebox, and they have a sitemap. Use the sitemap scraper addon.
     
    • Thanks Thanks x 1
  7. Drago05

    Drago05 Junior Member

    Joined:
    Oct 31, 2010
    Messages:
    151
    Likes Received:
    10
    Location:
    Europe
    Can you elaborate a little more about this method?
     
  8. Swiss

    Swiss Power Member

    Joined:
    Jun 3, 2011
    Messages:
    551
    Likes Received:
    324
    Location:
    Take a guess
    That way there's a possibility that other search results come up from that site, for example with the keyword "a" you might get pages in the results, which don't come up with the keyword "b".

    Google only shows 1k results, so that way you have a higher possibility to get all indexed pages..

    Remove duplicates etc.
     
    • Thanks Thanks x 1
    Last edited: Dec 3, 2011
  9. extremephp

    extremephp BANNED BANNED

    Joined:
    Oct 19, 2010
    Messages:
    1,293
    Likes Received:
    1,274
    Open a text file

    Start typing this :

    a site:
    b site:
    c site:
    upto z site:

    You can also use diff words after that, i.e like

    a site:
    b site:
    z site:
    keyword site:
    blahblah site:

    and save the file.

    Now open scrapebox. In the left pane, click Custom Footprint.In the keyword section, enter site.com and also http://www.site.com (use your target site). Click M button and select that footprint txt file.

    Go scrape like hell :D
     
    • Thanks Thanks x 1