1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to scrape all indexed pages of a site?

Discussion in 'Black Hat SEO' started by Drago05, Dec 3, 2011.

  1. Drago05

    Drago05 Junior Member

    Joined:
    Oct 31, 2010
    Messages:
    151
    Likes Received:
    10
    Location:
    Europe
    I want to scrape the url's of all pages of a site that have 20 000 indexed pages in google. How can i do this? I know that when using the operator "site:" google doesn't give all pages indexed but some limited number. Is there any way around this that will allow me to scrape all indexed pages?
     
  2. seoguru13

    seoguru13 Senior Member

    Joined:
    Jan 9, 2011
    Messages:
    1,063
    Likes Received:
    691
    Occupation:
    Businessman - SEO Consultant & Writer
    Location:
    India
    Home Page:
    There's a tool called winhttrack, which can rip html sites completely. Should be helpful.
     
    • Thanks Thanks x 1
  3. Knoxgates

    Knoxgates Supreme Member

    Joined:
    Aug 9, 2008
    Messages:
    1,266
    Likes Received:
    918
    Scrap their Sitemap by going to Url/Sitemap.xml. This only works if they have installed Sitemaps plugin on their blog.
     
    • Thanks Thanks x 1
  4. roamer

    roamer Power Member

    Joined:
    Dec 2, 2008
    Messages:
    500
    Likes Received:
    479
    Occupation:
    Gfx designer, vfx and mgfx
    Location:
    plɹoʍ ǝɥʇ punoɹɐ ƃuıɯɐoɹ
    Try the "alphabet" trick:

    a site:yoursite.com
    b site:yoursite.com
    c site:yoursite.com
    ...
    z site:yoursite.com

    It shall give you more than the standard output (1000 I think). Then just use a tool to filter out duplicate URLs. You can also use keyword site: and even (number) site: to get more results.

    Hope this helps :).
     
    • Thanks Thanks x 3
  5. extremephp

    extremephp BANNED BANNED

    Joined:
    Oct 19, 2010
    Messages:
    1,293
    Likes Received:
    1,272
    Get scrapebox, and work out with footmarks :D

    Site: should give you all if I am not wrong. Or try site:site.com a, b upto z and remove duplicates.

    I can do it for you if you can pay me $1 per scraped url :D
     
    • Thanks Thanks x 1
  6. themidiman

    themidiman Power Member

    Joined:
    Feb 25, 2011
    Messages:
    701
    Likes Received:
    1,535
    Location:
    root@pts/0
    If you have scrapebox, and they have a sitemap. Use the sitemap scraper addon.
     
    • Thanks Thanks x 1
  7. Drago05

    Drago05 Junior Member

    Joined:
    Oct 31, 2010
    Messages:
    151
    Likes Received:
    10
    Location:
    Europe
    Can you elaborate a little more about this method?
     
  8. Swiss

    Swiss Power Member

    Joined:
    Jun 3, 2011
    Messages:
    551
    Likes Received:
    323
    Location:
    Take a guess
    That way there's a possibility that other search results come up from that site, for example with the keyword "a" you might get pages in the results, which don't come up with the keyword "b".

    Google only shows 1k results, so that way you have a higher possibility to get all indexed pages..

    Remove duplicates etc.
     
    • Thanks Thanks x 1
    Last edited: Dec 3, 2011
  9. extremephp

    extremephp BANNED BANNED

    Joined:
    Oct 19, 2010
    Messages:
    1,293
    Likes Received:
    1,272
    Open a text file

    Start typing this :

    a site:
    b site:
    c site:
    upto z site:

    You can also use diff words after that, i.e like

    a site:
    b site:
    z site:
    keyword site:
    blahblah site:

    and save the file.

    Now open scrapebox. In the left pane, click Custom Footprint.In the keyword section, enter site.com and also http://www.site.com (use your target site). Click M button and select that footprint txt file.

    Go scrape like hell :D
     
    • Thanks Thanks x 1