1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Big data for serious SEOs

Discussion in 'Black Hat SEO' started by thejake, Jul 2, 2013.

  1. thejake

    thejake Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 13, 2009
    Messages:
    685
    Likes Received:
    828
    If you're serious about your SEO analysis, there's a project called Common Crawl that offers a 100TB web archive of around 6 billion web pages, downloadable from Amazon S3. It used to be sort of a pain to use but they've added a tool at hxxp://urlsearch.commoncrawl.org/ that lets you list all the indexed urls for a site, view the page source, and download a json file of the urls and their pointers in the archive. From there it's pretty easy to do certain analysis that's pretty difficult with other tools without doing your own spidering.

    Common Crawl has other interesting data, including their meta files, which give HTTP response codes for URLs, which makes it fairly easy to research what's redirecting where if you have the tools (think Hadoop) and skill to process all that data, and the text only sets, approximating what a search engine would do to determine the content within a page.

    If you're already using CC data or providing a service based on it, please consider supporting their efforts. Karma and all that.
     
    • Thanks Thanks x 1
  2. innozemec

    innozemec Jr. VIP Jr. VIP

    Joined:
    Aug 19, 2011
    Messages:
    5,288
    Likes Received:
    1,799
    Location:
    www.Indexification.com
    Home Page:
    sounds great and nice, but i tried like 10 of my sites some of them that are 5+ years old and for all of them i got 0 results :(

    did you get any success yourself?
     
  3. thejake

    thejake Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 13, 2009
    Messages:
    685
    Likes Received:
    828
    Yes, the main difference is they crawl less, one of mine with 11k pages indexed in G only has 654 in CC, and they seemed not to crawl my blogs that don't use pretty permalinks.
     
  4. soriful

    soriful Newbie

    Joined:
    Jun 29, 2013
    Messages:
    5
    Likes Received:
    0
    Nice, I tried like 7 of my sites some of them that are 1+ years old and for all of them i got good results.
     
  5. prab1996

    prab1996 Elite Member

    Joined:
    Jan 8, 2013
    Messages:
    3,496
    Likes Received:
    2,028
    Occupation:
    your gf's <3 ♥♥♥♥
    Location:
    Prab1996.com
    Home Page:
    i am not good in seo and it looks like it is a thing for pro's .

    just bumping thread for a good cause.