Bulk Detecting Duplicate Content within a Site

    Dec 4, 2008
    I have a site that has grown over the years, and now has over 50million pages. Yes that is right 50million pages, many of which are basically duplicates in a different context. I want to scan through the site, to detect any pages that are basically duplicates of another page. Are there any tools that can index the pages, and list the duplicates? So that I can noindex the duplicate content.

    I have root access to the server, so I can run something locally on the (linux) server or I can use a web service. Note I am only interested in internal duplicates, not other copies on the web at this point.