1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Google Scoring Gibberish Content to Demote Pages in Rankings?

Discussion in 'White Hat SEO' started by ERozon, Oct 28, 2013.

  1. ERozon

    ERozon Newbie

    Joined:
    Jul 24, 2012
    Messages:
    4
    Likes Received:
    0
    Google Scoring Gibberish Content to Demote Pages in Rankings?

    This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.
    The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:

    • Using low-cost untrained labor (from places like Mechanical Turk)
    • Scraping content and modifying and splicing it randomly
    • Translating from a different language
    Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.
    The patent tells us that spammers might generate revenue from the traffic to gibberish web pages by including:

    • Advertisements
    • Pay-per-click links
    • Affiliate programs
    It also tells us us that since those pages were created ?using high value keywords without context, the web page typically does not provide any useful information to a user.?
    This process involves:

    • Creating language models for pages on the Web, and applying those models to the text of pages.
    • Generating a language model score for the resource including applying a language model to the text content of the resource
    • Generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index
    • Calculating a gibberish score for the resource using the language model score and the query stuffing score
    • Using the calculated gibberish score to determine whether to modify a ranking score of the resource
    These gibberish scores might be created for each page based upon multiple queries that are contained on those pages.
    The pages may be ranked initially by information retrieval relevance scores and importance scores such as PageRank.
    Pages may then be re-ranked or demoted based upon a statistical review where content on those pages is broken down into different n-gram, such as 5 word long n-grams that would break the content of a page into consecutive groupings of the words found on a page, and create statistics about those groupings and compare them to other n-gram groupings on other pages on the Web. An example n-gram analysis of a well known phrase using 5 words:
    The quick brown fox jumps
    quick brown fox jumps over
    brown fox jumps over the
    fox jumps over the lazy
    jumps over the lazy dog

    The statistical patterns found in a language model can be used to identify languages, to apply machine translation, and to do optical character recognition.
    The patent is:
    Identifying gibberish content in resources
    Invented by Shashidhar A. Thakur, Sushrut Karanjkar, Pavel Levin, and Thorsten Brants
    Assigned to Google
    US Patent 8,554,769
    Granted October 8, 2013
    Filed: June 17, 2009

    Abstract
    This specification describes technologies relating to providing search results.
    One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.
    It?s not a surprise that Google might use natural language statistical models like the one described here to identify content that they might consider low quality content. Having a technical name (gibberish content) to refer to that kind of content is helpful, as well as a patent to point others to when describing the dangers of creating low quality content through one approach or another.

    Source: SeoByTheSea (dot) com
     
  2. ERozon

    ERozon Newbie

    Joined:
    Jul 24, 2012
    Messages:
    4
    Likes Received:
    0
    Trying to change the font to WHITE so that everyone can read it, but the system is asking me for a donation so that I can do that. :confused:
     
  3. BlueOrchard

    BlueOrchard Regular Member

    Joined:
    Feb 21, 2011
    Messages:
    349
    Likes Received:
    207
    Occupation:
    Day Trader
    Location:
    Florida, United States
    This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.
    The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:


    • Using low-cost untrained labor (from places like Mechanical Turk)
    • Scraping content and modifying and splicing it randomly
    • Translating from a different language

    Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.
    The patent tells us that spammers might generate revenue from the traffic to gibberish web pages by including:


    • Advertisements
    • Pay-per-click links
    • Affiliate programs

    It also tells us us that since those pages were created ?using high value keywords without context, the web page typically does not provide any useful information to a user.?
    This process involves:


    • Creating language models for pages on the Web, and applying those models to the text of pages.
    • Generating a language model score for the resource including applying a language model to the text content of the resource
    • Generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index
    • Calculating a gibberish score for the resource using the language model score and the query stuffing score
    • Using the calculated gibberish score to determine whether to modify a ranking score of the resource

    These gibberish scores might be created for each page based upon multiple queries that are contained on those pages.
    The pages may be ranked initially by information retrieval relevance scores and importance scores such as PageRank.
    Pages may then be re-ranked or demoted based upon a statistical review where content on those pages is broken down into different n-gram, such as 5 word long n-grams that would break the content of a page into consecutive groupings of the words found on a page, and create statistics about those groupings and compare them to other n-gram groupings on other pages on the Web. An example n-gram analysis of a well known phrase using 5 words:
    The quick brown fox jumps
    quick brown fox jumps over
    brown fox jumps over the
    fox jumps over the lazy
    jumps over the lazy dog

    The statistical patterns found in a language model can be used to identify languages, to apply machine translation, and to do optical character recognition.

    The patent is:
    Identifying gibberish content in resources
    Invented by Shashidhar A. Thakur, Sushrut Karanjkar, Pavel Levin, and Thorsten Brants
    Assigned to Google
    US Patent 8,554,769
    Granted October 8, 2013
    Filed: June 17, 2009

    Abstract
    This specification describes technologies relating to providing search results.
    One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.

    It?s not a surprise that Google might use natural language statistical models like the one described here to identify content that they might consider low quality content. Having a technical name (gibberish content) to refer to that kind of content is helpful, as well as a patent to point others to when describing the dangers of creating low quality content through one approach or another.

    Source: SeoByTheSea (dot) com

    For anyone that wants to read it without highlighting it :)
     
    • Thanks Thanks x 1