Markov Text Generation

PhilosopherKing · Aug 15, 2012

I thought that I would share a bit about my favorite mathematical method: Markov Generation. I am not going to share any Markov generation scripts. Anyone who is willing to put a bit of time into learning PHP is more than capable of writing there own Markov text generation script. What you do with your new found ability is up to you.

"Give a man a fish, feed him for a day. Teach a man to fish, and feed him for the rest of his life."

Markov Generation involves an actuarial analysis of a given corpus of text. In laymen's terms, if I look at the complete works of William Shakespeare what word, or segments of characters, most often follows say "thou"? With that knowledge, we can then tell this handy bit of plastic and silicon to randomly choose what word to place next. Let us just say for instance that "shalt" is statistically chosen. We now have "thou shalt". This process continues and out pops our generated text.

The nuts and bolts.

The first step is to create an index. We need some source of text: generally the larger the corpus is the more sensible the result. It is probably a good idea at this point to send the text through a series of preg_replace functions and clean out all of the line breaks, etc. We then split the text into an array of strings. We now have a neat little array of the cleaned up text. The index is just a table listing the regularity with which one segment of text follows the previous segment, or numerous previous segments. We simply use a loop to crawl through our array and increment each time a certain segment follows. For example: in our previous example "thou" could be followed by "shalt", "will", "shan't", etc. Let us assume we end up with something like thou -> will(6), shalt(9), shan't(3). This tells us that 1/2 the time "shalt" follows "thou". Now you have an index of the text.

A Random Walk

Now that we have our index, how do we generate text from it? This really depends on how your script will function. If your generating individual sentences and then rolling them out as a paragraph, then you probably made an index of First Words in the previous section by splitting the string by sentence. If this is the case you simply start with a random first word, we will call it the token, search the array for that entry, and choose the following word randomly based on frequency. The following word then becomes the new token and we look that up in the index and loop through the index letting the script randomly choose, based on frequency, the next word until we have our generated text.

If you are generating the whole chunk of text without care for individual sentences, or using chunks of characters, then you will probably have an index that includes words, or characters, that have capitalization and periods tagged on the end. In this case, we can either clean out the partial sentence at the beginning after we generate it with some regex; or search the index for a capitalized word for our first token. The same process applies: we choose a token, randomly choose then next word based on frequency, then set the new word as our token and restart the loop. Congratulations - you now have an extremely ugly hardly readable chunk of generated text that sound somewhat similar to the text you put into the script.

Water into Wine

This is the most important and most often under-appreciated step. The chunk of text we have isn't terrible, but it's ugly. It probably isn't very presentable. We need to throw a suit on it, teach it some public speaking, and get it out into the world. We need to run our script a few dozen times and take some notes. Write down anything that you don't like or want to appear in the finished product. We then go back to our script and we add some regex or other methods of fixing the errors or problems we noticed when we took notes. We might want to delete repetitions of the same or similar text. We may need to add or remove some spaces to our corpus before we split it into our index. If you are really a glutton for punishment you can run it through a list of commonly misspelled words and correct them. How about adding a system so that tokens we choose as keywords have a higher frequency? Maybe, we create a massive database index of millions of chunks of text and change the script to start in the middle of the sentence with our keyword and then loop backward to the first word and forward to the first period. We might generate millions of sentences and then have the script organize them by keyword phrase. What we will definitely do is crack a bottle of champagne and toast our new found ability to create snake oil.

PhilosopherKing · Aug 15, 2012

My mistake. This probably belongs in the content generation section of the forum - newb mistake. If an admin could transfer it over I would appreciate it, thanks

camstryker · Aug 16, 2012

Excellent post and explanation. Definitely on my list of things to try out, I've been looking into a lot of different ways to defeat the content beast.

closedCaption · Aug 16, 2012

That's old story OP. It was used to create rather large sites with useless content.

Here is one link from 2009 with Python source code:
http://agiliq.com/blog/2009/06/generating-pseudo-random-text-with-markov-chains-u/

Don't try to use this on your site, G is already on this from few years ago.

PhilosopherKing · Aug 17, 2012

closedCaption said:
Don't try to use this on your site, G is already on this from few years ago.

I suppose, like anything, that would depend on what you are using it for... I have a site that has been around for about a year with around 10,000 pages indexed, and I have never had an issue. Of coarse, I add the markov script to fill out my pages and have legitimate content that is generated through other means on my site as well. I think using Markov methods is best used to avoid the duplicate content filter, as opposed to pumping out a veritable pool of filth: especially if you are utilizing affiliate feeds or some other method that naturally results in a page of duplicate content. I use a second script that takes my input and then pumps out grammatically correct fill in the blank type statements. I then markov that content to get a much larger but still 100% original text. I then join the duplicate content with my original content and generate a massive amount of semi-original content. That is then put through a bunch of filters and word lists to weed out the junk and duplicate sentences. I set-up the script to specifically weed out sentences significantly similar to the source text to avoid the duplicate content filter.

Another important note: you might want to spoof your headers and do a periodic generation on your site. When I generate a new page I save the content and reference that same content for about two weeks with the same header. Then when my script notices the header is over two weeks old I re-generate the page, adding the previous page to the corpus. I do this because I want to increase the chance that the pages re-generates similar to previous pages, so that when it is re-indexed it appears updated and new but doesn't stray too far from the previous page. I also .htaccess the pages to make them appear to be static html files, as opposed to dynamic pages.

I am curious, by what means do you believe G is nixing these pages? Also, I am aware that this is somewhat "old hat". Care to share a few "new hat" techniques with us?

closedCaption · Aug 17, 2012

Well you can always go that route you have chosen for yourself, and to some extend it will work, until your site is manually reviewed or you are extra careful. One thing I see repeating here and
its false to the core is that G is almighy. Well it is SMART and hard to deceive, but it is possible: however, you need to work extra hours and be extra paranoid and on the move.

I was years ago in auto content generation, but since recession hit, I find it easier to just buy some original articles written at 0.2$-0.4$ per 100 words and use those on site. It has a much better effect,
and gets me out of the water with G.

PhilosopherKing said:
I suppose, like anything, that would depend on what you are using it for... I have a site that has been around for about a year with around 10,000 pages indexed, and I have never had an issue. Of coarse, I add the markov script to fill out my pages and have legitimate content that is generated through other means on my site as well. I think using Markov methods is best used to avoid the duplicate content filter, as opposed to pumping out a veritable pool of filth: especially if you are utilizing affiliate feeds or some other method that naturally results in a page of duplicate content. I use a second script that takes my input and then pumps out grammatically correct fill in the blank type statements. I then markov that content to get a much larger but still 100% original text. I then join the duplicate content with my original content and generate a massive amount of semi-original content. That is then put through a bunch of filters and word lists to weed out the junk and duplicate sentences. I set-up the script to specifically weed out sentences significantly similar to the source text to avoid the duplicate content filter.

Another important note: you might want to spoof your headers and do a periodic generation on your site. When I generate a new page I save the content and reference that same content for about two weeks with the same header. Then when my script notices the header is over two weeks old I re-generate the page, adding the previous page to the corpus. I do this because I want to increase the chance that the pages re-generates similar to previous pages, so that when it is re-indexed it appears updated and new but doesn't stray too far from the previous page. I also .htaccess the pages to make them appear to be static html files, as opposed to dynamic pages.

I am curious, by what means do you believe G is nixing these pages? Also, I am aware that this is somewhat "old hat". Care to share a few "new hat" techniques with us?

Markov Text Generation

PhilosopherKing

Newbie

PhilosopherKing

Newbie

camstryker

Registered Member

closedCaption

BANNED

PhilosopherKing

Newbie

closedCaption

BANNED

Main Menu

Marketplace

Making Money

BlackHat World