1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Markov chain generators source code, and their effectiveness

Discussion in 'PHP & Perl' started by Daemon, Jan 7, 2010.

  1. Daemon

    Daemon Registered Member

    Joined:
    Dec 31, 2009
    Messages:
    59
    Likes Received:
    40
    I thought I would try and start a compilation of PHP Markov chain generators, and a discussion of their effectiveness. Below are a couple different generators - I believe the first is from YACG, and the second is from hxxp://www.haykranen.nl/2008/09/21/markov.
    PHP:
    function markov($content$gran 5$num 200$letters_line 65) {
      
    $combo $content;
      
    $output "";
      
    $combo preg_replace('/\s\s+/'' '$combo);
      
    $combo preg_replace('/\n|\r/'''$combo);
      
    $combo strip_tags($combo);
      
    $combo htmlspecialchars($combo);
      
    $combo explode(".",$combo);
      
    shuffle($combo);
      
    $combo implode("."$combo);
      
    $textwords explode(" "$combo);
      
    $loopmax count($textwords) - ($gran 2) - 1;
      
    $frequency_table = array();
      for (
    $j 0$j $loopmax$j++) {
        
    $key_string " ";
        
    $end $j $gran;
        for (
    $k $j$k $end$k++) {
          
    $key_string .= $textwords[$k].' ';
        }
        
    $frequency_table[$key_string] = ' ';
        
    $frequency_table[$key_string] .= $textwords[$j $gran]." ";
        if ((
    $j+$gran) > $loopmax ) {
          break;
        }
      }
      
    $buffer "";
      
    $lastwords = array();
      for (
    $i 0$i $gran$i++) {
        
    $lastwords[] = $textwords[$i];
        
    $buffer .= " ".$textwords[$i];
      }
      for (
    $i 0$i $num$i++) {
        
    $key_string " ";
        for (
    $j 0$j $gran$j++) {
          
    $key_string .= $lastwords[$j]." ";
        }
        if (isset(
    $frequency_table[$key_string])) {
          
    $possible explode(" "trim($frequency_table[$key_string]));
          
    mt_srand();
          
    $c count($possible);
          
    $r mt_rand(1$c) - 1;
          
    $nextword $possible[$r];
          
    $buffer .= $nextword";
          if (
    strlen($buffer) >= $letters_line) {
            
    $output .= $buffer;
            
    $buffer " ";
          }
          for (
    $l 0$l $gran 1$l++) {
            
    $lastwords[$l] = $lastwords[$l 1];
          }
          
    $lastwords[$gran 1] = $nextword;
        } 
        else {
          
    $lastwords array_splice($lastwords0count($lastwords));
          for (
    $l 0$l $gran$l++) {
            
    $lastwords[] = $textwords[$l];
            
    $buffer .= ' '.$textwords[$l];
          }
        }
      }
      
    $output trim($output);
      return 
    $output;
    }
    And the second one:
    PHP:
    function generate_markov_table($text$look_forward) {
        
    $table = array();
        
        
    // now walk through the text and make the index table
        
    for ($i 0$i strlen($text); $i++) {
            
    $char substr($text$i$look_forward);
            if (!isset(
    $table[$char])) $table[$char] = array();
        }              
        
        
    // walk the array again and count the numbers
        
    for ($i 0$i < (strlen($text) - $look_forward); $i++) {
            
    $char_index substr($text$i$look_forward);
            
    $char_count substr($text$i+$look_forward$look_forward);
            
            if (isset(
    $table[$char_index][$char_count])) {
                
    $table[$char_index][$char_count]++;
            } else {
                
    $table[$char_index][$char_count] = 1;
            }                
        } 

        return 
    $table;
    }

    function 
    generate_markov_text($length$table$look_forward) {
        
    // get first character
        
    $char array_rand($table);
        
    $o $char;

        for (
    $i 0$i < ($length $look_forward); $i++) {
            
    $newchar return_weighted_char($table[$char]);            
            
            if (
    $newchar) {
                
    $char $newchar;
                
    $o .= $newchar;
            } else {       
                
    $char array_rand($table);
            }
        }
        
        return 
    $o;
    }
        

    function 
    return_weighted_char($array) {
        if (!
    $array) return false;
        
        
    $total array_sum($array);
        
    $rand  mt_rand(1$total);
        foreach (
    $array as $item => $weight) {
            if (
    $rand <= $weight) return $item;
            
    $rand -= $weight;
        }
    }
    Below are sample results, based on the content from the Wikipedia article for "Digital Cameras", and a granularity of 5. The sample code used to generate the results, and the results:
    PHP:
    echo markov($text5200);
    PHP:
    $order 5;
    echo 
    generate_markov_text(1000generate_markov_table($text$order), $order);
    Both have some problems with UTF8 characters. The first function seems to have results that are slightly less gibberish-like, but I haven't done extensive testing.

    I've never used Markov generated content on any websites so I have no knowledge of its effectiveness. Any input in this area would be appreciated :)
     
  2. nixnash

    nixnash Power Member

    Joined:
    Oct 26, 2009
    Messages:
    581
    Likes Received:
    204
    Occupation:
    Student
    Location:
    BHW
    i think markov chain we use to to rewrite content , i just harvested for building a synonmy databse , but the output you get is mostly garbage..
    I think we what we need is a English major who can explain the us how we can use markovs chain to rewtite sentence..
    Im..would like to throw a few suggestions...if you like,,
     
  3. radi2k

    radi2k Junior Member

    Joined:
    Nov 29, 2009
    Messages:
    117
    Likes Received:
    34
    Location:
    Germany
    man you are my king! actually i searched for such code some days ago. i think there is a lof of power in this method. it just needs some optimization and a lof of reference data to produce good content. even if it sounds strange - big G wont see it as not by human written content. I'm sure it works. and if your content is worse more people will click the ads :D