Markov chain generators source code, and their effectiveness

Discussion in 'PHP & Perl' started by Daemon, Jan 7, 2010.

  1. Daemon

    Daemon Registered Member

    Joined:
    Dec 31, 2009
    Messages:
    59
    Likes Received:
    40
    I thought I would try and start a compilation of PHP Markov chain generators, and a discussion of their effectiveness. Below are a couple different generators - I believe the first is from YACG, and the second is from hxxp://www.haykranen.nl/2008/09/21/markov.
    PHP:
    function markov($content$gran 5$num 200$letters_line 65) {
      
    $combo $content;
      
    $output "";
      
    $combo preg_replace('/\s\s+/'' '$combo);
      
    $combo preg_replace('/\n|\r/'''$combo);
      
    $combo strip_tags($combo);
      
    $combo htmlspecialchars($combo);
      
    $combo explode(".",$combo);
      
    shuffle($combo);
      
    $combo implode("."$combo);
      
    $textwords explode(" "$combo);
      
    $loopmax count($textwords) - ($gran 2) - 1;
      
    $frequency_table = array();
      for (
    $j 0$j $loopmax$j++) {
        
    $key_string " ";
        
    $end $j $gran;
        for (
    $k $j$k $end$k++) {
          
    $key_string .= $textwords[$k].' ';
        }
        
    $frequency_table[$key_string] = ' ';
        
    $frequency_table[$key_string] .= $textwords[$j $gran]." ";
        if ((
    $j+$gran) > $loopmax ) {
          break;
        }
      }
      
    $buffer "";
      
    $lastwords = array();
      for (
    $i 0$i $gran$i++) {
        
    $lastwords[] = $textwords[$i];
        
    $buffer .= " ".$textwords[$i];
      }
      for (
    $i 0$i $num$i++) {
        
    $key_string " ";
        for (
    $j 0$j $gran$j++) {
          
    $key_string .= $lastwords[$j]." ";
        }
        if (isset(
    $frequency_table[$key_string])) {
          
    $possible explode(" "trim($frequency_table[$key_string]));
          
    mt_srand();
          
    $c count($possible);
          
    $r mt_rand(1$c) - 1;
          
    $nextword $possible[$r];
          
    $buffer .= $nextword";
          if (
    strlen($buffer) >= $letters_line) {
            
    $output .= $buffer;
            
    $buffer " ";
          }
          for (
    $l 0$l $gran 1$l++) {
            
    $lastwords[$l] = $lastwords[$l 1];
          }
          
    $lastwords[$gran 1] = $nextword;
        } 
        else {
          
    $lastwords array_splice($lastwords0count($lastwords));
          for (
    $l 0$l $gran$l++) {
            
    $lastwords[] = $textwords[$l];
            
    $buffer .= ' '.$textwords[$l];
          }
        }
      }
      
    $output trim($output);
      return 
    $output;
    }
    And the second one:
    PHP:
    function generate_markov_table($text$look_forward) {
        
    $table = array();
        
        
    // now walk through the text and make the index table
        
    for ($i 0$i strlen($text); $i++) {
            
    $char substr($text$i$look_forward);
            if (!isset(
    $table[$char])) $table[$char] = array();
        }              
        
        
    // walk the array again and count the numbers
        
    for ($i 0$i < (strlen($text) - $look_forward); $i++) {
            
    $char_index substr($text$i$look_forward);
            
    $char_count substr($text$i+$look_forward$look_forward);
            
            if (isset(
    $table[$char_index][$char_count])) {
                
    $table[$char_index][$char_count]++;
            } else {
                
    $table[$char_index][$char_count] = 1;
            }                
        } 

        return 
    $table;
    }

    function 
    generate_markov_text($length$table$look_forward) {
        
    // get first character
        
    $char array_rand($table);
        
    $o $char;

        for (
    $i 0$i < ($length $look_forward); $i++) {
            
    $newchar return_weighted_char($table[$char]);            
            
            if (
    $newchar) {
                
    $char $newchar;
                
    $o .= $newchar;
            } else {       
                
    $char array_rand($table);
            }
        }
        
        return 
    $o;
    }
        

    function 
    return_weighted_char($array) {
        if (!
    $array) return false;
        
        
    $total array_sum($array);
        
    $rand  mt_rand(1$total);
        foreach (
    $array as $item => $weight) {
            if (
    $rand <= $weight) return $item;
            
    $rand -= $weight;
        }
    }
    Below are sample results, based on the content from the Wikipedia article for "Digital Cameras", and a granularity of 5. The sample code used to generate the results, and the results:
    PHP:
    echo markov($text5200);
    PHP:
    $order 5;
    echo 
    generate_markov_text(1000generate_markov_table($text$order), $order);
    Both have some problems with UTF8 characters. The first function seems to have results that are slightly less gibberish-like, but I haven't done extensive testing.

    I've never used Markov generated content on any websites so I have no knowledge of its effectiveness. Any input in this area would be appreciated :)
     
  2. nixnash

    nixnash Power Member

    Joined:
    Oct 26, 2009
    Messages:
    581
    Likes Received:
    205
    Occupation:
    Student
    Location:
    BHW
    i think markov chain we use to to rewrite content , i just harvested for building a synonmy databse , but the output you get is mostly garbage..
    I think we what we need is a English major who can explain the us how we can use markovs chain to rewtite sentence..
    Im..would like to throw a few suggestions...if you like,,
     
  3. radi2k

    radi2k Junior Member

    Joined:
    Nov 29, 2009
    Messages:
    117
    Likes Received:
    34
    Location:
    Germany
    man you are my king! actually i searched for such code some days ago. i think there is a lof of power in this method. it just needs some optimization and a lof of reference data to produce good content. even if it sounds strange - big G wont see it as not by human written content. I'm sure it works. and if your content is worse more people will click the ads :D