Mix up articles into new ones (Python)

Jespersen · Jun 10, 2014

A quick and dirty Python script which takes a list of paragraphs (= articles) from a .txt file as input and replaces every sentence in each with similar sentences taken from other paragraphs in the collections. It doesn't spin anything (that'd be a good next step though), but the results aren't 'traceable' to a single source. Use lots of articles (>100) for more human-readable content.

Statistical measure for sentence similarity adopted from Metzler et al. (2005).

Code:

#!/usr/bin/env python

import re, sys, os, math, itertools, time
from collections import defaultdict

def split_sentences(content):
    abbrevs = ['dr', 'mr', 'mrs', 'ms', 'prof', 'inc', 'vs', 'ex', 'e.g', 'i.e', 'ps', 'p.s', 'no'] + list('abcdefghijklmnopqrstuvwxyz')
    normalized_content = re.sub(r'(?<=[ \(])(' + r'|'.join(abbrevs) + r')\.', lambda c: c.group()[:-1] + '_', content, flags=re.I)
    return [content[sent.start():sent.end()] for sent in re.finditer(r'[^ ].+?[\.\?\!]+', normalized_content)]

def split_words(S):
    return re.sub(r'(?<![\.\!\?])[\.\,\?\!\:\;\'\"](?![\.\!\?])', lambda c: ' ' + c.group() + ' ', S).split()

def lemma(word):
    """ how word forms get standarized when comparing sentences - currently just lowercase, first 5 characters """
    return word.lower()[:5]

def calculate_idf(documents):
    lemmatized_docs = [[lemma(w) for w in split_words(doc)] for doc in documents]

    df = defaultdict(int)

    for doc in lemmatized_docs:
        for type in set(doc):
            df[type] += 1

    return {w: math.log(len(documents)) / df[w] for w in df.keys()}

def edge_similarity(s1, s2):
    e1 = s1[:2] + s1[-1:]
    e2 = s2[:2] + s2[-1:]
    aligned = zip(e1, e2)

    return float(len([e for e in aligned if e[0] == e[1]])) / len(aligned)

def content_similarity(s1, s2, type_IDF):
    s1_tokens = [lemma(w) for w in split_words(s1)]
    s2_tokens = [lemma(w) for w in split_words(s2)]

    def w_penalty(type, tokens1, tokens2): return 1 + abs(tokens1.count(type) - tokens2.count(type))
    def s_penalty(tokens1, tokens2): return 1 + max(len(tokens1), len(tokens2)) / min(len(tokens1), len(tokens2))

    measure = sum(type_IDF[type] / w_penalty(type, s1_tokens, s2_tokens) for type in set(s1_tokens) & set(s2_tokens))

    if s1_tokens or s2_tokens:
        return measure / s_penalty(s1_tokens, s2_tokens)
    else:
        return 1

def sentence_similarity(s1, s2, type_IDF):
    return content_similarity(s1, s2, type_IDF) + edge_similarity(s1, s2)


class ArticleDatabase():
    def __init__(self, file_path):
        try:
            with open(file_path, 'r') as file:
                self.articles = file.readlines()

            print "\n%i articles loaded." % len(self.articles)

        except IOError:
            sys.exit("\nNo articles could be loaded - couldn't find path: %s" % file_path)

        self.idf = calculate_idf(self.articles)

        self.initial_sents = {}
        self.mainbody_sents = {}
        self.final_sents = {}

        for article_id, article in enumerate(self.articles):
            sents = split_sentences(article)

            self.initial_sents[article_id] = sents[:1]
            self.mainbody_sents[article_id] = sents[1:-1]
            self.final_sents[article_id] = sents[-1:]

    def recreate_article(self, article_id):
        print "Recreating article no. %i" % (article_id + 1)

        target_sents = split_sentences(self.articles[article_id])
        new_art = []

        source_mainbody_sents = set(itertools.chain(*[v for k, v in self.mainbody_sents.items() if k != article_id]))
        source_initial_sents = set(itertools.chain(*[v for k, v in self.initial_sents.items() if k != article_id]))
        source_final_sents = set(itertools.chain(*[v for k, v in self.final_sents.items() if k != article_id]))

        for i, sent in enumerate(target_sents):
            if i == 0:
                source_set = source_initial_sents
            elif i == len(target_sents) - 1:
                source_set = source_final_sents
            else:
                source_set = source_mainbody_sents

            if source_set:
                most_similar = sorted(source_set, key=lambda x: sentence_similarity(sent, x, self.idf), reverse=True)
                replacement = most_similar[0]
                source_set.remove(replacement)
            else:
                replacement = sent

            new_art.append(replacement)

        return new_art

    def output(self, out_path, number_of_articles=10):
        results = [" ".join(self.recreate_article(article_id)) + "\n" for article_id in range(min(len(self.articles), number_of_articles))]

        with open(out_path, 'w') as out_file:
            for result in results:
                out_file.writelines(result)


if __name__ == "__main__":
    if len(sys.argv) == 4:
        db = ArticleDatabase(sys.argv[1])
        db.output(sys.argv[2], int(sys.argv[3]))

    else:
        print "\nWrong number of arguments!\nCorrect usage:\n"
        print "python recompose.py python recompose.py <file path> <output file path> <number of articles to rewrite>"
        sys.exit()

Result example using a bunch of fantasy book reviews:

This fantasy work of Dorsey's involves a unique and interesting setting and distinct characters, but after finishing the book I truly had no idea what the book was trying to say. I am definitely not one for PC in books or other entertainment, I find censure and rating of movies and books silly. However, for some a reason I wasn't aware of at first, I didn't care about these characters as much as I usually do in Card's work. That, in the final analysis, is all you need to know. I'm not sure there is any humanity left in Anita at this point. For example, the plot is too small-scale; we only learn about what happens on a couple days in one city to a dozen or so people. I'm not sure if Goodkind intended to begin a series when he wrote this first book, but that's what happened. The title implied so much, and I knew that this author, so brilliant so far, would never do anything trite, easily expected, or trivial. If you love a book which is big on story as well as being big on ideas, if you love a nostalgic adjective-packed prose, if you love science fiction - and if you don't, just damn well read 1984 and then maybe some Phillip K Dick like i did, and you'll see what all the fuss is about - and btw, science fiction isn't really science, its ABOUT science, or the fantasy of science, what science can do to us, the thirst for knowledge and advancement - but its really about people struggling against a system. This story plays with the form, much like Vonnegut did in Slaughter-House Five. The Song of the Eagle is Christian imagery -- an Easter proclamation -- and The Lord of the Rings is laced throughout with such Christian typology and symbolism. Some of the high praise had made me expect something that was really lyrical and powerful; I found it instead to be a pleasant entertainment, but for fantasy readers this is an excellent choice as a change of pace.

How to use:

1. Save a bunch of articles (with no linebreaks inside) as separate paragraphs to a .txt file.
2. Install Python 2.7 if you don't have it.
3. Save the script as recompose.py (Windows users might need to use their Python or Python/Scripts folder)
4. From the command line, run: python recompose.py <file path> <output file path> <number of articles to rewrite>, e.g.

python recompose2.py C:/Whatever/articles.txt C:/Whatever/new_articles.txt 20

Hopefully you'll find this useful for something.

lord1027 · Jun 10, 2014

This looks interesting, I'll give it a try. Anyone else tried this yet?

MrBlue · Jun 10, 2014

Nice share. I built something very similar using Node.js. You should take a look at the following NLP ( Natural Language Processing) library for Python.

Code:

http://www.nltk.org/

Jespersen · Jun 11, 2014

Yep, I normally use NLTK, it just wasn't necessary in the end because I wanted to see how simple I can keep a thing like this. Using Wordnet-based similarity measures improves accuracy, but it lowers performance approximately 20 times, so...

Mix up articles into new ones (Python)

Jespersen

Newbie

lord1027

Elite Member

MrBlue

Senior Member

Jespersen

Newbie

Main Menu

Marketplace

Making Money

BlackHat World