1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Mix up articles into new ones (Python)

Discussion in 'Black Hat SEO Tools' started by Jespersen, Jun 10, 2014.

  1. Jespersen

    Jespersen Newbie

    Joined:
    Nov 2, 2013
    Messages:
    17
    Likes Received:
    26
    A quick and dirty Python script which takes a list of paragraphs (= articles) from a .txt file as input and replaces every sentence in each with similar sentences taken from other paragraphs in the collections. It doesn't spin anything (that'd be a good next step though), but the results aren't 'traceable' to a single source. Use lots of articles (>100) for more human-readable content.

    Statistical measure for sentence similarity adopted from Metzler et al. (2005).

    Code:
    #!/usr/bin/env python
    
    import re, sys, os, math, itertools, time
    from collections import defaultdict
    
    def split_sentences(content):
        abbrevs = ['dr', 'mr', 'mrs', 'ms', 'prof', 'inc', 'vs', 'ex', 'e.g', 'i.e', 'ps', 'p.s', 'no'] + list('abcdefghijklmnopqrstuvwxyz')
        normalized_content = re.sub(r'(?<=[ \(])(' + r'|'.join(abbrevs) + r')\.', lambda c: c.group()[:-1] + '_', content, flags=re.I)
        return [content[sent.start():sent.end()] for sent in re.finditer(r'[^ ].+?[\.\?\!]+', normalized_content)]
    
    def split_words(S):
        return re.sub(r'(?<![\.\!\?])[\.\,\?\!\:\;\'\"](?![\.\!\?])', lambda c: ' ' + c.group() + ' ', S).split()
    
    def lemma(word):
        """ how word forms get standarized when comparing sentences - currently just lowercase, first 5 characters """
        return word.lower()[:5]
    
    def calculate_idf(documents):
        lemmatized_docs = [[lemma(w) for w in split_words(doc)] for doc in documents]
    
        df = defaultdict(int)
    
        for doc in lemmatized_docs:
            for type in set(doc):
                df[type] += 1
    
        return {w: math.log(len(documents)) / df[w] for w in df.keys()}
    
    def edge_similarity(s1, s2):
        e1 = s1[:2] + s1[-1:]
        e2 = s2[:2] + s2[-1:]
        aligned = zip(e1, e2)
    
        return float(len([e for e in aligned if e[0] == e[1]])) / len(aligned)
    
    def content_similarity(s1, s2, type_IDF):
        s1_tokens = [lemma(w) for w in split_words(s1)]
        s2_tokens = [lemma(w) for w in split_words(s2)]
    
        def w_penalty(type, tokens1, tokens2): return 1 + abs(tokens1.count(type) - tokens2.count(type))
        def s_penalty(tokens1, tokens2): return 1 + max(len(tokens1), len(tokens2)) / min(len(tokens1), len(tokens2))
    
        measure = sum(type_IDF[type] / w_penalty(type, s1_tokens, s2_tokens) for type in set(s1_tokens) & set(s2_tokens))
    
        if s1_tokens or s2_tokens:
            return measure / s_penalty(s1_tokens, s2_tokens)
        else:
            return 1
    
    def sentence_similarity(s1, s2, type_IDF):
        return content_similarity(s1, s2, type_IDF) + edge_similarity(s1, s2)
    
    
    class ArticleDatabase():
        def __init__(self, file_path):
            try:
                with open(file_path, 'r') as file:
                    self.articles = file.readlines()
    
                print "\n%i articles loaded." % len(self.articles)
    
            except IOError:
                sys.exit("\nNo articles could be loaded - couldn't find path: %s" % file_path)
    
            self.idf = calculate_idf(self.articles)
    
            self.initial_sents = {}
            self.mainbody_sents = {}
            self.final_sents = {}
    
            for article_id, article in enumerate(self.articles):
                sents = split_sentences(article)
    
                self.initial_sents[article_id] = sents[:1]
                self.mainbody_sents[article_id] = sents[1:-1]
                self.final_sents[article_id] = sents[-1:]
    
        def recreate_article(self, article_id):
            print "Recreating article no. %i" % (article_id + 1)
    
            target_sents = split_sentences(self.articles[article_id])
            new_art = []
    
            source_mainbody_sents = set(itertools.chain(*[v for k, v in self.mainbody_sents.items() if k != article_id]))
            source_initial_sents = set(itertools.chain(*[v for k, v in self.initial_sents.items() if k != article_id]))
            source_final_sents = set(itertools.chain(*[v for k, v in self.final_sents.items() if k != article_id]))
    
            for i, sent in enumerate(target_sents):
                if i == 0:
                    source_set = source_initial_sents
                elif i == len(target_sents) - 1:
                    source_set = source_final_sents
                else:
                    source_set = source_mainbody_sents
    
                if source_set:
                    most_similar = sorted(source_set, key=lambda x: sentence_similarity(sent, x, self.idf), reverse=True)
                    replacement = most_similar[0]
                    source_set.remove(replacement)
                else:
                    replacement = sent
    
                new_art.append(replacement)
    
            return new_art
    
        def output(self, out_path, number_of_articles=10):
            results = [" ".join(self.recreate_article(article_id)) + "\n" for article_id in range(min(len(self.articles), number_of_articles))]
    
            with open(out_path, 'w') as out_file:
                for result in results:
                    out_file.writelines(result)
    
    
    if __name__ == "__main__":
        if len(sys.argv) == 4:
            db = ArticleDatabase(sys.argv[1])
            db.output(sys.argv[2], int(sys.argv[3]))
    
        else:
            print "\nWrong number of arguments!\nCorrect usage:\n"
            print "python recompose.py python recompose.py <file path> <output file path> <number of articles to rewrite>"
            sys.exit()
    Result example using a bunch of fantasy book reviews:

    How to use:

    1. Save a bunch of articles (with no linebreaks inside) as separate paragraphs to a .txt file.
    2. Install Python 2.7 if you don't have it.
    3. Save the script as recompose.py (Windows users might need to use their Python or Python/Scripts folder)
    4. From the command line, run: python recompose.py <file path> <output file path> <number of articles to rewrite>, e.g.

    Hopefully you'll find this useful for something.
     
  2. lord1027

    lord1027 Elite Member

    Joined:
    Sep 20, 2013
    Messages:
    3,174
    Likes Received:
    2,222
    This looks interesting, I'll give it a try. Anyone else tried this yet?
     
  3. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    950
    Likes Received:
    662
    Occupation:
    Web/Bot Developer
    Nice share. I built something very similar using Node.js. You should take a look at the following NLP ( Natural Language Processing) library for Python.
    Code:
    http://www.nltk.org/
     
  4. Jespersen

    Jespersen Newbie

    Joined:
    Nov 2, 2013
    Messages:
    17
    Likes Received:
    26
    Yep, I normally use NLTK, it just wasn't necessary in the end because I wanted to see how simple I can keep a thing like this. Using Wordnet-based similarity measures improves accuracy, but it lowers performance approximately 20 times, so...