1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Perl script to remove lines of File_A from File_B

Discussion in 'Black Hat SEO' started by bigleftie, Oct 18, 2010.

  1. bigleftie

    bigleftie Junior Member

    Joined:
    Jun 15, 2010
    Messages:
    135
    Likes Received:
    34
    Occupation:
    Senior Software Engineer
    Location:
    NJ, USA
    This PERL script will remove the lines of File_A from File_B. For example:

    File_A
    delete1
    delete2
    delete3

    File_B
    keep1
    delete1
    delete2
    keep2


    If you ran the script using the following command:
    perl remove_lines.pl File_A File_B
    the output (to STDOUT) would be
    keep1
    keep2


    I keep track of "UNKNOWN BLOG TYPES", sites flagged as malware, places I've already backlinked from, my URLs, etc.

    I run this PERL script after harvesting URLs with ScrapeBox to delete undesired URLS from File_B before I begin working with the URLs I harvested.

    I always run my harvested URLs through two ScrapeBox add-ons before using them. I use the "Malware and Phishing Filter" and the "Blog Anazlyer" add-ons.

    By eliminating URLs I know I am not interested in, I don't waste time and bandwidth that can be used for other tasks.

    It's a handy little script that I use a few places as part of my automated tasks. I thought I'd share with everyone - enjoy!

    Code:
    #!/usr/local/ActivePerl-5.6/bin/perl -w
    
    # Usage: remove_lines.pl lines remove_from
    # lines: text file of strings to be removed
    # remove_from: text file to exclude from
    #
    # Will print all lines in remove_from that are not in EXCLUDE list
    
    $i=0;
    #$KNOWN_LINKS_FILE_NAME = "domains.txt";
    $KNOWN_LINKS_FILE_NAME = $ARGV[0];
    open(KNOWN_LINKS_FILE, $KNOWN_LINKS_FILE_NAME) || die "Cannot open $KNOWN_LINKS_FILE_NAME: $!";
    while(<KNOWN_LINKS_FILE>)
    {
    	#print $_;
    	chop;
    	$KNOWN_LINKS[$i] = $_;
    	#print "known link $i: $KNOWN_LINKS[$i]\n";
    	$i++;
    }
    close(KNOWN_LINKS_FILE);
    
    $i=0;
    $CANDIDATES_FILE_NAME = $ARGV[1];
    open(CANDIDATES_FILE, $CANDIDATES_FILE_NAME) || die "Cannot open $CANDIDATES_FILE_NAME: $!";
    while(<CANDIDATES_FILE>)
    {
    	chop;
    	$candidates[$i] = $_;
    	#print "candidate $i: $candidates[$i]\n";
    	$i++;
    }
    close(CANDIDATES_FILE);
    #print @candidates;
    
    local(%TEMP);
    grep($TEMP{$_}++,@KNOWN_LINKS);
    @STILL_AVAILABLE = grep(!$TEMP{$_}, @candidates);
    
    $num = @STILL_AVAILABLE;
    #print "Num: $num\n";
    $i=0;
    while($i<$num)
    {
    	print "$STILL_AVAILABLE[$i++]\n";
    }
    exit(0);
    
    For those of you that don't know, ScrapeBox has this functionality built in if you want to do it manually.

    http://www.youtube.com/watch?v=U609Qbk36Ew