1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Need a program to

Discussion in 'BlackHat Lounge' started by Bwaht, May 16, 2009.

  1. Bwaht

    Bwaht Regular Member

    Joined:
    Jan 27, 2009
    Messages:
    292
    Likes Received:
    243
    Hey everyone,
    I need to find a program that will delete duplicate web addresses from a .txt or .doc. Basically I have a lists of web addresses, but there are many dups. I need a fast way to get rid of them quickly and easily. If you know of a program that can do this, let me know. Thanks!:D
     
  2. Rick4691

    Rick4691 Registered Member Premium Member

    Joined:
    Feb 19, 2008
    Messages:
    70
    Likes Received:
    30
    Occupation:
    Programmer
    Location:
    Oceania
    Code:
    # Get rid of "www"
    cat file.txt | sed 's/\/www\./\//g' > file_www_cleansed.txt
    
    # Sort addresses and remove duplicates
    sort -u file_www_cleansed.txt > file_www_cleansed_deduped.txt
    Just enter these command-line ... or incorporate into a shell script.

    Heck, here's the quickie script, too. Sorry, it's late so I didn't comment it completely or put in the usual safety-valves.

    Code:
    #!/bin/sh
    #
    filename=$1
    
    cat $filename | sed 's/\/www\./\//g' > ${filename}_cleansed
    sort -u ${filename}_cleansed > ${filename}_cleansed_deduped
    
    exit 0
    So, if the name of the script is dedupe_sites.sh, you could call if like this: "dedupe_sites.sh filename.txt"

    Your domains will be in a file called "filename.txt_cleansed_deduped".
     
  3. springer98

    springer98 Regular Member

    Joined:
    Dec 6, 2008
    Messages:
    211
    Likes Received:
    250
    Occupation:
    We doeneeeeed no stinkin' yob!
    Location:
    ZRF