1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Please help.. removing duplicates?

Discussion in 'BlackHat Lounge' started by Patel, May 20, 2012.

Thread Status:
Not open for further replies.
  1. Patel

    Patel Senior Member

    Joined:
    Mar 1, 2011
    Messages:
    1,116
    Likes Received:
    1,503
    Location:
    On the coast
    Hey guys,

    I am trying to remove duplicates and I need some help.

    Say I have a master list, with all of my smaller lists combined.

    And then I have 1 smaller list, how can I remove all the duplicates from the smaller list, that is already in the master list?

    Let me know if you can help me! Any suggestions would be appreciated
     
  2. sapo

    sapo Power Member

    Joined:
    Feb 25, 2008
    Messages:
    510
    Likes Received:
    281
    not sure I totally get you but import them in excel and remove duplicate lines from their
     
  3. Knoxgates

    Knoxgates Supreme Member

    Joined:
    Aug 9, 2008
    Messages:
    1,266
    Likes Received:
    918
    That is not what he asked for.
    @OP: Does the master file contains URL's
     
    • Thanks Thanks x 1
  4. sapo

    sapo Power Member

    Joined:
    Feb 25, 2008
    Messages:
    510
    Likes Received:
    281
    is why I added the "not sure I totally get you" part. But In case I did I put what I thought would help him.
     
    • Thanks Thanks x 1
  5. Patel

    Patel Senior Member

    Joined:
    Mar 1, 2011
    Messages:
    1,116
    Likes Received:
    1,503
    Location:
    On the coast
    Yeah the master file contains all of the URLs.

    And then the smaller lists may or may not have duplicate URL's. I want to remove all of the URL's that are already in the master list out of the smaller list. This way each of the smaller lists are 100% unique

    Dont trip you guys are both helping. I have been using excel as Sapo said, but it is very inefficient for me. Because when I remove duplicates, I dont know if its removing the duplicate from the master list or the smaller list.
     
  6. bertbaby

    bertbaby Elite Member

    Joined:
    Apr 15, 2009
    Messages:
    2,019
    Likes Received:
    1,496
    Occupation:
    Product marketing
    Location:
    USA
    Home Page:
    Do it in Excel by combining the two lists, the master first, then your smaller, create a column that indicates which list they came from, remove the dups and sort out the smaller list now missing the dups, remove the list column you created earlier.
     
    • Thanks Thanks x 1
  7. tacopalypse

    tacopalypse Executive VIP Jr. VIP Premium Member

    Joined:
    Nov 30, 2009
    Messages:
    980
    Likes Received:
    2,485
    Home Page:
    if the master list is already unique, you just need to put the smaller list below the master list in excel, keeping them separate them with a space or something.

    the remove duplicates operation starts at the top and will keep the first copy (which is in the master list), and delete all subsequent copies (which are in the smaller list).
     
    • Thanks Thanks x 1
  8. trajek

    trajek Newbie

    Joined:
    Nov 29, 2010
    Messages:
    44
    Likes Received:
    10
    Location:
    Austin, Tx U.S.A
    Home Page:
    upload the file to a unix shell account and use the text file tools to sort and uniq the list with 2 commands.

    sort filename > filename.sorted
    uniq filename.sorted > filename.uniq

    after executing these 2 commands you will have what you want in the filename.uniq file. It should take anywhere from 1 to 5 seconds to execute depending on how many millions of lines the files have even on an old slow unix machine.
     
  9. Patel

    Patel Senior Member

    Joined:
    Mar 1, 2011
    Messages:
    1,116
    Likes Received:
    1,503
    Location:
    On the coast
    Thanks for your help guys. Got it, and done
     
Thread Status:
Not open for further replies.