1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to completely remove dupicates in .txt file? So that BOTH entries go, not just one!

Discussion in 'Black Hat SEO' started by krzysiekz, Dec 2, 2010.

  1. krzysiekz

    krzysiekz Senior Member

    Joined:
    Jul 29, 2010
    Messages:
    953
    Likes Received:
    578
    Hi guys,

    I have a really large list of URL's. I am using the 'remove duplicate' feature with SB, but that just removes duplicate URL's if there are more than one of the same URL.

    So, if I have a text file with the below in it:

    www.test.com
    www.test.com

    And I remove duplicates, I get:

    www.test.com


    But I want to, somehow, end up with a result where IF there is a duplicate in the file, it will remove ALL entries, so the end result would be without the original URL in the file at all.

    Does anyone know how this can be achieved?

    Thanks,
     
  2. hellomotow07

    hellomotow07 Power Member Premium Member

    Joined:
    Aug 24, 2010
    Messages:
    643
    Likes Received:
    350
  3. jacobpov

    jacobpov Junior Member Premium Member

    Joined:
    Aug 23, 2010
    Messages:
    153
    Likes Received:
    391
    you can code something simple that would load the text file and make a loop that will loop through the file and do what you require :D
    of course you can accomplish this with visual basic :D
     
  4. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    I have done this program for one of the guy here on blackhatworld. He had 1 billion urls stored in text file. Normal programming techniques will not work for such programs as programs will easily crash when it goes out of the memory.

    Real solution is use database for storage.

    If you are interested, contact me.

    Thanks,
    Kalrudra
     
  5. krzysiekz

    krzysiekz Senior Member

    Joined:
    Jul 29, 2010
    Messages:
    953
    Likes Received:
    578
    PM'ed :)
     
  6. soultrain

    soultrain Newbie

    Joined:
    Nov 25, 2010
    Messages:
    30
    Likes Received:
    16
    Hell, I'll tell you how to do this for free.

    Copy all of the data into Excel. Have the list all in one giant column in a table. In each cell of column 2 of the table, enter the value: [ =COUNTIF([Column1],Table1[[#This Row],[Column1]]) ] Now just sort column 2 from largest to smallest. All the duplicate URLS will have a value of '2' and the rest will have a value of '1'. Enjoy! :)
     
    • Thanks Thanks x 1
  7. krzysiekz

    krzysiekz Senior Member

    Joined:
    Jul 29, 2010
    Messages:
    953
    Likes Received:
    578
    Thanks! Will try the above tomorrow and let you know how it goes. Needless to say, +rep, + thanks! :)
     
  8. Sweetfunny

    Sweetfunny Jr. VIP Jr. VIP Premium Member

    Joined:
    Jul 13, 2008
    Messages:
    1,747
    Likes Received:
    5,038
    Location:
    ScrapeBox v2.0
    Home Page:
    When your list is in the harvester grid try, Remove Duplicates >> Split Duplicate Domains

    Save it as dupes.txt or whatever.

    Import URL List >> Select the lists to compare, and choose your dupes.txt file and the harvester grid should now have what you want.
     
    • Thanks Thanks x 1
  9. MaDeuce

    MaDeuce Newbie

    Joined:
    Oct 24, 2008
    Messages:
    45
    Likes Received:
    16
    Location:
    Austin, TX
    As is the case with many things like this, there is a unix/linux one-liner that will do exactly what you request. Assume that your initial file is named 'yourfile' and that you want the results stored in 'results', then

    Code:
    sort yourfile | uniq -u > results
    That's it. This is why unix is such a productive environment to work in.

    --Ma
     
    • Thanks Thanks x 1
  10. krzysiekz

    krzysiekz Senior Member

    Joined:
    Jul 29, 2010
    Messages:
    953
    Likes Received:
    578
    Thanks Sweetfunny, will try that out for sure as it seems the easiest way!

    In regarding to unix, sorry I am on Win7 so it won't work but I have thanked anyway!!