1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Software to remove duplicate lines in a text file?

Discussion in 'Black Hat SEO Tools' started by youssef93, Feb 25, 2011.

  1. youssef93

    youssef93 BANNED BANNED

    Joined:
    Sep 14, 2008
    Messages:
    828
    Likes Received:
    1,149
    May seem a stupid question but I'm really tired. After using scrapebox dup remove addon to merge my text files I ended up with a 1.21 GB text file (full of duplicates of course) and I'm not able to do anything about it. The addon won't complete the dup remove process. Says "reading file" then "writing unique URLs" and stays like this forever. I noticed a 10 MB files created on the desktop from it but that's it. The file is 10 mb and does not increase in size. Left sb for over 18 hours and same situation. "writing unique urls...." and a 10 mb file. Is it finished like that or what?

    Can you recommend a software that is able to deal with massive files and delete duplicate lines? txtcollector combines multiple files but doesn't remove dups. I don't mind if the app would take a long time, just get the job done!

    Thanks!:D
     
  2. adbox

    adbox Power Member

    Joined:
    May 1, 2009
    Messages:
    658
    Likes Received:
    107
    Home Page:
    Edit:

    After re-reading, this definitely would not work for a text file >1gig, or even close to that...


    Original Post:

    Create directory and set text file in directory, and create php file in direction with the code below:

    remove_dup.php
    PHP:
    <?php

    $lines 
    file('textfile.txt');
    $lines array_unique($lines);
    $lines implode('<br>'$lines);
    echo 
    $lines;

    ?>
    execute remove_dupe.php in browser.
     
    • Thanks Thanks x 1
  3. pyronaut

    pyronaut Executive VIP

    Joined:
    Dec 9, 2008
    Messages:
    1,229
    Likes Received:
    1,423
    • Thanks Thanks x 1
  4. flibbertigibbet

    flibbertigibbet Regular Member

    Joined:
    Apr 11, 2010
    Messages:
    388
    Likes Received:
    188
    Here's a great web based tool. I use it a lot. :) (it's not mine)
    Code:
    http://textmechanic.com/Remove-Duplicate-Lines.html
     
    • Thanks Thanks x 1
  5. Jared255

    Jared255 Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    May 10, 2009
    Messages:
    1,945
    Likes Received:
    1,746
    Location:
    Boston, MA
    • Thanks Thanks x 2
  6. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Sounds to me like your file has right around 19 million lines in it (rough estimate) and it sounds like your PC isn't strong enough for the SB dup remover to handle the file all at once.

    Either split that file into 5 or 6 million lines each (free tool to do that for you located here) and then run each of them through the SB dup remover individually, recombine them, and then run the new file through the SB dup remover (trust me, it'll be much smaller then).

    Or combine fewer files using the merge function in the SB dup remover and run them through the dup remover.

    Or, do as Jared said above, and check out my site where I have a full guide.
     
    • Thanks Thanks x 2
  7. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    TextPipe will remove duplicate lines from 200 milion lines in few seconds without splitting the file , it can handle more than 10GB file , you can find textpipe here on BHW
    I have removed duplicate lines from 125 milion domains (Zone Files) 2.66gb (one .txt file)
     
    • Thanks Thanks x 1
    Last edited: Feb 26, 2011
  8. SpellZ

    SpellZ Regular Member

    Joined:
    Feb 8, 2009
    Messages:
    357
    Likes Received:
    312
    Location:
    Toronto, ON
    LOL! I was looking for this as well.
    Got 8000 emails, and each email repeats like 2-3 times... so annoying to do it manually, wow!
     
  9. youssef93

    youssef93 BANNED BANNED

    Joined:
    Sep 14, 2008
    Messages:
    828
    Likes Received:
    1,149
    Wow such an overwhelming response. Thread is glowing because of all the colorful rank stars and names from Execs and Jr.VIPs :p

    Thanks a bunch everyone will review the solution and report back.

    @Crazyflx
    You've gone to the point were you can determine the number of lines from the file size. They're almost exactly 19 mil. LOL. I'd love to know some of your scraping stats. Boy you must be using dozens of terabytes of bandwidth scraping 24/7 perhaps having your own 'dedicated scraping data center' :D. I've seen your blog list on sale and they look legendary :D

    Anyway a bunch of thanks everyone will report back shortly :)
     
    • Thanks Thanks x 1
  10. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    Try TEXTPipe and with big respect of all Execs , Donors , JR.VIPs and theirs solutions you will see that TextPipe is absolutely best -:)
    Not only for removing duplicate lines but for all text manipulation.

    @crazyflx - you should try it too, I highly recommend , you will never use another text manipulation tool anymore.
     
  11. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Haha, yes, it's pretty sick, I know...I've got "scraping sickness". I've probably deleted more URLs than most have ever even scraped.

    I'm up to two dedi's now...one for scraping (I pay for 500 exclusive private proxies every month) and one for commenting.

    As for the bandwidth I use every month, I'm well into the terabytes...well into them ;)
     
  12. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    I'll give it a go, but the reason I mention the other solutions, is that most of the text manipulation solutions mentioned here on this thread, DON'T have an inbuilt function to detect duplicate DOMAINS in the file, and then remove duplicate domain, along with the rest of the entire line the duplicate domain appears on.
     
  13. youssef93

    youssef93 BANNED BANNED

    Joined:
    Sep 14, 2008
    Messages:
    828
    Likes Received:
    1,149
    Thanks a lot, Chris!

    Don't worry I'll examine the suggestions and report back for everyone to benefit. I did see your post and it seemed easiest so will start with it :D

    @Crazyflx

    Damn, 500 pps? That's like over $600 just for proxies. Add to that server and running expenses, I think we could say that you do hit $1k/mo scraping expenses then :S. Hell, no wonder your lists are expensive.
     
  14. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    TextPipe has no competitors between text manipulation tools , for sure try it out , it has many options like to search for duplicate lines from (for example ) (column 23 to column 4000) , I am discovering new functions every day , it has infinity filters (group of tasks) you can set up. You can also use subfilters to export every result to different folders (If you do 100 tasks like converting , renaming , replacing , Uppercasing , extracting , etc , you can get all files for every task in 100 different folders.) I can't name all those possibilities here , check it out , for large files is the best .
     
    • Thanks Thanks x 1