1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Can't Delete Duplicates 25 Million URL List Every Program Crashes Xrumer / gScraper notepa

Discussion in 'Black Hat SEO' started by smokemeoutdawg, Jan 22, 2014.

  1. smokemeoutdawg

    smokemeoutdawg Newbie

    Joined:
    Dec 14, 2013
    Messages:
    46
    Likes Received:
    12
    I got a 25 million list I harvested using gScraper and I am trying to delete duplicate domains, easier said than done with a 4gb file

    I attempted to delete the duplicates with gScraper and it crashes after taking up 14GB of ram lol. Tried Xrumer and it gives a error. Too big.

    It's too big for Notepad ++ to open fully.

    Tried splitting it with gSplit but it's too big for that too.

    I need a easy way to split big files if anyone could recommend anything that would be great.
     
  2. babasss

    babasss Regular Member

    Joined:
    Jul 12, 2010
    Messages:
    335
    Likes Received:
    197
    try scrapebox dupe remover
     
  3. tahworld

    tahworld Regular Member

    Joined:
    Aug 16, 2013
    Messages:
    457
    Likes Received:
    393
    Location:
    ✔✔✔✔✔✔✔
    Hey try this, it was made specifically for big text files:

    It supposed to be able to handle into the Billions.

    http://www.emeditor.com/

     
    • Thanks Thanks x 1
  4. Winternacht

    Winternacht Junior Member

    Joined:
    Jan 7, 2011
    Messages:
    113
    Likes Received:
    46
  5. PrinceVisi

    PrinceVisi Elite Member

    Joined:
    Jan 11, 2012
    Messages:
    1,916
    Likes Received:
    1,008
    Occupation:
    BusinessMan
    Location:
    Tropoja

    Can't you split them into 25 files with 1 million each and retest?
     
  6. smokemeoutdawg

    smokemeoutdawg Newbie

    Joined:
    Dec 14, 2013
    Messages:
    46
    Likes Received:
    12
    Tried that too. Lag for hours. No success.


    Heisenberg for the win! This actually opened up my file in seconds, amazing man.

    Are you familiar with how to remove dups using this program? Or Spilt into 2?
     
  7. davids355

    davids355 Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 25, 2011
    Messages:
    8,783
    Likes Received:
    6,319
    Home Page:
    I have a server with 128gb ram :)
    could give it a go if you want, but probably take me a few days.
     
  8. skadster

    skadster Junior Member

    Joined:
    Aug 6, 2011
    Messages:
    171
    Likes Received:
    49
    Location:
    Scotland
    You could also try editpad lite, I'm sure it has a command to delete duplicate lines.
     
  9. d2ugsd

    d2ugsd Registered Member Premium Member

    Joined:
    Mar 16, 2008
    Messages:
    81
    Likes Received:
    36

    UltraEdit will do the job
     
  10. smokemeoutdawg

    smokemeoutdawg Newbie

    Joined:
    Dec 14, 2013
    Messages:
    46
    Likes Received:
    12
    Its not the ram that's the problem, I got one with 32gb ram, it's the software.

    So I ran it through Heisenburgs program and it spilts them but adds a " before and after the URL

    This causes errors. Any other recommendations are highly appreciated.

    Editpad does not support files larger than 2GB.


    Any other recommendations are highly appreciated.
     
  11. d2ugsd

    d2ugsd Registered Member Premium Member

    Joined:
    Mar 16, 2008
    Messages:
    81
    Likes Received:
    36
    I worked with 50gb file with ultraedit so 4gb is peace of cake.
     
    • Thanks Thanks x 1
  12. Groen

    Groen Regular Member

    Joined:
    Nov 7, 2009
    Messages:
    397
    Likes Received:
    221
    Have you tried the scrapebox addon DupRemove as other users suggested? It should handle up to 180 million lines at a time.