1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Removing Domain Duplicates From 1gb file.

Discussion in 'Black Hat SEO' started by 12121231, Feb 26, 2011.

  1. 12121231

    12121231 Registered Member

    Joined:
    Nov 26, 2010
    Messages:
    83
    Likes Received:
    18
    Hi guys. I got 1gb worth of auto-approve blogs and I need to filter out the domains, and it'll take a long time on scrapebox, so i was wondering if their was an easier way to do it?

    Any advice/help appreciated and if someone can help me I'll be sure to email you the completed 100's of thousands of auto-approve blogs.
    Thanks :)
     
  2. antsaoo

    antsaoo Supreme Member

    Joined:
    Oct 1, 2008
    Messages:
    1,291
    Likes Received:
    637
    Use the scrapebox dupe remove add-on , You can't really do it faster i think.
     
  3. edsmithers

    edsmithers Junior Member

    Joined:
    Feb 7, 2008
    Messages:
    104
    Likes Received:
    18
    I can do this for you VERY quickly. Please PM me and let me know how you want them filtered specifically.
     
  4. 12121231

    12121231 Registered Member

    Joined:
    Nov 26, 2010
    Messages:
    83
    Likes Received:
    18
    Thanks pming now
     
  5. Monrox

    Monrox Power Member

    Joined:
    Apr 9, 2010
    Messages:
    615
    Likes Received:
    579
    I am creating my own duplicates remover and can run your file through it if you want.

    It was needed because there is no software I could find that can do that the way I want it. All compare the string between the beginning and the first "/" but this means that abf.yahoo.com and fgr.yahoo com will come out as different domains while they are not.

    1 GB uncompressed file should be about 10 million URLs. But unless your list is really comprehensive you will not get very many different domains, something like 200k or so. I routenly monitor urls submitted to ping services and last time came up with 1 456 183 out of 99 236 306.
     
  6. edsmithers

    edsmithers Junior Member

    Joined:
    Feb 7, 2008
    Messages:
    104
    Likes Received:
    18
    Dude, I just got your PM. This is the file of URLs posted here on the forum already. It's not yours, and it's not original auto-approved blogs right?
     
  7. 12121231

    12121231 Registered Member

    Joined:
    Nov 26, 2010
    Messages:
    83
    Likes Received:
    18
    Hey, yeah they are from the forum, and I've been checking them with scrapebox and majority are auto-approve. So yes they are good quality
     
  8. Kickflip

    Kickflip BANNED BANNED

    Joined:
    Jan 29, 2010
    Messages:
    2,038
    Likes Received:
    2,465
    They aren't auto approve. Something isn't right on your end, because those are not even 5% auto approve domains if they are the 926mb file posted the other day. That guy was just posting a random list of urls. How are you checking them with scrapebox? It would take you weeks to post to all of those URLs, so I don't see how it is possible you could have "checked the majority of them." In fact, since the time since you downloaded the file, you wouldn't have even been able to post to 10% of the list. And also, if you already posted to even 10% of the list, then you would easily already have the file broken up into manageable packages which you can load into scrapebox to post with. If that was the case, you would just use the dup remover button in scrapebox.

    Response?
     
  9. 12121231

    12121231 Registered Member

    Joined:
    Nov 26, 2010
    Messages:
    83
    Likes Received:
    18
    Valid points my friend, I wasn't meaning I scanned them all and the majority were auto approved. I scanned about 10, 000 of them and they were mainly auto approve, so I assumed it was consistant. On top of this the guy who posted it said they were auto-approved, and your right I wouldn't have been even able to scan 10% of them. Thanks for the good idea about the dup remover as well:)
     
  10. Kickflip

    Kickflip BANNED BANNED

    Joined:
    Jan 29, 2010
    Messages:
    2,038
    Likes Received:
    2,465
    [GET] Over 14 Million Unique WordPress Blog Links

    If that is the post you are talking about, the guy did not say they are auto approve. And what do you mean you "scanned 10,000 of them and they were mainly auto approve". I am telling you man, they are not auto approve urls. I think you are mistaken about what you found. I ran over 200,000 of the urls and only got 900 unique auto approves.
     
  11. 12121231

    12121231 Registered Member

    Joined:
    Nov 26, 2010
    Messages:
    83
    Likes Received:
    18
    I dont know mate, I remember seeing heaps of auto-approve, it doesnt matter anyway, Ill use the ones I do get and get more.
     
  12. mazgalici

    mazgalici Supreme Member

    Joined:
    Jan 2, 2009
    Messages:
    1,489
    Likes Received:
    881
    Home Page:
    "unique" command from Linux
     
  13. jerrandgab

    jerrandgab Newbie

    Joined:
    Feb 20, 2011
    Messages:
    5
    Likes Received:
    0
    you can split file with soft TextFileSplitter soft then delete dupes for each splited file and then merge again with LinksList Merger soft
     
  14. wilywonka

    wilywonka Newbie

    Joined:
    Jan 16, 2011
    Messages:
    14
    Likes Received:
    0
    uniq requires you to presort so I find it much easier to just do. This works far better than scrapebox and can actually dedupe lists that your computer can't fully hold in memory.

    Code:
    sort -u -o file.txt file.txt
     
    Last edited: Feb 27, 2011
  15. Jared255

    Jared255 Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    May 10, 2009
    Messages:
    1,907
    Likes Received:
    1,663
    Location:
    Boston, MA
    • Thanks Thanks x 1
  16. Gene78

    Gene78 Newbie

    Joined:
    Feb 27, 2011
    Messages:
    16
    Likes Received:
    0
    sb is good in deleting those duplicate domains
     
  17. waimeng00

    waimeng00 Junior Member

    Joined:
    Feb 27, 2008
    Messages:
    184
    Likes Received:
    292
    Occupation:
    Internet Marketer
    Location:
    Malaysia
    SB is not so good for 1GB of text file.

    These tools are recommended:
    Textpipe
    GVim