1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Have Over 1 Million URLs & Need to Remove Duplicate URLs & Domains? Here You Go!

Discussion in 'Black Hat SEO Tools' started by crazyflx, Nov 9, 2010.

  1. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    This thread appears menacing & long. While it is long, it's only this long because I have a tendency to be way overly detailed. What you see below is INCREDIBLY easy and has very few steps. I just wanted to make sure everybody understood.

    I scrape over 80 to 90 million URLs a week using ScrapeBox & other tools. While that may sound great, until I had a working solution to what the title of this thread is, it was a nightmare.

    Having to import 1 million URLs at a time into ScrapeBox, then removing duplicates, saving the list & doing that over 80 times, then combining those new files, splitting them into chunks of 1 million & then doing it 60 times more...repeat, repeat, repeat...until I finally got a list that had no duplicates...well, it seriously takes HOURS & HOURS and anybody working with large lists knows that.

    Well, here is how you can take ALL that headache away & simply take 10 million URLs (or more...or less) & using ONE program, remove all duplicate URLs & duplicate domains in less than 10 minutes. Sounds good right?

    First, you'll need a couple things (they are free).

    Text Collector: Download Page If you're using ScrapeBox & you've harvested more than 1 million URLs, you've likely got TONS of different .txt files all with X amount of URLs in them.

    You're going to want to get ALL of those URLs into one .txt file. That software I've just linked to will do that for you.

    gVim: Download Page gVim is GREAT for working with ENORMOUS .txt files. It can handle opening files with over 10 million lines no problem...it's also what is going to make removing duplicate URLs & domains a breeze.

    Once you've got those two things, you're ready to get started.

    Go ahead & combine all your .txt files that contain all your urls into one file.

    Right click on that new (huge) .txt file & click on "Edit with Vim"

    Once gVim has opened, you'll know it is "ready to go" (and you can get started) if you see a flashing cursor in the uppermost left hand corner. If you don't, wait until you do (it can take some time depending on your computer & it's processing power).

    Here is a picture of the interface you'll see. In the green circle in at the bottom, that's the number of lines in the file.
    [​IMG]


    REMOVING DUPLICATE URLS



    Now, to remove duplicate URLs, simply type the following (you don't have to click anywhere, just start typing it)

    :sort u

    As soon as you type the : key, you'll see the cursor jump to the bottom left hand corner of the screen. Keep typing the rest, exactly as I showed you above in red.

    Once you've typed it out, press enter. It will proceed to remove all duplicate URLS & sort them alphanumerically. This can take awhile depending on the number of URLs.

    If you'd like to see how it functions quickly, just open up a small file with a couple hundred URLs.

    It's really that simple. Once it's complete, you'll see an updated number of lines at the bottom of gVim's interface & it will tell you how many lines it removed.

    REMOVING DUPLICATE DOMAINS

    Once again, type this key :


    As before, as soon as you type the : the cursor will jump to the bottom left hand corner of the screen again. Once it does, then paste this after the :

    let g:gotDomains={}

    So you should end up seeing this at the bottom left hand corner of your screen:

    :let g:gotDomains={}

    If you see that, go ahead and press enter (you won't see anything happen, don't worry. Just continue with these steps).

    Type the : again (and you'll see the cursor jump to the bottom left hand corner of the screen waiting for you command).

    Paste the following after the :

    Code:
    %g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif
    Then, press Enter.

    Now, just wait. You may end up seeing an error message, if you do, just press Enter.

    Either way, you'll end up seeing the h of all the "http://'s" highlighted in yellow.

    That's it! You've just removed the duplicate domains of over 1 million URLs in one easy program using very few small steps.
     
    • Thanks Thanks x 28
  2. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    Best approach is to use DBMS like sql lite to store all data !! It's simplest and easiest..
    I would code that in 10 min.
    :)
     
  3. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    DBMS? Database something?

    Sorry, I'm not familiar with the abbreviation.
     
  4. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    A RDBMS a relational database management system, e.g. Mysql.

    Full ack, I do that for my URL lists, muuuuch easier.
     
  5. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    You're probably right...actually, I'm sure you're right...however, it being much easier for somebody without the knowledge & resources...probably not.
     
    • Thanks Thanks x 1
  6. Bestbuyfoam

    Bestbuyfoam Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 14, 2009
    Messages:
    1,637
    Likes Received:
    536
    Can at the right time...

    I have a huge list and didn't have a clue what I was going to do with it...

    Will try this right away...

    Thanks + Rep given...

    Thanks a million,

    And have a blessed one...
     
    • Thanks Thanks x 1
  7. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Happy this helped somebody :)

    I had such a headache after dealing with my lists without the method mentioned above, it was such a pain.

    I knew there had to be others out there with the same issue.
     
    • Thanks Thanks x 1
  8. m3ownz

    m3ownz Regular Member

    Joined:
    Dec 12, 2009
    Messages:
    311
    Likes Received:
    135
    Excellent timiming, just started scraping huge lists, and was getting prepared for an afternoon of split > dedupe > join >depude etc.
    Will save me a lot of time.
     
  9. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Huge pain in the @ss doing it that way isn't it?

    If you've got any questions, feel free to ask.

    It isn't nearly as complicated as I made it look above, haha.
     
    • Thanks Thanks x 1
  10. lewi

    lewi Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 5, 2008
    Messages:
    2,309
    Likes Received:
    818
    Was just wondering if the remove duplicate domains would take away the duplicate domains and just leave one or if it is just going to remove the duplicate urls!

    For example...

    http://domain.com/site.html
    http://domain.com/index.html

    Would it see those as different or the same and remove one?

    Lewi
     
  11. stoaf88

    stoaf88 Regular Member

    Joined:
    Nov 1, 2008
    Messages:
    243
    Likes Received:
    35
    Why not just use scrapebox for this? Thats what I do.
     
  12. Apocryphax

    Apocryphax Newbie

    Joined:
    Mar 24, 2010
    Messages:
    39
    Likes Received:
    16
    Location:
    Google.com/ncr
    Why not read everything properly?
    That's what i do..
     
    • Thanks Thanks x 3
  13. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Well, lets say you had a list like this:

    http://domain.com/site.html
    http://domain.com/index.html
    http://domain.com/site.html

    And as per my instructions above, you followed the steps to "remove duplicate URLs", you'd then be left with:

    http://domain.com/site.html
    http://domain.com/index.html

    Then, if you followed my intructions above to remove duplicate domains, you'd be left with:

    http://domain.com/site.html


    Now, if you just wanted to go straight to "remove duplicate domains" without removing duplcate urls first, and you did that with the following list:

    http://domain.com/site.html
    http://domain.com/index.html
    http://domain.com/site.html

    You'd still only be left with the following:

    http://domain.com/site.html

    Hopefully that all makes sense. In other words, it does EXACTLY the same thing as ScrapeBox does, but for lists that are over 1,000,000 lines.
     
  14. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Haha, what a great response to the person you were replying to with this message...gave me a good laugh.
     
    • Thanks Thanks x 1
  15. paincake

    paincake Power Member

    Joined:
    Aug 18, 2010
    Messages:
    716
    Likes Received:
    3,099
    Home Page:
    Flx you failed to mention that simply typing ":" doesn't launch the command mode. You need to press esc first.
     
  16. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Strange....I don't have to press esc first. The second I type the ":" it immediately enters command mode.
     
  17. paincake

    paincake Power Member

    Joined:
    Aug 18, 2010
    Messages:
    716
    Likes Received:
    3,099
    Home Page:
    Oh, sorry, I restarted the app and now it's working like it's supposed to.
     
  18. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Happy you got it working :)

    If you've got any questions, let me know.
     
    • Thanks Thanks x 1
  19. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    Try using the ":sort u" command first. I've seen that error, and for some reason, using the ":sort u" command beforehand solved it for me.
     
  20. daltarak

    daltarak Newbie

    Joined:
    Oct 4, 2009
    Messages:
    22
    Likes Received:
    3
    If you have shell access to a linux box just upload the file and type following:

    # cat huge_file_with_duplicate_urls.txt | sort | uniq > clean_file.txt

    It'll also work with unxutils from sourceforge but didn't test it on windows.


    Have fun!