How to remove duplicate domains from text files using python?

loopline · Jan 12, 2015

If I have a .txt file that has multiple urls from the same domain and I want to remove duplicate domains, how can I do that in python?

So if I have 1.txt and it contains

http://www.domain1.com/page1
http://www.domain1.com/page2
http://www.domain2.com/page1
http://www.domain3.com/page1
http://www.domain3.com/page2

and I want to only be left with

http://www.domain1.com/page2
http://www.domain2.com/page1
http://www.domain3.com/page2

I don't care which url is kept from a given domain, so long as there is only 1 url from that domain.

I was thinking I might be able to do this with regex, but I just have never really used regex much. Perhaps using a dictionary.

Or perhaps if there is some module in python that can be imported that already has something that will recognize urls.

I can remove duplicate urls just fine, but Im not the worlds foremost python expert, so Im a little stumped on this one.

Any help is appreciated.

loopline · Jan 12, 2015

Update:

Assuming I have a text file full of urls.

I used urlparse to get it in the ballpark:

Code:

from urlparse import urlparse
import glob


urls3 = glob.glob("C:\something\*.txt")

domains={}
for line in urls3:
    with open(line, "r") as infile:
        for line1 in infile:
            parse = urlparse(line1)
            domains[parse[1]] = line1
    with open(line, "w") as outfile:
        for line1 in domains:
            outfile.write(domains[line1])
    domains.clear()

This works fine to a point, it removes all domains, but it still sees www.domain.com and domain.com as 2 separate entries.

Any thoughts on how to tweak it so that it can output only 1 final url from a given domain regardless of www or not would be appreciated!

zeroto100k · Jan 13, 2015

Apparently your input data doesn't follow a common convention. If we could guarantee that
- All urls will begin with 'hxxp://' and immediately followed by the tld, and
- All tlds will be sorted in alphabetical order

...then you could run a script like this:

Code:

with open('urlfile.txt', 'r') as inputfile:
    lastDomain = ''
    
    with open('output.txt', 'w') as outputfile:
    for line in inputfile:
        currentDomain = line.split('/')[2].strip()

        if lastDomain == currentDomain:
            continue #Don't save to output file
        else:
            lastDomain = currentDomain
            outputfile.write(line) #Save to file

But if you can't guarantee alphabetical order of tlds, then you can add them to a list and check on every iteration if the tld is already on the list. When you're done reading the input file, just write the list to the output file.

TheVegan · Jan 13, 2015

This took me a minute but I figured it out

The regex pattern should be something like

Code:

pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'

This will give you a group with the name 'domain'
then you can for example

Code:

pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'

domain_names = [] 
links = [] 

for line in lines:
  m = re.match(pat,line)
  domain_name = m.group('domain')
  if domain_name not in domain_names:
    links.append(line)
    domain_names.append(domain_name)

loopline · Jan 14, 2015

zeroto100k said:
Apparently your input data doesn't follow a common convention. If we could guarantee that
- All urls will begin with 'hxxp://' and immediately followed by the tld, and
- All tlds will be sorted in alphabetical order

...then you could run a script like this:

Code:

with open('urlfile.txt', 'r') as inputfile: lastDomain = '' with open('output.txt', 'w') as outputfile: for line in inputfile: currentDomain = line.split('/')[2].strip() if lastDomain == currentDomain: continue #Don't save to output file else: lastDomain = currentDomain outputfile.write(line) #Save to file

But if you can't guarantee alphabetical order of tlds, then you can add them to a list and check on every iteration if the tld is already on the list. When you're done reading the input file, just write the list to the output file.

I can't guarantee that they will be sorted, however thats easy enough to do. However your approach seems as if it would still not consider www.domain.com and domain.com as the same. Because when those get sorted alphabetically they will still be many urls apart.

TheVegan said:
This took me a minute but I figured it out

The regex pattern should be something like

Code:

pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'

This will give you a group with the name 'domain'
then you can for example

Code:

pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?' domain_names = [] links = [] for line in lines: m = re.match(pat,line) domain_name = m.group('domain') if domain_name not in domain_names: links.append(line) domain_names.append(domain_name)

Thanks, Ill give this a try. I honestly just never had a need for regex so I really never learned it. Currently working on learning it, but still only know enough regex that I can create more issues then I solve, lol.

TheVegan · Jan 15, 2015

Yeah that code will work, google search 'regex cheat sheet,'
I think most people, including myself don't memorize the regex syntax, always handy to look at one of those and use a regex testing tool... because in alot of situations regex makes life so much easier!

sohom · Mar 2, 2015

use .read()
then split new line via .split("\n") . so it will create a list []

now only you need to run a for loop for every line and check whether each line already in that list or not

simple

How to remove duplicate domains from text files using python?

loopline

Elite Member

loopline

Elite Member

zeroto100k

Newbie

TheVegan

Junior Member

loopline

Elite Member

TheVegan

Junior Member

sohom

Supreme Member

Main Menu

Marketplace

Making Money

BlackHat World