1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to remove duplicate domains from text files using python?

Discussion in 'Other Languages' started by loopline, Jan 12, 2015.

  1. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,875
    Likes Received:
    2,058
    Gender:
    Male
    Home Page:
    If I have a .txt file that has multiple urls from the same domain and I want to remove duplicate domains, how can I do that in python?

    So if I have 1.txt and it contains

    http://www.domain1.com/page1
    http://www.domain1.com/page2
    http://www.domain2.com/page1
    http://www.domain3.com/page1
    http://www.domain3.com/page2

    and I want to only be left with


    http://www.domain1.com/page2
    http://www.domain2.com/page1
    http://www.domain3.com/page2


    I don't care which url is kept from a given domain, so long as there is only 1 url from that domain.

    I was thinking I might be able to do this with regex, but I just have never really used regex much. Perhaps using a dictionary.

    Or perhaps if there is some module in python that can be imported that already has something that will recognize urls.

    I can remove duplicate urls just fine, but Im not the worlds foremost python expert, so Im a little stumped on this one.

    Any help is appreciated.
     
  2. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,875
    Likes Received:
    2,058
    Gender:
    Male
    Home Page:
    Update:


    Assuming I have a text file full of urls.


    I used urlparse to get it in the ballpark:

    Code:
    from urlparse import urlparse
    import glob
    
    
    urls3 = glob.glob("C:\something\*.txt")
    
    domains={}
    for line in urls3:
        with open(line, "r") as infile:
            for line1 in infile:
                parse = urlparse(line1)
                domains[parse[1]] = line1
        with open(line, "w") as outfile:
            for line1 in domains:
                outfile.write(domains[line1])
        domains.clear()
     
    
    

    This works fine to a point, it removes all domains, but it still sees www.domain.com and domain.com as 2 separate entries.

    Any thoughts on how to tweak it so that it can output only 1 final url from a given domain regardless of www or not would be appreciated!
     
  3. zeroto100k

    zeroto100k Newbie

    Joined:
    Jan 13, 2015
    Messages:
    10
    Likes Received:
    1
    Apparently your input data doesn't follow a common convention. If we could guarantee that
    - All urls will begin with 'hxxp://' and immediately followed by the tld, and
    - All tlds will be sorted in alphabetical order

    ...then you could run a script like this:

    Code:
    with open('urlfile.txt', 'r') as inputfile:
        lastDomain = ''
        
        with open('output.txt', 'w') as outputfile:
        for line in inputfile:
            currentDomain = line.split('/')[2].strip()
    
            if lastDomain == currentDomain:
                continue #Don't save to output file
            else:
                lastDomain = currentDomain
                outputfile.write(line) #Save to file
    But if you can't guarantee alphabetical order of tlds, then you can add them to a list and check on every iteration if the tld is already on the list. When you're done reading the input file, just write the list to the output file.
     
  4. TheVegan

    TheVegan Junior Member

    Joined:
    Mar 6, 2013
    Messages:
    179
    Likes Received:
    33
    Occupation:
    blackhat
    Location:
    Prague
    This took me a minute but I figured it out :cool:

    The regex pattern should be something like
    Code:
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    
    This will give you a group with the name 'domain'
    then you can for example
    Code:
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    
    domain_names = [] 
    links = [] 
    
    for line in lines:
      m = re.match(pat,line)
      domain_name = m.group('domain')
      if domain_name not in domain_names:
        links.append(line)
        domain_names.append(domain_name)
    
    
    
     
    • Thanks Thanks x 1
  5. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,875
    Likes Received:
    2,058
    Gender:
    Male
    Home Page:
    I can't guarantee that they will be sorted, however thats easy enough to do. However your approach seems as if it would still not consider www.domain.com and domain.com as the same. Because when those get sorted alphabetically they will still be many urls apart.

    Thanks, Ill give this a try. I honestly just never had a need for regex so I really never learned it. Currently working on learning it, but still only know enough regex that I can create more issues then I solve, lol.
     
  6. TheVegan

    TheVegan Junior Member

    Joined:
    Mar 6, 2013
    Messages:
    179
    Likes Received:
    33
    Occupation:
    blackhat
    Location:
    Prague
    Yeah that code will work, google search 'regex cheat sheet,'
    I think most people, including myself don't memorize the regex syntax, always handy to look at one of those and use a regex testing tool... because in alot of situations regex makes life so much easier! :)
     
  7. sohom

    sohom Senior Member

    Joined:
    May 26, 2013
    Messages:
    990
    Likes Received:
    175
    Location:
    not in Past
    use .read()
    then split new line via .split("\n") . so it will create a list []

    now only you need to run a for loop for every line and check whether each line already in that list or not

    simple