1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

I got sick of duplicates the lists I was importing into GSA Search Engine Ranker

Discussion in 'General Scripting Chat' started by Anon752, Aug 26, 2012.

  1. Anon752

    Anon752 Regular Member

    Joined:
    Jul 3, 2010
    Messages:
    244
    Likes Received:
    179
    One of the big problems with seo tools and huge lists of sites to post to is that the lists will contain tons of duplicates that aren't easily removed.

    Before I show you the code I will tell you the results from running this just now. My original list had 497718 lines (urls) and after running this it had 433981. Over 60,000 urls that my seo software doesn't have to deal with now.

    Examples:

    Code:
    http://some.com/read.php?tid=10053
    http://some.com/read.php?tid=10053&ordertype=desc
    http://some.com/read.php?tid=10053&page=e&
    
    All three of those urls are the same for our purposes. We dont need all the extra junk on the end to feed it into an seo tool. Most de-duplication tools will leave all three of those in your list.

    More examples:

    Code:
    http://example.com/forum.php?mod=viewthread&tid=974
    http://example.com/forum.php?mod=viewthread&tid=974&extra=page%3D1
    
    http://another.com/index.php?title=User:Sdgsdgsdg
    http://another.com/index.php?title=User:Sdgsdgsdg&oldid=31992
    
    http://test.com/phpbb3/viewtopic.php?f=2&t=30301
    http://test.com/phpbb3/viewtopic.php?f=2&t=30301&view=print
    
    Thats three more examples of the same problems.

    I've spent the last few hours putting together a sed script that cleans up 60 versions of these.

    File "replacements.sed":

    Code:
    s/\/read\.php\?.*/\//g
    s/\/viewthread\.php.*/\//g
    s/\/forum\.php.*/\//g
    s/\/index\.php?title=User:.*/\//g
    s/\/index\.php?title=User%3A.*/\//g
    s/\/viewtopic.php?f=.*/\//g
    s/\/showthread\.php?.*/\//g
    s/\/review\.asp?.*/\//g
    s/\/index\.php\/User:.*/\//g
    s/\/index\.php\/User%3A.*/\//g
    s/\/viewtopic\.php?p=.*/\//g
    s/\/boke.asp?.*/\//g
    s/\/forum\/index.php?topic=.*/\/forum\//g
    s/\/index\.php?action=profile.*/\//g
    s/\/index\.php?topic=.*\.new/\//g
    s/\/memberlist\.php?.*/\//g
    s/\/home\.php?mod=.*/\//g
    s/\/?feed=rss.*/\//g
    s/\/index\.php?option=com_akobook.*/\/index\.php?option=com_akobook/g
    s/\/search.php.*/\//g
    s/\/forum\/topic\.php?.*/\/forum\//g
    s/\/asae-comments\.cgi.*/\//g
    s/\/contact\.php?.*/\//g
    s/\/member\.php?action=profile.*/\//g
    s/\/YaBB\.pl?.*/\//g
    s/\/viewtopic\.php?pid=.*/\//g
    s/\/forum?func=view.*/\//g
    s/\/viewlinks\.php?.*/\/viewlinks\.php/g
    s/\/posting\.php?.*/\//g
    s/\/space\.php?uid=.*/\//g
    s/\/showtopic\.aspx?.*/\//g
    s/?bfm_index=.*//g
    s/\/so\.php?id=.*/\//g
    s/\/modules\.php?name=Forums.*/\/modules\.php?name=Forums/g
    s/\/printthread\.php?.*/\//g
    s/\/index\.php?option=com_ckforms.*/\//g
    s/\/?contact_form=.*/\//g
    s/\/?widgetType=.*/\//g
    s/\/index\.php?showuser=.*/\//g
    s/\/forum\/member\.php?.*/\/forum\//g
    s/\/index\.php?do=forum.*/\/index\.php?do=forum/g
    s/\/guest-book\/?cpage=.*/\/guest-book\//g
    s/\/rss\.php?.*/\//g
    s/\/index\.php?site=guestbook.*/\/index\.php?site=guestbook/g
    s/\/profile\.php?mode=viewprofile.*/\//g
    s/\/search?updated-.*/\//g
    s/\/index\.php?topic=.*\.0$/\//g
    s/\/index\.php?topic=.*\.msg.*/\//g
    s/\/forum_viewtopic\.php?.*/\//g
    s/\/index\.php?site=forum_topic.*/\/index\.php?site=forum/g
    s/\/submit_article\.php?id=.*/\//g
    s/\/submit\.php?id=.*/\//g
    s/\/index\.php?action=post;.*/\//g
    s/\/Archiver\.asp?ThreadID=.*/\//g
    s/\/member\.php?u=.*/\//g
    s/\/guest_book\.php?.*/\/guest_book\.php/g
    s/\/index\.php?topic=.*\.0;prev_.*/\//g
    s/\/upcoming\.php?page=.*/\//g
    s/\/redirect\.php?.*/\//g
    s/\/guestbook\.php?page=.*/\/guestbook\.php/g
    s/?replytocom=.\+//g
    
    If your a linux/unix user you will know what to do with this file. Save that text as replacements.txt then do this:

    Code:
    sed -f replacements.sed < urls.txt | sort -u >fixed-urls.txt
    
    So lets say you have your list of urls in urls.txt they would now be in fixed-urls.txt and the file will have far less duplicates.

    If anyone has any others that are a real problem and that I didnt cover in this list please tell me, with examples, and I will get them added.

    ps: I wont be using this thread to teach anyone linux or the commandline. If you dont understand this post maybe someone else has the time to help.
     
  2. seoreports

    seoreports Junior Member

    Joined:
    Aug 26, 2012
    Messages:
    151
    Likes Received:
    34
    nice script. sed is awesome.