1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

URL list - filtering out URLs which have no subpage

Discussion in 'Black Hat SEO' started by counselor_X, Jan 10, 2015.

  1. counselor_X

    counselor_X Regular Member

    Joined:
    Mar 21, 2011
    Messages:
    244
    Likes Received:
    50
    Hi guys, I'm usually pretty good in Excel and Notepad when it comes to manipulating huge amounts of data to my liking, but I would like to do something and I don't have a solution.

    Say I have a list of 1 million URLs, some are just a domain (website.com) and some are a domain with a subpage (website.com/subpage). Can anyone think of a way that I can remove all URLs from the list which do not have a subpage? Unfortunately Scrapebox doesn't have this feature. Thanks
     
  2. Tobbe co

    Tobbe co Junior Member

    Joined:
    Sep 29, 2014
    Messages:
    171
    Likes Received:
    139
    http://.*?\.\w{2,3}/.*

    Load it in notepad++. Maybe it's not bulletproof but give it a try.
    Open your new data in excel and sort it from a-z to remove all empty lines.
    If it doesn't work, just google for another regex or create your own that might fit your needs.
    [​IMG]
     
    • Thanks Thanks x 1
    Last edited: Jan 10, 2015
  3. counselor_X

    counselor_X Regular Member

    Joined:
    Mar 21, 2011
    Messages:
    244
    Likes Received:
    50
    Thank you very much. I wasn't aware this tool existed. However, the string you provided does the opposite of what I am looking for. It is stripping out the URLs that have a subpage and only leaving the URLs that are just a domain. I would like to only get rid of the URLs which do not have a subpage.

    I suppose I could use this to produce the list of URLs that I need to remove, and then import it into Scrapebox to compare to my URL list and remove anything it finds as a match. That's probably the easiest way.
     
  4. counselor_X

    counselor_X Regular Member

    Joined:
    Mar 21, 2011
    Messages:
    244
    Likes Received:
    50
    I also just noticed that all my URLs that are just a domain have a slash at the end (http://website.com/) which makes things more difficult :hmmmm2:
     
  5. Tobbe co

    Tobbe co Junior Member

    Joined:
    Sep 29, 2014
    Messages:
    171
    Likes Received:
    139
    Just open it up in excel (I guess it works like Open Office), put a slash as the separator and sort it by the column where sub pages is.
    Now it will all be listed with naked domains first then all with sub pages.