1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

A way to scan many URL's for their platform?

Discussion in 'Black Hat SEO' started by SEO Sage, Jun 17, 2013.

  1. SEO Sage

    SEO Sage Junior Member

    Joined:
    Jan 7, 2011
    Messages:
    112
    Likes Received:
    36
    Occupation:
    Search Engine Marketing
    Location:
    The Big Apple
    Home Page:
    I am working on a design document that is for an SEO tool, just for my own fun and experimentation. I am always brainstorming such things, to see how or what I could use to make more money. Anyway, I have a little SEO system, where I check my competition's backlinks and then scan them for platforms. I use the free 'sick platform reader', but want to customize something that can dig deep for the CMS platform used, or identify flash websites, pure HTML, etc. In the way that I want it to behave.

    Being a person who sucks at regular expressions, I know it has to involve them in some way, but I want to know about anything related to this topic. I would also like to find partners and people interested in forming a private group discussion about this topic. Please feel free to PM me or post here, if you have something to add to the discussion. I look forward to hearing people's ideas or comments on this subject, thanks!

    :)
     
  2. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    710
    Likes Received:
    267
    Location:
    PHP Scripting ;)
    Regular Expressions? You mean regex? but why?

    Well, when you have the urls, you can simply get the html contents of the site, and check for footprints.Almost all platforms, has their name written in every pages of the site. If I scrape a wordpress site, I can simply do a string check for "Wordpress" in the html. If it returns, true, that means, it is wordpress. Well, there can be flaws in this setup. Eg : A Magento site discussing wordpress.

    To make it a bit more foolproof, you may look into footprints of platforms, and match your results to those footprints.Like, to check if there is a wp-login.php url or maybe, you can check if the wordpress wp-content folder exists, so and so. I wouldnt be using regex at all. Well, maybe I am not thinking like you are thinking?
     
  3. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    8,883
    Likes Received:
    7,481
    Occupation:
    ZLinky2Buy SEO Services
    Location:
    ⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩
    Home Page:
    Most CMS have "signatures" and the way to check those signatures is indeed by regular expression or some form of text matching(substring, etc).

    The URL may not be enough to reveal the CMS in question, especially those rewritten for SEO friendlyness.
     
    • Thanks Thanks x 1
  4. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    710
    Likes Received:
    267
    Location:
    PHP Scripting ;)
    True, but still there is a work around since what is rewritten is usually the front end.
    To check if a site is wordpress, I just have to have my script to scrape wp-content folder of wordpress. If it returns a 404, it is not wordpress. If it returns 403 or access denied, thats wordpress. No one is going to change the wp-content folder, or they will have to rename it everywhere in the WP core.

    Well, I am just talking about WP here, but there are similar approach for all other platforms.
     
    • Thanks Thanks x 1
  5. SmartMan

    SmartMan BANNED BANNED

    Joined:
    Jul 25, 2012
    Messages:
    673
    Likes Received:
    1,244
    Just so you know you can add your own set of footprints for specific platforms in Sick reader's filter.txt. The key here is to find the best footprints.
     
  6. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    8,883
    Likes Received:
    7,481
    Occupation:
    ZLinky2Buy SEO Services
    Location:
    ⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩⇩
    Home Page:
    True, but still you need to know the root directory for the CMS. Say you came across example.com. Just testing example.com/wp-content is not enough, as it may be under example.com/blog/wp-content/ and so on.

    Also wp-content may not be renamed, but it may be rewritten. One of the sites I work on alias wp-content to /resources for example....

    Still you may get a large percentage of success by brute forcing and trying example.com/wp-content and other signature dirs before trying other strategies.
     
  7. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    710
    Likes Received:
    267
    Location:
    PHP Scripting ;)
    So do you have all the wordpress core file rewritten to accept /resources as wp-content. Aha. I have heard of people change the wp-admin url, but not wp-content, yet.

    And yes, pretty much all can be filtered using the existing signatures and footprints without bothering about something bit more advance.
     
  8. moromete

    moromete Junior Member

    Joined:
    Jul 19, 2008
    Messages:
    183
    Likes Received:
    150
    There is a similar software already : urlradar.com . It dose not scan just the main(index.php file) it can be configured to scan multiple pages for a platform. This is useful if you wish to check that submit/login pages for a specific url exists and they are working.