1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Link extractor - Help needed

Discussion in 'General Scripting Chat' started by TrafficWizard, Jan 13, 2015.

  1. TrafficWizard

    TrafficWizard Junior Member

    Joined:
    Aug 22, 2014
    Messages:
    161
    Likes Received:
    28
    Home Page:
    Now hello, I have made some scripts in a past all good and so but I need a help with a link extractor - I need to collect all links from website to file.. ! Now the thing is I know how to do it in a basic manner I can grab links directly from source code by documment.links or converting all html to string and then doing regex search for link patterns etc

    Problem how you extract links which are not embed on site html/source code ?! Like outer ads etc ? Is it possible ?

    I need to get all links 100% on page works fine with links on source code but if there is some ads or iframed stuff I can't get it to work any help suggestions guides ?
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,566
    Likes Received:
    11,026
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    Write your scraper in a full browser emulator, like phantom.js.
     
  3. TrafficWizard

    TrafficWizard Junior Member

    Joined:
    Aug 22, 2014
    Messages:
    161
    Likes Received:
    28
    Home Page:
    thank you for reply jazzc !

    I did take a look at phantom.js but I wasn't sure if that will work for me, see I'm working on kinda blackhat plugin thing for a browser and it could be overkill for this.. !

    I manage to pull a lot of info on subject and there are no solutions as for now (100% working on all stuff/stable) for my specific case - but I did manage and find workaround :)
     
  4. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    970
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    Using PhantomJS:

    Code:
    var page = require('webpage').create();
    var url = 'http://example.com';
    
    page.open(url, function(status) {
        var links = page.evaluate(function() {
            return [].map.call(document.querySelectorAll('a'), function(link) {
                return link.getAttribute('href');
            });
        });
        console.log(links.join('\n'));
        phantom.exit();
    });
    
     
    • Thanks Thanks x 2
  5. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,566
    Likes Received:
    11,026
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    Props for map() and prototypical usage.
     
    • Thanks Thanks x 1
  6. xNotch

    xNotch Registered Member

    Joined:
    Sep 16, 2014
    Messages:
    81
    Likes Received:
    19
    I'd like to point out that it's considered bad practice to parse html with regex you should use a fault tolerant html/xml parser in your language of choice instead.
     
    • Thanks Thanks x 1
  7. TrafficWizard

    TrafficWizard Junior Member

    Joined:
    Aug 22, 2014
    Messages:
    161
    Likes Received:
    28
    Home Page:
    Thank you for input :) ! Now by days ticking I've made a lot of changes regex is long history, made a better solution but I will keep that in mind thank you.


    Now my problem, (I know what you thinking etc don't tell me it's bad idea I'm gonna do it one way or another so don't bother telling me that) Is it possible problematically to bypass domain origin ? Yes I want to click inside Iframe by client side language and Iframe is diff. origin site port etc :) is it possible ?! Any help/input on this ? I can get click on any links on page but click on Iframe works as for "simulation" no actual click is made. I just want to know is it possible or no.

    I got a working clickjacking script what I've been modifying and so far it works really good, now I need to make it more *silence !

    But I'm still not droppping a chance there is for direct click on iframed page - any help ?
     
  8. xNotch

    xNotch Registered Member

    Joined:
    Sep 16, 2014
    Messages:
    81
    Likes Received:
    19
    Im not really the expert on browser origin policies or clickjacking, but im fairly certain it's impossible to manipulate a iframe with a different origin.

    I did read one trick of using css to move around an invisible iframe to get the target to click... but i don't have any first hand expierence.
     
  9. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,566
    Likes Received:
    11,026
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    A normal page can't, a browser extension can.
     
  10. TrafficWizard

    TrafficWizard Junior Member

    Joined:
    Aug 22, 2014
    Messages:
    161
    Likes Received:
    28
    Home Page:

    xNotch
    - thank you for reply ;) as for quite a lot of hours reading all stuff what I can get my hands on I agree it's impossible to make that click on different origin domains, tho I manage to find few examples but they are not completing task what I want to do ! There are people "claiming" there exist such a scripts etc but as far as I know none has ever provided them/they are outdated by years. Nowadays browsers will not allow these things and bypassing that would require some 0day exploits on top browsers :) or I don't know out of box thinking method.

    As for clickjacking, If you are interested I can PM you basic script with example clickjacking - I've modified it, it's far from perfect but I'm looking for neat way how to secure/hide it on site with some badass logic behind it :D ! Tho I have bad feeling I will lose few adsesnse accounts by doing this.
     
  11. TrafficWizard

    TrafficWizard Junior Member

    Joined:
    Aug 22, 2014
    Messages:
    161
    Likes Received:
    28
    Home Page:
    Too bad I so hoped clientside script could too !