1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Making a program that scans websites?

Discussion in 'General Programming Chat' started by dalinkwent6, Sep 1, 2014.

  1. dalinkwent6

    dalinkwent6 Junior Member

    Joined:
    Jun 30, 2013
    Messages:
    114
    Likes Received:
    16
    Location:
    5th Dimension
    Background: Intermediate java, basic C++

    Problem:

    I am trying to scrape proxies from a specific website that displays the proxies in table format |address|port|location|time updates| etc...
    I usually just
    1. copy and paste each page (only has 20 per page) to word
    2. delete all the cells except address and port
    3. copy and paste to notepad to remove cell format and prepare for transfer
    4. i then delete the spaces between address and port and replace with ':'
    5. save file and import to scrapebox.


    Solution ideas:

    I know hot to scan text files using scanner object so perhaps i can make a program that scans step 3 to auto mate replaces the spaces between address and port with ':' but that only saves a fraction of the time. Any ides on the algorithm to scan websites directly from an IDE (specifically eclipse)? And perhaps to change pages within that site to scan the next?
    How
     
  2. Arbvestor

    Arbvestor Newbie

    Joined:
    Feb 11, 2014
    Messages:
    25
    Likes Received:
    15
    Occupation:
    Support Tech
    Location:
    Spain
    Home Page:
    It depends a bit on the website you want to scrape.

    1) If if does not use JavaScript and or AJAX to render the table within the browser you can use a simple approach with a library like JSoup. It has a connect method to load content from the web directly. then you parse the content using JSoup and can easily select (JSoup uses css selectors) the parts of teh site you need. You can print out the results in the format you need. Or directly write to a database or whatever you need.

    2) If the site uses AJAX calls to get the actual content or if it renders the page using JavaScript you probably need an approach that uses a real browser. For that, I find the best tool to be Selenium webdriver. Together with PhantomJS, which is a headless browser this is an easy to manage system to scrape otherwise hard to scrape websites.


    Genaral note: I assume you use Java for this, but any other language will prbably do, where there is a library like Jsoup and/or selenium bindngs
     
  3. DearSanta

    DearSanta BANNED BANNED

    Joined:
    Jul 25, 2014
    Messages:
    178
    Likes Received:
    66
    I would suggest using Zennoposter. It's easy to start with and you don't need programming skills for easy jobs. Try the demo.
     
  4. Arbvestor

    Arbvestor Newbie

    Joined:
    Feb 11, 2014
    Messages:
    25
    Likes Received:
    15
    Occupation:
    Support Tech
    Location:
    Spain
    Home Page:
    That is an option too of course, if he wants to not program. But Zennoposter is not free and you come quickly to its limits. If the guy can program a bit, then i would recommend to write a simple scraper with selenium. Something like this:

    Code:
    package testing;
    
    import java.util.List;
    import java.util.logging.Logger;
    
    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.phantomjs.PhantomJSDriver;
    
    public class HideMyAssProxyListGetter {
        private final static Logger LOGGER = Logger.getLogger(HideMyAssProxyListGetter.class .getName());
        
        
        public static void main(String[] args) {
            WebDriver driver = new PhantomJSDriver();
            
            String url = null;
            try{
                for (int i = 5; i > 0; i--){
                    url = "http://proxylist.hidemyass.com/"+i;
                    driver.get(url);
                    
                    WebElement table = driver.findElement(By.id("listable"));
                    
                    for (WebElement tr : table.findElements(By.tagName("tr"))){
                        List<WebElement> tds = tr.findElements(By.tagName("td"));
                    
                        if (tds.size()<8){
                            continue;
                        }
                        String ipStr = tds.get(1).getText();
                        if (ipStr.equals("IP address")){
                            continue; //ignore the header line
                        }
                        
                        String portStr = tds.get(2).getText();
                        String location = tds.get(3).getText();
                        
                        
                        String speed = tds.get(4).findElement(By.cssSelector("div.progress-indicator>div")).getAttribute("style").substring(7);
                        String connectionTime = tds.get(5).findElement(By.cssSelector("div.progress-indicator>div")).getAttribute("style").substring(7);
                        String protocol = tds.get(6).getText();
                        String anonymity = tds.get(7).getText();
        
                        LOGGER.info(ipStr + ":" +portStr
                                + " ("+location+")" 
                                + " speed="+speed 
                                + " connectionTime="+connectionTime
                                + " protocol="+ protocol
                                + " anonymity="+ anonymity);
    
                    }
                }
            }
            catch (org.openqa.selenium.NoSuchElementException nse){
                LOGGER.warning("can't access "+url +" try next service..., error was: "+nse.getMessage());
            }
            driver.quit();
    
        }
    }
     
    • Thanks Thanks x 2
  5. bluehatface

    bluehatface Regular Member

    Joined:
    Oct 19, 2013
    Messages:
    232
    Likes Received:
    98
    Location:
    Here
    Can you not just grab the page with CURL, and parse the DOM with a XML parser, or regex?
     
  6. journeycoder

    journeycoder Newbie

    Joined:
    Aug 31, 2014
    Messages:
    27
    Likes Received:
    3
    Occupation:
    Coder: asp.net, js, php,wordpress, genesis,tools,
    Location:
    bestmicroovens.com
    Home Page:
    I do not know two language above. But if you want process your website with .net, i want to help will, clawl data with all website
     
  7. ou8myi

    ou8myi Newbie

    Joined:
    Sep 4, 2014
    Messages:
    18
    Likes Received:
    5
    Occupation:
    programmer
    Location:
    Middle of Nowhere
    I think bluehatface has the simplest solution, if simple is ok. I would probably use perl, though, to grab and parse in one fell swoop. (Not familiar with curl)
     
  8. Arbvestor

    Arbvestor Newbie

    Joined:
    Feb 11, 2014
    Messages:
    25
    Likes Received:
    15
    Occupation:
    Support Tech
    Location:
    Spain
    Home Page:
    That would work only, if the site to be scraped does not use AJAX. Also, to automate this, you would need a (shell) script to call curl or wget or whatever command line "browser" you may want to use and fire up the regex/XML parser. I think that approach is actually more complicated than sticking with one programming language and use that throughout.

    But of course... it all depends on the experience of the OP.
     
  9. xNotch

    xNotch Registered Member

    Joined:
    Sep 16, 2014
    Messages:
    69
    Likes Received:
    15
    When ever I encounter complex javascript selenium/phantomjs is what I go to as well. Selenium has bindings in several languages so it's easy to integrate. Plus since phantomjs is headless you can run it on a cheap vps. This is what I'd go with.
     
    • Thanks Thanks x 1
  10. dalinkwent6

    dalinkwent6 Junior Member

    Joined:
    Jun 30, 2013
    Messages:
    114
    Likes Received:
    16
    Location:
    5th Dimension
    This is great, this is actually first time I worked with selenium. I can definitely see the possibilities now tho. I'll post my code when I get the chance to finish the project. Thank
     
  11. bubbubber

    bubbubber Newbie

    Joined:
    Sep 30, 2014
    Messages:
    3
    Likes Received:
    1
    Occupation:
    IT Professional
    Location:
    Taiwan
    Home Page:
    Yes, if the site uses AJAX and/or just in general generates a lot of table and form elements on the fly or in nested layers, then it makes it more difficult to scrape the site. Rather, it makes it more difficult to interact with the site (that is, for sites where interaction is necessary).

    If you take a look at my profile you can see a link to a Youtube video that shows an alternative method to web scraping and automated web crawling.

    Ask me any questions you have about the technology used.

    I believe that probably most folks that perform web scraping and web crawling use PHP and http protocol. In my method I use a web browser control embedded into a Windows form application. I used this method because I found that certain sites needed to think that an actual mouse cursor was moving around - and that their buttons were pressed by an actual mouse click - or else they would not respond. I found this to be true for some online betting sites, at the very least. Doing it my way, I am able to all Win32 API calls to move the mouse cursor and actuate mouse clicks and keyboard strokes. I used VB6 in the example in the video, but you can write the same program in C# (and actually, it might be easier to write this now in C# and .NET).

    I know a lot of people will call BS on this and say that this is not necessary, but, based on my experiences with certain sites, it is a viable solution to the problem.

    This method is not meant to replace the more commonly used PHP web scraping/web crawling methods. My method is comparatively slow because it actually uses a web browser to bring up the each web page, whereas I assume PHP with HTTP protocol can navigate to many more web pages in the same amount of time. But, in certain situations, my method will be able to allow you into log into some sites that the PHP method just can't handle.

    Also, please take a look at my profile and read about my search for work. If you have any job leads, please let me know. PM or email me for my resume. I am not necessarily looking for a job related to SEO. I am just looking for an IT job that will allow me to telecommute. See my profile for more details. Thanks.
     
  12. Chris22

    Chris22 Regular Member

    Joined:
    Sep 29, 2010
    Messages:
    400
    Likes Received:
    1,059
    Not all the time, there are plenty of sites where the developers have built restful apis for all their ajax calls and querying those endpoints has often been simpler than scraping the site. This practice seems is starting to become more commonplace too.
     
  13. Asif WILSON Khan

    Asif WILSON Khan Executive VIP Premium Member

    Joined:
    Nov 10, 2012
    Messages:
    10,138
    Likes Received:
    28,602
    Gender:
    Male
    Occupation:
    Fun Lovin' Criminal
    Location:
    London
    Home Page:
    • Making a program that scans websites?

    I need glasses. I read the title and was outraged.