Making a program that scans websites?

dalinkwent6 · Sep 1, 2014

copy and paste each page (only has 20 per page) to word
delete all the cells except address and port
copy and paste to notepad to remove cell format and prepare for transfer
i then delete the spaces between address and port and replace with ':'
save file and import to scrapebox.

Solution ideas:

I know hot to scan text files using scanner object so perhaps i can make a program that scans step 3 to auto mate replaces the spaces between address and port with ':' but that only saves a fraction of the time. Any ides on the algorithm to scan websites directly from an IDE (specifically eclipse)? And perhaps to change pages within that site to scan the next?
How

Arbvestor · Sep 6, 2014

It depends a bit on the website you want to scrape.

1) If if does not use JavaScript and or AJAX to render the table within the browser you can use a simple approach with a library like JSoup. It has a connect method to load content from the web directly. then you parse the content using JSoup and can easily select (JSoup uses css selectors) the parts of teh site you need. You can print out the results in the format you need. Or directly write to a database or whatever you need.

2) If the site uses AJAX calls to get the actual content or if it renders the page using JavaScript you probably need an approach that uses a real browser. For that, I find the best tool to be Selenium webdriver. Together with PhantomJS, which is a headless browser this is an easy to manage system to scrape otherwise hard to scrape websites.

Genaral note: I assume you use Java for this, but any other language will prbably do, where there is a library like Jsoup and/or selenium bindngs

DearSanta · Sep 6, 2014

I would suggest using Zennoposter. It's easy to start with and you don't need programming skills for easy jobs. Try the demo.

Arbvestor · Sep 6, 2014

DearSanta said:
I would suggest using Zennoposter. It's easy to start with and you don't need programming skills for easy jobs. Try the demo.

That is an option too of course, if he wants to not program. But Zennoposter is not free and you come quickly to its limits. If the guy can program a bit, then i would recommend to write a simple scraper with selenium. Something like this:

Code:

package testing;

import java.util.List;
import java.util.logging.Logger;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.phantomjs.PhantomJSDriver;

public class HideMyAssProxyListGetter {
    private final static Logger LOGGER = Logger.getLogger(HideMyAssProxyListGetter.class .getName());
    
    
    public static void main(String[] args) {
        WebDriver driver = new PhantomJSDriver();
        
        String url = null;
        try{
            for (int i = 5; i > 0; i--){
                url = "http://proxylist.hidemyass.com/"+i;
                driver.get(url);
                
                WebElement table = driver.findElement(By.id("listable"));
                
                for (WebElement tr : table.findElements(By.tagName("tr"))){
                    List<WebElement> tds = tr.findElements(By.tagName("td"));
                
                    if (tds.size()<8){
                        continue;
                    }
                    String ipStr = tds.get(1).getText();
                    if (ipStr.equals("IP address")){
                        continue; //ignore the header line
                    }
                    
                    String portStr = tds.get(2).getText();
                    String location = tds.get(3).getText();
                    
                    
                    String speed = tds.get(4).findElement(By.cssSelector("div.progress-indicator>div")).getAttribute("style").substring(7);
                    String connectionTime = tds.get(5).findElement(By.cssSelector("div.progress-indicator>div")).getAttribute("style").substring(7);
                    String protocol = tds.get(6).getText();
                    String anonymity = tds.get(7).getText();
    
                    LOGGER.info(ipStr + ":" +portStr
                            + " ("+location+")" 
                            + " speed="+speed 
                            + " connectionTime="+connectionTime
                            + " protocol="+ protocol
                            + " anonymity="+ anonymity);

                }
            }
        }
        catch (org.openqa.selenium.NoSuchElementException nse){
            LOGGER.warning("can't access "+url +" try next service..., error was: "+nse.getMessage());
        }
        driver.quit();

    }
}

bluehatface · Sep 7, 2014

Can you not just grab the page with CURL, and parse the DOM with a XML parser, or regex?

journeycoder · Sep 11, 2014

I do not know two language above. But if you want process your website with .net, i want to help will, clawl data with all website

ou8myi · Sep 11, 2014

bluehatface said:
Can you not just grab the page with CURL, and parse the DOM with a XML parser, or regex?

I think bluehatface has the simplest solution, if simple is ok. I would probably use perl, though, to grab and parse in one fell swoop. (Not familiar with curl)

Arbvestor · Sep 13, 2014

bluehatface said:
Can you not just grab the page with CURL, and parse the DOM with a XML parser, or regex?

That would work only, if the site to be scraped does not use AJAX. Also, to automate this, you would need a (shell) script to call curl or wget or whatever command line "browser" you may want to use and fire up the regex/XML parser. I think that approach is actually more complicated than sticking with one programming language and use that throughout.

But of course... it all depends on the experience of the OP.

xNotch · Sep 21, 2014

Arbvestor said:
It depends a bit on the website you want to scrape.

1) If if does not use JavaScript and or AJAX to render the table within the browser you can use a simple approach with a library like JSoup. It has a connect method to load content from the web directly. then you parse the content using JSoup and can easily select (JSoup uses css selectors) the parts of teh site you need. You can print out the results in the format you need. Or directly write to a database or whatever you need.

2) If the site uses AJAX calls to get the actual content or if it renders the page using JavaScript you probably need an approach that uses a real browser. For that, I find the best tool to be Selenium webdriver. Together with PhantomJS, which is a headless browser this is an easy to manage system to scrape otherwise hard to scrape websites.

Genaral note: I assume you use Java for this, but any other language will prbably do, where there is a library like Jsoup and/or selenium bindngs

When ever I encounter complex javascript selenium/phantomjs is what I go to as well. Selenium has bindings in several languages so it's easy to integrate. Plus since phantomjs is headless you can run it on a cheap vps. This is what I'd go with.

dalinkwent6 · Sep 25, 2014

Arbvestor said:
It depends a bit on the website you want to scrape.

1) If if does not use JavaScript and or AJAX to render the table within the browser you can use a simple approach with a library like JSoup. It has a connect method to load content from the web directly. then you parse the content using JSoup and can easily select (JSoup uses css selectors) the parts of teh site you need. You can print out the results in the format you need. Or directly write to a database or whatever you need.

2) If the site uses AJAX calls to get the actual content or if it renders the page using JavaScript you probably need an approach that uses a real browser. For that, I find the best tool to be Selenium webdriver. Together with PhantomJS, which is a headless browser this is an easy to manage system to scrape otherwise hard to scrape websites.

Genaral note: I assume you use Java for this, but any other language will prbably do, where there is a library like Jsoup and/or selenium bindngs

This is great, this is actually first time I worked with selenium. I can definitely see the possibilities now tho. I'll post my code when I get the chance to finish the project. Thank

bubbubber · Oct 5, 2014

Yes, if the site uses AJAX and/or just in general generates a lot of table and form elements on the fly or in nested layers, then it makes it more difficult to scrape the site. Rather, it makes it more difficult to interact with the site (that is, for sites where interaction is necessary).

If you take a look at my profile you can see a link to a Youtube video that shows an alternative method to web scraping and automated web crawling.

Ask me any questions you have about the technology used.

I believe that probably most folks that perform web scraping and web crawling use PHP and http protocol. In my method I use a web browser control embedded into a Windows form application. I used this method because I found that certain sites needed to think that an actual mouse cursor was moving around - and that their buttons were pressed by an actual mouse click - or else they would not respond. I found this to be true for some online betting sites, at the very least. Doing it my way, I am able to all Win32 API calls to move the mouse cursor and actuate mouse clicks and keyboard strokes. I used VB6 in the example in the video, but you can write the same program in C# (and actually, it might be easier to write this now in C# and .NET).

I know a lot of people will call BS on this and say that this is not necessary, but, based on my experiences with certain sites, it is a viable solution to the problem.

This method is not meant to replace the more commonly used PHP web scraping/web crawling methods. My method is comparatively slow because it actually uses a web browser to bring up the each web page, whereas I assume PHP with HTTP protocol can navigate to many more web pages in the same amount of time. But, in certain situations, my method will be able to allow you into log into some sites that the PHP method just can't handle.

Also, please take a look at my profile and read about my search for work. If you have any job leads, please let me know. PM or email me for my resume. I am not necessarily looking for a job related to SEO. I am just looking for an IT job that will allow me to telecommute. See my profile for more details. Thanks.

Chris22 · Oct 5, 2014

bubbubber said:
Yes, if the site uses AJAX and/or just in general generates a lot of table and form elements on the fly or in nested layers, then it makes it more difficult to scrape the site. Rather, it makes it more difficult to interact with the site (that is, for sites where interaction is necessary).

Not all the time, there are plenty of sites where the developers have built restful apis for all their ajax calls and querying those endpoints has often been simpler than scraping the site. This practice seems is starting to become more commonplace too.

Wilson Grant Fisk · Oct 5, 2014

Making a program that scans websites?

I need glasses. I read the title and was outraged.

Making a program that scans websites?

dalinkwent6

Junior Member

Arbvestor

Newbie

DearSanta

BANNED

Arbvestor

Newbie

bluehatface

Regular Member

journeycoder

Newbie

ou8myi

Newbie

Arbvestor

Newbie

xNotch

Registered Member

dalinkwent6

Junior Member

bubbubber

Newbie

Chris22

Regular Member

Wilson Grant Fisk

Elite Member

Main Menu

Marketplace

Making Money

BlackHat World