Scraping with regex

sden007 · Feb 23, 2012

Hi all, Im trying to scrape proxies using zennoposter, the problem is im not too good at regex. The info im trying to parse looks like this:

<TD>194.176.105.197</TD>
<TD>80</TD></TR>
<TR>
<TD>217.15.117.58</TD>
<TD>3128</TD></TR>
<TR>
<TD>80.194.50.123</TD>
<TD>8080</TD></TR>

Does anyone know how this can be achieved?
Thanks

cooooookies · Feb 23, 2012

Have a look at regexlib.com, many useful regexes there.

I have my own regexes for

Code:

IP:port

scan - in that form - I only parse text with HTML tags stripped, maybe you should also consider that.

But honestly, it is ABSOLUTELY not worth the effort to use public proxies. I programmed a quite advanced proxy checker for my needs and had so much hope - null.

If you really need that many IPs, when IP v6 is routed everywhere you will have literally unlimited proxies for almost no costs.

kokoloko75 · Feb 23, 2012

RegEx Extractor :

Code:

http://codecanyon.net/item/regex-extractor-extract-everything-simply-/1327433

Beny

jimbo2087 · Feb 23, 2012

Don't use Regex to parse HTML!

http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

Jokes aside - use a proper DOM library. If you're doing it in PHP this is great - http://simplehtmldom.sourceforge.net/

kveldulv · Apr 10, 2013

Where he's referring to In zenno's proxy checker , you can only use regex's, so not really scope for preprocessing.

Though I totally agree, parsing significant amounts of HTML with regex's is for mugs.

Check these first

Code:

stackoverflow.com/questions/106179/regular-expression-to-match-hostname-or-ip-address
mrhinkydink.wordpress.com/proxy-regex/

Code:

IP addresses
 ([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/  


\b([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

kveldulv · Apr 11, 2013

or if you do regex on file

replace

Code:

 replace </TD>\n<TD>
    with :

a simple ip

ort regex should match now.

affiliatepros · Apr 12, 2013

I would use an XML (html) parser, and get the data out of it.. Much easier, and if there's a mistake in HTML, it will most likely correct it (at least my Nokogiri parser for Ruby does that)

Scraping with regex

sden007

Newbie

cooooookies

Senior Member

kokoloko75

Elite Member

jimbo2087

Regular Member

kveldulv

Regular Member

kveldulv

Regular Member

affiliatepros

Junior Member

Main Menu

Marketplace

Making Money

BlackHat World