Scraping with regex

Discussion in 'Black Hat SEO Tools' started by sden007, Feb 23, 2012.

  1. sden007

    sden007 Newbie

    Joined:
    Dec 29, 2011
    Messages:
    22
    Likes Received:
    0
    Hi all, Im trying to scrape proxies using zennoposter, the problem is im not too good at regex. The info im trying to parse looks like this:

    <TD>194.176.105.197</TD>
    <TD>80</TD></TR>
    <TR>
    <TD>217.15.117.58</TD>
    <TD>3128</TD></TR>
    <TR>
    <TD>80.194.50.123</TD>
    <TD>8080</TD></TR>

    Does anyone know how this can be achieved?
    Thanks
     
  2. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,016
    Likes Received:
    217
    Have a look at regexlib.com, many useful regexes there.

    I have my own regexes for
    Code:
    IP:port
    scan - in that form - I only parse text with HTML tags stripped, maybe you should also consider that.

    But honestly, it is ABSOLUTELY not worth the effort to use public proxies. I programmed a quite advanced proxy checker for my needs and had so much hope - null.

    If you really need that many IPs, when IP v6 is routed everywhere you will have literally unlimited proxies for almost no costs.
     
  3. kokoloko75

    kokoloko75 Elite Member

    Joined:
    Jan 1, 2011
    Messages:
    1,628
    Likes Received:
    1,943
    Occupation:
    Design director
    Location:
    Paris (France)
    RegEx Extractor :
    Code:
    http://codecanyon.net/item/regex-extractor-extract-everything-simply-/1327433
    Beny
     
  4. jimbo2087

    jimbo2087 Jr. VIP Jr. VIP

    Joined:
    Jan 24, 2010
    Messages:
    205
    Likes Received:
    149
    Location:
    UK
  5. kveldulv

    kveldulv Junior Member

    Joined:
    Aug 19, 2009
    Messages:
    107
    Likes Received:
    45
    Where he's referring to In zenno's proxy checker , you can only use regex's, so not really scope for preprocessing.

    Though I totally agree, parsing significant amounts of HTML with regex's is for mugs.

    Check these first
    Code:
    stackoverflow.com/questions/106179/regular-expression-to-match-hostname-or-ip-address
    mrhinkydink.wordpress.com/proxy-regex/
    


    Code:
    IP addresses
     ([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
    /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/  
    
    
    \b([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
    
    
    
     
  6. kveldulv

    kveldulv Junior Member

    Joined:
    Aug 19, 2009
    Messages:
    107
    Likes Received:
    45
    or if you do regex on file

    replace
    Code:
     replace </TD>\n<TD>
        with :
    
    a simple ip:port regex should match now.
     
  7. affiliatepros

    affiliatepros Junior Member

    Joined:
    Jan 2, 2010
    Messages:
    112
    Likes Received:
    15
    I would use an XML (html) parser, and get the data out of it.. Much easier, and if there's a mistake in HTML, it will most likely correct it (at least my Nokogiri parser for Ruby does that)