1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping with regex

Discussion in 'Black Hat SEO Tools' started by sden007, Feb 23, 2012.

  1. sden007

    sden007 Newbie

    Joined:
    Dec 29, 2011
    Messages:
    22
    Likes Received:
    0
    Hi all, Im trying to scrape proxies using zennoposter, the problem is im not too good at regex. The info im trying to parse looks like this:

    <TD>194.176.105.197</TD>
    <TD>80</TD></TR>
    <TR>
    <TD>217.15.117.58</TD>
    <TD>3128</TD></TR>
    <TR>
    <TD>80.194.50.123</TD>
    <TD>8080</TD></TR>

    Does anyone know how this can be achieved?
    Thanks
     
  2. cooooookies

    cooooookies Senior Member

    Joined:
    Oct 6, 2008
    Messages:
    1,008
    Likes Received:
    216
    Have a look at regexlib.com, many useful regexes there.

    I have my own regexes for
    Code:
    IP:port
    scan - in that form - I only parse text with HTML tags stripped, maybe you should also consider that.

    But honestly, it is ABSOLUTELY not worth the effort to use public proxies. I programmed a quite advanced proxy checker for my needs and had so much hope - null.

    If you really need that many IPs, when IP v6 is routed everywhere you will have literally unlimited proxies for almost no costs.
     
  3. kokoloko75

    kokoloko75 Elite Member

    Joined:
    Jan 1, 2011
    Messages:
    1,628
    Likes Received:
    1,935
    Occupation:
    Design director
    Location:
    Paris (France)
    RegEx Extractor :
    Code:
    http://codecanyon.net/item/regex-extractor-extract-everything-simply-/1327433
    Beny
     
  4. jimbo2087

    jimbo2087 Regular Member

    Joined:
    Jan 24, 2010
    Messages:
    205
    Likes Received:
    149
    Location:
    UK
  5. kveldulv

    kveldulv Registered Member

    Joined:
    Aug 19, 2009
    Messages:
    76
    Likes Received:
    33
    Where he's referring to In zenno's proxy checker , you can only use regex's, so not really scope for preprocessing.

    Though I totally agree, parsing significant amounts of HTML with regex's is for mugs.

    Check these first
    Code:
    stackoverflow.com/questions/106179/regular-expression-to-match-hostname-or-ip-address
    mrhinkydink.wordpress.com/proxy-regex/
    


    Code:
    IP addresses
     ([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
    /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/  
    
    
    \b([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
    
    
    
     
  6. kveldulv

    kveldulv Registered Member

    Joined:
    Aug 19, 2009
    Messages:
    76
    Likes Received:
    33
    or if you do regex on file

    replace
    Code:
     replace </TD>\n<TD>
        with :
    
    a simple ip:port regex should match now.
     
  7. affiliatepros

    affiliatepros Junior Member

    Joined:
    Jan 2, 2010
    Messages:
    113
    Likes Received:
    15
    I would use an XML (html) parser, and get the data out of it.. Much easier, and if there's a mistake in HTML, it will most likely correct it (at least my Nokogiri parser for Ruby does that)