1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Regex issue. Does not behave the same in Expresso and C#

Discussion in 'C, C++, C#' started by Ampix0, Feb 18, 2013.

  1. Ampix0

    Ampix0 Power Member

    Joined:
    Jan 10, 2012
    Messages:
    525
    Likes Received:
    60
    Home Page:
    Ok you may have seen my other thread. fortunately this one is unrelated but the EXACT same issue so I can release the code.


    RegexString: (http:|www\.)(.*)(com|info|biz|me|org|ru|de|uk|in|bz|jp)(?=\")




    What this particular string SHOULD DO is find the links (in this case web proxies). Now, then I put this in expresso, and in the sample text I put in
    the source code of "http://www.zfreez.com/" my result is a list of proxy sites. FANTASTIC.
    So I go to implement this in C#:


    Code:
    String Uri = wclient.DownloadString("http://www.zfreez.com/");
    String regex = "(http:|www\\.)(.*)com|info|biz|me|org|ru|de|uk|in|bz|jp)(?=\\\")";
        MatchCollection coll = Regex.Matches(Uri, regex);
        String result = coll[0].Groups[1].Value;
        textBox1.AppendText(result);
    

    The result in textBox1.AppendText is "http:"
    That's all I get.


    How come there is a difference?
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,155
    That Groups[1] means you only get the results from the first capturing group (parenthesis in regexp).

    Why are you using capturing groups?
     
    Last edited: Feb 18, 2013
  3. Ampix0

    Ampix0 Power Member

    Joined:
    Jan 10, 2012
    Messages:
    525
    Likes Received:
    60
    Home Page:
    ah XD. See I am not a C# person. I was looking up how to use regex and didn't find much. I am not very sure how to implement it.
     
  4. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    This is a regex test site http://rubular.com/ its pretty useful to test the regex and what you will get out of it.

    Just wanted to share that with you in case you have other issues this might be handy for helping you to see what is wrong.

    You can put the expression, the data you will use and see the result, it also shows groups and other things, very useful.
     
  5. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,155
    Code:
    String htmlCode = wclient.DownloadString("http://www.zfreez.com/");
    Regex r = new Regex( "<a href=\"(.*?)\">", RegexOptions.IgnoreCase );
    MatchCollection mc = r.Matches(htmlCode);
    
    foreach ( Match m1 in mc ) {                
       // Here you can do something with -> m1.Groups[1].ToString();
    }
    
     
  6. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    Or you would use HTMLAgilityPack to parse the HTML with something like:

    Code:
    HtmlWeb site = new HtmlWeb();
    HtmlDocument html = site.Load("http://www.zfreez.com/");
    foreach (HtmlNode link in html.DocumentElement.SelectNodes("//a[@href]"))
    {
    // Here you can do something with link
    }
     
  7. Ampix0

    Ampix0 Power Member

    Joined:
    Jan 10, 2012
    Messages:
    525
    Likes Received:
    60
    Home Page:
    This outputs "#" id="btn_close_window" class="button btn_black" id="lbl_handler_desc#" id="btn_try_another" class="button btn_gray"

    But I noticed something. Look at the source of htmlCode. The source is not the same as what you get from a browser. everything is base64 encoded for some reason.
     
  8. Ampix0

    Ampix0 Power Member

    Joined:
    Jan 10, 2012
    Messages:
    525
    Likes Received:
    60
    Home Page:
    I had looked into the HTMLAgility pack, but I am going to be scraping IP addresses as well.
     
  9. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    It can scrap anything out of x/html and is a lot more tested then building your own regex.

    Using the xPath you find the item you want and from there you can get text, attribute values or anything without having to make some crazy rule to get it.

    In regards what you have said about the content, its possible that it is being generated dynamically in that case it will not show on the page source code but will for example show if you're viewing it with firebug.

    In this case you can either open the request directly if you can find it or you would have to emulate a browser but then again it depends on how its being processed.
     
    Last edited: Feb 18, 2013
  10. rootjazz

    rootjazz Jr. VIP Jr. VIP

    Joined:
    Dec 21, 2012
    Messages:
    617
    Likes Received:
    313
    Occupation:
    Developer
    Location:
    UK
    Home Page:
    Use HTMLAgility rather than trying to regex HTML +1

    I think he is referrring to the fact the proxy site is using javascript to display the proxies.

    For this, use awesomnium to process the javascript, then grab that and pipe it through HTMLAgility

    fuck regexes for HTML manipulation.