1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Regex driving me crazy too greedy

Discussion in 'General Programming Chat' started by xpro, Jul 3, 2010.

  1. xpro

    xpro Regular Member

    Joined:
    Jan 21, 2009
    Messages:
    416
    Likes Received:
    16
    Hello

    I'm trying to extract some data from a webpage and it should create 2 separate matches, but the regex keeps on going and it makes it as one match. What can I do so that it can stop at the first "</from>" it finds?

    Best Regards!

    The content
    Code:
    <messageInfo mid="1_449_ALlVimIAABd1TC5hsgc16zIYJFY" toEmail="Idelle" subject="Re: free ipod" mimeType="multipart/alternative" xapparentlyto="idellecain9195@yahoo.com" receivedDate="1278108078" size="7408"><flags isReplied="0" isFlagged="0" isRead="1" isDraft="0" isForwarded="0" isHam="0" isSpam="0" hasAttachment="0" inAddressBook="0" isRecent="1"/><from><name>LaVonda Bland</name><email>groovy_lv13@yahoo.com</email></from><inboxservices><name>Retro</name><value>Y</value></inboxservices><inboxservices><name>SgrnP</name><value>N</value></inboxservices><inboxservices><name>d_t</name><value>1278108081</value></inboxservices><inboxservices><name>s_ip</name><value>74.6.228.93</value></inboxservices><inboxservices><name>showStationery=</name><value/></inboxservices><inboxservices><name>url</name><value>craigslist.org,mailto:res-rjb2j-1822676356@craigslist.org,yahoo.com,mailto:idellecain9195@yahoo.com</value></inboxservices></messageInfo><messageInfo mid="1_22_ALVVimIAABo+TC1qtAhqDS72Xas" toEmail="idellecain9195@yahoo.com" subject="Welcome to Yahoo!" mimeType="text/html" receivedDate="1278044852" size="720"><flags isReplied="0" isFlagged="0" isRead="0" isDraft="0" isForwarded="0" isHam="0" isSpam="0" hasAttachment="0" inAddressBook="0" isRecent="1"/><from><name>Yahoo!</name><email>mailbot@yahoo.com</email></from>
    
    My regex
    Code:
    <messageInfo mid=\"(.*?)\" toEmail=\".*?\" subject=\"(.*?)\" mimeType=\".*?\" xapparentlyto=\".*?\" receivedDate=\".*?\" size=\".*?\"><flags isReplied=\"0\" isFlagged=\"0\" isRead=\"0\" isDraft=\"0\" isForwarded=\"0\" isHam=\"0\" isSpam=\"0\" hasAttachment=\"0\" inAddressBook=\"0\" isRecent=\"1\"/><from><name>(.*?)</name><email>(.*?)</email></from>
    
     
  2. madblacker

    madblacker Regular Member

    Joined:
    Nov 2, 2009
    Messages:
    268
    Likes Received:
    19
    I had these same types of problems before.. basically, regex isn't ideal for parsing a web page, you need to parse it using some sort of HTML parser and then just use regex to parse whats left after the initial parse... it depends on what language you are using as to which parser to use..
     
  3. voyevoda

    voyevoda Regular Member Premium Member

    Joined:
    Mar 21, 2010
    Messages:
    217
    Likes Received:
    97
    Location:
    Eastern Front
    I found the problem. You're trying to use a regular expression to parse XML! This is bad.

    Your regular expression isn't even valid. You aren't escaping the forward slashes.

    http://rubular.com/r/TTeyFAqD5y

    The matches look correct to me.
     
  4. bhmailer

    bhmailer Newbie

    Joined:
    Nov 12, 2009
    Messages:
    11
    Likes Received:
    1
    Try this tool "The Regex Coach"; lets you test your regular expressions easily
     
  5. anty

    anty Newbie

    Joined:
    Dec 8, 2008
    Messages:
    22
    Likes Received:
    0
    you specified
    Code:
    isRead="0"
    but in your sample there's only code with isRead="1" to match. That's why there's no output.

    The correct way would be to correct the content of the website to make it valid XML, then use XPath to access the data you want.


    I usually catch characters until a character occurs that is now allowed, which is really easy: They either can't have '"' or '<' characters in HTML code.

    Here's the code for your example with my method (isRead is still set to "0"):
    Code:
    <messageInfo mid=\"([^\"]*)\" toEmail=\"[^\"]*\" subject=\"([^\"]*)\" mimeType=\"[^\"]*\" xapparentlyto=\"[^\"]*\" receivedDate=\"[^\"]*\" size=\"[^\"]*\"><flags isReplied=\"0\" isFlagged=\"0\" isRead=\"0\" isDraft=\"0\" isForwarded=\"0\" isHam=\"0\" isSpam=\"0\" hasAttachment=\"0\" inAddressBook=\"0\" isRecent=\"1\"/><from><name>([^<]*)</name><email>([^<]*)</email></from>
     
    Last edited: Aug 21, 2010