Regex to capture content or text after H1?

Discussion in 'Programming' started by The Curator, Jan 4, 2017.

  1. The Curator

    The Curator Supreme Member

    Joined:
    Dec 27, 2013
    Messages:
    1,214
    Likes Received:
    504
    I am using a cool scraping software, webharvy that will scrape title, meta, url, but I am having a tough time scraping the content of the page and I was thinking of doing it by identifying via regex any text/content found on the page after the H1 tag. I just don't understand regex to formulate this myself. Appreciate any help with it!
     
  2. Sam Green

    Sam Green Junior Member

    Joined:
    Dec 15, 2016
    Messages:
    134
    Likes Received:
    31
    • Thanks Thanks x 1
  3. The Curator

    The Curator Supreme Member

    Joined:
    Dec 27, 2013
    Messages:
    1,214
    Likes Received:
    504
  4. Sam Green

    Sam Green Junior Member

    Joined:
    Dec 15, 2016
    Messages:
    134
    Likes Received:
    31
    depending on how challenging the site is you trying to scrape i might write something up. post a link or pm me.
     
  5. mynameisfrankenstein

    mynameisfrankenstein Regular Member

    Joined:
    Apr 2, 2015
    Messages:
    429
    Likes Received:
    351
    Gender:
    Male
    Location:
    BC, Canada
    If there a tag that contains what you want to scrape?

    You could use <div>(.*?)</div>
     
  6. Jomasdf

    Jomasdf Jr. VIP Jr. VIP

    Joined:
    Jul 7, 2012
    Messages:
    546
    Likes Received:
    214
    Occupation:
    C# dev
    Location:
    Sweden
    Home Page:
    What he said. Are you making it yourself? I have no idea what webharvy is, but going with a HTML parser + xpaths is a great way to do it.
     
  7. The Curator

    The Curator Supreme Member

    Joined:
    Dec 27, 2013
    Messages:
    1,214
    Likes Received:
    504
    Unfortunately that's not in my bag of skill sets. Here is webharvy https://webharvy.com/
     
  8. Jomasdf

    Jomasdf Jr. VIP Jr. VIP

    Joined:
    Jul 7, 2012
    Messages:
    546
    Likes Received:
    214
    Occupation:
    C# dev
    Location:
    Sweden
    Home Page:
    • Thanks Thanks x 1
  9. fastlinks

    fastlinks Power Member

    Joined:
    Feb 4, 2015
    Messages:
    616
    Likes Received:
    76
    xpath will change over time, but regex will always get what you need exactly

    try this:

    (?<=\<h1\>).*?(?=\<\/h1\>)
     
  10. mynameisfrankenstein

    mynameisfrankenstein Regular Member

    Joined:
    Apr 2, 2015
    Messages:
    429
    Likes Received:
    351
    Gender:
    Male
    Location:
    BC, Canada
    How is your content scraping and posting bot coming along?

    I'm quite interested in how it all works out for you.
     
    • Thanks Thanks x 1
  11. frenchboy

    frenchboy Power Member

    Joined:
    Aug 19, 2008
    Messages:
    762
    Likes Received:
    1,349
    (.+) means everything so <h1>(.+)</h1> will capture everything in the title tag. if you want after you can do </h1>(.+)
     
    • Thanks Thanks x 1
  12. fastlinks

    fastlinks Power Member

    Joined:
    Feb 4, 2015
    Messages:
    616
    Likes Received:
    76
    when you cant make own regex, try search on regexlib.com for existing regex
     
    • Thanks Thanks x 1
  13. seogibbon

    seogibbon Newbie

    Joined:
    Dec 11, 2012
    Messages:
    13
    Likes Received:
    3
    You can easily find id of tag in HTML and parse text by Id. example.png
    Can share this tool with bhw members - just send me send me request in PM.
    [​IMG]
     
  14. Crawlie

    Crawlie Registered Member

    Joined:
    Jan 2, 2017
    Messages:
    52
    Likes Received:
    11
    Gender:
    Male