1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Regex to capture content or text after H1?

Discussion in 'Programming' started by The Curator, Jan 4, 2017.

  1. The Curator

    The Curator Senior Member

    Joined:
    Dec 27, 2013
    Messages:
    1,053
    Likes Received:
    443
    I am using a cool scraping software, webharvy that will scrape title, meta, url, but I am having a tough time scraping the content of the page and I was thinking of doing it by identifying via regex any text/content found on the page after the H1 tag. I just don't understand regex to formulate this myself. Appreciate any help with it!
     
  2. Sam Green

    Sam Green Junior Member

    Joined:
    Dec 15, 2016
    Messages:
    134
    Likes Received:
    31
    • Thanks Thanks x 1
  3. The Curator

    The Curator Senior Member

    Joined:
    Dec 27, 2013
    Messages:
    1,053
    Likes Received:
    443
  4. Sam Green

    Sam Green Junior Member

    Joined:
    Dec 15, 2016
    Messages:
    134
    Likes Received:
    31
    depending on how challenging the site is you trying to scrape i might write something up. post a link or pm me.
     
  5. mynameisfrankenstein

    mynameisfrankenstein Regular Member

    Joined:
    Apr 2, 2015
    Messages:
    431
    Likes Received:
    346
    Gender:
    Male
    Location:
    BC, Canada
    If there a tag that contains what you want to scrape?

    You could use <div>(.*?)</div>
     
  6. Jomasdf

    Jomasdf Jr. VIP Jr. VIP

    Joined:
    Jul 7, 2012
    Messages:
    458
    Likes Received:
    158
    Occupation:
    C# dev
    Location:
    Sweden
    Home Page:
    What he said. Are you making it yourself? I have no idea what webharvy is, but going with a HTML parser + xpaths is a great way to do it.
     
  7. The Curator

    The Curator Senior Member

    Joined:
    Dec 27, 2013
    Messages:
    1,053
    Likes Received:
    443
    Unfortunately that's not in my bag of skill sets. Here is webharvy https://webharvy.com/
     
  8. Jomasdf

    Jomasdf Jr. VIP Jr. VIP

    Joined:
    Jul 7, 2012
    Messages:
    458
    Likes Received:
    158
    Occupation:
    C# dev
    Location:
    Sweden
    Home Page:
    • Thanks Thanks x 1
  9. fastlinks

    fastlinks BANNED BANNED

    Joined:
    Feb 4, 2015
    Messages:
    616
    Likes Received:
    75
    xpath will change over time, but regex will always get what you need exactly

    try this:

    (?<=\<h1\>).*?(?=\<\/h1\>)
     
  10. mynameisfrankenstein

    mynameisfrankenstein Regular Member

    Joined:
    Apr 2, 2015
    Messages:
    431
    Likes Received:
    346
    Gender:
    Male
    Location:
    BC, Canada
    How is your content scraping and posting bot coming along?

    I'm quite interested in how it all works out for you.
     
    • Thanks Thanks x 1
  11. frenchboy

    frenchboy Power Member

    Joined:
    Aug 19, 2008
    Messages:
    761
    Likes Received:
    1,338
    (.+) means everything so <h1>(.+)</h1> will capture everything in the title tag. if you want after you can do </h1>(.+)
     
    • Thanks Thanks x 1
  12. fastlinks

    fastlinks BANNED BANNED

    Joined:
    Feb 4, 2015
    Messages:
    616
    Likes Received:
    75
    when you cant make own regex, try search on regexlib.com for existing regex
     
    • Thanks Thanks x 1
  13. seogibbon

    seogibbon Newbie

    Joined:
    Dec 11, 2012
    Messages:
    13
    Likes Received:
    3
    You can easily find id of tag in HTML and parse text by Id. example.png
    Can share this tool with bhw members - just send me send me request in PM.
    [​IMG]
     
  14. Crawlie

    Crawlie Registered Member

    Joined:
    Jan 2, 2017
    Messages:
    52
    Likes Received:
    11
    Gender:
    Male