1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to grab text from a website with cURL?

Discussion in 'PHP & Perl' started by rnc505, Feb 3, 2009.

  1. rnc505

    rnc505 Regular Member

    Joined:
    Oct 28, 2008
    Messages:
    229
    Likes Received:
    109
    How do I use cURL to harvest the text of a website, say the blog entry titles of a Blogspot blog....

    I know the text (in the source) before a title is:

    Code:
    <h3 class="post-title">
    and after the title it is:
    Code:
    </h3>

    How would I go about doing this? Thanks a lot!
     
  2. 195471

    195471 Regular Member

    Joined:
    Oct 11, 2008
    Messages:
    417
    Likes Received:
    260
    These tutorials that may help you:

    Code:
    http://stackoverflow.com/questions/26947/how-to-implement-a-web-scraper-in-php
    http://www.oooff.com/php-scripts/basic-curl-scraping-php/basic-scraping-with-curl.php
     
  3. genie1

    genie1 Junior Member

    Joined:
    May 30, 2008
    Messages:
    151
    Likes Received:
    50
    Home Page:
    You can always use wget then grep, or if wget is blocked, lynx --source works too. I know curl can do more, and I molest it daily, but sometimes I prefer simpler too :)
     
  4. tattoo

    tattoo Regular Member

    Joined:
    May 10, 2008
    Messages:
    405
    Likes Received:
    36
    Occupation:
    vagrant
    Location:
    mars
    you can either turn it into well-formed xml and then traverse the tree or use a regex. in either case, it's beyond the scope of this forum. you will have better luck on a PHP or other programming forum.
     
  5. fatboy

    fatboy Elite Member

    Joined:
    Aug 13, 2008
    Messages:
    1,618
    Likes Received:
    3,227
    Occupation:
    Retired
    Location:
    Old Peoples Home
    Using Perl and Mechanize would do the trick as well :)
    Been using Mechanize a lot lately and it rocks!
     
  6. 80degreez

    80degreez Newbie

    Joined:
    May 23, 2008
    Messages:
    14
    Likes Received:
    1
    I'm a heavy Mechanize user as well and highly recommend it if the page doesn't have JavaScript
     
  7. mpruben

    mpruben Registered Member

    Joined:
    Jan 10, 2008
    Messages:
    51
    Likes Received:
    19
    Unless I'm missing something here all you need is 'substr' between the position of the start and end of the title, no?
     
  8. sikx

    sikx Registered Member

    Joined:
    Jan 4, 2009
    Messages:
    65
    Likes Received:
    166
    Location:
    Germany
    Home Page:
    Here's the proper and easy way of doing it

    Code:
    http://nytemarez.com/scraping-with-php-and-dom/
     
  9. demoniox

    demoniox Registered Member

    Joined:
    Mar 5, 2007
    Messages:
    98
    Likes Received:
    83
    use the php preg_match fuction
     
  10. confined

    confined Regular Member

    Joined:
    Jan 4, 2009
    Messages:
    216
    Likes Received:
    91
    PHP:
    <?php
    //get the page with curl; look it up. php.net/curl_setopt

    preg_match_all('@<h3 class="post-title">(.*)<\/h3>@is',$page,$matches);

    $titles $matches[1];

    ?>
    untested. may need more tweaking