1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to scrape Title & Description from Yahoo Answers

Discussion in 'PHP & Perl' started by issem10, Jul 14, 2013.

  1. issem10

    issem10 Junior Member

    Joined:
    Aug 31, 2010
    Messages:
    174
    Likes Received:
    13
    Hello. How Can I scrape Title & description from Yahoo Answers and export it into txt or similar format? Anyone know software or script to do it?
     
  2. shubhamm

    shubhamm Junior Member

    Joined:
    Jan 25, 2010
    Messages:
    102
    Likes Received:
    19
    Occupation:
    Developer
    Location:
    BHW
    You have to make it fully i don't think there is any script for it

    let me tell you Process

    Input URL by using textbox or string as it in PHP/Perl Category

    go to that url using CURL

    get response from it

    Use regex for title & Description

    Simple and also use useragent in Curl method
     
  3. Lyscer

    Lyscer Junior Member

    Joined:
    Jun 29, 2012
    Messages:
    109
    Likes Received:
    46
    Occupation:
    Software Engineer
    This is exactly how I would do it too. If you are looking to do a mass amount of these, you will more than likely find a pattern or some unique thing once you have done a couple that will allow you to easily progress from one Q/A to another.
     
  4. wkirk

    wkirk Junior Member

    Joined:
    Apr 3, 2011
    Messages:
    139
    Likes Received:
    67
    I think there is a programming/scripting section here somewhere but here you go.. made it as detailed and simple as I could.

    Code:
    <?
    $url=$argv['1'];
    // target y answers url
    
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, "$url");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $output = curl_exec($ch);
    curl_close($ch);      
    
    // page is now in $output variable, lets extract the title
    
    $title_starts_at_position=stripos($output,"<title>") + 7;
    $title_ends_at_position=stripos($output,"</title>");
    $length_of_the_title=$title_ends_at_position-$title_starts_at_position;
    $title_text=substr($output,$title_starts_at_position,$length_of_the_title);
    $title_without_yahoo_answers_suffix=str_replace(" - Yahoo! Answers","", $title_text);
    echo "\nTitle: $title_without_yahoo_answers_suffix"."\n";
    
    
    // same with description
    
    $description_starts_at_position=stripos($output,'<div class="content"') + 20;
    $remaining_content_from_start_of_description=substr($output,$description_starts_at_position,1000);
    $description_ends_at_position=stripos($remaining_content_from_start_of_description,"</div>");
    $description=substr($remaining_content_from_start_of_description,1,$description_ends_at_position -1);
    
    echo "Description: ".$description."\n";
    
    ?>
    
    How to use:

    yanswers.png
     
  5. CodingAndStuff

    CodingAndStuff Regular Member

    Joined:
    May 6, 2012
    Messages:
    236
    Likes Received:
    83
    Occupation:
    Swagstronaut
    Location:
    You can't have my bots. Sorry :'(
    I'd use the "Simple HTML Dom Parser" library found here: http://simplehtmldom.sourceforge.net/

    Then you'd do something like this:

    Code:
    <?php
    require_once("simple_html_dom.php");
    $question_id = "Put the question ID here";
    $html = file_get_html("http://answers.yahoo.com/question/" . $question_id);
    
    $title = $html->find('h1.subject');
    $description = $html->find('div.content');
    
    echo "Title is: " . $title . " and Description is: " . $description;
    ?>
    
    I didn't test it, but it should work fine.