1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

scraping ask.com

Discussion in 'PHP & Perl' started by ukescuba, Apr 28, 2009.

  1. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    hey guys

    quick question is anyone successfully scraping data from ask.com?

    if so what method are you using?

    thanks in advance

    jay
     
  2. denight

    denight Registered Member

    Joined:
    Apr 7, 2009
    Messages:
    58
    Likes Received:
    26
    No, but if you are interested.... what do you want scraped?

    RSS feeds? standard search results from web? Image search results?

    Give me a blueprint of what you want done and let's see what we can do, I always like a challenging php/perl parsing....challenge.. lol.

    dN
     
  3. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    thanks for getting back to me

    all im looking to do is to pull the number of results a particular search does... i have a feeling that ask.com is somehow blocking certain methods of scraping...

    basically after doing a search on ask.com i am looking to pull the number bolded below...

    Showing 1-10 of 19,700,000
     
  4. BlackMelvyn

    BlackMelvyn Regular Member

    Joined:
    Jul 8, 2008
    Messages:
    202
    Likes Received:
    272
    Home Page:
    Generally, scraping fails because of cookies.
    I did not try it but it should be OK (freshly coded)
    Just change $keyword to you search expression ;)
    PHP:
    <?php
    //    Scraping number of results for Ask.Com
    $reg '#<span id=\'indexLast\' class=\'b\'>[:alnum:]</span> of (.*) for#Usi';

    $keyword 'SEO';
    $cookie '';
    //    Get a cookie
    $ch curl_init('http://www.ask.com/');
    curl_setopt($chCURLOPT_COOKIEJAR$cookie);
    curl_setopt($chCURLOPT_COOKIEFILE$cookie);
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATION$true);
    $response curl_exec($ch);

    //    Get the results
    $ch curl_init('http://www.ask.com/web?qsrc=2417&o=312&l=dir&q='.urlencode($keyword));
    curl_setopt($chCURLOPT_COOKIEFILE$cookie);
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATION$true);
    $result curl_exec($ch);
    curl_close($ch);

    //    Search for result
    if(preg_match($regex$result$match)){
        
    $number str_replace(','''$match[0]);
        
    $number number_format($number0'.'' ';
        echo 
    $number ' results for keyword <strong>'.$keyword.'</strong>';
    }
    else{
        echo 
    'Sorry, could not retrieve results for  <strong>'.$keyword.'</strong>';
    }
    ?>
    Let me know !
     
    • Thanks Thanks x 1
  5. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    really appreciate your time on this, i had tried something similar to this before but my output looks like this:

    ‹&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;ÌZ{sÛ6ÿ[[email protected],MhŠÔ˶d&cËvë÷úˆÓÎÍ=: ‰¨)’!!;©ªï~» ø¦œ´sssòD"°?,ö…ÅÌË/+篮¾_Üÿý‡kòíýwwä‡÷—w· ¢¿ŒƒÁÕýUJ¦Eîc$\ð0 þ`pý7í 0ðÄÆ

    i ran your code perfect apart from a missing bracket at the end of line 27 :)

    i commented out the last few lines and replaced it with echo $results
    just to see what it was picking up and got the same freaky output???

    any ideas what it is? is it an encoding/character set issue??

    PS because of my problems at this ive started using CURL a lot more its awesome especially how you can use it to login to sites too!

    thanks

    jay
     
  6. 00CivicEX

    00CivicEX Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 3, 2009
    Messages:
    293
    Likes Received:
    214
    Help automate it a little replace the top with this
    Code:
    <?php
    //    Scraping number of results for Ask.Com
    $reg = '#<span id=\'indexLast\' class=\'b\'>[:alnum:]</span> of (.*) for#Usi';
    $scrape = $_POST["askkeyword"];
    $keyword = "$scrape";
    $cookie = '';
    Then create another page with this

    Code:
    </form></center>
    <center>Link Scraper by 00CivicEX</center>
    <center><form method="post" action="askresult.php">
    [URL="http://www.blackhatworld.com/blackhat-seo/<input"]URL:<input[/URL] type="text" size="12" maxlength="40" name="askkeyword"><br />
    <input type="submit" value="Scrape Links" name="submit">
    </form></center>
    Also got mine setup so it save to a database then my admin panel pulls the results up for me..etc...In the process of adding more features to it. You can remove the "Link Scraper by 00CivicEX" Just pulled the code out of mine real quick...make sure you change the askresult.php to whatever you called your file

    and here is how I echoed and stored mine

    Code:
    storeLink($number,$keyword);
     echo "<br />Keyword Number Stored: $number,$keyword";
    then just has a function like this adding it to the db

    Code:
    function storeLink($number,$keyword) {
     $query = "INSERT INTO links (number, keyword) VALUES ('$number', '$keyword')";
     mysql_query($query) or die('Error, insert query failed');
     
    Last edited: May 1, 2009
  7. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    hey civix

    im using post to automate my system already but thats good advice, im thinking of making it more seemless and using ajax too

    question for you though is your scrape actually working for ask.com as whenever i try the scrape i get data that looks like this:

    &#65533;&#65533;&#65533;&#65533;&#65533;&#65533;ÌZ{sÛ6ÿ[&#8482;[email protected],Mh&#352;Ô˶d&cËvë÷ú&#710;ÓÎÍ=: &#8240;¨)&#8217;!!;©ªï~» ø¦&#339;´sssòD"°?,ö&#8230;ÅÌË/+篮¾_Üÿý&#8225;kòíýwwä&#8225;÷&#8212;w· ¢¿&#338;&#402;ÁÕýUJ¦Eîc$\ð0 þ`pý7í 0ðÄÆ

    i dont doubt the scraper will work for other sites, i have over 80+ scrapes working just ask.com spits out this crap?? not sure how to get around it?

    thanks
     
  8. 00CivicEX

    00CivicEX Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 3, 2009
    Messages:
    293
    Likes Received:
    214
    i can scrape ask.com for content but havent tried to count the results. Shouldnt be too hard...will look it over and see if the script he posted works and if not fix it.
     
  9. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    just sent you a pm showing what my output looks like when i dump the grab

    am only looking to pull the number of results from:

    Showing 1-10 of 51,800,000 for keyword

    thanks

    jay
     
  10. 00CivicEX

    00CivicEX Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 3, 2009
    Messages:
    293
    Likes Received:
    214
    Still messing with it, but here is my complete code to scrape links from ask.com. Might help you with the count. I know what needs to be done...just getting it done...it needs to use a getByClassName to retreive the attribute for the count results which is 'indexFirst' and 'indexLast' Wouldnt need the indexFirst cause its always 1, just need the indexLast.

    Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[/URL]">
    <html xmlns="[URL]http://www.w3.org/1999/xhtml[/URL]">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>
    Scraper
    </title>
    <link href="style.css" rel="stylesheet" type="text/css" />
    </head>
     
    <body>
     
    <?php
    $username="xxxxxxxxx";
    $password="xxxxxxxxx";
    $database="xxxxxxxxx";
    mysql_connect(localhost,$username,$password);
     
    @mysql_select_db($database) or die( "Unable to select database");
     
     
    function storeLink($url,$gathered_from) {
     $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
     mysql_query($query) or die('Error, insert query failed');
    }
     
    $scrapekey = $_POST["keyword"];
     
    $target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q=$scrapekey[/URL]";
    $userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
     
    // make the cURL request to $target_url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html= curl_exec($ch);
    if (!$html) {
     echo "<br />cURL error number:" .curl_errno($ch);
     echo "<br />cURL error:" . curl_error($ch);
     exit;
    }
     
     
     
    // parse the html into a DOMDocument
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
     
    // grab all the on the page
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");
     
    for ($i = 0; $i < $hrefs->length; $i++) {
     $href = $hrefs->item($i);
     $url = $href->getAttribute('href');
     storeLink($url,$target_url);
     echo "<br />Link stored: $url";
    }
     
    mysql_close();
    ?>
    <center><FORM ACTION="home.html">
    <INPUT TYPE=SUBMIT VALUE="Results">
    </FORM></center>
     
     
    </body>
    </html>
     
     
     
     
     
    
     
  11. 00CivicEX

    00CivicEX Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 3, 2009
    Messages:
    293
    Likes Received:
    214
    Ok well this thing is pissing me off lol....not sure why its not working...but I have the code that should make it work...im thinking the reason it isnt working is because the results are in a iframe but here is the code:

    Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[/URL]">
    <html xmlns="[URL]http://www.w3.org/1999/xhtml[/URL]">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>
    Scraper
    </title>
    <link href="style.css" rel="stylesheet" type="text/css" />
    </head>
     
    <body>
     
    <?php
    $username="xxxxxx";
    $password="xxxxxx";
    $database="xxxxxx";
    mysql_connect(localhost,$username,$password);
     
    @mysql_select_db($database) or die( "Unable to select database");
     
    function storeLink($url,$gathered_from) {
     $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
     mysql_query($query) or die('Error, insert query failed');
    }
     
     
    $scrapekey = $_POST["keyword"];
     
    $target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q=$scrapekey[/URL]";
    $userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
     
    // make the cURL request to $target_url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html= curl_exec($ch);
    if (!$html) {
     echo "<br />cURL error number:" .curl_errno($ch);
     echo "<br />cURL error:" . curl_error($ch);
     exit;
    }
     
     
     
    // parse the html into a DOMDocument
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $xp = new domxpath($dom);
    //searches the ask.com for <span> tags and narrows it down to the spans class
    $titles = $xp->query("/html/body//span[@class = 'b']"); 
    foreach ($titles as $node) {
    //sifts through what it finds and only shows the text content
    print $node->textContent . " "; 
        
    }
     
    
    mysql_close();
    ?>
    <center><FORM ACTION="home.html">
    <INPUT TYPE=SUBMIT VALUE="Results">
    </FORM></center>
     
     
    </body>
    </html>
     
     
     
     
     
    
     
  12. 00CivicEX

    00CivicEX Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 3, 2009
    Messages:
    293
    Likes Received:
    214
    Ok, so I tried modifying the first guys php...since it had alot of errors in it but i think his regular expression is incorrect and i suck at converting them...so here is the code and if you can make it match what its looking for it should work...right now it says no matches found

    Code:
    <?php
    // this is the string its suppose to match
    $regex = "#<span id=\'indexLast\' class=\'b\'>[[:alnum:]]</span> of (.*) for#Usi";
    $keyword = 'nintendo';
    $cookie = '';
     
    $target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q='[/URL], urlencode($keyword)'";
    $userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
    // Get a cookie
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$target_url);
    curl_setopt($ch,CURLOPT_COOKIEJAR, $cookie);
    curl_setopt($ch,CURLOPT_COOKIEFILE, $cookie);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION, $true);
    $response = curl_exec($ch);
    // Get the results
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch,CURLOPT_URL,$target_url);
    curl_setopt($ch,CURLOPT_COOKIEFILE, $cookie);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION, $true);
    $result = curl_exec($ch);
    curl_close($ch);
    // Search for result
    if (preg_match($regex,$result,$match)){ 
    $number = str_replace(',', '', $match[0]); 
    $number = number_format($number, 0, '.', ' '); 
    echo $number . ' results for keyword <strong>'.$keyword.'</strong>';
    }else{ 
    echo 'Sorry, could not retrieve results for <strong>'.$keyword.'</strong>';
    }
    ?>
    
    here is what the string needs to match

    Code:
    <span id='indexLast' class='b'>10</span> of 25,100,000 for
    where 25,100,000 is random numbers because we dont know what that is gonna be.
     
    Last edited: May 2, 2009
  13. BlackMelvyn

    BlackMelvyn Regular Member

    Joined:
    Jul 8, 2008
    Messages:
    202
    Likes Received:
    272
    Home Page:
    Hi Jay,
    Really surprised on this output too :confused:
    Wouldn't it be a compressed output? Like the ones used to increase transfer speed?
    Could be an idea...
     
  14. I searched for 'paper' and it worked fine with this script.

    PHP:
    <?php
    $data 
    file_get_contents('http://www.ask.com/web?q=paper&search=search&qsrc=0&o=0&l=dir');
    $regex '/<\/span> of (.+?) for/';
    preg_match($regex,$data,$match);
    echo 
    $match[1];
    ?>
     
    Last edited: Jun 2, 2009
  15. Damn.. Sorry I went to edit but I just missed the half hour mark.

    Anyways I tried searching for a longer string using the url above, but ask.com uses a different format occasionally.. However, the $regex still works regardless of what url / search you use. And like the URL's used above you can use the format: 'http://www.ask.com/web?qsrc=2417&o=0&l=dir&q=' + url encode search string

    Here is some final working code:
    Code:
    <?php
    
    $search_for = urlencode("search for some shizzzzzzzz");
    
    $data = file_get_contents('http://www.ask.com/web?qsrc=2417&o=0&l=dir&q=' . $search_for);
    $regex = '/<\/span> of (.+?) for/';
    preg_match($regex,$data,$match);
    echo $match[1];
    ?>
    
    Lemme know if this works for you.
     
    Last edited: Jun 2, 2009
  16. Did this work for anyone? Or did I miss the mark completely?
     
  17. ukescuba

    ukescuba Senior Member

    Joined:
    Feb 24, 2008
    Messages:
    994
    Likes Received:
    635
    Occupation:
    Mobile Marketer & QR Code Junkie
    Location:
    San Antonio, TX
    Home Page:
    hi dor@ i appreciate your efforts, ive been tied up doing other stuff just recently but thats kinda the same code that i had used previously and was getting the weird output for...

    ill pick up my project again next week and give it a whirl

    thanks

    jay