scraping ask.com

ukescuba · Apr 28, 2009

hey guys

quick question is anyone successfully scraping data from ask.com?

if so what method are you using?

thanks in advance

jay

denight · Apr 29, 2009

No, but if you are interested.... what do you want scraped?

RSS feeds? standard search results from web? Image search results?

Give me a blueprint of what you want done and let's see what we can do, I always like a challenging php/perl parsing....challenge.. lol.

dN

ukescuba · Apr 30, 2009

thanks for getting back to me

all im looking to do is to pull the number of results a particular search does... i have a feeling that ask.com is somehow blocking certain methods of scraping...

basically after doing a search on ask.com i am looking to pull the number bolded below...

Showing 1-10 of 19,700,000

BlackMelvyn · Apr 30, 2009

Generally, scraping fails because of cookies.
I did not try it but it should be OK (freshly coded)
Just change $keyword to you search expression

PHP:

<?php
//	Scraping number of results for Ask.Com
$reg = '#<span id=\'indexLast\' class=\'b\'>[:alnum:]</span> of (.*) for#Usi';

$keyword = 'SEO';
$cookie = '';
//	Get a cookie
$ch = curl_init('http://www.ask.com/');
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $true);
$response = curl_exec($ch);

//	Get the results
$ch = curl_init('http://www.ask.com/web?qsrc=2417&o=312&l=dir&q='.urlencode($keyword));
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $true);
$result = curl_exec($ch);
curl_close($ch);

//	Search for result
if(preg_match($regex, $result, $match)){
	$number = str_replace(',', '', $match[0]);
	$number = number_format($number, 0, '.', ' ';
	echo $number . ' results for keyword <strong>'.$keyword.'</strong>';
}
else{
	echo 'Sorry, could not retrieve results for  <strong>'.$keyword.'</strong>';
}
?>

Let me know !

ukescuba · Apr 30, 2009

00CivicEX · May 1, 2009

Help automate it a little replace the top with this

Code:

<?php
//    Scraping number of results for Ask.Com
$reg = '#<span id=\'indexLast\' class=\'b\'>[:alnum:]</span> of (.*) for#Usi';
$scrape = $_POST["askkeyword"];
$keyword = "$scrape";
$cookie = '';

Then create another page with this

Code:

</form></center>
<center>Link Scraper by 00CivicEX</center>
<center><form method="post" action="askresult.php">
[URL="http://www.blackhatworld.com/blackhat-seo/<input"]URL:<input[/URL] type="text" size="12" maxlength="40" name="askkeyword"><br />
<input type="submit" value="Scrape Links" name="submit">
</form></center>

Also got mine setup so it save to a database then my admin panel pulls the results up for me..etc...In the process of adding more features to it. You can remove the "Link Scraper by 00CivicEX" Just pulled the code out of mine real quick...make sure you change the askresult.php to whatever you called your file

and here is how I echoed and stored mine

Code:

storeLink($number,$keyword);
 echo "<br />Keyword Number Stored: $number,$keyword";

then just has a function like this adding it to the db

Code:

function storeLink($number,$keyword) {
 $query = "INSERT INTO links (number, keyword) VALUES ('$number', '$keyword')";
 mysql_query($query) or die('Error, insert query failed');

ukescuba · May 1, 2009

00CivicEX · May 1, 2009

i can scrape ask.com for content but havent tried to count the results. Shouldnt be too hard...will look it over and see if the script he posted works and if not fix it.

ukescuba · May 1, 2009

just sent you a pm showing what my output looks like when i dump the grab

am only looking to pull the number of results from:

Showing 1-10 of 51,800,000 for keyword

thanks

jay

00CivicEX · May 1, 2009

Still messing with it, but here is my complete code to scrape links from ask.com. Might help you with the count. I know what needs to be done...just getting it done...it needs to use a getByClassName to retreive the attribute for the count results which is 'indexFirst' and 'indexLast' Wouldnt need the indexFirst cause its always 1, just need the indexLast.

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[/URL]">
<html xmlns="[URL]http://www.w3.org/1999/xhtml[/URL]">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
Scraper
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
 
<body>
 
<?php
$username="xxxxxxxxx";
$password="xxxxxxxxx";
$database="xxxxxxxxx";
mysql_connect(localhost,$username,$password);
 
@mysql_select_db($database) or die( "Unable to select database");
 
 
function storeLink($url,$gathered_from) {
 $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
 mysql_query($query) or die('Error, insert query failed');
}
 
$scrapekey = $_POST["keyword"];
 
$target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q=$scrapekey[/URL]";
$userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
 
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
 echo "<br />cURL error number:" .curl_errno($ch);
 echo "<br />cURL error:" . curl_error($ch);
 exit;
}
 
 
 
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
 
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
 
for ($i = 0; $i < $hrefs->length; $i++) {
 $href = $hrefs->item($i);
 $url = $href->getAttribute('href');
 storeLink($url,$target_url);
 echo "<br />Link stored: $url";
}
 
mysql_close();
?>
<center><FORM ACTION="home.html">
<INPUT TYPE=SUBMIT VALUE="Results">
</FORM></center>
 
 
</body>
</html>

00CivicEX · May 2, 2009

Ok well this thing is pissing me off lol....not sure why its not working...but I have the code that should make it work...im thinking the reason it isnt working is because the results are in a iframe but here is the code:

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "[URL]http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[/URL]">
<html xmlns="[URL]http://www.w3.org/1999/xhtml[/URL]">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
Scraper
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
 
<body>
 
<?php
$username="xxxxxx";
$password="xxxxxx";
$database="xxxxxx";
mysql_connect(localhost,$username,$password);
 
@mysql_select_db($database) or die( "Unable to select database");
 
function storeLink($url,$gathered_from) {
 $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
 mysql_query($query) or die('Error, insert query failed');
}
 
 
$scrapekey = $_POST["keyword"];
 
$target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q=$scrapekey[/URL]";
$userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
 
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
 echo "<br />cURL error number:" .curl_errno($ch);
 echo "<br />cURL error:" . curl_error($ch);
 exit;
}
 
 
 
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xp = new domxpath($dom);
//searches the ask.com for <span> tags and narrows it down to the spans class
$titles = $xp->query("/html/body//span[@class = 'b']"); 
foreach ($titles as $node) {
//sifts through what it finds and only shows the text content
print $node->textContent . " "; 
    
}
 

mysql_close();
?>
<center><FORM ACTION="home.html">
<INPUT TYPE=SUBMIT VALUE="Results">
</FORM></center>
 
 
</body>
</html>

00CivicEX · May 2, 2009

Ok, so I tried modifying the first guys php...since it had alot of errors in it but i think his regular expression is incorrect and i suck at converting them...so here is the code and if you can make it match what its looking for it should work...right now it says no matches found

Code:

<?php
// this is the string its suppose to match
$regex = "#<span id=\'indexLast\' class=\'b\'>[[:alnum:]]</span> of (.*) for#Usi";
$keyword = 'nintendo';
$cookie = '';
 
$target_url = "[URL]http://www.ask.com/web?qsrc=2417&o=312&l=dir&q='[/URL], urlencode($keyword)'";
$userAgent = 'Googlebot/2.1 ([URL]http://www.googlebot.com/bot.html)'[/URL];
// Get a cookie
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$target_url);
curl_setopt($ch,CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch,CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION, $true);
$response = curl_exec($ch);
// Get the results
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch,CURLOPT_URL,$target_url);
curl_setopt($ch,CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION, $true);
$result = curl_exec($ch);
curl_close($ch);
// Search for result
if (preg_match($regex,$result,$match)){ 
$number = str_replace(',', '', $match[0]); 
$number = number_format($number, 0, '.', ' '); 
echo $number . ' results for keyword <strong>'.$keyword.'</strong>';
}else{ 
echo 'Sorry, could not retrieve results for <strong>'.$keyword.'</strong>';
}
?>

here is what the string needs to match

Code:

<span id='indexLast' class='b'>10</span> of 25,100,000 for

where 25,100,000 is random numbers because we dont know what that is gonna be.

BlackMelvyn · May 3, 2009

Hi Jay,
Really surprised on this output too

Wouldn't it be a compressed output? Like the ones used to increase transfer speed?
Could be an idea...

ukescuba said:
any ideas what it is? is it an encoding/character set issue??
jay

dor@tehexploa · Jun 2, 2009

I searched for 'paper' and it worked fine with this script.

PHP:

<?php
$data = file_get_contents('http://www.ask.com/web?q=paper&search=search&qsrc=0&o=0&l=dir');
$regex = '/<\/span> of (.+?) for/';
preg_match($regex,$data,$match);
echo $match[1];
?>

dor@tehexploa · Jun 2, 2009

Damn.. Sorry I went to edit but I just missed the half hour mark.

Anyways I tried searching for a longer string using the url above, but ask.com uses a different format occasionally.. However, the $regex still works regardless of what url / search you use. And like the URL's used above you can use the format: 'http://www.ask.com/web?qsrc=2417&o=0&l=dir&q=' + url encode search string

Here is some final working code:

Code:

<?php

$search_for = urlencode("search for some shizzzzzzzz");

$data = file_get_contents('http://www.ask.com/web?qsrc=2417&o=0&l=dir&q=' . $search_for);
$regex = '/<\/span> of (.+?) for/';
preg_match($regex,$data,$match);
echo $match[1];
?>

Lemme know if this works for you.

dor@tehexploa · Jun 11, 2009

Did this work for anyone? Or did I miss the mark completely?

ukescuba · Jun 12, 2009

hi dor@ i appreciate your efforts, ive been tied up doing other stuff just recently but thats kinda the same code that i had used previously and was getting the weird output for...

ill pick up my project again next week and give it a whirl

thanks

jay

scraping ask.com

ukescuba

Senior Member

denight

Registered Member

ukescuba

Senior Member

BlackMelvyn

Regular Member

ukescuba

Senior Member

00CivicEX

Regular Member

ukescuba

Senior Member

00CivicEX

Regular Member

ukescuba

Senior Member

00CivicEX

Regular Member

00CivicEX

Regular Member

00CivicEX

Regular Member

BlackMelvyn

Regular Member

dor@tehexploa

Registered Member

dor@tehexploa

Registered Member

dor@tehexploa

Registered Member

ukescuba

Senior Member

Main Menu

Marketplace

Making Money

BlackHat World