Fake referrer in scraping script?

OsiriXbe · Sep 11, 2012

Hello

I have programmed a script to scrape some data from a huge website.
It something like this:

PHP:

<?php


$url_list = "urls.txt";

foreach (file($url_list) as $url) {
$html = file_get_html($url);

Here all the scraping stuff.

}
?>

It worked great, but the website owner found out I was scraping it and he has build in some security.
When the script visits a page of their website, the referrer is blank because php is visiting it directly. When they are getting too much blank referrer from the same IP, they're blocking the IP.

So now I need to find a way to fake the referrer with my PHP script. I've already tried it with some Chrome addons. It does work when I visit it by hand but not when I let the PHP script do it's job.

I think I have to fake the referrer just before it does this:
$html = file_get_html($url);
Because that's the line where the PHP script is actually going to the website to gather the content.

I hope someone can help with this.

Edit: Forgot to mention, the file_get_html function is a function of "simplehtmldom" (google it because I cannot post URL's)

gimme4free · Sep 11, 2012

You can use CURL:

PHP:

<?php
$curl_defaults = array(
    CURLOPT_HEADER => 0,
	CURLOPT_FOLLOWLOCATION => 1,
	CURLOPT_RETURNTRANSFER => 1,
	);
function Return_Content_From_URL($url,$accountid,$proxy,$port,$loginpassw,$proxytype,$referrer){
	global $curl_defaults;
	$ch = curl_init();
	curl_setopt_array($ch, $curl_defaults);
	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)");
	curl_setopt($ch, CURLOPT_URL,$url);
	if($referrer!=0){curl_setopt($ch, CURLOPT_REFERER, $referrer);}
    curl_setopt($ch, CURLOPT_PROXYPORT, $port);
	if ($proxytype=="CURLPROXY_SOCKS5"){curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);}else{curl_setopt($ch, CURLOPT_PROXYTYPE, "HTTP");}
	curl_setopt($ch, CURLOPT_PROXY, $proxy);
	if ($loginpassw!="0:0"){
		curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);
		}
	$html= curl_exec($ch);
	curl_close($ch);
	return $html;
	}

sockpuppet · Sep 11, 2012

use the CURL from gimme4free and str_get_html instead of file_get_html

OsiriXbe · Sep 12, 2012

Thanks for the help, I really appreciate it!

I integrated both of your suggestions. I simplified the script from gimme4free into this:

PHP:

$curl_defaults = array(
    CURLOPT_HEADER => 0,
    CURLOPT_FOLLOWLOCATION => 1,
    CURLOPT_RETURNTRANSFER => 1,
    );
function Return_Content_From_URL($url,$referrer){
    global $curl_defaults;
    $ch = curl_init();
    curl_setopt_array($ch, $curl_defaults);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)");
    curl_setopt($ch, CURLOPT_URL,$url);
    curl_setopt($ch, CURLOPT_REFERER, $referrer);
    $html= curl_exec($ch);
    curl_close($ch);    
    return $html;
}

I think I did nothing wrong by removing a few unneeded things.

So I implemented this function and used str_get_html instead of file_get_html.
I got the script working again, but the website I'm scraping is still blocking me after a few pageviews.

So I tried to show the header files to see if it actually is using the right referrer.
I did this by adding:

PHP:

    $headers = apache_request_headers();
    foreach ($headers as $header => $value) {
        echo "$header: $value <br />\n";
    }

To the bottom of gimme4free's script but I think it's just showing my header information.

So does anyone know how to see if the website I'm scraping is actually getting the right referrer?

Thanks

gimme4free · Sep 12, 2012

OsiriXbe said:
Thanks for the help, I really appreciate it!

I integrated both of your suggestions. I simplified the script from gimme4free into this:

PHP:

$curl_defaults = array( CURLOPT_HEADER => 0, CURLOPT_FOLLOWLOCATION => 1, CURLOPT_RETURNTRANSFER => 1, ); function Return_Content_From_URL($url,$referrer){ global $curl_defaults; $ch = curl_init(); curl_setopt_array($ch, $curl_defaults); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)"); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_REFERER, $referrer); $html= curl_exec($ch); curl_close($ch); return $html; }

I think I did nothing wrong by removing a few unneeded things.

So I implemented this function and used str_get_html instead of file_get_html.
I got the script working again, but the website I'm scraping is still blocking me after a few pageviews.

So I tried to show the header files to see if it actually is using the right referrer.
I did this by adding:

PHP:

$headers = apache_request_headers(); foreach ($headers as $header => $value) { echo "$header: $value <br />\n"; }

To the bottom of gimme4free's script but I think it's just showing my header information.

So does anyone know how to see if the website I'm scraping is actually getting the right referrer?

Thanks

Setup a PHP file on your site to connect to:
<?php
$referer=$_SERVER['HTTP_REFERER'];
echo $referer;
?>

OsiriXbe · Sep 12, 2012

gimme4free said:
Setup a PHP file on your site to connect to:
<?php
$referer=$_SERVER['HTTP_REFERER'];
echo $referer;
?>

I'm not really sure what you mean with this last post but it looks like the script is working now and the scraped webpage is getting the right referrer. I have no idea why it didn't work a few hours ago

Anyway, thanks a lot!

Fake referrer in scraping script?

OsiriXbe

Newbie

gimme4free

Elite Member

sockpuppet

Junior Member

OsiriXbe

Newbie

gimme4free

Elite Member

OsiriXbe

Newbie

Main Menu

Marketplace

Making Money

BlackHat World