1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Fake referrer in scraping script?

Discussion in 'PHP & Perl' started by OsiriXbe, Sep 11, 2012.

  1. OsiriXbe

    OsiriXbe Newbie

    Joined:
    Jul 29, 2010
    Messages:
    28
    Likes Received:
    7
    Hello

    I have programmed a script to scrape some data from a huge website.
    It something like this:
    PHP:
    <?php


    $url_list 
    "urls.txt";

    foreach (
    file($url_list) as $url) {
    $html file_get_html($url);

    Here all the scraping stuff.

    }
    ?>
    It worked great, but the website owner found out I was scraping it and he has build in some security.
    When the script visits a page of their website, the referrer is blank because php is visiting it directly. When they are getting too much blank referrer from the same IP, they're blocking the IP.

    So now I need to find a way to fake the referrer with my PHP script. I've already tried it with some Chrome addons. It does work when I visit it by hand but not when I let the PHP script do it's job.

    I think I have to fake the referrer just before it does this:
    $html = file_get_html($url);
    Because that's the line where the PHP script is actually going to the website to gather the content.


    I hope someone can help with this.

    Edit: Forgot to mention, the file_get_html function is a function of "simplehtmldom" (google it because I cannot post URL's)
     
    Last edited: Sep 11, 2012
  2. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,879
    Likes Received:
    1,931
    You can use CURL:

    PHP:
    <?php
    $curl_defaults 
    = array(
        
    CURLOPT_HEADER => 0,
        
    CURLOPT_FOLLOWLOCATION => 1,
        
    CURLOPT_RETURNTRANSFER => 1,
        );
    function 
    Return_Content_From_URL($url,$accountid,$proxy,$port,$loginpassw,$proxytype,$referrer){
        global 
    $curl_defaults;
        
    $ch curl_init();
        
    curl_setopt_array($ch$curl_defaults);
        
    curl_setopt($chCURLOPT_USERAGENT"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)");
        
    curl_setopt($chCURLOPT_URL,$url);
        if(
    $referrer!=0){curl_setopt($chCURLOPT_REFERER$referrer);}
        
    curl_setopt($chCURLOPT_PROXYPORT$port);
        if (
    $proxytype=="CURLPROXY_SOCKS5"){curl_setopt($chCURLOPT_PROXYTYPECURLPROXY_SOCKS5);}else{curl_setopt($chCURLOPT_PROXYTYPE"HTTP");}
        
    curl_setopt($chCURLOPT_PROXY$proxy);
        if (
    $loginpassw!="0:0"){
            
    curl_setopt($chCURLOPT_PROXYUSERPWD$loginpassw);
            }
        
    $htmlcurl_exec($ch);
        
    curl_close($ch);
        return 
    $html;
        }
     
    • Thanks Thanks x 1
  3. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    use the CURL from gimme4free and str_get_html instead of file_get_html
     
    • Thanks Thanks x 1
  4. OsiriXbe

    OsiriXbe Newbie

    Joined:
    Jul 29, 2010
    Messages:
    28
    Likes Received:
    7
    Thanks for the help, I really appreciate it!

    I integrated both of your suggestions. I simplified the script from gimme4free into this:
    PHP:

    $curl_defaults 
    = array(
        
    CURLOPT_HEADER => 0,
        
    CURLOPT_FOLLOWLOCATION => 1,
        
    CURLOPT_RETURNTRANSFER => 1,
        );
    function 
    Return_Content_From_URL($url,$referrer){
        global 
    $curl_defaults;
        
    $ch curl_init();
        
    curl_setopt_array($ch$curl_defaults);
        
    curl_setopt($chCURLOPT_USERAGENT"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)");
        
    curl_setopt($chCURLOPT_URL,$url);
        
    curl_setopt($chCURLOPT_REFERER$referrer);
        
    $htmlcurl_exec($ch);
        
    curl_close($ch);    
        return 
    $html;
    }
    I think I did nothing wrong by removing a few unneeded things.

    So I implemented this function and used str_get_html instead of file_get_html.
    I got the script working again, but the website I'm scraping is still blocking me after a few pageviews.

    So I tried to show the header files to see if it actually is using the right referrer.
    I did this by adding:
    PHP:
        $headers apache_request_headers();
        foreach (
    $headers as $header => $value) {
            echo 
    "$header$value <br />\n";
        }
    To the bottom of gimme4free's script but I think it's just showing my header information.

    So does anyone know how to see if the website I'm scraping is actually getting the right referrer?

    Thanks
     
  5. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,879
    Likes Received:
    1,931
    Setup a PHP file on your site to connect to:
    <?php
    $referer=$_SERVER['HTTP_REFERER'];
    echo $referer;
    ?>
     
  6. OsiriXbe

    OsiriXbe Newbie

    Joined:
    Jul 29, 2010
    Messages:
    28
    Likes Received:
    7
    I'm not really sure what you mean with this last post but it looks like the script is working now and the scraped webpage is getting the right referrer. I have no idea why it didn't work a few hours ago :)

    Anyway, thanks a lot!