1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Receiving "502 Bad Gateway" fron Cloudfront when scraping a website?

Discussion in 'PHP & Perl' started by sgtBerbatov, Apr 5, 2017.

  1. sgtBerbatov

    sgtBerbatov Newbie

    Joined:
    Mar 2, 2016
    Messages:
    20
    Likes Received:
    11
    I've built a scraping script in PHP to gather information from a particular website. I have tested the script thoroughly with downloaded HTML files from the target website, so the xpath queries are correct. Yesterday I tried the script for the first time locally but targetting the actual website and it worked. So I then took the script, placed it on my server farm, and turned it on.

    This morning I awoke to 90 emails from one particular server telling me there's been an error. Another server sent about 10 emails, while the other two seem to be working away just fine. I've checked the logs I keep on the database and all of the errors encountered have been "502 Bad Gateway". I've tried the URL through a normal web browser and it loads fine, and I've tried the URL via a wget method on the same server. The wget returns this error:

    ERROR: The certificate of `http://www.targetwebsite.com' is not trusted.
    ERROR: The certificate of `http://www.targetwebsite.com' hasn't got a known issuer.

    Using the "--no-check-certificate" flag still produces the error, but downloads the HTML file anyway.

    So anyway, in my script I have the following code:

    // Assign Curl Options
    $curlOptions = array(
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HEADER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_ENCODING => "",
    CURLOPT_AUTOREFERER => true,
    CURLOPT_CONNECTTIMEOUT => 120,
    CURLOPT_TIMEOUT => 120,
    CURLOPT_MAXREDIRS => 10,
    CURLINFO_HEADER_OUT => true,
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_HTTP_VERSION => 'CURL_HTTP_VERSION_1_1',
    CURLOPT_COOKIE => $cookiesJar,
    CURLOPT_USERAGENT => $userAgent,
    );

    // URL To scrape
    $url = "https://www.targetsite.com/specific/page/;


    // Build Curl Headers
    $ch = curl_init($url);
    curl_setopt_array($ch, $curlOptions);

    $content = curl_exec($ch);
    $err = curl_errno($ch);
    $errmsg = curl_error($ch);
    $header = curl_getinfo($ch);
    $responseCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    So this tends to work, but obviously isn't 100% reliable at the moment due to Cloudfront returning 502 Bad Gateway errors. But, at the same time, I've never created a web scraping script before, and while I'm sure that's all the options I need to make the website think I'm a legitimate user, well obviously I'm missing something!

    So has anyone had this issue with a website that uses Cloudfront before? How can I get around it and make this more robust?