1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

I need to scrape 500,000 pages

Discussion in 'Hire a Freelancer' started by valser29, Jul 30, 2016.

  1. valser29

    valser29 Registered Member

    Joined:
    Oct 15, 2015
    Messages:
    56
    Likes Received:
    11
    Location:
    Toronto
    Hello there,
    I need to scrape Alexa top 500,000 sites and select only ones that have checkout pages.
    If anyone is interested, please reply here or send me a pm with prices.
    Thanks!
     
  2. blackhat777

    blackhat777 Elite Member

    Joined:
    Jun 25, 2011
    Messages:
    1,784
    Likes Received:
    653
    I believe I can do it. Will you be providing the site lists?
    Add me on Skype.
     
    • Thanks Thanks x 1
  3. tounsi7orr

    tounsi7orr BANNED BANNED

    Joined:
    Apr 21, 2014
    Messages:
    180
    Likes Received:
    11
    Yes! I can do this very well.
    I'm python programmer and I use selenium librarie to make scraping bots that works perfectly. I want to tell you that the script I will code scrap a single page in 5 seconds.
    This is my skype: akermiy.yassine
     
    • Thanks Thanks x 1
  4. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    969
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    That's not very fast at all.
     
    • Thanks Thanks x 1
  5. tounsi7orr

    tounsi7orr BANNED BANNED

    Joined:
    Apr 21, 2014
    Messages:
    180
    Likes Received:
    11
    If you run the script 5 times in a free aws vps. It gonna be 1 second.
    And believe me you'll never find a script that scrap less than two seconds.
     
    • Thanks Thanks x 1
  6. BeanDH92

    BeanDH92 Regular Member

    Joined:
    Nov 24, 2013
    Messages:
    297
    Likes Received:
    35
    Occupation:
    Web developer
    selenium is not the best choice when it comes to scraping content
     
    • Thanks Thanks x 2
  7. Ren

    Ren BANNED BANNED

    Joined:
    Jul 24, 2016
    Messages:
    8
    Likes Received:
    1
    Gender:
    Male
    Hey give me ur Skype pls
     
    • Thanks Thanks x 1
  8. chrisyoungsd

    chrisyoungsd Junior Member Premium Member

    Joined:
    Mar 6, 2014
    Messages:
    171
    Likes Received:
    18
    Occupation:
    Dominate Search Engines
    Location:
    San Diego
    Check out scrape box. Great tool that you can use for all kinds of things.
     
    • Thanks Thanks x 1
  9. Omoruyiik

    Omoruyiik Regular Member

    Joined:
    Sep 15, 2015
    Messages:
    246
    Likes Received:
    61
    i am sure you'd find paigham bot (in bst) useful, they even have videos here
     
    • Thanks Thanks x 1
  10. redarrow

    redarrow Elite Member

    Joined:
    Apr 1, 2013
    Messages:
    4,267
    Likes Received:
    974
    Info do it your self very easy.

    http://www.the-art-of-web.com/php/parse-links/


    <?php

    // Original PHP code by Chirp Internet: www.chirp.com.au
    // Please acknowledge use of this code by including this header.

    $url ="http://www.example.net/somepage.html";
    $input [email protected]_get_contents($url)ordie("Could not access file: $url");
    $regexp ="<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $input, $matches)){
    // $matches[2] = array of link addresses
    // $matches[3] = array of link text - including HTML code
    }

    ?>


    Easy peasy

    Or

    /*
    Function to get all links on a certain url using the DomDocument
    */

    function get_links($link)
    {
    //return array
    $ret = array();
    /*** a new dom object ***/
    $dom = new domDocument;
    /*** get the HTML (suppress errors) ***/
    @$dom->loadHTML(file_get_contents($link));
    /*** remove silly white space ***/
    $dom->preserveWhiteSpace = false;
    /*** get the links from the HTML ***/
    $links = $dom->getElementsByTagName('a');
    /*** loop over the links ***/
    foreach ($links as $tag)
    {
    $ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
    }
    return $ret;
    }


    //Link to open and search for links
    $link = "http://www.php.net";

    /*** get the links ***/
    $urls = get_links($link);

    /*** check for results ***/
    if(sizeof($urls) > 0)
    {
    foreach($urls as $key=>$value)
    {
    echo $key . ' - '. $value . '<br >';
    }
    }
    else
    {
    echo "No links found at $link";
    }
     
    • Thanks Thanks x 1
  11. tounsi7orr

    tounsi7orr BANNED BANNED

    Joined:
    Apr 21, 2014
    Messages:
    180
    Likes Received:
    11
    This : akermiy.yassine
     
  12. tounsi7orr

    tounsi7orr BANNED BANNED

    Joined:
    Apr 21, 2014
    Messages:
    180
    Likes Received:
    11
    I know, it's problem is that it waits until the page fully load to start scaping and that takes some time. But, I always disable images and javascript and then page loading speed become very fast.
     
  13. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    969
    Likes Received:
    678
    Occupation:
    Web/Bot Developer
    LOL. What's the point of using a headless browser if you are disabling JS.
     
  14. JustChillin

    JustChillin Jr. VIP Jr. VIP

    Joined:
    Apr 14, 2015
    Messages:
    1,668
    Likes Received:
    1,085
    Top 500,000 or 500? Cause I think Alexa doesn't show the top 500,000 pages on their website. It's limited to 500 for me.
     
  15. Pakal

    Pakal Junior Member

    Joined:
    Dec 6, 2015
    Messages:
    116
    Likes Received:
    55
    Gender:
    Male
    Location:
    http://bit.cards
    Alexa gives you free access to the top 1 million sites. The link below should help you get it directly from Alexa:

    Code:
    https://support.alexa.com/hc/en-us/articles/200461990-Can-I-get-a-list-of-top-sites-from-an-API-
    Also I posted a few days ago a thread for a huge list of unique domains ready for download which could help you out. This list was harvested and generated with a reverse ip technique which generated roughly about 42 million unique domains :)

    Code:
    http://www.blackhatworld.com/seo/get-42-million-unique-domains-good-for-seo.862822/
    Hope it helps :)
     
    • Thanks Thanks x 1
  16. turkeypockets

    turkeypockets Junior Member

    Joined:
    Apr 1, 2016
    Messages:
    111
    Likes Received:
    26
    Gender:
    Male
    Occupation:
    Digital Marketing Solutions
    Location:
    United States
    Home Page:
    I can do this. I have a list of the top 200,000 Alex's ranked sites already, I could target checkout pages and scrape the web assets, localize all the links, and get past robot restrictions. You realize how big this data is gonna be though right? We're talking an easy 50 gigs on 500,000 html pages.
     
    • Thanks Thanks x 1
  17. SEMWORLD

    SEMWORLD BANNED BANNED

    Joined:
    Nov 21, 2015
    Messages:
    1,235
    Likes Received:
    218
    Hello buddy. If you are still looking for a person to do the scraping, then I am available to help you out. Kindly just reach me through the inbox so that we can discuss the details of the project and see how best we can help each other. Looking forward to hear from you.
     
    • Thanks Thanks x 1
  18. tounsi7orr

    tounsi7orr BANNED BANNED

    Joined:
    Apr 21, 2014
    Messages:
    180
    Likes Received:
    11
    Because, I only need HTML
     
  19. turkeypockets

    turkeypockets Junior Member

    Joined:
    Apr 1, 2016
    Messages:
    111
    Likes Received:
    26
    Gender:
    Male
    Occupation:
    Digital Marketing Solutions
    Location:
    United States
    Home Page:
    I can scrape html pages only and use filters on page size, localize assets or not, add necessary head info charset/meta for original source, append file extension, etc. I can't have it done instantly and it might take a few dry runs, but all in all my scraper rips html assets probably 10 pages/second it's stupid fast
     
    • Thanks Thanks x 1
  20. valser29

    valser29 Registered Member

    Joined:
    Oct 15, 2015
    Messages:
    56
    Likes Received:
    11
    Location:
    Toronto
    Thanks everyone for the responses! Did not expect to receive so many :D
    Sorry, I am kind of busy and obviously cannot respond to everyone. I am currently negotiating with one provider.