1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[GET] guide to web scraping PHP

Discussion in 'PHP & Perl' started by Mutikasa, Jan 28, 2012.

Tags:
  1. Mutikasa

    Mutikasa Power Member

    Joined:
    May 23, 2011
    Messages:
    581
    Likes Received:
    207
    Contents
    xiii
    Foreword xvii
    Chapter 1 � Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1
    Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
    How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
    Web Scraping Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
    Applications of Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
    Appropriate Use of Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . 4
    Legality of Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
    Topics Covered . . . . . . . . . . . 4
    Chapter 2 � HTTP . . . . . . . . . . . . . . . . . . . . . . . . . 7
    Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
    GET Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
    Anatomy of a URL . . . . . . . . . . . . . . . . . . . . . . . . . . 10
    Query Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
    POST Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
    HEAD Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
    Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
    Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
    Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
    Redirection . . . . . 16
    .
    .
    .
    .
    Chapter 3 � HTTP Streams Wrapper . . . . . . . . . . . . . . . . . . . . . 27
    Simple Request and Response Handling . . . . . . . . . . . . . . . . . . . . . 28
    Stream Contexts and POST Requests . . . . . . . . . . . . . . . . . . . . . . . 29
    Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
    HTTP Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
    A Few More Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
    Wrap-Up . . . . . . . . . . . . . . . . . . . 33
    Chapter 4 � cURL Extension . . . . . . . . . . . . . . . . . . . . . 35
    Simple Request and Response Handling . . . . . . . . . . . . . . . . . . . . . 36
    Contrasting GET and POST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
    Setting Multiple Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
    Handling Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
    Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
    Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
    HTTP Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
    Redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
    Referers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
    Content Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
    User Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
    Byte Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
    DNS Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
    Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
    Request Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
    Wrap-Up . . . . . . . . . . . . . . . . . . . 46
    Referring URLs . . . . . . . .
    Persistent Connections . . . .
    Content Caching . . . . . . .
    User Agents . . . . . . . . . .
    Ranges . . . . . . . . . . . . .
    Basic HTTP Authentication .
    Digest HTTP Authentication
    Wrap-Up . . . . . . . . . . . . . . .
    Chapter 5 � pecl_http PECL Extension
    GET Requests . . . . . . . . . . . . . .
    POST Requests . . . . . . . . . . . . .
    Handling Headers . . . . . . . . . . . .
    Debugging . . . . . . . . . . . . . . . .
    Timeouts . . . . . . . . . . . . . . . . .
    Content Encoding . . . . . . . . . . .
    Cookies . . . . . . . . . . . . . . . . . .
    HTTP Authentication . . . . . . . . . .
    Redirection and Referers . . . . . . . .
    Content Caching . . . . . . . . . . . .
    User Agents . . . . . . . . . . . . . . .
    Byte Ranges . . . . . . . . . . . . . . .
    Request Pooling . . . . . . . . . . . . .
    Wrap-Up . . . . . . . . . . . . . . . . .
    Chapter 6 � PEAR::HTTP_Client . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
    Requests and Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
    Juggling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
    Wrangling Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
    Using the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
    Observing Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
    Wrap-Up . . . . . . . . . . . . . 68
    Chapter 7 � Zend_Http_Client . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
    Basic Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
    Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
    URL Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
    Custom Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
    Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
    Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
    Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
    Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
    Redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
    User Agents . . . . . . . . . . 77
    HTTP Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Wrap-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Chapter 9 � Tidy Extension
    Validation . . . . . . . . .
    Tidy . . . . . . . . . . . . .
    Input . . . . . . . . . . . .
    Configuration . . . . . . .
    Options . . . . . . . . . . .
    Debugging . . . . . . . . .
    Output . . . . . . . . . . .
    Wrap-Up . . . . . . . . . .

    Chapter 10 � DOM Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
    Types of Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
    Loading Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
    Tree Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
    Elements and Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
    Locating Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
    XPath and DOMXPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
    Absolute Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
    Relative Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
    Addressing Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
    Unions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
    Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
    Resources . . . . . . . . . . 109
    Chapter 8 � Rolling Your Own
    Sending Requests . . . . . . .
    Parsing Responses . . . . . .
    Transfer Encoding . . . . . .
    Content Encoding . . . . . .
    Timing . . . . . . . . . . . . .

    Chapter 12 � XMLReader Extension . . . . . . . . . . . . . . . . . . . . . . . . . 121
    Loading a Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
    Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
    Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
    Elements and Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
    DOM Interoperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
    Closing Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
    Wrap-Up . . . . . . . . . . . . . . . 127
    Chapter 13 � CSS Selector Libraries
    Reason to Use Them . . . . . . . . . .
    Basics . . . . . . . . . . . . . . . . . . .
    Hierarchical Selectors . . . . . . . . .
    Basic Filters . . . . . . . . . . . . . . .
    Content Filters . . . . . . . . . . . . . .
    Attribute Filters . . . . . . . . . . . . .
    Child Filters . . . . . . . . . . . . . . .
    Form Filters . . . . . . . . . . . . . . .
    Libraries . . . . . . . . . . . . . . . . .
    PHP Simple HTML DOM Parser
    Zend_Dom_Query . . . . . . . .
    phpQuery . . . . . . . . . . . . .
    DOMQuery . . . . . . . . . . . . .
    Wrap-Up . . . . . . . . . . . . . . . . .

    Chapter 11 � SimpleXML Extension
    Loading a Document . . . . . . . .
    Accessing Elements . . . . . . . . .
    Accessing Attributes . . . . . . . .
    Comparing Nodes . . . . . . . . . .
    DOM Interoperability . . . . . . .
    XPath . . . . . . . . . . . . . . . . .
    Wrap-Up . . . . . . . . . . . . . . .
    Chapter 15 � Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
    Batch Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
    Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
    Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
    Crawlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
    Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
    Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
    Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
    That�s All Folks . . . . . . . . 162
    Appendix A � Legality of Web Scraping 165
    Chapter B � Multiprocessing 169


    little messed up, but u get the point

    Code:
    http://www.sendspace.com/file/v2ig42
    Code:
    https://www.virustotal.com/file/c9404ac984d45c18f3dbed72d38c1a0beb18e7080fa494b3a297a296cfc29461/analysis/1353701589/
    Detection ratio: 0 / 43
     
    • Thanks Thanks x 12
    Last edited by a moderator: Nov 23, 2012
  2. randomnumbers

    randomnumbers Newbie

    Joined:
    Jun 18, 2011
    Messages:
    20
    Likes Received:
    2
    Might find some tips in this one. I think my methods are pretty solid, but there is always room for improvements.
     
  3. RidiculeMe

    RidiculeMe Newbie

    Joined:
    Feb 4, 2012
    Messages:
    34
    Likes Received:
    2
    Where did you find this? Though, i think i would prefer downloading a compiled program already LOL
     
  4. andee

    andee Regular Member

    Joined:
    Jul 24, 2010
    Messages:
    218
    Likes Received:
    83
    you do know what PHP is right ?
     
    • Thanks Thanks x 3
  5. paincake

    paincake Power Member

    Joined:
    Aug 18, 2010
    Messages:
    716
    Likes Received:
    3,099
    Home Page:
    [​IMG]
     
  6. marconi

    marconi Newbie

    Joined:
    Mar 13, 2010
    Messages:
    6
    Likes Received:
    2
    Excellent share! After a quick scan through I can say it offers some great detailed info on xPath and using cURL with proxies and setting referer! Hard to find this info on the php manual site.
    Thanks Given :D
     
  7. blackhat655

    blackhat655 Newbie

    Joined:
    Mar 16, 2012
    Messages:
    14
    Likes Received:
    0
    What would be the use for this script?
     
  8. abhi007

    abhi007 Jr. VIP Jr. VIP

    Joined:
    Aug 31, 2010
    Messages:
    5,797
    Likes Received:
    3,918
    Location:
    Theatre of dreams :)
    hehe but what has that gotta do with this thread?
     
  9. n30nGUY

    n30nGUY Newbie

    Joined:
    May 25, 2012
    Messages:
    33
    Likes Received:
    5
    You can use this to program scripts to grab data from other websites. Perhaps you want to pinch the information from a popular online directory or scrape Google for link prospects.

    The trick is to appear as legit as possible. Otherwise you'll get banned. This is why a lot of scripts like this will require proxies.
     
  10. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    maybe he wants a hiphoped version
     
  11. puneetas3

    puneetas3 Senior Member

    Joined:
    Jan 8, 2012
    Messages:
    896
    Likes Received:
    387
    Thanks for this. I have been using php to scrape urls from Google for different footprints (don't have scrapebox), scraping twitter, downloading/uploading youtube videos on server. And I now want to port the youtube view bot into php. I think this book would be great help for some quick curl examples lookup.
     
  12. wakekitty

    wakekitty Newbie

    Joined:
    Mar 7, 2011
    Messages:
    37
    Likes Received:
    6
    Can you please reupload? File was already deleted
     
  13. meannn

    meannn Supreme Member

    Joined:
    Apr 22, 2009
    Messages:
    1,461
    Likes Received:
    1,898
    Occupation:
    Unemployed Winner
    Location:
    TR
    Welcome to 2007. Scraping others content won't work unless you are superior creative.
     
  14. akulali

    akulali Newbie

    Joined:
    Apr 5, 2012
    Messages:
    16
    Likes Received:
    1
    file has already deleted, please reupload
     
    • Thanks Thanks x 1
  15. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,612
    Likes Received:
    11,241
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    Link is dead, so thread is closed temporarily. If anyone has a mirror, PM me and I 'll update the post and re-open the thread.
     
  16. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,612
    Likes Received:
    11,241
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    OP has provided a new link, thread has been reopened.
     
  17. Mutikasa

    Mutikasa Power Member

    Joined:
    May 23, 2011
    Messages:
    581
    Likes Received:
    207
    but forgot the link
    Code:
    http://www.sendspace.com/file/v2ig42
     
  18. ampedsoftware

    ampedsoftware Newbie

    Joined:
    Dec 25, 2012
    Messages:
    26
    Likes Received:
    12
    Any good web scraping book could use a section on using Wireshark - the only way to truly see how legit you look is to analyze the network traffic yourself and compare to legitimate traffic. Also a section on JavaScript emulation is near essential these days for some sites, especially if you're sending data back to servers.
     
  19. alias_unknown

    alias_unknown Newbie

    Joined:
    Dec 30, 2012
    Messages:
    41
    Likes Received:
    4
    You will usually find that each web scraping job you do is different, but you can reuse code from previous scraping projects so a all in one web scraping program is not possible.
     
  20. hawkweb

    hawkweb Regular Member

    Joined:
    Aug 18, 2009
    Messages:
    375
    Likes Received:
    21
    I'd say the best thing this guide could use is video tutorials and samples.