1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scrape the Web: Strategies for programming websites that don't expected it

Discussion in 'Black Hat SEO' started by MrBlue, Apr 12, 2011.

  1. MrBlue

    MrBlue Senior Member

    Joined:
    Dec 18, 2009
    Messages:
    950
    Likes Received:
    662
    Occupation:
    Web/Bot Developer
    An interesting lecture on scraping/browser automation methods and strategies using Python.

    * Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
    * Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
    * Automatic template reverse-engineering tools.
    * Submitting to forms.
    * Playing with XML-RPC.
    * Countermeasures, and circumventing them:
    o IP address limits
    o Hidden form fields
    o User-agent detection
    o JavaScript
    o CAPTCHAs
    * Plenty of full source code to working examples:
    o Submitting to forms for text-to-speech.
    o Downloading music from web stores.
    o Automating Firefox with Selenium RC to navigate a pure-JavaScript service.
    * Q&A; and workshopping
    * Use your power for good, not evil.

    Video:
    Code:
    http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra
     
    • Thanks Thanks x 1
  2. sw1344

    sw1344 Newbie

    Joined:
    Nov 14, 2010
    Messages:
    22
    Likes Received:
    11
    Mmm.. Nice..

    I'd better brush on my python.

    Many thanks
     
  3. wu1239

    wu1239 Newbie

    Joined:
    Jun 4, 2011
    Messages:
    16
    Likes Received:
    0
    the video is fine and i found the guy's site:
    google: asheesh laroia pycon
    (i can not post links now)
     
  4. Frogserv

    Frogserv Regular Member

    Joined:
    Jun 21, 2011
    Messages:
    376
    Likes Received:
    180
    Occupation:
    Entrepreneur
    Location:
    Paris, FR
    Evil is good enough :D
     
  5. wu1239

    wu1239 Newbie

    Joined:
    Jun 4, 2011
    Messages:
    16
    Likes Received:
    0
    i also found the guy's slides links:
    i just could not post the link, so I give google keyword again:
    google: "stats pop quitz" ext::ppdf