1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Create Simple Python Scraper Bot with Scrapy

Discussion in 'Programming' started by GoDesain, Jun 10, 2017.

  1. GoDesain

    GoDesain Regular Member

    Joined:
    Feb 26, 2011
    Messages:
    471
    Likes Received:
    179
    Home Page:
    Hi all, just want to share my exp to other bhw member who want learn in python programing.
    Recently got interest question about "What i must use for boting with python"
    so this my simple answer with sample..

    What module i must use if using python..?
    Sometimes, each module have same function in your mind.. but in real situation will be different.
    for sample i use udemy API to grab free course.
    Code:
    https://www.udemy.com/api-2.0/channels/1640/courses?is_angular_app=true&price=price-free&sort=newest
    as you can see you will see json in result.. my question is.. are you sure will use selenium to scrap that json data ? how about using scrapy ?

    Installing scrapy !
    If you already have python in your system, just use command :
    Code:
    pip install scrapy
    more details about scrapy : https://scrapy.org

    Try to code !
    Just take a look this code :
    Code:
    https://pastebin.com/4Hmt6Dmh
    You can see in start_urls have list array url.. that's ugly code for each category in udemy.. that mean the bot will start from that url.
    Next part is custom_settings, i put fake user agent to avoid detection.. but this only optional.
    Code:
    data = json.loads(response.body)
    if u ask hey "data" how are you ? i'm respond body from udemy json url.
    Code:
    for item in data.get('results', []):
    and what is this ? take a look on udemy json and see 'results' part.. this mean you only scrap result part.
    Code:
    url = 'https://www.udemy.com' + item.get('url')
    and what you want 'url' ? as you can see in json 'results' part 'url' only have slug without main domain.. so i must combine manually main domain with json respond.
    Code:
    with open('udemylist.txt', 'a') as f:
    f.write('{0}\n'.format(url))
    And...??? this will create udemylist.txt and put all result inside..
    Code:
    if data['next'] is not None:
    next_page = data['next']
    Udemy api only use next full url and none for pagination.. that mean if in json 'next' is note none, scrapy will open next page.. with this code
    Code:
    yield scrapy.Request(next_page)
    Test drive....
    Final part is test how fast this script to scarp data.. from cli
    Code:
    scrapy runspider UdemyFree.py
    and compare if you use selenium or mechanize..

    Note :
    • For admin if i make tread in wrong section you can delete or move it.. just suggestion have new sub forum about python
    • For other member, let discus together.. i'm not expert in python.. just want to share..:oops:
    • I'm not good in english and make good tread..
     
    • Thanks Thanks x 2
    Last edited: Jun 10, 2017
  2. thetrustedzone

    thetrustedzone Jr. VIP Jr. VIP

    Joined:
    Jun 15, 2010
    Messages:
    2,430
    Likes Received:
    1,959
    Home Page:
    Python is great for scraping , anyone looking to learn programming i advise him/her to start with python , but better first to learn HTML -isn't programming actually- but it's very helpful to understand basics of programming syntax writing , and passing psychological barriers when start learning programming language like python ...