1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

chimse242 Proxy Scraper in Python 3

Discussion in 'Programming' started by kuhis, Feb 16, 2017.

  1. kuhis

    kuhis Newbie

    Joined:
    Jan 9, 2017
    Messages:
    9
    Likes Received:
    2
    Hi there, i am new programmer and new member here.
    I'd like to share a scraper which scrapes a thread that post free socks 4/5 proxy . It's updated daily.
    Hope you guys like it and this is the code.
    Code:
    #!/usr.bin.python3
    
    # thread link: https://www.blackhatworld.com/seo/free-socks4-5-update-daily.887767/
    
    class chimse242():
        def __init__(self, debug=False):
            self.debug = debug
            self.s = requests.Session()
            self.baseurl = "https://www.blackhatworld.com/"
    
    
        def proxy_scrape(self, html_data, start_id=""):
            soup = BeautifulSoup(html_data, 'html.parser')
            msgs = soup.find('ol', class_="messageList").find_all("li", class_="message")
            if msgs == None:
                return "Fail to scrape the data"
           
            len_msgs = len(msgs)
            if self.debug:
                print("[+] Found "+str(len_msgs)+" messages")
    
            for msg_num in range(len_msgs-1,-1,-1):
                msg = msgs[msg_num]
                if msg['data-author'].strip() != "chimse242":
                    continue
                id = msg['id']
                msg_time = msg.div.find("div", class_="privateControls").a.find('abbr').string
               
                prmlink = msg.div.find("div", class_="publicControls").a.string
                content = msg.find("div", class_="messageInfo primaryContent").find('div', class_="messageContent")
                content = content.find('article').find('blockquote').find('div', class_="bbCodeBlock")
                content = content.find('pre').string
                return (id, msg_time, content)
    
               
        def page_nav_scrape(self, html_data):
            soup = BeautifulSoup(html_data, 'html.parser')
            pages = soup.find("div", class_="PageNav")
            span = pages.find('span', class_="pageNavHeader")
            data = span.string.strip()
            if self.debug:
                print("[+] Found "+ data.split(" ")[-1] +" pages")
            next_url = pages.find('nav').find_all('a')[-2]['href'].strip()
            return next_url
    
        def http_reg(self, url):
            if self.debug:
                print("[+] HTTP Req on: "+ url)
            req = self.s.get(url)
    
            if req.status_code != 200:
                if self.debug:
                    print("Fail on http req")
                print("[ERR] Respond code: " + req.status_code)
                return
    
            return req.text
    
    def main():
        scraper = chimse242()
        url_1 = "https://www.blackhatworld.com/seo/free-socks4-5-update-daily.887767/"
       
        data_1 = scraper.http_reg(url_1)
       
        url_2 = scraper.baseurl
        url_2 += scraper.page_nav_scrape(data_1)
       
        data_2 = scraper.http_reg(url_2)
        result = scraper.proxy_scrape(data_2)
    
        count = 0
        for r in result:
            if count == 0:
                print("ID  : "+ r.strip())
            elif count == 1:
                print("Time: "+ r.strip())
            else:
                print(r.strip())
    
            count += 1
    
    
    if __name__ == '__main__':
        main()
    Note: you need requests an BeautifulSoup4 modules.
    Disclaimer: this is free and made for education purpose only. Writer doesn't want to make any harm. Enjoy
     
    • Thanks Thanks x 2
  2. patrick007

    patrick007 Registered Member

    Joined:
    Jan 7, 2017
    Messages:
    67
    Likes Received:
    14
    Gender:
    Male
    Location:
    Brazil
    Nice python scraper, but the source code is missing these lines at the beginning of the file
    Code:
    from bs4 import BeautifulSoup
    import requests
     
    • Thanks Thanks x 3
    Last edited: Feb 16, 2017
  3. gman777

    gman777 Jr. VIP Jr. VIP

    Joined:
    Apr 7, 2016
    Messages:
    645
    Likes Received:
    496
    @patrick007 Lol, your profile pic goes so well with your post.
     
    • Thanks Thanks x 1
  4. kuhis

    kuhis Newbie

    Joined:
    Jan 9, 2017
    Messages:
    9
    Likes Received:
    2
    @patrick007: ah, that's right.
    Thanks for the correction.
     
  5. Turkhero

    Turkhero Newbie

    Joined:
    Nov 30, 2016
    Messages:
    21
    Likes Received:
    3
    Gender:
    Male
    Nice
     
  6. kuhis

    kuhis Newbie

    Joined:
    Jan 9, 2017
    Messages:
    9
    Likes Received:
    2
    Thank you
    Make sure you add 2 line which corrected by @patrick007
     
  7. Tier.Net

    Tier.Net Newbie

    Joined:
    Feb 22, 2017
    Messages:
    16
    Likes Received:
    1
    Occupation:
    Web Hosting Company
    Home Page:
    That's a nice script, well done.
     
  8. kuhis

    kuhis Newbie

    Joined:
    Jan 9, 2017
    Messages:
    9
    Likes Received:
    2
    Thank you @Tier.Net