[Python] Need Multiprocessing God To Help Me (Advice, Template, Code, anything)

apex1

Regular Member
Joined
May 29, 2015
Messages
217
Reaction score
182
Multiprocessing is kicking my ass :mad::mad::mad:

I've read over 30 tutorials and watched countless videos and still can't write my own custom multi-threaded script in the way I need. I'm on my 3rd day of doing nothing but trying to figure this out lol.

Here's the issue.. I have a script that's very basic. It does the following:
  1. Declares a list (of URLs)
  2. Creates a loop to go through all URLs
  3. Scrapes URL source with requests
  4. Saves results
Now lets say I set 5 threads, this is what it does:

get URL1 and scrape data
get URL1 and scrape data
get URL1 and scrape data
get URL1 and scrape data
get URL1 and scrape data

get URL2 and scrape data
get URL2 and scrape data
get URL2 and scrape data
get URL2 and scrape data
get URL2 and scrape data

It's not taking 1 unique URL per thread

Apparently I need to use queues, locks, pools, manager, and a concurrent future pool executor whatever the f that is. Another problem is say I have an odd number of URLs and I'm running 4 threads, won't that give me an error? Do I have to split my URL list by the amount of threads and assign a number amount to each?

Any programming Gods out there? Bless me with some of your wisdom please :D
 

Grimasaur

Junior Member
Joined
Apr 8, 2016
Messages
191
Reaction score
100
Age
28
I will give you a more detailed answer when i get home but this is some pseudocode to try to help you,
Ursl=[url1,url2,...]
Multiprocessingpool= pool(5)
Multiprocessingpool.map(sacrapewebsite, args=urls)

Thats 1 way to do it i think
 
D

Deleted member 969102

Guest
Why not just split the URLs into however many threads you want and run each python script at the same time?
 

lakerr

Power Member
Joined
Jun 20, 2012
Messages
505
Reaction score
180
Website
Lakerr.shop
Code:
url_index = 0
class myThread(threading.Thread):
    def __init__(self, thread_n, name):
         threading.Thread.__init__(self)
         self.name = name
         self.thread_n = thread_n
    def run(self):
         do_something(self.thread_n, self.name)
         
    def do_somthing(self, thread_n, threadname):
         global url_index
         scrape what you want
         url_index+=1
        
thread1 = myThread(1, "Thread-1", 1)
thread2 = myThread(2, "Thread-2", 1)
thread3 = myThread(3, "Thread-3", 1)

thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()
 

oscarboy

Registered Member
Joined
May 21, 2011
Messages
87
Reaction score
32
Have you considered using Browser Automation Studio ? It does all you want and more . The developer is a member of BHW, I've been using it for a while it's a powerful piece of software .
 

dgi

Newbie
Joined
Mar 5, 2018
Messages
6
Reaction score
0
Hi ,
If you still can't get it work, this is what you do.
The concept consists of main thread and worker threads.There is queue class in python that is thread safe.That means that multiple threads can act on it.
The main thread is filling the queue with URLs .The worker thread loops until the queue is empty and gets url at a time.
So when you have for example 4 threads / workers running with 5 urls in the queue , it's not a problem because when the final url is taken out of the queue
the other workers will see that the queue is empty and will stop their execution.

Hope this helps.
 

pasdoy

Senior Member
Joined
Jul 17, 2008
Messages
911
Reaction score
326

pasdoy

Senior Member
Joined
Jul 17, 2008
Messages
911
Reaction score
326
I did a small search and tried https://github.com/kennethreitz/grequests. It looks like out of the box solution to your problem.

Python:
import grequests


def exception_handler(request, exception):
    print "Request failed"


def set_meta(meta):
    def hook(r, **kwargs):
        r.meta = meta
        return r
    return hook


reqs = [
    grequests.get('http://httpbin.org/delay/1', timeout=0.001),
    grequests.get('http://fakedomain/'),
    grequests.get('http://httpbin.org/status/500', callback=set_meta({"title": 'it will 500',})),
    grequests.get('https://www.blackhatworld.com/seo/python-need-multiprocessing-god-to-help-me-advice-template-code-anything.1012254/', callback=set_meta({"title": 'it is bhw',}))]


for res in  grequests.imap(reqs, exception_handler=exception_handler):
    print res, res.meta
 
Last edited:
Top