[Python] Multiprocessing - 'Manager' Not Working

apex1

Regular Member
Joined
May 29, 2015
Messages
217
Reaction score
182
I'm trying to have a basic counter in the 'test' function that increments for each iteration (list1 item).

I can't get it to work, been trying for hours. Any ideas what I'm doing wrong?

Code:
from multiprocessing import Pool, Manager


def test(current_item, counter): 

    counter = counter + 1
    print(counter)

    print(current_item)


if __name__ == '__main__':

    list1 = ["item1",
             "item2",
             "item3",
             "item4",
             "item5",
             "item6",
             "item7",
             "item8",
             "item9",
             "item10",
             "item11",
             "item12"]

    counter = Manager().Value(0)

    p = Pool(4)  # worker count
    p.map(test, list1)  # (function, iterable)
    p.terminate()
    p.join()
 

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
I think I understand what you are trying to do here, and there is probably an easier way to do it, but trying to go about it the way you did I came up with this:

Code:
from multiprocessing import Pool, Manager

def test(obj):
    counter, item = obj
    counter_val = counter.get()
    counter.set(counter_val + 1)
    print(item, counter_val)   
    return

if __name__ == '__main__':
    list1 = ["item1",
            "item2",
            "item3",
            "item4",
            "item5",
            "item6",
            "item7",
            "item8",
            "item9",
            "item10",
            "item11",
            "item12"]

    counter = Manager().Value(int, 0)
    p = Pool(4)  # worker count
    array = [(counter, item) for item in list1]
    p.map(test, array)
    p.terminate()
    print(counter)

You were missing the type value for the counter object. The most confusing part about my code is probably how I am building the
Code:
array
variable. I create a list of tuples that contain the counter object and the list1 items. I pass that to the test function because it's easier via p.map
 

apex1

Regular Member
Joined
May 29, 2015
Messages
217
Reaction score
182
I think I understand what you are trying to do here, and there is probably an easier way to do it, but trying to go about it the way you did I came up with this:

Code:
from multiprocessing import Pool, Manager

def test(obj):
    counter, item = obj
    counter_val = counter.get()
    counter.set(counter_val + 1)
    print(item, counter_val)  
    return

if __name__ == '__main__':
    list1 = ["item1",
            "item2",
            "item3",
            "item4",
            "item5",
            "item6",
            "item7",
            "item8",
            "item9",
            "item10",
            "item11",
            "item12"]

    counter = Manager().Value(int, 0)
    p = Pool(4)  # worker count
    array = [(counter, item) for item in list1]
    p.map(test, array)
    p.terminate()
    print(counter)

You were missing the type value for the counter object. The most confusing part about my code is probably how I am building the
Code:
array
variable. I create a list of tuples that contain the counter object and the list1 items. I pass that to the test function because it's easier via p.map

Holy crap.. you're the best. It works perfectly!! :D

You have no idea how long it would have taken me to figure that out.

I searched around forever and couldn't find any tutorials showing your method.

Cheers!!
 

Gazo

Newbie
Joined
Apr 12, 2018
Messages
11
Reaction score
22
Holy crap.. you're the best. It works perfectly!! :D

You have no idea how long it would have taken me to figure that out.

I searched around forever and couldn't find any tutorials showing your method.

Cheers!!

Not a problem, any time you need some Python help just hit me up.
 

apex1

Regular Member
Joined
May 29, 2015
Messages
217
Reaction score
182
Not a problem, any time you need some Python help just hit me up.

Need your help bro. I think I'm almost there but can't quite get it.

I'm taking scraped URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to use within my 'scraper' function.

I need the dataframe within scraper function because the counter will let me fill the scraped data into the right table cell.

Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.

Here's the code:

Code:
from multiprocessing import Lock, Pool, Manager
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests


exceptions = []
lock = Lock()


def scraper(obj):  # obj is the array passed from map (counter, url items)

    counter, url = obj  # not sure what this does

    df.insert(1, 'Alexa Rank:', "")  # insert new column
    df.insert(2, 'Status:', "")  # insert new column

    lock.acquire()

    counter_val = counter.get()

    try:

        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''      --> SCRAPE ALEXA RANK: <--    '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            df.iloc[counter_val, 0] = url  # fill cell with URL data
            df.iloc[counter_val, 1] = rank  # fill cell with alexa rank

            counter.set(counter_val + 1)  # increment counter

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', counter_val, '-', df.iloc[counter_val, 0], '-', "Rank:", rank[0])

        else:
            print("Server Status:", scrape.status_code)
            df.iloc[counter_val, 2] = scrape.status_code  # fill df cell with server status code
            counter.set(counter_val + 1)
            pass

    except BaseException as e:
        exceptions.append(e)
        print("Exception:", e)
        df.iloc[counter_val, 2] = e  # fill df cell with script exception message
        counter.set(counter_val + 1)
        pass

    finally:
        lock.release()
        df.to_csv("output.csv", index=False)
        return


if __name__ == '__main__':

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               GET LINK LIST:                  '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    # get this line of code from the pastebin (link list)
    https://pastebin.com/h42wqJPp

    df = pd.DataFrame(list1, columns=["Links:"])  # create pandas dataframe from links list

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               MULTIPROCESSING:         '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    counter = Manager().Value(int, 0)  # set counter as manager with value of 0
    array = [(counter, url) for url in df]  # link together the counter and list in an array ---------------------------------------- ***** ERROR - not adding links to array correctly *****
    print("Problem here, it's not adding all the links to array", array)

    p = Pool(20)  # worker count
    p.map(scraper, array)  # function, iterable
    p.terminate()
 
Top