简体   繁体   English

如何使用多处理更快地遍历列表数据?

[英]How can I iterate through list data faster using multiprocessing?

I'm trying to determine the amount of time worked by a list of employees during their shift on site - this data is given to me in the form of a CSV file.我正在尝试确定一系列员工在现场轮班期间的工作时间 - 这些数据以 CSV 文件的形式提供给我。 From there, I put the data into a matrix and iterate through it using a while loop applying the necessary conditionals (for example, deducting 30 minute for lunch).从那里,我将数据放入一个矩阵中,并使用 while 循环应用必要的条件(例如,扣除 30 分钟的午餐时间)对其进行迭代。 This is then put into a new list, which is used to make an Excel worksheet.然后将其放入一个新列表中,该列表用于制作 Excel 工作表。

My script does what it is meant to do, but takes a very long time when having to loop through a lot of data (it needs to loop through approximately 26 000 rows).我的脚本做了它应该做的事情,但是当必须循环大量数据时需要很长时间(它需要循环大约 26 000 行)。 My idea is to use multiprocessing to do the following three loops in parallel:我的想法是使用多处理并行执行以下三个循环:

  1. Convert the time from hh:mm:ss to minutes.将时间从 hh:mm:ss 转换为分钟。
  2. Loop through and apply conditionals.循环并应用条件。
  3. Round values and convert back to hours, so that this is not done within the big while loop.将值四舍五入并转换回小时,这样就不会在大的 while 循环中完成。

Is this a good idea?这是一个好主意吗? If so, how would I have the loops run in parallel when I need data from one loop to be used in the next?如果是这样,当我需要一个循环中的数据在下一个循环中使用时,我将如何让这些循环并行运行? My first thought is to use the time function to give a delay, but then I'm concerned that my loops may "catch up" with one another and spit out that the list index being called does not exist.我的第一个想法是使用时间 function 来延迟,但随后我担心我的循环可能会“赶上”彼此并吐出被调用的列表索引不存在。

Any more experienced opinions would be amazing, thanks!任何更有经验的意见将是惊人的,谢谢!

My script:我的脚本:

import pandas as pd

Function: To round down the time to the next lowest ten minutes --> 77 = 70; Function:将时间四舍五入到下一个最低的十分钟 --> 77 = 70; 32 = 30: 32 = 30:

def floor_time(n, decimals=0): def floor_time(n,小数=0):

multiplier = 10 ** decimals
return int(n * multiplier) / multiplier

Function: Get data from excel spreadsheet: Function:从 excel 电子表格中获取数据:

def get_data():定义获取数据():

df = pd.read_csv('/Users/Chadd/Desktop/dd.csv', sep = ',')
list_of_rows = [list(row) for row in df.values]
data = []
i = 0
while i < len(list_of_rows):
    data.append(list_of_rows[i][0].split(';'))
    data[i].pop()
    i += 1
return data

Function: Convert time index in data to 24 hour scale: Function:将数据中的时间索引转换为 24 小时制:

def get_time(time_data): def get_time(time_data):

return int(time_data.split(':')[0])*60 + int(time_data.split(':')[1])

Function: Loop through data in CSV applying conditionals: Function:循环通过 CSV 中的数据应用条件:

def get_time_worked(): def get_time_worked():

i = 0 # Looping through entry data
j = 1 # Looping through departure data
list_of_times = []

while j < len(get_data()):

    start_time = get_time(get_data()[i][3])
    end_time = get_time(get_data()[j][3])

     # Morning shift - start time < end time
    if start_time < end_time:
        time_worked = end_time - start_time # end time - start time (minutes)
        # Need to deduct 15 minutes if late:
        if start_time > 6*60: # Late
            time_worked = time_worked - 15
        # Need to set the start time to 06:00:00:
        if start_time < 6*60: # Early
            time_worked = end_time - 6*60

    # Afternoon shift - start time > end time
    elif start_time > end_time:
        time_worked = 24*60 - start_time + end_time # 24*60 - start time + end time (minutes)
        # Need to deduct 15 minutes if late:
        if start_time > 18*60: # Late
            time_worked = time_worked - 15
        # Need to set the start time to 18:00:00:
        if start_time > 18*60: # Early
            time_worked = 24*60 - 18*60 + end_time

    # If time worked exceeds 5 hours, deduct 30 minutes for lunch:
    if time_worked >= 5*60:
        time_worked = time_worked - 30

    # Set max time worked to 11.5 hours:
    if time_worked > 11.5*60:
        time_worked = 11.5*60

    list_of_times.append([get_data()[i][1], get_data()[i][2], round(floor_time(time_worked, decimals = -1)/60, 2)])

    i += 2
    j += 2

return list_of_times

Save the data into Excel worksheet:将数据保存到 Excel 工作表中:

def save_data():定义保存数据():

file_heading = '{} to {}'.format(get_data()[0][2], get_data()[len(get_data())-1][2])
file_heading_2 = file_heading.replace('/', '_')

df = pd.DataFrame(get_time_worked())
writer = pd.ExcelWriter('/Users/Chadd/Desktop/{}.xlsx'.format(file_heading_2), engine='xlsxwriter')
df.to_excel(writer, sheet_name='Hours Worked', index=False)
writer.save()

save_data()保存数据()

You can look at multiprocessing.Pool which allows executing a function multiple times with different input variables.您可以查看multiprocessing.Pool ,它允许使用不同的输入变量多次执行 function。 From the docs文档

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

Then, it's a matter of splitting up your data into chunks (instead of the [1, 2, 3] in the example).然后,将数据分成块(而不是示例中的[1, 2, 3] )。
But, my personal preference, is to take the time and learn something that is distributed by default.但是,我个人的偏好是花时间学习一些默认分发的东西。 Such as Spark and pyspark .比如Sparkpyspark It'll help you in the long run immensely.从长远来看,它将极大地帮助您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM