[英]BrokenPipeError: [Errno 32] Python Multiprocessing
I was working on a web scraping project, but it was taking a lot of time processing the data, I came up with an alternate route to scrape the source code of products being scraped and then process data separately.我正在做一个 web 抓取项目,但是处理数据需要花费大量时间,我想出了一条替代路线来抓取被抓取产品的源代码,然后单独处理数据。
What I did is, stored the source code of each product enclosed separately within a tuple in an array and saved that array data in a text file for further processing at a later stage.我所做的是,将每个产品的源代码分别存储在一个数组中的一个元组中,并将该数组数据保存在一个文本文件中,以便稍后进行进一步处理。 I save data as chunks of 10,000 products.我将数据保存为 10,000 个产品的块。 Each text file is about 10GB.每个文本文件大约 10GB。
When I started to process data using multiprocessing I kept coming across the BrokenPipeError: [Error 32], Initially I was processing data on a windows machine, I explored a bit found that Linux is better at managing memory and this error is because of complete memory utilization during processing. When I started to process data using multiprocessing I kept coming across the BrokenPipeError: [Error 32], Initially I was processing data on a windows machine, I explored a bit found that Linux is better at managing memory and this error is because of complete memory加工过程中的利用。
Initially, I was storing the processed data in an array (not saving the data at run time for each product), I read about at the stack forum that I need to save processed data, as the processed data was eating up all the memory, I changed the code accordingly changed map to imap, although it ran longer but still got the same error.最初,我将处理后的数据存储在一个数组中(不是在运行时为每个产品保存数据),我在堆栈论坛上读到我需要保存处理后的数据,因为处理后的数据耗尽了所有 memory,我相应地更改了代码,将 map 更改为 imap,虽然运行时间更长但仍然出现相同的错误。
Here is my code, I am not posting the complete processing steps as it will only increase the length of code.这是我的代码,我没有发布完整的处理步骤,因为它只会增加代码的长度。
Point to note is there is huge amount of array data for each product when processed, each individual array up to 18000 elements.需要注意的是,每个产品在处理时都有大量的数组数据,每个单独的数组最多 18000 个元素。
I am using an octa-core processor with 16GB of ram and 500GB of ssd.我正在使用具有 16GB 内存和 500GB ssd 的八核处理器。
Any help would be appreciated.任何帮助,将不胜感激。 Thanks!谢谢!
import xml.etree.cElementTree as ET
from lxml import html
import openpyxl
from openpyxl import Workbook
from lxml import etree
from lxml.etree import tostring
import pathos.multiprocessing as mp
import multiprocessing
import ast
global sourceDataList
sourceDataList=[]
global trackIndex
trackIndex=1
global failList
failList=[]
def processData(data):
vehicalData=[]
oemData=[]
appendIndex=0
#geting product link form incoming data list (tupile)
p=data[0][1]
#geting html source code form incoming data list(tupile)
#converting it to html element
source_code=html.fromstring(data[0][0])
#processing data
try:
firstOem=source_code.xpath("//div[@id='tab-review']//tr[2]/td[2]")
firstOem=firstOem[0].text_content().strip()
except:
firstOem=''
try:
name=source_code.xpath("//div[@id='right_title']/h1")
name=name[0].text_content().strip()
except:
name=''
#saving data in respective arrays
vehicalData.append([firstOem,p,name,productType,brand,mfgNumber,imgOne,imgTwo,imgThree,imgFour,imgFive])
for q in dayQtyPrice:
vehicalData[appendIndex].append(q)
vehicalData[appendIndex].append(specString)
vehicalData[appendIndex].append(subAssembltString)
vehicalData[appendIndex].append(parentAssemblyString)
vehicalData[appendIndex].append(otherProductString)
vehicalData[appendIndex].append(description)
vehicalData[appendIndex].append(placement)
for dma in makeModelArray:
vehicalData[appendIndex].append(dma)
oemData.append([firstOem,name,productType,brand,mfgNumber,p])
for o in oemArray:
oemData[appendIndex].append(o)
print('Done !',p,len(vehicalData[0]),len(oemData[0]))
#returning both arrays
return (vehicalData,oemData)
def main():
productLinks=[]
vehicalData=[]
oemData=[]
#opening text file for processing list data
with open('test.txt', encoding='utf-8') as f:
string=f.read()
sourceDataList=ast.literal_eval(string)
print('Number of products:',len(sourceDataList))
#creating pool and initiating multiprocessing
p = mp.Pool(4) # Pool tells how many at a time
#opening and saving data at run time
vehicalOutBook=openpyxl.load_workbook('vehical_data_file.xlsx')
vehicalOutSheet=vehicalOutBook.active
oemOutBook=openpyxl.load_workbook('oem_data_file.xlsx')
oemOutSheet=oemOutBook.active
for d in p.imap(processData, sourceDataList):
v=d[0][0][:18000]
o=d[1][0][:18000]
vehicalOutSheet.append(v)
oemOutSheet.append(o)
p.terminate()
p.join()
#saving data
vehicalOutBook.save('vehical_data_file.xlsx')
oemOutBook.save('oem_data_file.xlsx')
if __name__=='__main__':
main()
I am not familiar with the pathos.multiprocessing.Pool
class, but let's assume it works more or less the same as the multiprocess.pool.Pool
class.我不熟悉pathos.multiprocessing.Pool
class,但我们假设它的工作原理或多或少与multiprocess.pool.Pool
class 相同。 The problem is that the data in test.txt
is in such a format that it appears that you must read the whole file in to parse it with ast.liter_eval
and therefore there can be no storage savings with imap
.问题是test.txt
中的数据格式似乎必须读取整个文件才能使用ast.liter_eval
解析它,因此imap
无法节省存储空间。
To use imap
(or imap_unordered
) efficiently, instead of storing in file test.txt
a representation ( JSON
?) of a list
, store multiple product representations separated by a newline that can be individually parsed so that the file can be read and parsed line by line instead to yield individual products.为了有效地使用imap
(或imap_unordered
),而不是在文件test.txt
中存储list
的表示( JSON
?),存储多个产品表示,由可以单独解析的换行符分隔,以便可以读取和解析文件行而是按生产线生产单个产品。 You should have an approximate count of how many lines and thus how many tasks will need to be submitted to imap
.您应该对需要提交给imap
的行数以及因此需要提交的任务数有一个大致的计数。 The reason for this is that when you have a large number of tasks, it will be more efficient to use something other than the default chunksize argument value of 1. I have included below a function to compute a chunksize value along the lines that the map
function would use.这样做的原因是,当您有大量任务时,使用默认chunksize参数值 1 以外的其他值会更有效。我在 function 下面包括了一个按照map的行计算map
大小值function 将使用。 Also, it seems that your worker function processData
is using one level of nested lists more than necessary.此外,您的工作人员 function processData
似乎使用了不必要的一级嵌套列表。 I have also reverted to using the standard multiprocessing.pool.Pool
class since I know more or less how that works.我也恢复使用标准multiprocessing.pool.Pool
class 因为我或多或少知道它是如何工作的。
Note: I don't see where in processData
variables makeModelArray
and oemArray
are defined.注意:我看不到processData
变量makeModelArray
和oemArray
的定义位置。
import xml.etree.cElementTree as ET
from lxml import html
import openpyxl
from openpyxl import Workbook
from lxml import etree
from lxml.etree import tostring
#import pathos.multiprocessing as mp
import multiprocessing
import ast
global sourceDataList
sourceDataList=[]
global trackIndex
trackIndex=1
global failList
failList=[]
def processData(data):
#geting product link form incoming data list (tupile)
p=data[0][1]
#geting html source code form incoming data list(tupile)
#converting it to html element
source_code=html.fromstring(data[0][0])
#processing data
try:
firstOem=source_code.xpath("//div[@id='tab-review']//tr[2]/td[2]")
firstOem=firstOem[0].text_content().strip()
except:
firstOem=''
try:
name=source_code.xpath("//div[@id='right_title']/h1")
name=name[0].text_content().strip()
except:
name=''
#saving data in respective arrays
vehicalData = [firstOem,p,name,productType,brand,mfgNumber,imgOne,imgTwo,imgThree,imgFour,imgFive]
for q in dayQtyPrice:
vehicalData,append(q)
vehicalData,append(specString)
vehicalData.append(subAssembltString)
vehicalData.append(parentAssemblyString)
vehicalData.append(otherProductString)
vehicalData.append(description)
vehicalData.append(placement)
for dma in makeModelArray:
vehicalData.append(dma)
oemData = [firstOem,name,productType,brand,mfgNumber,p]
for o in oemArray:
oemData.append(o)
#print('Done !',p,len(vehicalData),len(oemData))
#returning both arrays
return (vehicalData,oemData)
def generate_source_data_list():
#opening text file for processing list data
with open('test.txt', encoding='utf-8') as f:
for line in f:
# data for just one product:
yield ast.literal_eval(line)
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def main():
#creating pool and initiating multiprocessing
# use pool size equal to number of cores you have:
pool_size = multiprocessing.cpu_count()
# Approximate number of elements generate_source_data_list() will yield:
NUM_TASKS = 100_000 # replace with actual number
p = multiprocessing.Pool(pool_size)
chunksize = compute_chunksize(NUM_TASKS, pool_size)
#opening and saving data at run time
vehicalOutBook=openpyxl.load_workbook('vehical_data_file.xlsx')
vehicalOutSheet=vehicalOutBook.active
oemOutBook=openpyxl.load_workbook('oem_data_file.xlsx')
oemOutSheet=oemOutBook.active
for d in p.imap(processData, generate_source_data_list(), chunksize=chunksize):
v = d[0][:18000]
o = d[1][:18000]
vehicalOutSheet.append(v)
oemOutSheet.append(o)
p.terminate()
p.join()
#saving data
vehicalOutBook.save('vehical_data_file.xlsx')
oemOutBook.save('oem_data_file.xlsx')
if __name__=='__main__':
main()
You will still require a lot of storage for your final spreadsheet!最终的电子表格仍需要大量存储空间! Now if you were outputting two csv
files, that would be a different story -- you could be writing those as you went along.现在,如果您要输出两个csv
文件,那将是另一回事——您可以边写边写。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.