[英]Python multithreading performance - use C++ instead?
因此,我有一個Python腳本,該腳本基本上可以寫入80GB以上的文件。 目前,它只是串行運行,而只有我實際運行它時,才在服務器上運行約13個小時。
我將對其進行並行化,以便它寫入許多文件,而不只是一個文件。
取走我已經擁有的並保留在Python中但合並多個線程會稍微容易一些(有一個共享數據映射,他們需要訪問這些共享數據,而沒人會寫入這些數據,因此不需要保護) 。
但是,將其保留在Python中是愚蠢的嗎? 我也知道C ++,所以您認為我應該只用C ++重寫它嗎? 我認為該程序比其他任何東西都具有更多的磁盤綁定(沒有大量邏輯可用於寫入文件),因此也許它並沒有多大區別。 我不確定C ++編寫相同的80GB文件(串行)需要花費多長時間。
UPDATE 6/6 / 14,16:40 PST:我在下面發布我的代碼,以確定代碼本身是否存在瓶頸,而不是純粹由磁盤綁定。
我每個表調用一次writeEntriesToSql(),那里大約有30個表。 “大小”是要寫入表的插入數。 所有表的累積大小約為200,000,000。
我確實注意到我要一次又一次地編譯我的正則表達式,這可能會導致我很浪費,盡管我不確定多少。
def writeEntriesToSql(db, table, size, outputFile):
# get a description of the table
rows = queryDatabaseMultipleRows(db, 'DESC ' + table)
fieldNameCol = 0 # no enums in python 2.7 :(
typeCol = 1
nullCol = 2
keyCol = 3
defaultCol = 4
extraCol = 5
fieldNamesToTypes = {}
for row in rows:
if (row[extraCol].find("auto_increment") == -1):
# insert this one
fieldNamesToTypes[row[fieldNameCol]] = row[typeCol]
for i in range(size):
fieldNames = ""
fieldVals = ""
count = 0
# go through the fields
for fieldName, type in fieldNamesToTypes.iteritems():
# build a string of field names to be used in the INSERT statement
fieldNames += table + "." + fieldName
if fieldName in foreignKeys[table]:
otherTable = foreignKeys[table][fieldName][0]
otherTableKey = foreignKeys[table][fieldName][1]
if len(foreignKeys[table][fieldName]) == 3:
# we already got the value so we don't have to get it again
val = foreignKeys[table][fieldName][2]
else:
# get the value from the other table and store it
#### I plan for this to be an infrequent query - unless something is broken here!
query = "SELECT " + otherTableKey + " FROM " + otherTable + " LIMIT 1"
val = queryDatabaseSingleRowCol(db, query)
foreignKeys[table][fieldName].append(val)
fieldVals += val
else:
fieldVals += getDefaultFieldVal(type)
count = count + 1
if count != len(fieldNamesToTypes):
fieldNames += ","
fieldVals += ","
# return the default field value for a given field type which will be used to prepopulate our tables
def getDefaultFieldVal(type):
if not ('insertTime' in globals()):
global insertTime
insertTime = datetime.utcnow()
# store this time in a file so that it can be retrieved by SkyReporterTest.perfoutput.py
try:
timeFileName = perfTestDir + "/dbTime.txt"
timeFile = open(timeFileName, 'w')
timeFile.write(str(insertTime))
except:
print "!!! cannot open file " + timeFileName + " for writing. Please make sure this is run where you have write permissions\n"
os.exit(1)
# many of the types are formatted with a typename, followed by a size in parentheses
##### Looking at this more closely, I suppose I could be compiling this once instead of over and over - a bit bottleneck here?
p = re.compile("(.*)\(([0-9]+).*")
size = 0
if (p.match(type)):
size = int(p.sub(r"\2", type))
type = p.sub(r"\1", type)
else:
size = 0
if (type == "tinyint"):
return str(random.randint(1, math.pow(2,7)))
elif (type == "smallint"):
return str(random.randint(1, math.pow(2,15)))
elif (type == "mediumint"):
return str(random.randint(1, math.pow(2,23)))
elif (type == "int" or type == "integer"):
return str(random.randint(1, math.pow(2,31)))
elif (type == "bigint"):
return str(random.randint(1, math.pow(2,63)))
elif (type == "float" or type == "double" or type == "doubleprecision" or type == "decimal" or type == "realdecimal" or type == "numeric"):
return str(random.random() * 100000000) # random endpoints for this random
elif (type == "date"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y-%m-%d") + "'"
elif (type == "datetime"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y-%m-%d %H:%M:%S") + "'"
elif (type == "timestamp"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y%m%d%H%M%S") + "'"
elif (type == "time"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%H:%M:%S") + "'"
elif (type == "year"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y") + "'"
elif (type == "char" or type == "varchar" or type == "tinyblog" or type == "tinytext" or type == "blob" or type == "text" or type == "mediumblob"
or type == "mediumtext" or type == "longblob" or type == "longtext"):
if (size == 0): # not specified
return "'a'"
else:
lst = [random.choice(string.ascii_letters + string.digits) for n in xrange(size)]
strn = "".join(lst)
return strn
elif (type == "enum"):
return "NULL" # TBD if needed
elif (type == "set"):
return "NULL" # TBD if needed
else:
print "!!! Unrecognized mysql type: " + type + "\n"
os.exit(1)
Python的I / O不會比其他語言慢很多。 解釋器的啟動速度可能很慢,但是編寫如此大的文件將攤銷該效果。
我建議研究一下多處理模塊,該模塊將通過具有多個Python實例來使您具有真正的並行性,這將有助於解決GIL。 但是,這些將附加一些開銷,但同樣,對於80GB的文件來說,也沒有太大關系。 請記住,每個過程都是一個完整的過程,這意味着它將占用更多的計算資源。
還請記住,由於您的代碼已經是IO / Bound,因此根據您的配置,您可能會獲得低速/無速比。 如果您正在從多個線程向單個磁盤寫入數據,那么弊大於利。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.