[英]Make Python Script to read modify and write TSV file efficient
I have a file of size 800MB that I would like to iterate through and find which taxonID
belongs to which kingdom.我有一个大小为 800MB 的文件,我想遍历它并找到哪个
taxonID
属于哪个王国。 Each row has a parentNameUsageID
which points to its next parent.每行都有一个指向其下一个父级的
parentNameUsageID
。 I recurse until I find next parent whose taxonRank
is kingdom
.我递归直到找到下一个
taxonRank
为kingdom
的父母。 The file over 2M records which makes it resource intensive.该文件超过 2M 条记录,这使得它占用大量资源。 I have a working script that does what I need but I create a dictionary so I can easily access the attributes and once the whole dictionary is updated with the new kingdom field, I write it to a file.
我有一个工作脚本可以满足我的需要,但我创建了一个字典,这样我就可以轻松访问属性,一旦整个字典用新的王国字段更新,我就将它写入一个文件。 Is there is a better way to do it, like read -> update -> write each row to a new file without waiting for the whole thing to finish?
有没有更好的方法,比如读取 -> 更新 -> 将每一行写入一个新文件,而无需等待整个过程完成?
Also, Can we do it in a better way so it doesnot use as much resources and goes faster than 2 rows in a mins (that's the current speed)?另外,我们能否以更好的方式做到这一点,使其不使用那么多资源并且在一分钟内比 2 行更快(这是当前速度)? Each recursion opens the file again.
每次递归都会再次打开文件。 If I dont, it only iterates through part of the file and not the whole file.
如果我不这样做,它只会遍历文件的一部分而不是整个文件。
Sample Data file cab be downloaded from here - 840 MB示例数据文件可从此处下载 - 840 MB
Code:代码:
import csv
import time
import io
file_name = "Taxon.tsv"
def getKingdomName(parentID):
with open(file_name, "r", encoding="utf-8") as file:
tfr = csv.DictReader(file, delimiter="\t")
for t in tfr:
if t["dwc:taxonID"] == parentID:
# print("Found", t["dwc:taxonID"])
if t["dwc:taxonRank"] == 'kingdom':
# print("kingdom name: ", t["dwc:scientificName"])
return t["dwc:scientificName"]
else:
# print("No kingdom match. Calling getKingdomName with", t["dwc:parentNameUsageID"])
return getKingdomName(t["dwc:parentNameUsageID"])
else:
pass
with open(file_name, "r", encoding="utf-8") as file:
taxon_file = csv.DictReader(file, delimiter="\t")
new_taxon_file = None
print("start:", time.strftime("%H:%M:%S", time.localtime()))
for line in taxon_file:
# print(line)
kingdomName = getKingdomName(line["dwc:parentNameUsageID"])
line["dwc:kingdom"] = kingdomName
print(line)
memory_file = io.StringIO()
with open('Taxon-out.tsv', "w", encoding="utf-8", newline='') as output:
writer = csv.DictWriter(output, fieldnames=taxon_file.fieldnames, delimiter="\t")
for row in taxon_file:
writer.writerow(row)
print("end:", time.strftime("%H:%M:%S", time.localtime()))
``
I am able to tackle this issue of slow processing using the below code.我可以使用以下代码解决这个处理缓慢的问题。 The performance suffers with pandas if I iterate through each row.
如果我遍历每一行,性能会受到 pandas 的影响。 The below code does not iterate through the rows.
下面的代码不会遍历行。 It just looks up the row by the index column using the ID
df.loc("dwc:taxonID")
.它只是使用 ID
df.loc("dwc:taxonID")
按索引列查找行。 This improves performance from 2 rows per min to 1000 rows per second.这将性能从每分钟 2 行提高到每秒 1000 行。
@furas: thanks for your help in comments. @furas:感谢您在评论中的帮助。
I hope this helps someone trying to accomplish the same thing.我希望这可以帮助那些试图完成同样事情的人。
import pandas as pd
import numpy as np
dtypes = {
'dwc:parentNameUsageID': str,
'dwc:acceptedNameUsageID': str,
'dwc:originalNameUsageID': str,
'dwc:datasetID': str,
'dwc:taxonomicStatus': str,
'dwc:taxonRank': str,
'dwc:scientificName': str,
'gbif:genericName': str,
'dwc:specificEpithet': str,
'dwc:infraspecificEpithet': str,
'dwc:nameAccordingTo': str,
'dwc:namePublishedIn': str,
'dwc:nomenclaturalStatus': str,
'dwc:nomenclaturalCode': str,
'dwc:taxonRemarks': str,
'dcterms:references': str
}
taxon_fp = 'Taxon.tsv' # 'taxon-small.tsv' 'Taxon.tsv'
out_file = 'Taxon-out.tsv' # 'taxon-small-out.tsv' 'Taxon-out.tsv'
df = pd.read_csv(taxon_fp, sep="\t", encoding="utf-8", index_col="dwc:taxonID", na_filter=False)
for col, dtype in dtypes.items():
df[col] = df[col].astype(dtype)
kingdom_list = []
def getKingdomName(taxonID):
row = df.loc[taxonID]
if row["dwc:taxonRank"] == "kingdom":
return row["dwc:scientificName"]
elif row["dwc:parentNameUsageID"] == '':
if row["dwc:acceptedNameUsageID"] == '':
return ''
else:
return getKingdomName(row["dwc:acceptedNameUsageID"])
else:
return getKingdomName(row["dwc:parentNameUsageID"])
for i, row in df.iterrows():
if row["dwc:taxonRank"] == 'kingdom':
kingdomName = row["dwc:scientificName"]
else:
kingdomName = getKingdomName(i)
kingdom_list.append(kingdomName)
df['dwc:kingdom'] = kingdom_list
new_columns = []
for item in df.columns:
new_columns.append(item.replace('dwc:', '').replace('gbif:', '').replace('dcterms:', ''))
df.columns = new_columns
df.index.name = 'taxonID'
df.to_csv(out_file, sep="\t")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.