I have a file of size 800MB that I would like to iterate through and find which taxonID
belongs to which kingdom. Each row has a parentNameUsageID
which points to its next parent. I recurse until I find next parent whose taxonRank
is kingdom
. The file over 2M records which makes it resource intensive. I have a working script that does what I need but I create a dictionary so I can easily access the attributes and once the whole dictionary is updated with the new kingdom field, I write it to a file. Is there is a better way to do it, like read -> update -> write each row to a new file without waiting for the whole thing to finish?
Also, Can we do it in a better way so it doesnot use as much resources and goes faster than 2 rows in a mins (that's the current speed)? Each recursion opens the file again. If I dont, it only iterates through part of the file and not the whole file.
Sample Data file cab be downloaded from here - 840 MB
Code:
import csv
import time
import io
file_name = "Taxon.tsv"
def getKingdomName(parentID):
with open(file_name, "r", encoding="utf-8") as file:
tfr = csv.DictReader(file, delimiter="\t")
for t in tfr:
if t["dwc:taxonID"] == parentID:
# print("Found", t["dwc:taxonID"])
if t["dwc:taxonRank"] == 'kingdom':
# print("kingdom name: ", t["dwc:scientificName"])
return t["dwc:scientificName"]
else:
# print("No kingdom match. Calling getKingdomName with", t["dwc:parentNameUsageID"])
return getKingdomName(t["dwc:parentNameUsageID"])
else:
pass
with open(file_name, "r", encoding="utf-8") as file:
taxon_file = csv.DictReader(file, delimiter="\t")
new_taxon_file = None
print("start:", time.strftime("%H:%M:%S", time.localtime()))
for line in taxon_file:
# print(line)
kingdomName = getKingdomName(line["dwc:parentNameUsageID"])
line["dwc:kingdom"] = kingdomName
print(line)
memory_file = io.StringIO()
with open('Taxon-out.tsv', "w", encoding="utf-8", newline='') as output:
writer = csv.DictWriter(output, fieldnames=taxon_file.fieldnames, delimiter="\t")
for row in taxon_file:
writer.writerow(row)
print("end:", time.strftime("%H:%M:%S", time.localtime()))
``
I am able to tackle this issue of slow processing using the below code. The performance suffers with pandas if I iterate through each row. The below code does not iterate through the rows. It just looks up the row by the index column using the ID df.loc("dwc:taxonID")
. This improves performance from 2 rows per min to 1000 rows per second.
@furas: thanks for your help in comments.
I hope this helps someone trying to accomplish the same thing.
import pandas as pd
import numpy as np
dtypes = {
'dwc:parentNameUsageID': str,
'dwc:acceptedNameUsageID': str,
'dwc:originalNameUsageID': str,
'dwc:datasetID': str,
'dwc:taxonomicStatus': str,
'dwc:taxonRank': str,
'dwc:scientificName': str,
'gbif:genericName': str,
'dwc:specificEpithet': str,
'dwc:infraspecificEpithet': str,
'dwc:nameAccordingTo': str,
'dwc:namePublishedIn': str,
'dwc:nomenclaturalStatus': str,
'dwc:nomenclaturalCode': str,
'dwc:taxonRemarks': str,
'dcterms:references': str
}
taxon_fp = 'Taxon.tsv' # 'taxon-small.tsv' 'Taxon.tsv'
out_file = 'Taxon-out.tsv' # 'taxon-small-out.tsv' 'Taxon-out.tsv'
df = pd.read_csv(taxon_fp, sep="\t", encoding="utf-8", index_col="dwc:taxonID", na_filter=False)
for col, dtype in dtypes.items():
df[col] = df[col].astype(dtype)
kingdom_list = []
def getKingdomName(taxonID):
row = df.loc[taxonID]
if row["dwc:taxonRank"] == "kingdom":
return row["dwc:scientificName"]
elif row["dwc:parentNameUsageID"] == '':
if row["dwc:acceptedNameUsageID"] == '':
return ''
else:
return getKingdomName(row["dwc:acceptedNameUsageID"])
else:
return getKingdomName(row["dwc:parentNameUsageID"])
for i, row in df.iterrows():
if row["dwc:taxonRank"] == 'kingdom':
kingdomName = row["dwc:scientificName"]
else:
kingdomName = getKingdomName(i)
kingdom_list.append(kingdomName)
df['dwc:kingdom'] = kingdom_list
new_columns = []
for item in df.columns:
new_columns.append(item.replace('dwc:', '').replace('gbif:', '').replace('dcterms:', ''))
df.columns = new_columns
df.index.name = 'taxonID'
df.to_csv(out_file, sep="\t")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.