简体   繁体   中英

Make Python Script to read modify and write TSV file efficient

I have a file of size 800MB that I would like to iterate through and find which taxonID belongs to which kingdom. Each row has a parentNameUsageID which points to its next parent. I recurse until I find next parent whose taxonRank is kingdom . The file over 2M records which makes it resource intensive. I have a working script that does what I need but I create a dictionary so I can easily access the attributes and once the whole dictionary is updated with the new kingdom field, I write it to a file. Is there is a better way to do it, like read -> update -> write each row to a new file without waiting for the whole thing to finish?

Also, Can we do it in a better way so it doesnot use as much resources and goes faster than 2 rows in a mins (that's the current speed)? Each recursion opens the file again. If I dont, it only iterates through part of the file and not the whole file.

Sample Data file cab be downloaded from here - 840 MB

Code:

import csv
import time
import io

file_name = "Taxon.tsv"
def getKingdomName(parentID):
    with open(file_name, "r", encoding="utf-8") as file:
        tfr = csv.DictReader(file, delimiter="\t")
        for t in tfr:
            if t["dwc:taxonID"] == parentID:
                # print("Found", t["dwc:taxonID"])
                if t["dwc:taxonRank"] == 'kingdom':
                    # print("kingdom name: ", t["dwc:scientificName"])
                    return t["dwc:scientificName"]
                else:
                    # print("No kingdom match. Calling getKingdomName with", t["dwc:parentNameUsageID"])
                    return getKingdomName(t["dwc:parentNameUsageID"])
            else:
                pass


with open(file_name, "r", encoding="utf-8") as file:
    taxon_file = csv.DictReader(file, delimiter="\t")
    new_taxon_file = None
    print("start:", time.strftime("%H:%M:%S", time.localtime()))
    for line in taxon_file:
        # print(line)
        kingdomName = getKingdomName(line["dwc:parentNameUsageID"])
        line["dwc:kingdom"] = kingdomName
        print(line)

    memory_file = io.StringIO()
    with open('Taxon-out.tsv', "w", encoding="utf-8", newline='') as output:
        writer = csv.DictWriter(output, fieldnames=taxon_file.fieldnames, delimiter="\t")
        for row in taxon_file:
            writer.writerow(row)

    print("end:", time.strftime("%H:%M:%S", time.localtime()))
``

I am able to tackle this issue of slow processing using the below code. The performance suffers with pandas if I iterate through each row. The below code does not iterate through the rows. It just looks up the row by the index column using the ID df.loc("dwc:taxonID") . This improves performance from 2 rows per min to 1000 rows per second.

@furas: thanks for your help in comments.

I hope this helps someone trying to accomplish the same thing.

import pandas as pd
import numpy as np

dtypes = {
    'dwc:parentNameUsageID': str,
    'dwc:acceptedNameUsageID': str,
    'dwc:originalNameUsageID': str,
    'dwc:datasetID': str,
    'dwc:taxonomicStatus': str,
    'dwc:taxonRank': str,
    'dwc:scientificName': str,
    'gbif:genericName': str,
    'dwc:specificEpithet': str,
    'dwc:infraspecificEpithet': str,
    'dwc:nameAccordingTo': str,
    'dwc:namePublishedIn': str,
    'dwc:nomenclaturalStatus': str,
    'dwc:nomenclaturalCode': str,
    'dwc:taxonRemarks': str,
    'dcterms:references': str
}

taxon_fp = 'Taxon.tsv' # 'taxon-small.tsv' 'Taxon.tsv'
out_file = 'Taxon-out.tsv'  # 'taxon-small-out.tsv' 'Taxon-out.tsv'
df = pd.read_csv(taxon_fp, sep="\t", encoding="utf-8", index_col="dwc:taxonID", na_filter=False)

for col, dtype in dtypes.items():
    df[col] = df[col].astype(dtype)

kingdom_list = []
def getKingdomName(taxonID):
    row = df.loc[taxonID]
    if row["dwc:taxonRank"] == "kingdom":
        return row["dwc:scientificName"]
    elif row["dwc:parentNameUsageID"] == '':
        if row["dwc:acceptedNameUsageID"] == '':
            return ''
        else:
            return getKingdomName(row["dwc:acceptedNameUsageID"])
    else:
        return getKingdomName(row["dwc:parentNameUsageID"])

for i, row in df.iterrows():
    if row["dwc:taxonRank"] == 'kingdom':
        kingdomName = row["dwc:scientificName"]
    else:
        kingdomName = getKingdomName(i)
    kingdom_list.append(kingdomName)
df['dwc:kingdom'] = kingdom_list
new_columns = []
for item in df.columns:
    new_columns.append(item.replace('dwc:', '').replace('gbif:', '').replace('dcterms:', ''))
df.columns = new_columns
df.index.name = 'taxonID'
df.to_csv(out_file, sep="\t")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM