简体   繁体   English

使 Python 脚本高效地读取修改和写入 TSV 文件

[英]Make Python Script to read modify and write TSV file efficient

I have a file of size 800MB that I would like to iterate through and find which taxonID belongs to which kingdom.我有一个大小为 800MB 的文件,我想遍历它并找到哪个taxonID属于哪个王国。 Each row has a parentNameUsageID which points to its next parent.每行都有一个指向其下一个父级的parentNameUsageID I recurse until I find next parent whose taxonRank is kingdom .我递归直到找到下一个taxonRankkingdom的父母。 The file over 2M records which makes it resource intensive.该文件超过 2M 条记录,这使得它占用大量资源。 I have a working script that does what I need but I create a dictionary so I can easily access the attributes and once the whole dictionary is updated with the new kingdom field, I write it to a file.我有一个工作脚本可以满足我的需要,但我创建了一个字典,这样我就可以轻松访问属性,一旦整个字典用新的王国字段更新,我就将它写入一个文件。 Is there is a better way to do it, like read -> update -> write each row to a new file without waiting for the whole thing to finish?有没有更好的方法,比如读取 -> 更新 -> 将每一行写入一个新文件,而无需等待整个过程完成?

Also, Can we do it in a better way so it doesnot use as much resources and goes faster than 2 rows in a mins (that's the current speed)?另外,我们能否以更好的方式做到这一点,使其不使用那么多资源并且在一分钟内比 2 行更快(这是当前速度)? Each recursion opens the file again.每次递归都会再次打开文件。 If I dont, it only iterates through part of the file and not the whole file.如果我不这样做,它只会遍历文件的一部分而不是整个文件。

Sample Data file cab be downloaded from here - 840 MB示例数据文件可从此处下载 - 840 MB

Code:代码:

import csv
import time
import io

file_name = "Taxon.tsv"
def getKingdomName(parentID):
    with open(file_name, "r", encoding="utf-8") as file:
        tfr = csv.DictReader(file, delimiter="\t")
        for t in tfr:
            if t["dwc:taxonID"] == parentID:
                # print("Found", t["dwc:taxonID"])
                if t["dwc:taxonRank"] == 'kingdom':
                    # print("kingdom name: ", t["dwc:scientificName"])
                    return t["dwc:scientificName"]
                else:
                    # print("No kingdom match. Calling getKingdomName with", t["dwc:parentNameUsageID"])
                    return getKingdomName(t["dwc:parentNameUsageID"])
            else:
                pass


with open(file_name, "r", encoding="utf-8") as file:
    taxon_file = csv.DictReader(file, delimiter="\t")
    new_taxon_file = None
    print("start:", time.strftime("%H:%M:%S", time.localtime()))
    for line in taxon_file:
        # print(line)
        kingdomName = getKingdomName(line["dwc:parentNameUsageID"])
        line["dwc:kingdom"] = kingdomName
        print(line)

    memory_file = io.StringIO()
    with open('Taxon-out.tsv', "w", encoding="utf-8", newline='') as output:
        writer = csv.DictWriter(output, fieldnames=taxon_file.fieldnames, delimiter="\t")
        for row in taxon_file:
            writer.writerow(row)

    print("end:", time.strftime("%H:%M:%S", time.localtime()))
``

I am able to tackle this issue of slow processing using the below code.我可以使用以下代码解决这个处理缓慢的问题。 The performance suffers with pandas if I iterate through each row.如果我遍历每一行,性能会受到 pandas 的影响。 The below code does not iterate through the rows.下面的代码不会遍历行。 It just looks up the row by the index column using the ID df.loc("dwc:taxonID") .它只是使用 ID df.loc("dwc:taxonID")按索引列查找行。 This improves performance from 2 rows per min to 1000 rows per second.这将性能从每分钟 2 行提高到每秒 1000 行。

@furas: thanks for your help in comments. @furas:感谢您在评论中的帮助。

I hope this helps someone trying to accomplish the same thing.我希望这可以帮助那些试图完成同样事情的人。

import pandas as pd
import numpy as np

dtypes = {
    'dwc:parentNameUsageID': str,
    'dwc:acceptedNameUsageID': str,
    'dwc:originalNameUsageID': str,
    'dwc:datasetID': str,
    'dwc:taxonomicStatus': str,
    'dwc:taxonRank': str,
    'dwc:scientificName': str,
    'gbif:genericName': str,
    'dwc:specificEpithet': str,
    'dwc:infraspecificEpithet': str,
    'dwc:nameAccordingTo': str,
    'dwc:namePublishedIn': str,
    'dwc:nomenclaturalStatus': str,
    'dwc:nomenclaturalCode': str,
    'dwc:taxonRemarks': str,
    'dcterms:references': str
}

taxon_fp = 'Taxon.tsv' # 'taxon-small.tsv' 'Taxon.tsv'
out_file = 'Taxon-out.tsv'  # 'taxon-small-out.tsv' 'Taxon-out.tsv'
df = pd.read_csv(taxon_fp, sep="\t", encoding="utf-8", index_col="dwc:taxonID", na_filter=False)

for col, dtype in dtypes.items():
    df[col] = df[col].astype(dtype)

kingdom_list = []
def getKingdomName(taxonID):
    row = df.loc[taxonID]
    if row["dwc:taxonRank"] == "kingdom":
        return row["dwc:scientificName"]
    elif row["dwc:parentNameUsageID"] == '':
        if row["dwc:acceptedNameUsageID"] == '':
            return ''
        else:
            return getKingdomName(row["dwc:acceptedNameUsageID"])
    else:
        return getKingdomName(row["dwc:parentNameUsageID"])

for i, row in df.iterrows():
    if row["dwc:taxonRank"] == 'kingdom':
        kingdomName = row["dwc:scientificName"]
    else:
        kingdomName = getKingdomName(i)
    kingdom_list.append(kingdomName)
df['dwc:kingdom'] = kingdom_list
new_columns = []
for item in df.columns:
    new_columns.append(item.replace('dwc:', '').replace('gbif:', '').replace('dcterms:', ''))
df.columns = new_columns
df.index.name = 'taxonID'
df.to_csv(out_file, sep="\t")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM