简体   繁体   English

运行速度非常慢的简单python脚本(csv文件)

[英]simple python script running very slow (csv file)

I'm runnning a script to restore some header columns to a CSV file. 我正在运行一个脚本,以将一些标题列还原到CSV文件。 It takes the original file that has the header columns as a dictionary and stitches them back into the file which has lost it's header columns. 它会将具有标题列的原始文件作为字典,然后将它们重新缝合到丢失了标题列的文件中。

The issue is that it is incredibly slow. 问题是它的速度非常慢。 These files are both moderately large (~50mb) with 200,000 rows by 96 columns. 这些文件都是中等大小(〜50mb),具有200,000行乘96列。 At the moment the output file looks correct when I preview it. 目前,当我预览输出文件时,它看起来正确。 Growing in size by about 200kb per 10 minutes. 每10分钟增加200kb的大小。

I'm an absolute noob at coding, so any help to figure out why the script is so slow would be appreciated. 我在编码方面绝对是菜鸟,所以对弄清脚本为何如此缓慢的任何帮助将不胜感激。

hapinfile = file('file_with_header_columns', 'r')
hapoutfile = file('file_missing_header_columns.csv', 'r')
o = file('filescombined.txt', 'w')

dictoutfile={}

for line in hapoutfile:
    a=line.rstrip('\n').rstrip('\r').split('\t')
    dictoutfile[a[0]]=a[1:]

hapinfile.close()

for line in hapinfile:
    q=line.rstrip('\n').rstrip('\r').split('\t')
    g=q[0:11]
    for key, value in dictoutfile.items():
        if g[0] == key:
            g.extend(value)
            o.write(str('\t'.join(g)+'\n'))


hapoutfile.close()
o.close()

For starters, you don't need the internal loop in the second part. 对于初学者,您不需要第二部分的内部循环。 That's a dictionary you're looping over, you should just access the value using g[0] as the key. 那是您要遍历的字典,您应该只使用g [0]作为键来访问值。 That'll save you a loop over a huge dictionary which happens for every line in the header-less file. 这将使您免于遍历无标题文件中每一行的巨大词典。 If needed, you can check whether g[0] is in the dictionary to avoid KeyErrors. 如果需要,可以检查字典中是否存在g [0]以避免KeyErrors。

It's taking forever because of the nested for loop uselessly trudging through the dict again and again. 因为嵌套的for循环一次又一次地无序地遍历dict,所以这是永远需要的。 Try this: 尝试这个:

for line in hapinfile:
    q=line.rstrip('\n').rstrip('\r').split('\t')
    g=q[0:11]
    if g[0] in dictoutfile:
        g.extend( dictoutfile[g[0]] )
        o.write(str('\t'.join(g)+'\n'))
from __future__ import with_statement   # if you need it

import csv 

with open('file_with_header_columns', 'r') as hapinfile,
         open('file_missing_header_columns', 'r') as hapoutfile,
         open('filescombined.txt', 'w') as outfile:
    good_data = csv.reader(hapoutfile, delimiter='\t')
    bad_data = csv.reader(hapinfile, delimiter='\t')
    out_data = csv.writer(outfile, delimiter='\t')
    for data_row in good_data:
        for header_row in bad_data:
            if header_row[0] == data_row[0]
                out_data.writerow(data_row)
                break   # stop looking through headers

You seem to have a really unfortunate problem here in that you have to do nested loops to find your data. 您似乎在这里遇到了一个非常不幸的问题,因为您必须执行嵌套循环才能找到您的数据。 If you could do something like sort the CSV files by header fields, you could get more efficiency. 如果您可以执行类似按标题字段对CSV文件排序的操作,则可以提高效率。 As it is, take advantage of the CSV module and condense everything. 实际上,请利用CSV模块并压缩所有内容。 You can make use of break which, while a bit odd in a for loop, will at least "short-circuit" you out of the search through the second file once you've found your header. 您可以使用break ,虽然在for循环中有些奇怪,但一旦找到标头,它至少会使您“短路”出第二个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM