[英]Convert 2-column counter-like csv file to Python collections.Counter?
I have a comma separated ( ,
) tab delimited ( \\t
), file. 我有一个用逗号分隔( ,
)的制表符分隔( \\t
)文件。
68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t
Is there a simple approach to throw it into a Python Counter
? 有没有简单的方法可以将其放入Python Counter
?
You could use a dictionary comprehension, is considered more pythonic and it can be marginally faster : 您可以使用字典理解,被认为是更pythonic的 , 并且可以稍微快一些 :
import csv
from collections import Counter
def convert_counter_like_csv_to_counter(file_to_convert):
with file_to_convert.open(encoding="utf-8") as f:
csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
return the_counter
I couldn't let this go and stumbled on what I think is the winner. 我不能放弃这个,偶然发现了我认为是赢家的东西。
In testing it was clear that looping through the rows of the csv.DictReader
was the slowest part; 在测试中,很明显,循环浏览csv.DictReader
的行是最慢的部分。 taking about 30 of the 40 seconds. 大约需要40秒中的30秒。
I switched it to simple csv.reader
to see what I would get. 我将其切换到简单的csv.reader
以查看得到的结果。 This resulted in rows of lists. 这导致了列表行。 I wrapped this in a dict
to see if it directly converted. 我将其包装在dict
以查看其是否直接转换。 It did! 它做了!
Then I could loop through a native dictionary instead of a csv.DictReader
. 然后,我可以遍历本机字典,而不是csv.DictReader
。
The result... done with 4 million rows in 3 seconds ! 结果... 在3秒内完成了400万行 ! 🎉 🎉
def convert_counter_like_csv_to_counter(file_to_convert):
with file_to_convert.open(encoding="utf-8") as f:
csv_reader = csv.reader(f, delimiter="\t")
d = dict(csv_reader)
the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})
return the_counter
Here's my best attempt. 这是我的最佳尝试。 It works but isn't the fastest. 它可以工作,但不是最快的。
Takes about 1.5 minutes to run on a 4 million line input file.
在400万行输入文件上运行大约需要1.5分钟。
Now takes about 40 seconds on a 4 million line input file after the suggestion by Daniel Mesejo. 根据Daniel Mesejo的建议,现在需要花费40秒钟来处理400万行输入文件。
Note : the count
value in the csv can be in scientific notation and needs conversion. 注意 :csv中的count
数值可以是科学计数法,需要转换。 Hence the int(float(
casting. 因此, int(float(
强制转换。
import csv
from collections import Counter
def convert_counter_like_csv_to_counter(file_to_convert):
the_counter = Counter()
with file_to_convert.open(encoding="utf-8") as f:
csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
for row in csv_reader:
the_counter[row["title"]] = int(float(row["count"]))
return the_counter
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.