将2列类似计数器的csv文件转换为Python集合。

Question

I have a comma separated ( , ) tab delimited ( \\t ), file. 我有一个用逗号分隔（ , ）的制表符分隔（ \\t ）文件。

68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t

Is there a simple approach to throw it into a Python Counter ? 有没有简单的方法可以将其放入Python Counter ？

Answer 1

You could use a dictionary comprehension, is considered more pythonic and it can be marginally faster : 您可以使用字典理解，被认为是更pythonic的 ，并且可以稍微快一些：

import csv
from collections import Counter


def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
    return the_counter

Answer 2

I couldn't let this go and stumbled on what I think is the winner. 我不能放弃这个，偶然发现了我认为是赢家的东西。

In testing it was clear that looping through the rows of the csv.DictReader was the slowest part; 在测试中，很明显，循环浏览csv.DictReader的行是最慢的部分。 taking about 30 of the 40 seconds. 大约需要40秒中的30秒。

I switched it to simple csv.reader to see what I would get. 我将其切换到简单的csv.reader以查看得到的结果。 This resulted in rows of lists. 这导致了列表行。 I wrapped this in a dict to see if it directly converted. 我将其包装在dict以查看其是否直接转换。 It did! 它做了！

Then I could loop through a native dictionary instead of a csv.DictReader . 然后，我可以遍历本机字典，而不是csv.DictReader 。

The result... done with 4 million rows in 3 seconds ! 结果... 在3秒内完成了400万行 ！ 🎉 🎉

def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.reader(f, delimiter="\t")
        d = dict(csv_reader)
        the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})

    return the_counter

Answer 3

Here's my best attempt. 这是我的最佳尝试。 It works but isn't the fastest. 它可以工作，但不是最快的。
~~Takes about 1.5 minutes to run on a 4 million line input file.~~ ~~在400万行输入文件上运行大约需要1.5分钟。~~
Now takes about 40 seconds on a 4 million line input file after the suggestion by Daniel Mesejo. 根据Daniel Mesejo的建议，现在需要花费40秒钟来处理400万行输入文件。

_{Note : the count value in the csv can be in scientific notation and needs conversion.} _{注意：csv中的count数值可以是科学计数法，需要转换。} _{Hence the int(float( casting.} _{因此， int(float(强制转换。}

import csv
from collections import Counter

def convert_counter_like_csv_to_counter(file_to_convert):

    the_counter = Counter()
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        for row in csv_reader:
            the_counter[row["title"]] = int(float(row["count"]))

    return the_counter

将2列类似计数器的csv文件转换为Python集合。

问题描述

3 个解决方案

解决方案1
1 2018-12-06 02:13:23

解决方案2
1 已采纳 2018-12-06 04:46:59

解决方案3
0 2018-12-06 01:42:36

将2列类似计数器的csv文件转换为Python集合。

问题描述

3 个解决方案

解决方案1 1 2018-12-06 02:13:23

解决方案2 1 已采纳 2018-12-06 04:46:59

解决方案3 0 2018-12-06 01:42:36

解决方案1
1 2018-12-06 02:13:23

解决方案2
1 已采纳 2018-12-06 04:46:59

解决方案3
0 2018-12-06 01:42:36