过滤 csv 中的唯一值并将计数添加为新列

Question

我有一个非常大的 csv（或多或少 5000 万条记录）文件，其中包含不同的列，例如：

id, state, city, origin, destination, url, type

在这个文件中，我想检查每个重复的值，我的意思是所有具有完全相同列值的行，删除重复项，然后添加一个具有重复编号的新列。

例如，如果我有

id, state, city, origin, destination, url, type
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi

我要 output 这个

id, state, city, origin, destination, url, type, count
1, NY, NY, manhattan, times square, http:ny.com, taxi, 4

其中 count 是此列重复的次数。 我知道一些 javascript 但不知道 Python，但是我愿意使用任何工具，只要我可以使用新值和列创建新文件。

Answer 1

如果没有间距问题，您可以将文件作为文本处理：

with open('input.csv') as fdin, open('output.csv', 'w', newline='\r\n') as fdout:
    header = next(fdin).strip()
    lines = {}
    for line in fd:
            line = line.strip()
            n = lines.get(line.strip(), 0)
            lines[line.strip()] = n+1
    print(header, file=fdout)
    for line, n in lines.items():
            print(line, n, file=fdout)

这里的好处是，如果有很多重复项，您只需将唯一行存储在 memory 中。

如果重复是连续的，它甚至会更简单，只有最后一行将存储在 memory 中。

Answer 2

如果您将 csv 读入 pandas DataFrame 称为 df 您可以应用以下内容。

df.groupby(df.columns.to_list()).size()

Answer 3

如果您愿意使用pandas那么，请使用：

import pandas as pd

df = pd.read_csv("data.csv") # read the csv file as dataframe

data = (
    df.groupby(df.columns.tolist())
    .size()
    .rename("count")
    .to_frame().reset_index()
)

data.to_csv("output.csv", index=False) # exports the dataframe as csv file.

这将生成一个名为output.csv的csv文件，如下所示：

id, state, city, origin, destination, url, type,count
1, NY, NY, manhattan, times square, http:ny.com, taxi,4
....
....

过滤 csv 中的唯一值并将计数添加为新列

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-05-12 09:05:17

解决方案2
0 2020-05-12 08:45:04

解决方案3
0 2020-05-12 08:52:24

过滤 csv 中的唯一值并将计数添加为新列

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-05-12 09:05:17

解决方案2 0 2020-05-12 08:45:04

解决方案3 0 2020-05-12 08:52:24

解决方案1
1 已采纳 2020-05-12 09:05:17

解决方案2
0 2020-05-12 08:45:04

解决方案3
0 2020-05-12 08:52:24