Filter unique values in csv and add count as a new column

Question

I have a very large csv (50 millions records more or less) file with different columns like:

id, state, city, origin, destination, url, type

In this file, I want to check each repeated value, I mean all rows that have exactly the same columns value, remove the duplicates and then add a new column with the repeated number.

For example if I have

id, state, city, origin, destination, url, type
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi

I want to output this

id, state, city, origin, destination, url, type, count
1, NY, NY, manhattan, times square, http:ny.com, taxi, 4

Where count is the number of times this column is repeated. I know some javascript but not Python, however I am willing to use any tool as long as I can create a new file with the new values and columns.

Answer 1

If there are no spacing problems, you could just process the file as text:

with open('input.csv') as fdin, open('output.csv', 'w', newline='\r\n') as fdout:
    header = next(fdin).strip()
    lines = {}
    for line in fd:
            line = line.strip()
            n = lines.get(line.strip(), 0)
            lines[line.strip()] = n+1
    print(header, file=fdout)
    for line, n in lines.items():
            print(line, n, file=fdout)

The nice point here, is that if there are a lot of duplicates, you only store the unique lines in memory.

If duplicates were consecutive, it would even be simpler and only last line would be stored in memory.

Answer 2

If you read the csv into a pandas DataFrame called df you can apply the following.

df.groupby(df.columns.to_list()).size()

Answer 3

If you are willing to use pandas then, Use:

import pandas as pd

df = pd.read_csv("data.csv") # read the csv file as dataframe

data = (
    df.groupby(df.columns.tolist())
    .size()
    .rename("count")
    .to_frame().reset_index()
)

data.to_csv("output.csv", index=False) # exports the dataframe as csv file.

This will produce a csv file named output.csv which looks like:

id, state, city, origin, destination, url, type,count
1, NY, NY, manhattan, times square, http:ny.com, taxi,4
....
....

Filter unique values in csv and add count as a new column

Question

3 answers

solution1
1 ACCPTED 2020-05-12 09:05:17

solution2
0 2020-05-12 08:45:04

solution3
0 2020-05-12 08:52:24

Filter unique values in csv and add count as a new column

Question

3 answers

solution1 1 ACCPTED 2020-05-12 09:05:17

solution2 0 2020-05-12 08:45:04

solution3 0 2020-05-12 08:52:24

solution1
1 ACCPTED 2020-05-12 09:05:17

solution2
0 2020-05-12 08:45:04

solution3
0 2020-05-12 08:52:24