简体   繁体   中英

Filter unique values in csv and add count as a new column

I have a very large csv (50 millions records more or less) file with different columns like:

id, state, city, origin, destination, url, type

In this file, I want to check each repeated value, I mean all rows that have exactly the same columns value, remove the duplicates and then add a new column with the repeated number.

For example if I have

id, state, city, origin, destination, url, type
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi
1, NY, NY, manhattan, times square, http:ny.com, taxi

I want to output this

id, state, city, origin, destination, url, type, count
1, NY, NY, manhattan, times square, http:ny.com, taxi, 4

Where count is the number of times this column is repeated. I know some javascript but not Python, however I am willing to use any tool as long as I can create a new file with the new values and columns.

If there are no spacing problems, you could just process the file as text:

with open('input.csv') as fdin, open('output.csv', 'w', newline='\r\n') as fdout:
    header = next(fdin).strip()
    lines = {}
    for line in fd:
            line = line.strip()
            n = lines.get(line.strip(), 0)
            lines[line.strip()] = n+1
    print(header, file=fdout)
    for line, n in lines.items():
            print(line, n, file=fdout)

The nice point here, is that if there are a lot of duplicates, you only store the unique lines in memory.

If duplicates were consecutive, it would even be simpler and only last line would be stored in memory.

If you read the csv into a pandas DataFrame called df you can apply the following.

df.groupby(df.columns.to_list()).size()

If you are willing to use pandas then, Use:

import pandas as pd

df = pd.read_csv("data.csv") # read the csv file as dataframe

data = (
    df.groupby(df.columns.tolist())
    .size()
    .rename("count")
    .to_frame().reset_index()
)

data.to_csv("output.csv", index=False) # exports the dataframe as csv file.

This will produce a csv file named output.csv which looks like:

id, state, city, origin, destination, url, type,count
1, NY, NY, manhattan, times square, http:ny.com, taxi,4
....
....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM