简体   繁体   中英

Removing duplicates from a large csv file

I have a large csv around 24 million rows, and I want to cut in size.

Here is a little preview of a csv:

在此处输入图片说明

I want to remove the rows that have the same CIK and IP, because I have a bunch of these files and they take up a lot of space, so I want to make an efficient way to remove the duplicates.

I've made to test how many duplicates of CIK are there, and for some there are more then 100k, that is why I want to cut those duplicates out.

I've tried some stuff but in most cases it failed, because of the size of the csv.

Another quick way is to do it with awk , running from the command line:

awk -F, '!x[$1,$5]++' file.csv > file_uniq.csv

where file.csv is the name of your file, and file_uniq.csv is where you want to have your deduplicated records ($1 and $5 are column numbers, 1 for ip and 5 for cik )

PS You should have awk if you're on a Linux/Mac, but may need to download it separately on Windows

Here is an example using pandas and reduce :

from functools import reduce

import pandas as pd

df = reduce(
    lambda df_i, df_j: pd.concat([df_i, df_j])
                         .drop_duplicates(subset=["cik", "ip"]),
    pd.read_csv("path/to/csv", chunksize=100000)
)
df.to_csv("path/to/deduplicated/csv")

This avoids opening the entire file at once (opening it in 100000 line chunks instead), and dropping duplicates as it goes.

You can do the following:

import pandas as pd

df = pd.read_csv('filepath/filename.csv', sep='your separator', header = True, index=False)
df.drop_duplicates(subset=['cik','ip'], keep=False, inplace=True)
df.to_csv('filepath/new_filename.csv', sep= 'your separator', header=True, index=False)

and enjoy your csv without the duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM