I have a large csv around 24 million rows, and I want to cut in size.
Here is a little preview of a csv:
I want to remove the rows that have the same CIK and IP, because I have a bunch of these files and they take up a lot of space, so I want to make an efficient way to remove the duplicates.
I've made to test how many duplicates of CIK are there, and for some there are more then 100k, that is why I want to cut those duplicates out.
I've tried some stuff but in most cases it failed, because of the size of the csv.
Another quick way is to do it with awk
, running from the command line:
awk -F, '!x[$1,$5]++' file.csv > file_uniq.csv
where file.csv
is the name of your file, and file_uniq.csv
is where you want to have your deduplicated records ($1 and $5 are column numbers, 1 for ip
and 5 for cik
)
PS You should have awk
if you're on a Linux/Mac, but may need to download it separately on Windows
Here is an example using pandas
and reduce
:
from functools import reduce
import pandas as pd
df = reduce(
lambda df_i, df_j: pd.concat([df_i, df_j])
.drop_duplicates(subset=["cik", "ip"]),
pd.read_csv("path/to/csv", chunksize=100000)
)
df.to_csv("path/to/deduplicated/csv")
This avoids opening the entire file at once (opening it in 100000 line chunks instead), and dropping duplicates as it goes.
You can do the following:
import pandas as pd
df = pd.read_csv('filepath/filename.csv', sep='your separator', header = True, index=False)
df.drop_duplicates(subset=['cik','ip'], keep=False, inplace=True)
df.to_csv('filepath/new_filename.csv', sep= 'your separator', header=True, index=False)
and enjoy your csv without the duplicates.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.