简体   繁体   English

Pandas:read_csv 读取没有 NaN 的大型 csv 文件

[英]Pandas: read_csv reading large csv file with no NaNs

I have a large dataset in .csv file format, with around 60 GB of data containing more than 60% of the data is missing in some columns and rows, Since Its not possible to read such a huge file directly into jupyter notebook , I want to read only specific columns and only non-null rows into jupyter notebook using pandas.read_csv .我有一个.csv文件格式的大型数据集,大约 60 GB 的数据包含超过 60% 的数据在某些列和行中丢失,因为它不可能将这么大的文件直接读入jupyter notebook ,我想要使用pandas.read_csv仅将特定列非空行读取到 jupyter 笔记本中。 How can this be done?如何才能做到这一点?

Thanks in advance!!提前致谢!!

Check following suggestion in a previous post.检查上一篇文章中的以下建议

The pandas documentation suggest you can read a csv file selecting only the columns which you want to read. pandas 文档建议您可以阅读 csv 文件,仅选择您要阅读的列。

import pandas as pd

df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)

You can read the CSV file chunk by chunk and retain the rows which you want to have您可以逐块读取 CSV 文件并保留您想要的行

iter_csv = pd.read_csv('sample.csv',, usecols = ['col1','col2'] iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.dropna(how='all') for chunk in iter_csv] )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM