[英]Pandas: read_csv reading large csv file with no NaNs
I have a large dataset in .csv
file format, with around 60 GB of data containing more than 60% of the data is missing in some columns and rows, Since Its not possible to read such a huge file directly into jupyter notebook
, I want to read only specific columns and only non-null rows into jupyter notebook using pandas.read_csv
.我有一个.csv
文件格式的大型数据集,大约 60 GB 的数据包含超过 60% 的数据在某些列和行中丢失,因为它不可能将这么大的文件直接读入jupyter notebook
,我想要使用pandas.read_csv
仅将特定列和非空行读取到 jupyter 笔记本中。 How can this be done?如何才能做到这一点?
Thanks in advance!!提前致谢!!
Check following suggestion in a previous post.检查上一篇文章中的以下建议。
The pandas documentation suggest you can read a csv file selecting only the columns which you want to read. pandas 文档建议您可以阅读 csv 文件,仅选择您要阅读的列。
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
You can read the CSV file chunk by chunk and retain the rows which you want to have您可以逐块读取 CSV 文件并保留您想要的行
iter_csv = pd.read_csv('sample.csv',, usecols = ['col1','col2'] iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.dropna(how='all') for chunk in iter_csv] )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.