Pandas：read_csv 读取没有 NaN 的大型 csv 文件

Question

I have a large dataset in .csv file format, with around 60 GB of data containing more than 60% of the data is missing in some columns and rows, Since Its not possible to read such a huge file directly into jupyter notebook , I want to read only specific columns and only non-null rows into jupyter notebook using pandas.read_csv .我有一个.csv文件格式的大型数据集，大约 60 GB 的数据包含超过 60% 的数据在某些列和行中丢失，因为它不可能将这么大的文件直接读入jupyter notebook ，我想要使用pandas.read_csv仅将特定列和非空行读取到 jupyter 笔记本中。 How can this be done?如何才能做到这一点？

Thanks in advance!!提前致谢！！

Answer 1

Check following suggestion in a previous post.检查上一篇文章中的以下建议。

The pandas documentation suggest you can read a csv file selecting only the columns which you want to read. pandas 文档建议您可以阅读 csv 文件，仅选择您要阅读的列。

import pandas as pd

df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)

Answer 2

You can read the CSV file chunk by chunk and retain the rows which you want to have您可以逐块读取 CSV 文件并保留您想要的行

iter_csv = pd.read_csv('sample.csv',, usecols = ['col1','col2'] iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.dropna(how='all') for chunk in iter_csv] )

Pandas：read_csv 读取没有 NaN 的大型 csv 文件

问题描述

2 个解决方案

解决方案1
1 2021-03-30 06:42:01

解决方案2
1 2021-03-30 06:59:10

Pandas：read_csv 读取没有 NaN 的大型 csv 文件

问题描述

2 个解决方案

解决方案1 1 2021-03-30 06:42:01

解决方案2 1 2021-03-30 06:59:10

解决方案1
1 2021-03-30 06:42:01

解决方案2
1 2021-03-30 06:59:10