简体   繁体   English

如何将`dask.DataFrame`的结果映射到csvs

[英]How to map results of `dask.DataFrame` to csvs

I create a dataframe with df=dask.DataFrame.read_csv('s3://bucket/*.csv') . 我用df=dask.DataFrame.read_csv('s3://bucket/*.csv')创建了一个数据帧。 When i execute a df[df.a.isnull()].compute operation, i get a set of rows returned that match the filter criteria. 当我执行df[df.a.isnull()].compute operation时,我得到一组返回的符合过滤条件的行。 I would like to know which files do these returned rows belong in so that i could investigate why such records have null values. 我想知道这些返回的行属于哪些文件,以便我可以调查为什么这些记录具有空值。 The DataFrame has billions of rows and the records with the missing values are in single digits. DataFrame有数十亿行,缺少值的记录是DataFrame数。 Is there an efficient way to do so? 有没有一种有效的方法呢?

If your CSV files are small then I recommend creating one partition per file 如果您的CSV文件很小,我建议为每个文件创建一个分区

df = dd.read_csv('s3://bucket/*.csv', blocksize=None)

And then computing the number of null elements per partition: 然后计算每个分区的空元素数量:

counts = df.a.isnull().map_partitions(sum).compute()

You could then find the filenames 然后,您可以找到文件名

from s3fs import S3FileSystem
s3 = S3FileSystem()
filenames = s3.glob('s3://bucket/*.csv')

And compare the two 并比较两者

dict(zip(filenames, counts))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM