简体   繁体   English

数据包发行问题

[英]Dask bag to Dataframe Issue

I have data in key value format. 我有键值格式的数据。 I have created a dask bag and then a dataframe from that bag. 我创建了一个dask包,然后从那个包中创建了一个数据框。 But when I am trying do group by on that dataframe its throwing error. 但是,当我尝试在该数据帧上进行分组时,会引发抛出错误。 But for same data when I directly create a pandas dataframe or dask dataframe it was working fine. 但是对于相同的数据,当我直接创建pandas数据框或dask数据框时,它运行良好。

I think I am missing something. 我想我缺少了一些东西。 Plz help !!! 请帮助!

I have recreated the issue in below code. 我在下面的代码中重新创建了该问题。

import pandas as pd
import dask.dataframe as dd
import dask.bag as db

df = pd.DataFrame({'A': [1, 1, 2, None],  'B': [1, 2, 3, 4]})

df.groupby(df.A).count()  # pandas, working 

ddf = dd.from_pandas(df, 2)
ddf.groupby(ddf.A).count().compute() # dask dataframe, working 

bg = db.from_sequence([{'A': 1,'B':1}, {'A': 1,'B': 2}, {'A': 2,'B':3 }, {'A': None, 'B': 4}])
ddf_2 = bg.to_dataframe()
ddf_2 = ddf_2.fillna(0)
ddf_2.groupby(ddf_2.A).count().compute()  # throws error 

..........
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Note : In actual scenario I have data in avro files. 注意:在实际情况下,我在avro文件中有数据。 So I cannot skip the dask bag to dataframe part. 因此,我不能将数据包部分跳过。

The issue is that the dtypes that dask thinks you have aren't the dtypes you actually have. 问题在于,dask 认为您拥有的dtype不是您实际拥有的dtype。 When you Bag.to_dataframe without specifying the output dtypes, dask assumes that the first partition is representative (loading the whole datasset to check is expensive) and infers the dataframe dtypes from that, thus inferring 'A' as an integer column. 当您Bag.to_dataframe不指定输出Bag.to_dataframe情况下使用Bag.to_dataframe时,dask假定第一个分区是代表性的(加载整个数据集以进行检查非常昂贵),并从中推断出数据框dtypes,从而将'A'推断为整数列。

In [1]: import dask.bag as db

In [2]: bg = db.from_sequence([{'A': 1,'B':1}, {'A': 1,'B': 2}, {'A': 2,'B':3 }, {'A': None, 'B': 4}])

In [3]: ddf = bg.to_dataframe()

In [4]: ddf.dtypes
Out[4]:
A    int64
B    int64
dtype: object

In actuality though 'A' has a missing value later on, and so can't be an integer column (pandas integer series currently has no missing value representation, you must use floats). 实际上,尽管'A'稍后会丢失值,所以不能是整数列(pandas整数系列当前没有缺失值表示,必须使用浮点数)。 To be robust here you should specify the dtypes of the expected dataframe with the meta keyword: 为了保持健壮,您应该使用meta关键字指定期望数据帧的dtypes:

In [5]: ddf = bg.to_dataframe(meta={'A': float, 'B': int})  # specify 'A' has missing values and must be float

In [6]: ddf2 = ddf.fillna(0).astype({'A': int})  # fill missing with 0, and convert A back to int

In [7]: ddf2.groupby(ddf2.A).count().compute()
Out[7]:
   B
A
1  2
2  1
0  1

See the docstring of Bag.to_dataframe for more information. 有关更多信息,请参见Bag.to_dataframe的文档字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM