简体   繁体   English

Pandas Dataframe to Apache Beam PCollection转换问题

[英]Pandas Dataframe to Apache Beam PCollection conversion problem

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.我正在尝试将 pandas DataFrame 转换为来自 Apache Beam 的 PCollection。 Unfortunately, when I use to_pcollection() function, I get the following error:不幸的是,当我使用to_pcollection() function 时,出现以下错误:

AttributeError: 'DataFrame' object has no attribute '_expr'

Does anyone know how to solve it?有谁知道如何解决它? I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.我正在使用 pandas=1.1.4、beam=2.25.0 和 Python 3.6.9。

to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. to_pcollection只打算应用于 Beam 的延迟数据帧,但从这个角度来看,它应该工作是有道理的,并且不明显如何手动执行。 https://github.com/apache/beam/pull/14170 should fix this. https://github.com/apache/beam/pull/14170应该解决这个问题。

I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam.当我使用“本机”Pandas dataframe 而不是 Beam 中的 to_dataframe 创建的to_dataframe时,我遇到了这个问题。 I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr ) that the native Pandas dataframe doesn't have. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr ) that the native Pandas dataframe doesn't have.

The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe , but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection ).真正的答案涉及知道如何使用apache_beam.dataframe.convert.to_dataframe ,但我无法弄清楚如何正确设置代理 object (我得到 Z6FF5F73C8B5EBD311406568CEF 稍后使用to_pcollection错误) So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:因此,由于我无法在 2.25.0 中获得“正确”的工作方式(我是 Beam 和 Pandas 的新手——而且不知道代理对象是如何工作的——所以对这一切持保留态度),我使用这个解决方法:

class SomeDoFn(beam.DoFn):
    def process(self, pair): # pair is a key/value tuple
        df = pd.DataFrame(pair[1]) # just the array of values

        ## do something with the dataframe
        ...

        records = df.to_dict('records')

        # return a tuple with the same shape as the one we received
        return [(rec["key"], rec) for rec in records]

which I invoke with something like this:我用这样的东西调用它:

rows = (
    pcoll
    | beam.ParDo(SomeDoFn())
)

I hope others will give you a better answer than this workaround.我希望其他人会给你一个比这个解决方法更好的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM