Pandas Dataframe to Apache Beam PCollection转换问题

Question

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.我正在尝试将 pandas DataFrame 转换为来自 Apache Beam 的 PCollection。 Unfortunately, when I use to_pcollection() function, I get the following error:不幸的是，当我使用to_pcollection() function 时，出现以下错误：

AttributeError: 'DataFrame' object has no attribute '_expr'

Does anyone know how to solve it?有谁知道如何解决它？ I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.我正在使用 pandas=1.1.4、beam=2.25.0 和 Python 3.6.9。

Answer 1

to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. to_pcollection只打算应用于 Beam 的延迟数据帧，但从这个角度来看，它应该工作是有道理的，并且不明显如何手动执行。 https://github.com/apache/beam/pull/14170 should fix this. https://github.com/apache/beam/pull/14170应该解决这个问题。

Answer 2

I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam.当我使用“本机”Pandas dataframe 而不是 Beam 中的 to_dataframe 创建的to_dataframe时，我遇到了这个问题。 I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr ) that the native Pandas dataframe doesn't have. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr ) that the native Pandas dataframe doesn't have.

The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe , but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection ).真正的答案涉及知道如何使用apache_beam.dataframe.convert.to_dataframe ，但我无法弄清楚如何正确设置代理 object （我得到 Z6FF5F73C8B5EBD311406568CEF 稍后使用to_pcollection错误） So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:因此，由于我无法在 2.25.0 中获得“正确”的工作方式（我是 Beam 和 Pandas 的新手——而且不知道代理对象是如何工作的——所以对这一切持保留态度），我使用这个解决方法：

class SomeDoFn(beam.DoFn):
    def process(self, pair): # pair is a key/value tuple
        df = pd.DataFrame(pair[1]) # just the array of values

        ## do something with the dataframe
        ...

        records = df.to_dict('records')

        # return a tuple with the same shape as the one we received
        return [(rec["key"], rec) for rec in records]

which I invoke with something like this:我用这样的东西调用它：

rows = (
    pcoll
    | beam.ParDo(SomeDoFn())
)

I hope others will give you a better answer than this workaround.我希望其他人会给你一个比这个解决方法更好的答案。

Pandas Dataframe to Apache Beam PCollection转换问题

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-09 01:47:41

解决方案2
0 2020-12-02 21:27:38

Pandas Dataframe to Apache Beam PCollection转换问题

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-09 01:47:41

解决方案2 0 2020-12-02 21:27:38

解决方案1
1 已采纳 2021-03-09 01:47:41

解决方案2
0 2020-12-02 21:27:38