[英]Pandas Dataframe to Apache Beam PCollection conversion problem
I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.我正在尝试将 pandas DataFrame 转换为来自 Apache Beam 的 PCollection。 Unfortunately, when I use
to_pcollection()
function, I get the following error:不幸的是,当我使用
to_pcollection()
function 时,出现以下错误:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?有谁知道如何解决它? I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.
我正在使用 pandas=1.1.4、beam=2.25.0 和 Python 3.6.9。
to_pcollection
was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. to_pcollection
只打算应用于 Beam 的延迟数据帧,但从这个角度来看,它应该工作是有道理的,并且不明显如何手动执行。 https://github.com/apache/beam/pull/14170 should fix this. https://github.com/apache/beam/pull/14170应该解决这个问题。
I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe
within Beam.当我使用“本机”Pandas dataframe 而不是 Beam 中的 to_dataframe 创建的
to_dataframe
时,我遇到了这个问题。 I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr
) that the native Pandas dataframe doesn't have. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like
_expr
) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe
, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection
).真正的答案涉及知道如何使用
apache_beam.dataframe.convert.to_dataframe
,但我无法弄清楚如何正确设置代理 object (我得到 Z6FF5F73C8B5EBD311406568CEF 稍后使用to_pcollection
错误) So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:因此,由于我无法在 2.25.0 中获得“正确”的工作方式(我是 Beam 和 Pandas 的新手——而且不知道代理对象是如何工作的——所以对这一切持保留态度),我使用这个解决方法:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:我用这样的东西调用它:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.我希望其他人会给你一个比这个解决方法更好的答案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.