Pandas Dataframe to Apache Beam PCollection轉換問題

Question

我正在嘗試將 pandas DataFrame 轉換為來自 Apache Beam 的 PCollection。 不幸的是，當我使用to_pcollection() function 時，出現以下錯誤：

AttributeError: 'DataFrame' object has no attribute '_expr'

有誰知道如何解決它？ 我正在使用 pandas=1.1.4、beam=2.25.0 和 Python 3.6.9。

Answer 1

to_pcollection只打算應用於 Beam 的延遲數據幀，但從這個角度來看，它應該工作是有道理的，並且不明顯如何手動執行。 https://github.com/apache/beam/pull/14170應該解決這個問題。

Answer 2

當我使用“本機”Pandas dataframe 而不是 Beam 中的 to_dataframe 創建的to_dataframe時，我遇到了這個問題。 I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr ) that the native Pandas dataframe doesn't have.

真正的答案涉及知道如何使用apache_beam.dataframe.convert.to_dataframe ，但我無法弄清楚如何正確設置代理 object （我得到 Z6FF5F73C8B5EBD311406568CEF 稍后使用to_pcollection錯誤）因此，由於我無法在 2.25.0 中獲得“正確”的工作方式（我是 Beam 和 Pandas 的新手——而且不知道代理對象是如何工作的——所以對這一切持保留態度），我使用這個解決方法：

class SomeDoFn(beam.DoFn):
    def process(self, pair): # pair is a key/value tuple
        df = pd.DataFrame(pair[1]) # just the array of values

        ## do something with the dataframe
        ...

        records = df.to_dict('records')

        # return a tuple with the same shape as the one we received
        return [(rec["key"], rec) for rec in records]

我用這樣的東西調用它：

rows = (
    pcoll
    | beam.ParDo(SomeDoFn())
)

我希望其他人會給你一個比這個解決方法更好的答案。

Pandas Dataframe to Apache Beam PCollection轉換問題

問題描述

2 個解決方案

解決方案1
1 已采納 2021-03-09 01:47:41

解決方案2
0 2020-12-02 21:27:38

Pandas Dataframe to Apache Beam PCollection轉換問題

問題描述

2 個解決方案

解決方案1 1 已采納 2021-03-09 01:47:41

解決方案2 0 2020-12-02 21:27:38

解決方案1
1 已采納 2021-03-09 01:47:41

解決方案2
0 2020-12-02 21:27:38