尝试重新索引列时 pyspark.pandas 出错

Question

我有一个 dataframe 缺少某些值，我想为这些值生成记录并输入值为 0。

我的 df 看起来像这样：

import pyspark.pandas as ps
import databricks.koalas as ks
import pandas as pd

data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
         'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
         'Price': [500, 0,450,750,0,0,890,19,120,3],
         'Quantity': [1200,0,330,500,190,70,120,300,50,80],
         'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}

df = ps.DataFrame(data)

此 df 中的某些条目缺失，例如South Africa的2017年和Japan的2018 。

我想生成这些条目并在Quantity 、 Price和Value列中添加0 。

我设法使用pandas在较小的数据集上执行此操作，但是，当我尝试使用pyspark.pandas实现此操作时，出现错误。

这是我到目前为止的代码：

(df.set_index(['Region', 'Country','Product','Year'])
   .reindex(ps.MultiIndex.from_product([df['Region'].unique(), 
                                        df['Country'].unique(),
                                        df['Product'].unique(),
                                        df['Year'].unique()], 
                                       names=['Region', 'Country','Product','Year']), 
            fill_value=0)
   .reset_index())

每当我运行它时，我都会遇到以下问题：

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

任何想法为什么会发生这种情况以及如何解决它？

Answer 1

好的，仔细查看 ps.MultiIndex 的 synatx，变量必须作为列表传递，所以我必须在每列的唯一值之后添加.tolist() 。 下面的代码：

(df.set_index(['Region', 'Country','Product','Year'])
   .reindex(ps.MultiIndex.from_product([df['Region'].unique().tolist(), 
                                        df['Country'].unique().tolist(),
                                        df['Product'].unique().tolist(),
                                        df['Year'].unique().tolist()], 
                                       names=['Region', 'Country','Product','Year']), 
            fill_value=0)
   .reset_index())

尝试重新索引列时 pyspark.pandas 出错

问题描述

1 个解决方案

解决方案1
0 2022-08-10 13:51:23

尝试重新索引列时 pyspark.pandas 出错

问题描述

1 个解决方案

解决方案1 0 2022-08-10 13:51:23

解决方案1
0 2022-08-10 13:51:23