简体   繁体   English

Pandas 迭代到 Pyspark Function

[英]Pandas iterations to Pyspark Function

Given a dataset I tried to create a logic where within two column I need to enforce continuity in terms of last destination (in to column) being the exact next starting point (in from column) per id.给定一个数据集,我尝试创建一个逻辑,在该逻辑中,我需要在两列中强制执行连续性,即最后一个目的地(在列中)是每个 id 的确切下一个起点(从列中)。 For instance this table比如这张表

+----+-------+-------+
| id | from  | to    |
+----+-------+-------+
|  1 | A     | B     |
|  1 | C     | A     |
|  2 | D     | D     |
|  2 | F     | G     |
|  2 | F     | F     |
+----+-------+-------+

should ideally look like this:理想情况下应该是这样的:

+----+-------+-------+
| id | from  | to    |
+----+-------+-------+
|  1 | A     | B     |
|  1 | B     | C     |
|  1 | C     | A     |
|  2 | D     | D     |
|  2 | D     | F     |
|  2 | F     | G     |
|  2 | G     | F     |
|  2 | F     | F     |
+----+-------+-------+

Using Pandas I did this by looping row wise and checking if previous_row['to'] == current_row['from'], also a check for id that can be probably avoided using a groupby, as you may see below使用 Pandas,我通过逐行循环并检查 previous_row['to'] == current_row['from'] 来做到这一点,这也是使用 groupby 可以避免的 id 检查,如下所示

for i in range(len(df)):
    if (i < (len(df)-1)):
        if (new.ix[i,"to"] != new.ix[i+1,"from"]) & (new.ix[i,"id"] == new.ix[i+1,"id"]): 
            new_index = i + 0.5
            line = pd.DataFrame({"id":new.ix[i,"id"],
                             "from":new.ix[i,"to"],"to":new.ix[i+1,"from"],}, index = [new_index])
            appendings = pd.concat([appendings,line])
        else:
            pass
    else:
        pass

Is it possible to as is "translate" this to pyspark rdds?是否可以将其“翻译”为 pyspark rdds?

I am aware that looping is far from optimal in Pyspark to replicate a looping and if-else logic.我知道在 Pyspark 中循环远非最佳,以复制循环和 if-else 逻辑。

I considered grouping by and zipping from and to columns and working on a single column.我考虑按列分组和压缩列,并处理单个列。 Main problem with this lies in the fact that I could produce a flag on lines that are "faulty" but there is no way to insert new lines without using index-wise operations.主要问题在于我可以在“错误”的行上生成一个标志,但是如果不使用索引操作就无法插入新行。

This isn't a pyspark answer but a partial answer to show you how to achieve the task without loop in pandas .这不是pyspark答案,而是向您展示如何在pandas中无循环地完成任务的部分答案。

You can try:你可以试试:

def f(sub_df):
    return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
                .apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
                .explode() \
                .apply(pd.Series)


out = df.groupby('id').apply(f) \
        .reset_index(level=1, drop=True) \
        .rename(columns={0: "from", 1: "to"})

Workflow:工作流程:

  • Group the dataframe by id using groupby使用groupbyid对 dataframe 进行分组
  • For each group:对于每个组:
    • Create a new column (here names to_ ) to have the previous row.创建一个新列(此处名称为to_ )以包含前一行。 np.roll performs a circular shift in order to keep the last value. np.roll执行循环移位以保留最后一个值。
    • According if the current from == previous to : return current line or add a new line to make the transition.根据 if the current from == previous to :返回当前行或添加新行以进行转换。
    • Use explode to explode the list of list into one list per row.使用explodelist list分解为每行一个列表。
    • Convert the column into two columns using apply(pd.Series)使用apply(pd.Series)将列转换为两列
  • Then, for the output dataframe, remove the level 1 index using reset_index然后,对于 output dataframe,使用reset_index删除级别 1索引
  • And rename the columns using rename并使用rename命名列

Full code完整代码

# Import module
import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({"id": [1,1,2,2,2], "From": ["A", "C", "D", "F", "F"], "To": ["B", "D", "D", "G", "F"]})
# print(df)


def f(sub_df):
    return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
                .apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
                .explode() \
                .apply(pd.Series)


out = df.groupby('id').apply(f) \
        .reset_index(level=1, drop=True) \
        .rename(columns={0: "from", 1: "to"})
print(out)
#    from to
# id
# 1     D  A
# 1     A  B
# 1     B  C
# 1     C  D
# 2     F  D
# 2     D  D
# 2     D  F
# 2     F  G
# 2     G  F
# 2     F  F

The next step is to translate it into PySpark .下一步是将其翻译成PySpark Try it and feel free to open a new question with yours attempts.尝试一下,并随时用您的尝试打开一个新问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM