Pandas 迭代到 Pyspark Function

Question

Given a dataset I tried to create a logic where within two column I need to enforce continuity in terms of last destination (in to column) being the exact next starting point (in from column) per id.给定一个数据集，我尝试创建一个逻辑，在该逻辑中，我需要在两列中强制执行连续性，即最后一个目的地（在列中）是每个 id 的确切下一个起点（从列中）。 For instance this table比如这张表

+----+-------+-------+
| id | from  | to    |
+----+-------+-------+
|  1 | A     | B     |
|  1 | C     | A     |
|  2 | D     | D     |
|  2 | F     | G     |
|  2 | F     | F     |
+----+-------+-------+

should ideally look like this:理想情况下应该是这样的：

+----+-------+-------+
| id | from  | to    |
+----+-------+-------+
|  1 | A     | B     |
|  1 | B     | C     |
|  1 | C     | A     |
|  2 | D     | D     |
|  2 | D     | F     |
|  2 | F     | G     |
|  2 | G     | F     |
|  2 | F     | F     |
+----+-------+-------+

Using Pandas I did this by looping row wise and checking if previous_row['to'] == current_row['from'], also a check for id that can be probably avoided using a groupby, as you may see below使用 Pandas，我通过逐行循环并检查 previous_row['to'] == current_row['from'] 来做到这一点，这也是使用 groupby 可以避免的 id 检查，如下所示

for i in range(len(df)):
    if (i < (len(df)-1)):
        if (new.ix[i,"to"] != new.ix[i+1,"from"]) & (new.ix[i,"id"] == new.ix[i+1,"id"]): 
            new_index = i + 0.5
            line = pd.DataFrame({"id":new.ix[i,"id"],
                             "from":new.ix[i,"to"],"to":new.ix[i+1,"from"],}, index = [new_index])
            appendings = pd.concat([appendings,line])
        else:
            pass
    else:
        pass

Is it possible to as is "translate" this to pyspark rdds?是否可以将其“翻译”为 pyspark rdds？

I am aware that looping is far from optimal in Pyspark to replicate a looping and if-else logic.我知道在 Pyspark 中循环远非最佳，以复制循环和 if-else 逻辑。

I considered grouping by and zipping from and to columns and working on a single column.我考虑按列分组和压缩列，并处理单个列。 Main problem with this lies in the fact that I could produce a flag on lines that are "faulty" but there is no way to insert new lines without using index-wise operations.主要问题在于我可以在“错误”的行上生成一个标志，但是如果不使用索引操作就无法插入新行。

Answer 1

This isn't a pyspark answer but a partial answer to show you how to achieve the task without loop in pandas .这不是pyspark答案，而是向您展示如何在pandas中无循环地完成任务的部分答案。

You can try:你可以试试：

def f(sub_df):
    return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
                .apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
                .explode() \
                .apply(pd.Series)


out = df.groupby('id').apply(f) \
        .reset_index(level=1, drop=True) \
        .rename(columns={0: "from", 1: "to"})

Workflow:工作流程：

Group the dataframe by id using groupby使用groupby按id对 dataframe 进行分组
For each group:对于每个组：
- Create a new column (here names to_ ) to have the previous row.创建一个新列（此处名称为to_ ）以包含前一行。 np.roll performs a circular shift in order to keep the last value. np.roll执行循环移位以保留最后一个值。
- According if the current from == previous to : return current line or add a new line to make the transition.根据 if the current from == previous to ：返回当前行或添加新行以进行转换。
- Use explode to explode the list of list into one list per row.使用explode将list list分解为每行一个列表。
- Convert the column into two columns using apply(pd.Series)使用apply(pd.Series)将列转换为两列
Then, for the output dataframe, remove the level 1 index using reset_index然后，对于 output dataframe，使用reset_index删除级别 1索引
And rename the columns using rename并使用rename命名列

Full code完整代码

# Import module
import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({"id": [1,1,2,2,2], "From": ["A", "C", "D", "F", "F"], "To": ["B", "D", "D", "G", "F"]})
# print(df)


def f(sub_df):
    return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
                .apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
                .explode() \
                .apply(pd.Series)


out = df.groupby('id').apply(f) \
        .reset_index(level=1, drop=True) \
        .rename(columns={0: "from", 1: "to"})
print(out)
#    from to
# id
# 1     D  A
# 1     A  B
# 1     B  C
# 1     C  D
# 2     F  D
# 2     D  D
# 2     D  F
# 2     F  G
# 2     G  F
# 2     F  F

The next step is to translate it into PySpark .下一步是将其翻译成PySpark 。 Try it and feel free to open a new question with yours attempts.尝试一下，并随时用您的尝试打开一个新问题。

Pandas 迭代到 Pyspark Function

问题描述

1 个解决方案

解决方案1
0 2020-04-22 16:08:43

Pandas 迭代到 Pyspark Function

问题描述

1 个解决方案

解决方案1 0 2020-04-22 16:08:43

解决方案1
0 2020-04-22 16:08:43