[英]Pandas iterations to Pyspark Function
Given a dataset I tried to create a logic where within two column I need to enforce continuity in terms of last destination (in to column) being the exact next starting point (in from column) per id.给定一个数据集,我尝试创建一个逻辑,在该逻辑中,我需要在两列中强制执行连续性,即最后一个目的地(在列中)是每个 id 的确切下一个起点(从列中)。 For instance this table比如这张表
+----+-------+-------+
| id | from | to |
+----+-------+-------+
| 1 | A | B |
| 1 | C | A |
| 2 | D | D |
| 2 | F | G |
| 2 | F | F |
+----+-------+-------+
should ideally look like this:理想情况下应该是这样的:
+----+-------+-------+
| id | from | to |
+----+-------+-------+
| 1 | A | B |
| 1 | B | C |
| 1 | C | A |
| 2 | D | D |
| 2 | D | F |
| 2 | F | G |
| 2 | G | F |
| 2 | F | F |
+----+-------+-------+
Using Pandas I did this by looping row wise and checking if previous_row['to'] == current_row['from'], also a check for id that can be probably avoided using a groupby, as you may see below使用 Pandas,我通过逐行循环并检查 previous_row['to'] == current_row['from'] 来做到这一点,这也是使用 groupby 可以避免的 id 检查,如下所示
for i in range(len(df)):
if (i < (len(df)-1)):
if (new.ix[i,"to"] != new.ix[i+1,"from"]) & (new.ix[i,"id"] == new.ix[i+1,"id"]):
new_index = i + 0.5
line = pd.DataFrame({"id":new.ix[i,"id"],
"from":new.ix[i,"to"],"to":new.ix[i+1,"from"],}, index = [new_index])
appendings = pd.concat([appendings,line])
else:
pass
else:
pass
Is it possible to as is "translate" this to pyspark rdds?是否可以将其“翻译”为 pyspark rdds?
I am aware that looping is far from optimal in Pyspark to replicate a looping and if-else logic.我知道在 Pyspark 中循环远非最佳,以复制循环和 if-else 逻辑。
I considered grouping by and zipping from and to columns and working on a single column.我考虑按列分组和压缩列,并处理单个列。 Main problem with this lies in the fact that I could produce a flag on lines that are "faulty" but there is no way to insert new lines without using index-wise operations.主要问题在于我可以在“错误”的行上生成一个标志,但是如果不使用索引操作就无法插入新行。
This isn't a pyspark
answer but a partial answer to show you how to achieve the task without loop in pandas
.这不是pyspark
答案,而是向您展示如何在pandas
中无循环地完成任务的部分答案。
You can try:你可以试试:
def f(sub_df):
return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
.apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
.explode() \
.apply(pd.Series)
out = df.groupby('id').apply(f) \
.reset_index(level=1, drop=True) \
.rename(columns={0: "from", 1: "to"})
Workflow:工作流程:
id
using groupby
使用groupby
按id
对 dataframe 进行分组to_
) to have the previous row.创建一个新列(此处名称为to_
)以包含前一行。 np.roll
performs a circular shift in order to keep the last value. np.roll
执行循环移位以保留最后一个值。explode
to explode the list
of list
into one list per row.使用explode
将list
list
分解为每行一个列表。apply(pd.Series)
使用apply(pd.Series)
将列转换为两列reset_index
然后,对于 output dataframe,使用reset_index
删除级别 1索引rename
并使用rename
命名列Full code完整代码
# Import module
import pandas as pd
import numpy as np
# create dataset
df = pd.DataFrame({"id": [1,1,2,2,2], "From": ["A", "C", "D", "F", "F"], "To": ["B", "D", "D", "G", "F"]})
# print(df)
def f(sub_df):
return sub_df.assign(to_=np.roll(sub_df.To, 1)) \
.apply(lambda x: [[x.From, x.To]] if x.to_ == x.From else [[x.to_, x.From], [x.From, x.To]], axis=1) \
.explode() \
.apply(pd.Series)
out = df.groupby('id').apply(f) \
.reset_index(level=1, drop=True) \
.rename(columns={0: "from", 1: "to"})
print(out)
# from to
# id
# 1 D A
# 1 A B
# 1 B C
# 1 C D
# 2 F D
# 2 D D
# 2 D F
# 2 F G
# 2 G F
# 2 F F
The next step is to translate it into PySpark
.下一步是将其翻译成PySpark
。 Try it and feel free to open a new question with yours attempts.尝试一下,并随时用您的尝试打开一个新问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.