Pyspark：如何遍历数据框列？

Question

I'm new to pyspark.我是 pyspark 的新手。 I usually work with pandas.我通常使用 pandas。 I to iterate through row by row using a column in pyspark.我使用 pyspark 中的列逐行迭代。 My dataset looks like:-我的数据集看起来像：-

+-------------------+--------------------+--------+-----+
|           DateTime|           user_name|keyboard|mouse|
+-------------------+--------------------+--------+-----+
|2019-10-21 08:35:01|prathameshsalap@g...|   333.0|658.0|
|2019-10-21 08:35:01|vaishusawant143@g...|   447.5|  0.0|
|2019-10-21 08:35:01|     you@example.com|     0.5|  1.0|
|2019-10-21 08:40:01|     you@example.com|     0.0|  0.0|
|2019-10-21 08:40:01|prathameshsalap@g...|   227.0|366.0|
|2019-10-21 08:40:02|vaishusawant143@g...|   472.0|  0.0|
|2019-10-21 08:45:01|     you@example.com|     0.0|  0.0|
|2019-10-21 08:45:01|prathameshsalap@g...|    35.0|458.0|
|2019-10-21 08:45:01|vaishusawant143@g...|  1659.5|  0.0|
|2019-10-21 08:50:01|     you@example.com|     0.0|  0.0|
+-------------------+--------------------+--------+-----+

In the pandas data frame it also has a given index but in spark not.在 pandas 数据帧中，它也有一个给定的索引，但在 spark 中没有。 In pandas:-在 pandas 中：-

## pandas
usr_log = pd.read_csv("data.csv")
unique_users = usr_log.user_name.unique()
usr_log.sort_values(by='DateTime', inplace=True)
users_new_data = dict()
users_new_data[user] = {'start_time': None}

for user in unique_users:
    count_idle = 0
    ## first part of the question
    for index in usr_log.index:
        if user == usr_log['user_name'][index]:
            if users_new_data[user]['start_time'] is None:
                users_new_data[user]['start_time'] = usr_log['DateTime'][index]

            ## Second part of the question

            if usr_log['keyboard'][index] == 0 and usr_log['mouse'][index] == 0:
                count_idle += 1
            else:
                count_idle = 0
            if count_idle >= 5:
                if count_idle == 5:
                    users_new_data[usr_log['user_name'][index]]['idle_time'] \
                        = users_new_data[usr_log['user_name'][index]].get('idle_time') \
                          + datetime.timedelta(0, 1500)
                else:
                    users_new_data[usr_log['user_name'][index]]['idle_time'] \
                        = users_new_data[usr_log['user_name'][index]].get('idle_time') \
                          + datetime.timedelta(0, 300)

Same thing how can do it in spark?同样的事情怎么能在火花中做到呢？

For each user data generated after 5 mins(Like if the user starts at 8:30:01 the next log generated at 8:35:01).对于 5 分钟后生成的每个用户数据（例如，如果用户在 8:30:01 开始，则下一个日志在 8:35:01 生成）。 In the second question in I want to find an idle hour for each user.在第二个问题中，我想为每个用户找到一个空闲时间。 The calculation of idle hours is if he does not move the mouse or use the keyboard the next 30 mins(1500) then I add in user idle hours.空闲时间的计算是如果他在接下来的 30 分钟（1500）内不移动鼠标或使用键盘，那么我添加用户空闲时间。

After converting dictionary value into data frame my expected output like:-将字典值转换为数据框后，我预期的 output 如下： -

+--------------------+-------------------+-------------------+
|           user_name|         start_time|          idle_time|
+--------------------+-------------------+-------------------+
|prathameshsalap@g...|2019-10-21 08:35:01|2019-10-21 05:05:00|
|vaishusawant143@g...|2019-10-21 08:35:01|2019-10-21 02:15:00|
|     you@example.com|2019-10-21 08:35:01|2019-10-21 01:30:00|
+--------------------+-------------------+-------------------+

Answer 1

If you want to find for each user the first timestamp that they have you can simplify it first in pandas, do this:如果您想为每个用户找到他们拥有的第一个时间戳，您可以先在 pandas 中对其进行简化，请执行以下操作：

usr_log[['user_name','DateTime']].groupby(['user_name']).min()

And for spark will be very similar而对于火花将非常相似

urs_log = sparkSession.read.csv(...)
urs_log.groupBy("user_name").agg(min("DateTime"))

you only will have to rename DateTime column to the one you want, and try to not use for loops in pandas .您只需要将DateTime列重命名为您想要的列，并尽量不要在 pandas 中使用 for 循环。

In spark, you have a distributed collection and it's impossible to do a for loop, you have to apply transformations to columns, never apply logic to a single row of data.在 spark 中，您有一个分布式集合，并且不可能执行 for 循环，您必须将转换应用于列，永远不要将逻辑应用于单行数据。

Answer 2

Here is solution on same,这是相同的解决方案，

dataFrame = (spark.read.format("csv").option("sep", ",").option("header", "true").load("data.csv"))

df.show()
+-------------------+--------------------+--------+-----+
|           DateTime|           user_name|keyboard|mouse|
+-------------------+--------------------+--------+-----+
|2019-10-21 08:35:01|prathameshsalap@g...|   333.0|658.0|
|2019-10-21 08:35:01|vaishusawant143@g...|   447.5|  0.0|
|2019-10-21 08:35:01|     you@example.com|     0.5|  1.0|
|2019-10-21 08:40:01|prathameshsalap@g...|   227.0|366.0|
|2019-10-21 08:40:02|vaishusawant143@g...|   472.0|  0.0|
|2019-10-21 08:45:01|     you@example.com|     0.0|  0.0|
|2019-10-21 08:45:01|prathameshsalap@g...|    35.0|458.0|
|2019-10-21 08:45:01|vaishusawant143@g...|  1659.5|  0.0|
|2019-10-21 08:50:01|     you@example.com|     0.0|  0.0|
+-------------------+--------------------+--------+-----+
df1 = df.groupBy("user_name").agg(min("DateTime"))
df1.show()
+--------------------+-------------------+
|           user_name|      min(DateTime)|
+--------------------+-------------------+
|prathameshsalap@g...|2019-10-21 08:35:01|
|vaishusawant143@g...|2019-10-21 08:35:01|
|     you@example.com|2019-10-21 08:35:01|
+--------------------+-------------------+

Other Part -其他部分 -

df1 = df.withColumn("count",when(((col("keyboard")==0.0) & (col("mouse")==0.0)), count_idle+1).otherwise(0))

df2 = df1.withColumn("Idle_Sec",when((col("count")==0), 300).otherwise(1500))

df2.show()
+-------------------+--------------------+--------+-----+-----+--------+
|           DateTime|           user_name|keyboard|mouse|count|Idle_Sec|
+-------------------+--------------------+--------+-----+-----+--------+
|2019-10-21 08:35:01|prathameshsalap@g...|   333.0|658.0|    0|     300|
|2019-10-21 08:35:01|vaishusawant143@g...|   447.5|  0.0|    0|     300|
|2019-10-21 08:35:01|     you@example.com|     0.5|  1.0|    0|     300|
|2019-10-21 08:40:01|     you@example.com|     0.0|  0.0|    1|    1500|
|2019-10-21 08:40:01|prathameshsalap@g...|   227.0|366.0|    0|     300|
|2019-10-21 08:40:02|vaishusawant143@g...|   472.0|  0.0|    0|     300|
|2019-10-21 08:45:01|     you@example.com|     0.0|  0.0|    1|    1500|
|2019-10-21 08:45:01|prathameshsalap@g...|    35.0|458.0|    0|     300|
|2019-10-21 08:45:01|vaishusawant143@g...|  1659.5|  0.0|    0|     300|
|2019-10-21 08:50:01|     you@example.com|     0.0|  0.0|    1|    1500|
+-------------------+--------------------+--------+-----+-----+--------+

df3 = df2.groupBy("user_name").agg(min("DateTime").alias("start_time"),sum("Idle_Sec").alias("Sum_Idle_Sec"))

+--------------------+-------------------+------------+
|           user_name|         start_time|Sum_Idle_Sec|
+--------------------+-------------------+------------+
|prathameshsalap@g...|2019-10-21 08:35:01|         900|
|vaishusawant143@g...|2019-10-21 08:35:01|         900|
|     you@example.com|2019-10-21 08:35:01|        4800|
+--------------------+-------------------+------------+

df3.withColumn("Idle_time",(F.unix_timestamp("start_time") + col("Sum_Idle_Sec")).cast('timestamp')).show()
+--------------------+-------------------+---------+----------------------+
|           user_name|         start_time|Sum_Idle_Sec|          Idle_time|
+--------------------+-------------------+---------+----------------------+
|prathameshsalap@g...|2019-10-21 08:35:01|         900|2019-10-21 08:50:01|
|vaishusawant143@g...|2019-10-21 08:35:01|         900|2019-10-21 08:50:01|
|     you@example.com|2019-10-21 08:35:01|        4800|2019-10-21 09:55:01|
+--------------------+-------------------+---------+----------------------+

Answer 3

You should do as the following example:您应该按照以下示例进行操作：

df.withColumn("user_name", do_something ) df.withColumn("user_name", do_something )

" do_something " can be any function that you define. “ do_something ”可以是您定义的任何 function。

Pyspark：如何遍历数据框列？

问题描述

3 个解决方案

解决方案1
3 已采纳 2020-05-21 14:45:00

解决方案2
1 2020-05-21 15:17:49

解决方案3
0 2020-05-21 12:53:32

Pyspark：如何遍历数据框列？

问题描述

3 个解决方案

解决方案1 3 已采纳 2020-05-21 14:45:00

解决方案2 1 2020-05-21 15:17:49

解决方案3 0 2020-05-21 12:53:32

解决方案1
3 已采纳 2020-05-21 14:45:00

解决方案2
1 2020-05-21 15:17:49

解决方案3
0 2020-05-21 12:53:32