如何避免在火花（python）中使用for循环

Question

I'm new to pySpark, hope someone coul'd help me.我是 pySpark 的新手，希望有人能帮助我。

I have a dataframe with a bunch of filght searches results:我有一个包含一堆飞行搜索结果的数据框：

+------+-----------+----------+----------+-----+
|origin|destination|      from|        to|price|
+------+-----------+----------+----------+-----+
|   TLV|        NYC|2022-01-01|2022-01-05| 1000|
|   TLV|        ROM|2022-03-01|2022-04-05|  480|
|   TLV|        NYC|2022-01-02|2022-01-04|  990|
|   TLV|        NYC|2022-02-01|2022-03-15| 1200|
|   TLV|        NYC|2022-01-02|2022-01-05| 1100|
|   TLV|        BLR|2022-01-01|2022-01-05| 1480|
|   TLV|        NYC|2022-01-02|2022-01-05| 1010|
+------+-----------+----------+----------+-----+

I want to get all the flights prices from the dataframe, based on origin-destination and dates.我想根据出发地-目的地和日期从数据框中获取所有航班价格。

I have a list with some date combinations like so:我有一个包含一些日期组合的列表，如下所示：

date_combinations = [("2022-01-01", "2022-01-02"), ("2022-01-01", "2022-01-03"), ("2022-01-01", "2022-01-04"),("2022-01-01", "2022-01-05"), ("2022-01-02", "2022-01-03"), ("2022-01-02", "2022-01-04"),("2022-01-02", "2022-01-05"), ("2022-01-03", "2022-01-04"), ("2022-01-03", "2022-01-05"), ("2022-01-04", "2022-01-05") ]

What I'm currently doing is filtering the dataframe inside a for loop for every date combination:我目前正在做的是在 for 循环中为每个日期组合过滤数据框：

for date in date_combinations:
    df_date = df.filter((df['from']==date[0])&(df['to']==date[1]))
    if df_date.count()==0:
        results.append([date, 0])
    else:
        results.append([date, df_date.collect()[0]['price']])

Output:输出：

[('2022-01-01', '2022-01-02'), 0]
[('2022-01-01', '2022-01-03'), 0]
[('2022-01-01', '2022-01-04'), 0]
[('2022-01-01', '2022-01-05'), 1000]
[('2022-01-02', '2022-01-03'), 0]
[('2022-01-02', '2022-01-04'), 990]
[('2022-01-02', '2022-01-05'), 1100]
[('2022-01-03', '2022-01-04'), 0]
[('2022-01-03', '2022-01-05'), 0]
[('2022-01-04', '2022-01-05'), 0]

The output is OK, but I'm sure there is a much more efficient way of doing it, instead of for loop (which in large datasets will take forever).输出没问题，但我确信有一种更有效的方法，而不是 for 循环（在大型数据集中将永远持续）。

Thanks!谢谢！

Answer 1

I would first convert date_combinations to a DataFrame (using parallelize or selecting then dropping duplicates if that comes from a data set.我会首先将date_combinations转换为 DataFrame（如果来自数据集，则使用并行化或选择然后删除重复项。

The idea is to do a left join between your dates and the data table (we will call it).这个想法是在您的日期和data表（我们称之为）之间进行左连接。

First, we want to clean your data table and drop duplicates, because you do not want that (because then the left join will also create duplicates on matching records):首先，我们要清理您的data表并删除重复项，因为您不希望这样做（因为左连接也会在匹配记录上创建重复项）：

val mainTableFiltered = data.select("from", "to", "price").dropDuplicates("from", "to")

We then join the dates with this cleaned table on from and to in a left join manner, so we do not lose records:然后我们以left连接方式将日期与这个清理过的表连接到from和to上，这样我们就不会丢失记录：

dateCombinations.join(mainTableFiltered, Seq("from", "to"), "left")

Then, the created records if not matched will be null, therefore we replace nulls with 0s:然后，如果不匹配，创建的记录将为空，因此我们将空值替换为 0：

.withColumn("price", when(col("price").isNull, 0).otherwise(col("price")))

And we finally order by from and to to get the same results (as in your example):我们最终按from和to排序以获得相同的结果（如您的示例中所示）：

.orderBy("from", "to")

Full code:完整代码：

val mainTableFiltered = data.select("from", "to", "price").dropDuplicates("from", "to")
dateCombinations.join(mainTableFiltered, Seq("from", "to"), "left")
  .withColumn("price", when(col("price").isNull, 0).otherwise(col("price")))
  .orderBy("from", "to")

Final output:最终输出：

+----------+----------+-----+
|      from|        to|price|
+----------+----------+-----+
|2022-01-01|2022-01-02|    0|
|2022-01-01|2022-01-03|    0|
|2022-01-01|2022-01-04|    0|
|2022-01-01|2022-01-05| 1000|
|2022-01-02|2022-01-03|    0|
|2022-01-02|2022-01-04|  990|
|2022-01-02|2022-01-05| 1100|
|2022-01-03|2022-01-04|    0|
|2022-01-03|2022-01-05|    0|
|2022-01-04|2022-01-05|    0|
+----------+----------+-----+

Answer 2

You can create a df from your list of dates and join both dfs:您可以从日期列表中创建一个 df 并加入两个 dfs：

df = spark.createDataFrame(
    [
     ('TLV','NYC','2022-01-01','2022-01-05','1000')
    ,('TLV','ROM','2022-03-01','2022-04-05','480')
    ,('TLV','NYC','2022-01-02','2022-01-04','990')
    ,('TLV','NYC','2022-02-01','2022-03-15','1200')
    ,('TLV','NYC','2022-01-02','2022-01-05','1100')
    ,('TLV','BLR','2022-01-01','2022-01-05','1480')
    ,('TLV','NYC','2022-01-02','2022-01-05','1010')
    ],
    ['origin','destination','from','to','price']
)

df.show()

+------+-----------+----------+----------+-----+
|origin|destination|      from|        to|price|
+------+-----------+----------+----------+-----+
|   TLV|        NYC|2022-01-01|2022-01-05| 1000|
|   TLV|        ROM|2022-03-01|2022-04-05|  480|
|   TLV|        NYC|2022-01-02|2022-01-04|  990|
|   TLV|        NYC|2022-02-01|2022-03-15| 1200|
|   TLV|        NYC|2022-01-02|2022-01-05| 1100|
|   TLV|        BLR|2022-01-01|2022-01-05| 1480|
|   TLV|        NYC|2022-01-02|2022-01-05| 1010|
+------+-----------+----------+----------+-----+

date_combinations = [("2022-01-01", "2022-01-02"), ("2022-01-01", "2022-01-03"), ("2022-01-01", "2022-01-04"),("2022-01-01", "2022-01-05"), ("2022-01-02", "2022-01-03"), ("2022-01-02", "2022-01-04"),("2022-01-02", "2022-01-05"), ("2022-01-03", "2022-01-04"), ("2022-01-03", "2022-01-05"), ("2022-01-04", "2022-01-05") ]

df_date_combinations = spark.createDataFrame(date_combinations, ['from','to'])

df\
    .join(df_date_combinations, ['from','to'])\
    .show()

+----------+----------+------+-----------+-----+
|      from|        to|origin|destination|price|
+----------+----------+------+-----------+-----+
|2022-01-01|2022-01-05|   TLV|        NYC| 1000|
|2022-01-01|2022-01-05|   TLV|        BLR| 1480|
|2022-01-02|2022-01-04|   TLV|        NYC|  990|
|2022-01-02|2022-01-05|   TLV|        NYC| 1100|
|2022-01-02|2022-01-05|   TLV|        NYC| 1010|
+----------+----------+------+-----------+-----+

如何避免在火花（python）中使用for循环

问题描述

2 个解决方案

解决方案1
1 2022-07-18 11:24:20

解决方案2
0 2022-07-18 11:16:42

如何避免在火花（python）中使用for循环

问题描述

2 个解决方案

解决方案1 1 2022-07-18 11:24:20

解决方案2 0 2022-07-18 11:16:42

解决方案1
1 2022-07-18 11:24:20

解决方案2
0 2022-07-18 11:16:42