在PySpark数据框中添加列需要花费时间

Question

I am currently trying to integrate PySpark and Cassandra and am having trouble optimising the code in order for it to execute faster. 我目前正在尝试将PySpark和Cassandra集成在一起，并且在优化代码以使其更快执行方面遇到困难。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import sum as _sum

def connect_cassandra():
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[*]') \
      .getOrCreate()

    sqlContext = SQLContext(spark)
    return sqlContext

#--------THIS FUNCTION IS MY CONCERN ACTUALLY------------
def check_ip(ip, df):
    rows= df.filter("src_ip = '"+ip+"' or dst_ip = '"+ip+"'") \
            .agg(_sum('total').alias('data')) \
            .collect()

    print(rows[0][0])
#-----------------------------------------------------------

def load_df(sqlContext):

    df = sqlContext \
      .read \
      .format('org.apache.spark.sql.cassandra') \
      .options(table='acrs_app_data_usage', keyspace='acrs') \
      .load()

    return df

if __name__ == '__main__':
    lists = ['10.8.25.6', '10.8.24.10', '10.8.24.11', '10.8.20.1', '10.8.25.15', '10.8.25.10']
    sqlContext = connect_cassandra()
    df = load_df(sqlContext)
    for ip in lists:
        check_ip(ip, df)

The function check_ip() here takes an ip and a pre-loaded dataframe, the data-frame has 3 columns( src_ip, dst_ip and total ) and around 250K rows ,as argument and then iterates through it's total-column adding them and returning the summed data grouped by the IP provided. 这里的功能check_ip()接受一个ip和一个预加载的数据帧，该数据帧具有3列（ src_ip, dst_ip and total ）和大约250K行，作为参数，然后遍历总列，将它们添加并返回汇总的数据按提供的IP分组。

But when I execute the script, it takes atleast a second per IP to return the summed amount. 但是，当我执行脚本时，每个IP至少要花一秒钟的时间才能返回总和。 And I have over 32K IPs for which the same has to happen. 我有超过32K个IP必须相同。 And the amount of time is a lot. 而且时间很长。

Any help would be appreciated. 任何帮助，将不胜感激。 Thanks in advance. 提前致谢。

Answer 1

Short answer: Don't use loops. 简短的回答：不要使用循环。

Possible solution: 可能的解决方案：

Convert lists to dataframe. 将lists转换为数据框。
Inner Join lists_df twice with your dataframe, first on ip == src_ip and the second on ip == dst_ip 与数据lists_df两次内部连接lists_df ，第一次在ip == src_ip ，第二次在ip == dst_ip
Concatenate both with unionAll 与unionAll串联
Finally use groupBy("ip").agg(_sum("total")) 最后使用groupBy("ip").agg(_sum("total"))

This uses joins. 这使用联接。 So there is perhaps an even better solution out there. 因此，也许有一个更好的解决方案。

在PySpark数据框中添加列需要花费时间

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-06 12:57:19

在PySpark数据框中添加列需要花费时间

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-06 12:57:19

解决方案1
1 已采纳 2019-06-06 12:57:19