简体   繁体   English

在PySpark数据框中添加列需要花费时间

[英]Addition of a column takes time in PySpark Dataframes

I am currently trying to integrate PySpark and Cassandra and am having trouble optimising the code in order for it to execute faster. 我目前正在尝试将PySpark和Cassandra集成在一起,并且在优化代码以使其更快执行方面遇到困难。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import sum as _sum

def connect_cassandra():
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[*]') \
      .getOrCreate()

    sqlContext = SQLContext(spark)
    return sqlContext

#--------THIS FUNCTION IS MY CONCERN ACTUALLY------------
def check_ip(ip, df):
    rows= df.filter("src_ip = '"+ip+"' or dst_ip = '"+ip+"'") \
            .agg(_sum('total').alias('data')) \
            .collect()

    print(rows[0][0])
#-----------------------------------------------------------

def load_df(sqlContext):

    df = sqlContext \
      .read \
      .format('org.apache.spark.sql.cassandra') \
      .options(table='acrs_app_data_usage', keyspace='acrs') \
      .load()

    return df

if __name__ == '__main__':
    lists = ['10.8.25.6', '10.8.24.10', '10.8.24.11', '10.8.20.1', '10.8.25.15', '10.8.25.10']
    sqlContext = connect_cassandra()
    df = load_df(sqlContext)
    for ip in lists:
        check_ip(ip, df)

The function check_ip() here takes an ip and a pre-loaded dataframe, the data-frame has 3 columns( src_ip, dst_ip and total ) and around 250K rows ,as argument and then iterates through it's total-column adding them and returning the summed data grouped by the IP provided. 这里的功能check_ip()接受一个ip和一个预加载的数据帧,该数据帧具有3列( src_ip, dst_ip and total )和大约250K行,作为参数,然后遍历总列,将它们添加并返回汇总的数据按提供的IP分组。

But when I execute the script, it takes atleast a second per IP to return the summed amount. 但是,当我执行脚本时,每个IP至少要花一秒钟的时间才能返回总和。 And I have over 32K IPs for which the same has to happen. 我有超过32K个IP必须相同。 And the amount of time is a lot. 而且时间很长。

Any help would be appreciated. 任何帮助,将不胜感激。 Thanks in advance. 提前致谢。

Short answer: Don't use loops. 简短的回答:不要使用循环。

Possible solution: 可能的解决方案:

  • Convert lists to dataframe. lists转换为数据框。
  • Inner Join lists_df twice with your dataframe, first on ip == src_ip and the second on ip == dst_ip 与数据lists_df两次内部连接lists_df ,第一次在ip == src_ip ,第二次在ip == dst_ip
  • Concatenate both with unionAll unionAll串联
  • Finally use groupBy("ip").agg(_sum("total")) 最后使用groupBy("ip").agg(_sum("total"))

This uses joins. 这使用联接。 So there is perhaps an even better solution out there. 因此,也许有一个更好的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM