简体   繁体   中英

Addition of a column takes time in PySpark Dataframes

I am currently trying to integrate PySpark and Cassandra and am having trouble optimising the code in order for it to execute faster.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import sum as _sum

def connect_cassandra():
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[*]') \
      .getOrCreate()

    sqlContext = SQLContext(spark)
    return sqlContext

#--------THIS FUNCTION IS MY CONCERN ACTUALLY------------
def check_ip(ip, df):
    rows= df.filter("src_ip = '"+ip+"' or dst_ip = '"+ip+"'") \
            .agg(_sum('total').alias('data')) \
            .collect()

    print(rows[0][0])
#-----------------------------------------------------------

def load_df(sqlContext):

    df = sqlContext \
      .read \
      .format('org.apache.spark.sql.cassandra') \
      .options(table='acrs_app_data_usage', keyspace='acrs') \
      .load()

    return df

if __name__ == '__main__':
    lists = ['10.8.25.6', '10.8.24.10', '10.8.24.11', '10.8.20.1', '10.8.25.15', '10.8.25.10']
    sqlContext = connect_cassandra()
    df = load_df(sqlContext)
    for ip in lists:
        check_ip(ip, df)

The function check_ip() here takes an ip and a pre-loaded dataframe, the data-frame has 3 columns( src_ip, dst_ip and total ) and around 250K rows ,as argument and then iterates through it's total-column adding them and returning the summed data grouped by the IP provided.

But when I execute the script, it takes atleast a second per IP to return the summed amount. And I have over 32K IPs for which the same has to happen. And the amount of time is a lot.

Any help would be appreciated. Thanks in advance.

Short answer: Don't use loops.

Possible solution:

  • Convert lists to dataframe.
  • Inner Join lists_df twice with your dataframe, first on ip == src_ip and the second on ip == dst_ip
  • Concatenate both with unionAll
  • Finally use groupBy("ip").agg(_sum("total"))

This uses joins. So there is perhaps an even better solution out there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM