Addition of a column takes time in PySpark Dataframes

Question

I am currently trying to integrate PySpark and Cassandra and am having trouble optimising the code in order for it to execute faster.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import sum as _sum

def connect_cassandra():
    spark = SparkSession.builder \
      .appName('SparkCassandraApp') \
      .config('spark.cassandra.connection.host', 'localhost') \
      .config('spark.cassandra.connection.port', '9042') \
      .config('spark.cassandra.output.consistency.level','ONE') \
      .master('local[*]') \
      .getOrCreate()

    sqlContext = SQLContext(spark)
    return sqlContext

#--------THIS FUNCTION IS MY CONCERN ACTUALLY------------
def check_ip(ip, df):
    rows= df.filter("src_ip = '"+ip+"' or dst_ip = '"+ip+"'") \
            .agg(_sum('total').alias('data')) \
            .collect()

    print(rows[0][0])
#-----------------------------------------------------------

def load_df(sqlContext):

    df = sqlContext \
      .read \
      .format('org.apache.spark.sql.cassandra') \
      .options(table='acrs_app_data_usage', keyspace='acrs') \
      .load()

    return df

if __name__ == '__main__':
    lists = ['10.8.25.6', '10.8.24.10', '10.8.24.11', '10.8.20.1', '10.8.25.15', '10.8.25.10']
    sqlContext = connect_cassandra()
    df = load_df(sqlContext)
    for ip in lists:
        check_ip(ip, df)

The function check_ip() here takes an ip and a pre-loaded dataframe, the data-frame has 3 columns( src_ip, dst_ip and total ) and around 250K rows ,as argument and then iterates through it's total-column adding them and returning the summed data grouped by the IP provided.

But when I execute the script, it takes atleast a second per IP to return the summed amount. And I have over 32K IPs for which the same has to happen. And the amount of time is a lot.

Any help would be appreciated. Thanks in advance.

Answer 1

Short answer: Don't use loops.

Possible solution:

Convert lists to dataframe.
Inner Join lists_df twice with your dataframe, first on ip == src_ip and the second on ip == dst_ip
Concatenate both with unionAll
Finally use groupBy("ip").agg(_sum("total"))

This uses joins. So there is perhaps an even better solution out there.

Addition of a column takes time in PySpark Dataframes

Question

1 answers

solution1
1 ACCPTED 2019-06-06 12:57:19

Addition of a column takes time in PySpark Dataframes

Question

1 answers

solution1 1 ACCPTED 2019-06-06 12:57:19

solution1
1 ACCPTED 2019-06-06 12:57:19