[英]Addition of a column takes time in PySpark Dataframes
I am currently trying to integrate PySpark and Cassandra and am having trouble optimising the code in order for it to execute faster. 我目前正在尝试将PySpark和Cassandra集成在一起,并且在优化代码以使其更快执行方面遇到困难。
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import sum as _sum
def connect_cassandra():
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'localhost') \
.config('spark.cassandra.connection.port', '9042') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[*]') \
.getOrCreate()
sqlContext = SQLContext(spark)
return sqlContext
#--------THIS FUNCTION IS MY CONCERN ACTUALLY------------
def check_ip(ip, df):
rows= df.filter("src_ip = '"+ip+"' or dst_ip = '"+ip+"'") \
.agg(_sum('total').alias('data')) \
.collect()
print(rows[0][0])
#-----------------------------------------------------------
def load_df(sqlContext):
df = sqlContext \
.read \
.format('org.apache.spark.sql.cassandra') \
.options(table='acrs_app_data_usage', keyspace='acrs') \
.load()
return df
if __name__ == '__main__':
lists = ['10.8.25.6', '10.8.24.10', '10.8.24.11', '10.8.20.1', '10.8.25.15', '10.8.25.10']
sqlContext = connect_cassandra()
df = load_df(sqlContext)
for ip in lists:
check_ip(ip, df)
The function check_ip()
here takes an ip and a pre-loaded dataframe, the data-frame has 3 columns( src_ip, dst_ip and total
) and around 250K rows ,as argument and then iterates through it's total-column adding them and returning the summed data grouped by the IP provided. 这里的功能
check_ip()
接受一个ip和一个预加载的数据帧,该数据帧具有3列( src_ip, dst_ip and total
)和大约250K行,作为参数,然后遍历总列,将它们添加并返回汇总的数据按提供的IP分组。
But when I execute the script, it takes atleast a second per IP to return the summed amount. 但是,当我执行脚本时,每个IP至少要花一秒钟的时间才能返回总和。 And I have over 32K IPs for which the same has to happen.
我有超过32K个IP必须相同。 And the amount of time is a lot.
而且时间很长。
Any help would be appreciated. 任何帮助,将不胜感激。 Thanks in advance.
提前致谢。
Short answer: Don't use loops. 简短的回答:不要使用循环。
Possible solution: 可能的解决方案:
lists
to dataframe. lists
转换为数据框。 lists_df
twice with your dataframe, first on ip == src_ip
and the second on ip == dst_ip
lists_df
两次内部连接lists_df
,第一次在ip == src_ip
,第二次在ip == dst_ip
unionAll
unionAll
串联 groupBy("ip").agg(_sum("total"))
groupBy("ip").agg(_sum("total"))
This uses joins. 这使用联接。 So there is perhaps an even better solution out there.
因此,也许有一个更好的解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.