简体   繁体   中英

How to increase speed when writing Spark DataFrame to Redis?

I am developing a book recommendation API based on Flask, and it was found that to manage multiple requests I'll need to pre-calculate similarity matrix and store it somewhere for future queries. This matrix is created using PySpark based on ~1.5 million of database entries with book id, name and metadata, and the result can be described by this schema ( i and j are for book indexes, dot is for similarity of their metadata):

StructType(List(StructField(i,IntegerType,true),StructField(j,IntegerType,true),StructField(dot,DoubleType,true)))

Initially, it was my intention to store it on Redis, using spark-redis connector. However, the following command appears to work with a very slow speed (even if initial book database query size is limited to a very modest 40k batch):

similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").option("key.column", "i").save()

It took around 6 hours to advance through 3 of the 9 stages Spark separated the initial task into. Strangely, storage memory usage by Spark executors was very low, around 20kb. A typical stage active stage is described as such by Spark Application UI:

org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)

Is it possible to somehow speed up this process? My Spark session is set up this way:

SUBMIT_ARGS = "  --driver-memory 2G --executor-memory 2G --executor-cores 4 --packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf().set("spark.jars", "spark-redis/target/spark-redis_2.11-2.4.3-SNAPSHOT-jar-with-dependencies.jar").set("spark.executor.memory", "4g")
sc = SparkContext('local','example', conf=conf) 
sql_sc = SQLContext(sc)

You may try to use Append save mode to avoid checking if the data already exists in the table:

similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").mode('append').option("key.column", "i").save()

Also, you may want to change

sc = SparkContext('local','example', conf=conf) 

to

sc = SparkContext('local[*]','example', conf=conf) 

to utilize all cores on your machine.

BTW, is it correct to use i as a key in Redis? Shouldn't it be a composition of both i and j ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM