I am developing a book recommendation API based on Flask, and it was found that to manage multiple requests I'll need to pre-calculate similarity matrix and store it somewhere for future queries. This matrix is created using PySpark based on ~1.5 million of database entries with book id, name and metadata, and the result can be described by this schema ( i
and j
are for book indexes, dot
is for similarity of their metadata):
StructType(List(StructField(i,IntegerType,true),StructField(j,IntegerType,true),StructField(dot,DoubleType,true)))
Initially, it was my intention to store it on Redis, using spark-redis connector. However, the following command appears to work with a very slow speed (even if initial book database query size is limited to a very modest 40k batch):
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").option("key.column", "i").save()
It took around 6 hours to advance through 3 of the 9 stages Spark separated the initial task into. Strangely, storage memory usage by Spark executors was very low, around 20kb. A typical stage active stage is described as such by Spark Application UI:
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
Is it possible to somehow speed up this process? My Spark session is set up this way:
SUBMIT_ARGS = " --driver-memory 2G --executor-memory 2G --executor-cores 4 --packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf().set("spark.jars", "spark-redis/target/spark-redis_2.11-2.4.3-SNAPSHOT-jar-with-dependencies.jar").set("spark.executor.memory", "4g")
sc = SparkContext('local','example', conf=conf)
sql_sc = SQLContext(sc)
You may try to use Append
save mode to avoid checking if the data already exists in the table:
similarities.write.format("org.apache.spark.sql.redis").option("table", "similarities").mode('append').option("key.column", "i").save()
Also, you may want to change
sc = SparkContext('local','example', conf=conf)
to
sc = SparkContext('local[*]','example', conf=conf)
to utilize all cores on your machine.
BTW, is it correct to use i
as a key in Redis? Shouldn't it be a composition of both i
and j
?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.