filter working in pyspark shell not spark-submit

Question

df_filter = df.filter(~(col('word').isin(stop_words_list)))

df_filter.count()

27781

df.count()

31240

While submitting the same code to Spark cluster using spark-submit, the filter function is not working properly, the rows with col('word') in the stop_words_list are not filtered. Why does this happen?

Answer 1

The filtering is working now after the col('word') is trimmed. df_filter = df.filter(~(trim(col("word")).isin(stop_words_list))) I still don't know why it works in pyspark shell, but not spark-submit. The only difference they have is: in pyspark shell, I used spark.read.csv() to read in the file, while in spark-submit, I used the following method. from pyspark.sql import SparkSession from pyspark.sql import SQLContext session = pyspark.sql.SparkSession.builder.appName('test').getOrCreate() sqlContext = SQLContext(session) df = sqlContext.read.format("com.databricks.spark.csv").option('header','true').load() I'm not sure if two different read-in methods are causing the discrepancy. Someone who is familiar with this can clarify.

Answer 2

Try using double quotes instead of single quotes.

from pyspark.sql.functions import col
df_filter = df.filter(~(col("word").isin(stop_words_list))).count()

filter working in pyspark shell not spark-submit

Question

2 answers

solution1
0 2018-08-03 23:02:24

solution2
-1 2018-08-02 19:50:55

filter working in pyspark shell not spark-submit

Question

2 answers

solution1 0 2018-08-03 23:02:24

solution2 -1 2018-08-02 19:50:55

solution1
0 2018-08-03 23:02:24

solution2
-1 2018-08-02 19:50:55