简体   繁体   English

在pyspark shell中运行的过滤器无法提交火花

[英]filter working in pyspark shell not spark-submit

df_filter = df.filter(~(col('word').isin(stop_words_list))) df_filter = df.filter(〜(col('word')。isin(stop_words_list))))

df_filter.count() df_filter.count()

27781 27781

df.count() df.count()

31240 31240

While submitting the same code to Spark cluster using spark-submit, the filter function is not working properly, the rows with col('word') in the stop_words_list are not filtered. 使用spark-submit将相同的代码提交到Spark集群时,筛选器功能无法正常工作,stop_words_list中具有col('word')的行未筛选。 Why does this happen? 为什么会这样?

The filtering is working now after the col('word') is trimmed. 修剪col('word')后,过滤器现在可以正常工作。 df_filter = df.filter(~(trim(col("word")).isin(stop_words_list))) I still don't know why it works in pyspark shell, but not spark-submit. df_filter = df.filter(~(trim(col("word")).isin(stop_words_list)))我仍然不知道为什么它可以在pyspark shell中工作,但不能提交火花。 The only difference they have is: in pyspark shell, I used spark.read.csv() to read in the file, while in spark-submit, I used the following method. 它们唯一的区别是:在pyspark shell中,我使用spark.read.csv()读取文件,而在spark-submit中,我使用了以下方法。 from pyspark.sql import SparkSession from pyspark.sql import SQLContext session = pyspark.sql.SparkSession.builder.appName('test').getOrCreate() sqlContext = SQLContext(session) df = sqlContext.read.format("com.databricks.spark.csv").option('header','true').load() I'm not sure if two different read-in methods are causing the discrepancy. from pyspark.sql import SparkSession from pyspark.sql import SQLContext session = pyspark.sql.SparkSession.builder.appName('test').getOrCreate() sqlContext = SQLContext(session) df = sqlContext.read.format("com.databricks.spark.csv").option('header','true').load()我不确定两种不同的读入方法是否会导致差异。 Someone who is familiar with this can clarify. 熟悉此事的人可以澄清。

Try using double quotes instead of single quotes. 尝试使用双引号而不是单引号。

from pyspark.sql.functions import col
df_filter = df.filter(~(col("word").isin(stop_words_list))).count()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM