简体   繁体   English

将一个数据框列值传递给另一数据框过滤条件表达式+ Spark 1.5

[英]Pass one dataframe column values to another dataframe filter condition expression + Spark 1.5

I have two input datasets First input dataset like as below : 我有两个输入数据集,第一个输入数据集如下所示:

year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt

Second Input dataset : 第二输入数据集:

TagId,condition
1997_cars,year = 1997 and model = 'E350'
2012_cars,year=2012 and model ='S'
2015_cars ,year=2015 and model = 'Volt'

Now my requirement is read first data set and based on the filtering condition in second dataset need to tag rows of first input dataset by introducing a new column TagId to first input data set so the expected should look like : 现在,我的要求是读取第一个数据集,并且基于第二个数据集中的过滤条件,需要通过在第一个输入数据集中引入新的列TagId来标记第一个输入数据集的行,因此预期的结果应类似于:

year,make,model,comment,blank,TagId
"2012","Tesla","S","No comment",2012_cars
1997,Ford,E350,"Go get one now they are going fast",1997_cars
2015,Chevy,Volt, ,2015_cars

I tried like : 我试过像:

val sqlContext = new SQLContext(sc)
val carsSchema = StructType(Seq(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val carTagsSchema = StructType(Seq(
    StructField("TagId", StringType, true),
    StructField("condition", StringType, true)))


val dfcars = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carsSchema).load("/TestDivya/Spark/cars.csv")
val dftags = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")

val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
val cdtnval = dftags.select("condition")
val df2=dfcars.filter(cdtnval)
<console>:35: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.sql.DataFrame)
       val df2=dfcars.filter(cdtnval)

another way : 其他方式 :

val col = dftags.col("TagId")
val finaldf = dfcars.withColumn("TagId", col)
org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project [year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];

finaldf.write.format("com.databricks.spark.csv").option("header", "true").save("/TestDivya/Spark/carswithtags.csv")

Would really appreciate if somebody give me pointers how can I pass the filter condition to filter function of dataframe. 如果有人给我一个指针,我将如何将过滤条件传递给数据框的过滤功能,将不胜感激。 Or another solution . 或其他解决方案。 My apppologies for such a naive question as I am new to scala and Spark 我对诸如Scala和Spark等新手这么幼稚的问题表示歉意

Thanks 谢谢

There is no simple solution to this. 没有简单的解决方案。 I think there are two general directions you can go with it: 我认为您可以遵循两个一般方向:

  • Collect the conditions ( dftags ) to a local list. 将条件( dftags )收集到本地列表。 Then go through it one by one, executing each on the cars ( dfcars ) as a filter . 然后一个接一个地处理它,对每个汽车( dfcars )进行filter Use the results to get the desired output. 使用结果获得所需的输出。

  • Collect the conditions ( dftags ) to a local list. 将条件( dftags )收集到本地列表。 Implement the parsing and evaluation code for them yourself. 自己为他们实现解析和评估代码。 Go through the cars ( dfcars ) once, evaluating the ruleset on each line in a map . 遍历汽车( dfcars )一次,评估map上每一行的规则集。

If you have a high number of conditions (so you cannot collect them) and a high number of cars, then the situation is very bad. 如果您有很多情况(因此您无法收集它们)并且有很多汽车,那么情况就很糟糕。 You need to check every car against every condition, so this will be very inefficient. 您需要在每种情况下检查每辆车,因此效率很低。 In this case you need to optimize the ruleset first, so it can be evaluated more efficiently. 在这种情况下,您需要首先优化规则集,以便可以更有效地对其进行评估。 (A decision tree may be a nice solution.) (决策树可能是一个很好的解决方案。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM