简体   繁体   English

spark数据帧scala列中字符串的部分匹配

[英]Partial match of string in a column of spark dataframe scala

I have a dataframe as follows我有一个数据框如下

id  value
1   I am a boy
1   I am a men
1   I am afather
2   I am a girl
2   I am awomen
2   I am a mother

I have 2 lists as follows:-我有 2 个列表如下:-

val male = List("boy", "men", "father")
val female = List("girl", "women", "mother")

I want to do a search in the values column for a partial match for one of the strings in the list and create a resulting dataframe as follows:-我想在值列中搜索列表中字符串之一的部分匹配,并创建一个结果数据框,如下所示:-

id  value   gender
1   I am a boy  male
1   I am a men  male
1   I am a father   male
2   I am a girl female
2   I am a women    female
2   I am a mother   female

Am using Scala for programming.我正在使用 Scala 进行编程。 Just want to check for a substring in the column.只想检查列中的子字符串。 And I cannot split the values in the column because they are not properly formatted with spaces but the strings in the list are present.而且我无法拆分列中的值,因为它们的格式不正确,但存在列表中的字符串。

Using the rdd way.使用rdd方式。

scala> val df = Seq((1,"I am a boy"),
     | (1,"I am a men"),
     | (1,"I am a father"),
     | (2,"I am a girl"),
     | (2,"I am a women"),
     | (2,"I am a mother")).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: int, value: string]

scala> val male = List("boy", "men", "father")
male: List[String] = List(boy, men, father)

scala> val female = List("girl", "women", "mother")
female: List[String] = List(girl, women, mother)

scala> val rdd2 = df.rdd.map( x => { val p = if(male.intersect(x(1).toString.split(" ")).length > 0) "male" else if (female.intersect(x(1).toString.split(" ")).length > 0) "female" else "none" ; Row(x(0),x(1),p) } )
rdd2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[26] at map at <console>:41

scala> val schema = df.schema.add(StructField("gender",StringType))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,false), StructField(value,StringType,true), StructField(gender,StringType,true))

scala> spark.createDataFrame(rdd2,schema).show
+---+-------------+------+
| id|        value|gender|
+---+-------------+------+
|  1|   I am a boy|  male|
|  1|   I am a men|  male|
|  1|I am a father|  male|
|  2|  I am a girl|female|
|  2| I am a women|female|
|  2|I am a mother|female|
+---+-------------+------+


scala>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM