[英]Spark select column based on row values
I have a all string spark dataframe and I need to return columns in which all rows meet a certain criteria.我有一个全字符串 spark 数据框,我需要返回所有行都满足特定条件的列。
scala> val df = spark.read.format("csv").option("delimiter",",").option("header", "true").option("inferSchema", "true").load("file:///home/animals.csv")
df.show()
+--------+---------+--------+
|Column 1| Column 2|Column 3|
+--------+---------+--------+
|(ani)mal| donkey| wolf|
| mammal|(mam)-mal| animal|
| chi-mps| chimps| goat|
+--------+---------+--------+
Over here the criteria is return columns where all row values have length==6
, irrespective of special characters.这里的标准是返回列,其中所有行值的length==6
都为length==6
,而与特殊字符无关。 The response should be below dataframe since all rows in column 1 and column 2 have length==6
响应应低于数据框,因为第 1 列和第 2 列中的所有行都具有length==6
+--------+---------+
|Column 1| Column 2|
+--------+---------+
|(ani)mal| donkey|
| mammal|(mam)-mal|
| chi-mps| chimps|
+--------+---------+
You can use regexp_replace
to delete the special characters if you know what there are and then get the length, filter to field what you want.如果您知道有什么特殊字符,您可以使用regexp_replace
删除特殊字符,然后获取长度,过滤到您想要的字段。
val cols = df.columns
val df2 = cols.foldLeft(df) {
(df, c) => df.withColumn(c + "_len", length(regexp_replace(col(c), "[()-]", "")))
}
df2.show()
+--------+---------+-------+-----------+-----------+-----------+
| Column1| Column2|Column3|Column1_len|Column2_len|Column3_len|
+--------+---------+-------+-----------+-----------+-----------+
|(ani)mal| donkey| wolf| 6| 6| 4|
| mammal|(mam)-mal| animal| 6| 6| 6|
| chi-mps| chimps| goat| 6| 6| 4|
+--------+---------+-------+-----------+-----------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.