繁体   English   中英

如何从 Spark 中的 DF 字符串列中仅取出部分字符串,scala

[英]How to takeout only part of a string, from DF string column in spark, scala

在 Dataframe 里面我有一列包含以下数据

('Rated 3.0', "RATED\n \nWent there for a quick bite with friends.\nThe ambience had more of corporate feel. I would say it was unique.\nTried nachos, pasta churros and lasagne.\n\nNachos were pathetic.( Seriously don't order)\nPasta was okayish.\nLasagne was good.\nNutella churros were the best.\nOverall an okayish experience!\nPeace ??"), ('Rated 4.0', "RATED\n  First of all, a big thanks to the staff of this Cafe. Very polite and courteous.\n\nI was there 15mins before their closing time. Without any discomfort or hesitation, the staff welcomed me with a warm smile and said they're still open, though they were preparing to close the cafe for the day.\n\nQuickly ordered the Thai green curry, which is served with rice. They got it for me within 10mins, hot and freshly made.\n\nIt was tasty with the taste of coconut milk. Not very spicy, it was mild spicy.\n\nI saw they had yummy looking dessert menu, should go there to try them out!\n\nA good spacious place to hang out for coffee, pastas, pizza or Thai food.")

我需要从每条记录中取出Rated 3.0部分。 这是一个 StringType 列。 如何删除多余的数据并提取它?

如果每行格式为Rated xx ,您可以简单地使用substring function。

scala> df.select(substring('value,3,9)).show
+----------------------+
|substring(value, 3, 9)|
+----------------------+
|             Rated 3.0|
+----------------------+

如果您在一行中有多个“Rates”,您可以尝试使用regexp_replace并替换以下值:

(' to "
', to ":
") to "

此外,您应该在字符串的开头添加{并在末尾添加} 所以格式应该如下所示。

{
    "a": "b",
    "c": "d"
}

因此,您将创建 JSON 字符串,在下一步中,您可以使用from_json function 创建数组/结构并获取这些值。

这是我的解决方案:假设该问题有两条记录。

//创建列表//

val mytestList=List(("""Rated 3.0, RATED Went there for a quick bite with friends.The ambience had more of corporate feel. I would say it was unique.Tried nachos, pasta churros and lasagne.Nachos were pathetic.( Seriously don't order)Pasta was okayish.Lasagne was good.Nutella churros were the best.Overall an okayish experience!Peace ??"""), 
("""Rated 4.0, RATED  First of all, a big thanks to the staff of this Cafe. Very polite and courteous.I was there 15mins before their closing time. Without any discomfort or hesitation, the staff welcomed me with a warm smile and said they're still open, though they were preparing to close the cafe for the day.Quickly ordered the Thai green curry, which is served with rice. They got it for me within 10mins, hot and freshly made.It was tasty with the taste of coconut milk. Not very spicy, it was mild spicy.I saw they had yummy looking dessert menu, should go there to try them out!A good spacious place to hang out for coffee, pastas, pizza or Thai food."""))

//加载列表到RDD//

val rdd = spark.sparkContext.parallelize(mytestList)

//强加模式列名//

val DF1 = rdd.toDF("Rating")

//解决方案1

DF1.withColumn("tmp", split($"Rating", ",")).select($"tmp".getItem(0).as("col1")).show()
+---------+
|     col1|
+---------+
|Rated 3.0|
|Rated 4.0|
+---------+

//解决方案2删除/删除其他人

DF1.withColumn("tmp", split(col("Rating"), ",").getItem(0)).drop("Rating").show()


+---------+
|      tmp|
+---------+
|Rated 3.0|
|Rated 4.0|
+---------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM