使用 spark 和 scala 将具有值的 dataframe 列转换为列表

Question

+-----------------------------------------------------------------------------------------------------------------------------------------------+
|Texts                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! 

must RT !                                                                                                                                      |
|RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw   |
|RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
 
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b…                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

I want to have these rows of values in Text column in a list using scala and spark.我想使用 scala 和 spark 在列表中的 Text 列中包含这些值行。

1. val newList =   myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList =   myDataframe.select("Texts").collect().map(_(0)).toList
   newList .foreach(println)

both ways aren't giving any output and program doesn't terminate also.两种方式都没有给出任何 output 并且程序也不会终止。 No errors are thrown.没有错误被抛出。

Expected output预期 output

["RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! must RT !", "RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw", "RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.

sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b…"]

Please note that sentence in each row in dataframe may contain new line请注意 dataframe 中每一行中的句子可能包含新行

eg I am going to the the shop.\n Its very expensive例如I am going to the the shop.\n Its very expensive

this sentence will be displayed as这句话将显示为

 I am going to the shop
 its very expensive

But both will belong to the same row.但两者都属于同一行。

Answer 1

Below methods are correct to convert a column of a dataframe into a list以下方法正确地将 Z6A8064B5DF4794555500553C47C55057DZ 的列转换为列表

1. val newList =   myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList =   myDataframe.select("Texts").collect().map(_(0)).toList

But the Dataframe in the question says each row may contain new lines.但是问题中的 Dataframe 说每一行都可能包含新行。 therefore above mthods won't work directly.因此上述方法不能直接工作。 First new lines should be removed.应删除第一行新行。

val singleLineDataframe =  myDataframe.withColumn("Texts", regexp_replace(col("Texts"), "[\\r\\n\\n]", "."))
val sentenceList =   singleLineDataframe.select("Texts").rdd.map(r => r(0)).collect.toList
for ( element <- sentenceList)
      println(element)

使用 spark 和 scala 将具有值的 dataframe 列转换为列表

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-12 16:28:12

使用 spark 和 scala 将具有值的 dataframe 列转换为列表

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-12 16:28:12

解决方案1
0 已采纳 2021-03-12 16:28:12