简体   繁体   English

使用 spark 和 scala 将具有值的 dataframe 列转换为列表

[英]Converting a dataframe column with values to a list using spark and scala

+-----------------------------------------------------------------------------------------------------------------------------------------------+
|Texts                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! 

must RT !                                                                                                                                      |
|RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw   |
|RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
 
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b…                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

I want to have these rows of values in Text column in a list using scala and spark.我想使用 scala 和 spark 在列表中的 Text 列中包含这些值行。

1. val newList =   myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList =   myDataframe.select("Texts").collect().map(_(0)).toList
   newList .foreach(println)

both ways aren't giving any output and program doesn't terminate also.两种方式都没有给出任何 output 并且程序也不会终止。 No errors are thrown.没有错误被抛出。

Expected output预期 output

["RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! must RT !", "RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw", "RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.

sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b…"]

Please note that sentence in each row in dataframe may contain new line请注意 dataframe 中每一行中的句子可能包含新行

eg I am going to the the shop.\n Its very expensive例如I am going to the the shop.\n Its very expensive

this sentence will be displayed as这句话将显示为

 I am going to the shop
 its very expensive

But both will belong to the same row.但两者都属于同一行。

Below methods are correct to convert a column of a dataframe into a list以下方法正确地将 Z6A8064B5DF4794555500553C47C55057DZ 的列转换为列表

1. val newList =   myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList =   myDataframe.select("Texts").collect().map(_(0)).toList

But the Dataframe in the question says each row may contain new lines.但是问题中的 Dataframe 说每一行都可能包含新行。 therefore above mthods won't work directly.因此上述方法不能直接工作。 First new lines should be removed.应删除第一行新行。

val singleLineDataframe =  myDataframe.withColumn("Texts", regexp_replace(col("Texts"), "[\\r\\n\\n]", "."))
val sentenceList =   singleLineDataframe.select("Texts").rdd.map(r => r(0)).collect.toList
for ( element <- sentenceList)
      println(element)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 迭代 Stream dataframe 中的列值,并使用 Scala 和 Spark 将每个值分配给一个公共列表 - Iterate a column values in a Stream dataframe and assign each value to a common list using Scala and Spark Spark dataframe 值到 Scala 列表 - Spark dataframe value to Scala List 在 Spark 数据框中创建一个新列,该列是其他列值的列表 - Create a new column in Spark dataframe that is a list of other column values Scala 代码以列表形式替换 dataframe 中列的 null 值 - Scala code to replace null values of a column in a dataframe which is in form of list 将包含字典和字符串值的列表转换为数字 dataframe 列 - Converting a list containing both dictionary and string values to numeric dataframe column Scala / Spark中将两个数据帧整数值相加 - addition of two dataframe integer values in Scala/Spark dataframe 列内的嵌套列表,提取 dataframe 列 Pyspark Spark 内的列表值 - Nested list within a dataframe colum, extracting the values of list within a dataframe column Pyspark Spark 使用 R 中的列表和列表名称的值替换 dataframe 列的值 - replacing values of a dataframe column using values of a list and list name in R 在Spark / Scala中取消组合(键,列表(值))对 - Ungrouping a (key, list(values)) pair in Spark/Scala 将一些 dataframe 值转换为 NA:要转换的值取决于列,并在单独的列表中给出 - Converting some dataframe values to NA: values to convert are column-dependent, and given in a separate list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM