[英]Converting a dataframe column with values to a list using spark and scala
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|Texts |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!!
must RT ! |
|RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw |
|RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b… |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
I want to have these rows of values in Text column in a list using scala and spark.我想使用 scala 和 spark 在列表中的 Text 列中包含这些值行。
1. val newList = myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList = myDataframe.select("Texts").collect().map(_(0)).toList
newList .foreach(println)
both ways aren't giving any output and program doesn't terminate also.两种方式都没有给出任何 output 并且程序也不会终止。 No errors are thrown.
没有错误被抛出。
Expected output预期 output
["RT @xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! must RT !", "RT @xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw", "RT @xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, @b…"]
Please note that sentence in each row in dataframe may contain new line请注意 dataframe 中每一行中的句子可能包含新行
eg I am going to the the shop.\n Its very expensive
例如
I am going to the the shop.\n Its very expensive
this sentence will be displayed as这句话将显示为
I am going to the shop
its very expensive
But both will belong to the same row.但两者都属于同一行。
Below methods are correct to convert a column of a dataframe into a list以下方法正确地将 Z6A8064B5DF4794555500553C47C55057DZ 的列转换为列表
1. val newList = myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList = myDataframe.select("Texts").collect().map(_(0)).toList
But the Dataframe in the question says each row may contain new lines.但是问题中的 Dataframe 说每一行都可能包含新行。 therefore above mthods won't work directly.
因此上述方法不能直接工作。 First new lines should be removed.
应删除第一行新行。
val singleLineDataframe = myDataframe.withColumn("Texts", regexp_replace(col("Texts"), "[\\r\\n\\n]", "."))
val sentenceList = singleLineDataframe.select("Texts").rdd.map(r => r(0)).collect.toList
for ( element <- sentenceList)
println(element)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.