简体   繁体   English

如何对 RDD 进行字符串转换?

[英]How to do a string transformation of an RDD?

I have some documents from which I have to extract each word, and then each document-wise aggregate the number of times that word occurred using Pyspark.我有一些文档,我必须从中提取每个单词,然后每个文档使用 Pyspark 聚合该单词出现的次数。 I have managed to get it into the below format我已经设法把它变成下面的格式

["of#['d2:3', 'd4:10', 'd1:6', 'd3:13', 'd5:6', 'd6:9', 'd7:5']",
 "is#['d2:3', 'd4:8', 'd1:5', 'd3:1', 'd5:4', 'd6:6', 'd7:1']",
 "country#['d2:3', 'd1:1', 'd5:2', 'd6:2']",
 "in#['d2:5', 'd4:13', 'd1:2', 'd3:2', 'd5:2', 'd6:3', 'd7:3']",
 "seventh#['d2:1']"]

How can I do a transformation of the above rdd into something like我怎样才能将上述 rdd 转换为类似的东西

of#d2:3, d4:10, d1:6, d3:13, d5:6, d6:9, d7:5, 
is#d2:3, d4:8, d1:5, d3:1, d5:4, d6:6, d7:1, 
country#d2:3, d1:1, d5:2, d6:2,
in#d2:5, d4:13, d1:2, d3:2, d5:2, d6:3, d7:3,
seventh#d2:1

I have attempted the following line of code but I am getting an error.我尝试了以下代码行,但出现错误。 Would appreciate some inputs on where I am going wrong.希望能提供一些关于我哪里出错的意见。

print(x.map(lambda x:str(x[0])+"#"+str(x[1])).take(5))

It seems you only want to remove the square brackets and single quotes from those string values.您似乎只想从这些字符串值中删除方括号和单引号。

You can do something like this:你可以这样做:

import re

rdd1 = rdd.map(lambda x: re.sub(r"[\['\]]", "", x))

for i in rdd1.collect():
    print(i)
    
# of#d2:3, d4:10, d1:6, d3:13, d5:6, d6:9, d7:5
# is#d2:3, d4:8, d1:5, d3:1, d5:4, d6:6, d7:1
# country#d2:3, d1:1, d5:2, d6:2
# in#d2:5, d4:13, d1:2, d3:2, d5:2, d6:3, d7:3
# seventh#d2:1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM