简体   繁体   English

如何在python中展平RDD?

[英]How to flatten an RDD in python?

I have a dataset of spam msgs and it has this datatype: 我有一个垃圾邮件数据集msgs,它有这个数据类型:

pyspark.rdd.PipelinedRDD

When I do spams.take(3) , I get: 当我做spams.take(3) ,我得到:

[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"], ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'], ['Had your mobile 11 months or more? UR entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030']]

As you can see it has brackets within to separate each element within the list. 如您所见,它具有括号以分隔列表中的每个元素。 How can I get rid of those brackets? 我该怎样摆脱那些括号? I tried many ways of Flattening it but none seems to work. 我试过很多方法来展平它,但似乎都没有。

You can use flatMap method of rdd. 你可以使用rdd的flatMap方法。 It allow you to generate multiple rows from one row. 它允许您从一行生成多行。

spams.flatMap(lambda x:x).take(3)

Since your question is unclear whether you want to remove brackets after collecting in list or before collecting and other users already answered for after , I will answer for while data is still a rdd. 由于您的问题是,目前还不清楚是否要在列表收集之后或收集和其他用户已经回答了之后 删除括号,我会回答,而数据仍然是一个RDD。 It is pretty straight forward, 这很直接,

spams = spams.map(lambda x:x[0])
print spams.take(3)

This will remove the inner "brackets". 这将删除内部“括号”。

These lines of code will help. 这些代码行将有所帮助。

    >>> msg = [["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0
8452810075over18's"],
...  ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only.'],
...  ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on
08002986030']]
>>> msg = [x[0] for x in msg]
>>> msg
["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o
ver18's", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Va
lid 12 hours only.', 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Upd
ate Co FREE on 08002986030']

Try a for loop, "data" is the list you get back from spam.take(3). 试试for循环,“data”是你从spam.take(3)获得的列表。

mylist = []
for entry in data:
  print(entry)
  for e in entry:
    mylist.append(e)
print(mylist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM