如何遍历“pyspark”中的列表列表以获得特定结果

Question

I am new to PySpark, I am trying to understand how I can do this.我是 PySpark 的新手，我试图了解如何做到这一点。 Any help appreciated.任何帮助表示赞赏。

I have this RDD for example :例如，我有这个 RDD：

[[u'merit', u'release', u'appearance'], [u'www.bonsai.wbff.org'], [u'whitepages.com'], [u'the', u'childs', u'wonderland', u'company'], [u'lottery']]

I try to have :我试着有：

[[(u'merit',1), (u'release',1), (u'appearance',1)], [(u'www.bonsai.wbff.org',1)], [(u'whitepages.com',1)], [(u'the',1), (u'childs',1), (u'wonderland',1), (u'company',1)], [(u'lottery',1)]]

But all I've tried, it gets me either this result :但我已经尝试过，它让我得到了这个结果：

[[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]

or these errors:或这些错误：

TypeError: 'PipelinedRDD' object is not iterable
AttributeError: 'list' object has no attribute 'foreach' - or split , take , etc. AttributeError: 'list' object has no attribute 'foreach' - 或split 、 take等。

I tried this :我试过这个：

rdd1=rdd.map(lambda r : (r,1))

I have the first result :我有第一个结果：

[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]

rdd1=rdd.map(lambda r : (r[:][0],1))

It gets just the first word in each line, it's not what I want它只获取每一行中的第一个单词，这不是我想要的

for row in rdd.collect() : row.foreach(lambda x : (x,1)) 
# AttributeError: 'list' object has no attribute 'foreach'

rdd3.take(100).foreach( lambda a : (a.foreach( lambda e : print(e,1)))) 
# AttributeError: 'list' object has no attribute 'foreach'

Answer 1

To print collect and iterate locally:要在本地打印collect和迭代：

for xs in rdd3.take(100):
    for x in xs:
        print(x)

To iterate in general:一般迭代：

rdd.flatMap(lambda xs: [(x, 1) for x in xs])

如何遍历“pyspark”中的列表列表以获得特定结果

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-01-06 15:22:08

如何遍历“pyspark”中的列表列表以获得特定结果

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-01-06 15:22:08

解决方案1
6 已采纳 2017-01-06 15:22:08