简体   繁体   English

如何遍历“pyspark”中的列表列表以获得特定结果

[英]how can I iterate through list of list in "pyspark" for a specific result

I am new to PySpark, I am trying to understand how I can do this.我是 PySpark 的新手,我试图了解如何做到这一点。 Any help appreciated.任何帮助表示赞赏。

I have this RDD for example :例如,我有这个 RDD:

[[u'merit', u'release', u'appearance'], [u'www.bonsai.wbff.org'], [u'whitepages.com'], [u'the', u'childs', u'wonderland', u'company'], [u'lottery']]

I try to have :我试着有:

[[(u'merit',1), (u'release',1), (u'appearance',1)], [(u'www.bonsai.wbff.org',1)], [(u'whitepages.com',1)], [(u'the',1), (u'childs',1), (u'wonderland',1), (u'company',1)], [(u'lottery',1)]] 

But all I've tried, it gets me either this result :但我已经尝试过,它让我得到了这个结果:

[[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]

or these errors:或这些错误:

  • TypeError: 'PipelinedRDD' object is not iterable
  • AttributeError: 'list' object has no attribute 'foreach' - or split , take , etc. AttributeError: 'list' object has no attribute 'foreach' - 或splittake等。

I tried this :我试过这个:

rdd1=rdd.map(lambda r : (r,1))  

I have the first result :我有第一个结果:

[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]

rdd1=rdd.map(lambda r : (r[:][0],1))  

It gets just the first word in each line, it's not what I want它只获取每一行中的第一个单词,这不是我想要的

for row in rdd.collect() : row.foreach(lambda x : (x,1)) 
# AttributeError: 'list' object has no attribute 'foreach'
rdd3.take(100).foreach( lambda a : (a.foreach( lambda e : print(e,1)))) 
# AttributeError: 'list' object has no attribute 'foreach'

To print collect and iterate locally:要在本地打印collect和迭代:

for xs in rdd3.take(100):
    for x in xs:
        print(x)

To iterate in general:一般迭代:

rdd.flatMap(lambda xs: [(x, 1) for x in xs])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM