[英]how can I iterate through list of list in "pyspark" for a specific result
I am new to PySpark, I am trying to understand how I can do this.我是 PySpark 的新手,我试图了解如何做到这一点。 Any help appreciated.
任何帮助表示赞赏。
I have this RDD for example :例如,我有这个 RDD:
[[u'merit', u'release', u'appearance'], [u'www.bonsai.wbff.org'], [u'whitepages.com'], [u'the', u'childs', u'wonderland', u'company'], [u'lottery']]
I try to have :我试着有:
[[(u'merit',1), (u'release',1), (u'appearance',1)], [(u'www.bonsai.wbff.org',1)], [(u'whitepages.com',1)], [(u'the',1), (u'childs',1), (u'wonderland',1), (u'company',1)], [(u'lottery',1)]]
But all I've tried, it gets me either this result :但我已经尝试过,它让我得到了这个结果:
[[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]
or these errors:或这些错误:
TypeError: 'PipelinedRDD' object is not iterable
AttributeError: 'list' object has no attribute 'foreach'
- or split
, take
, etc. AttributeError: 'list' object has no attribute 'foreach'
- 或split
、 take
等。 I tried this :我试过这个:
rdd1=rdd.map(lambda r : (r,1))
I have the first result :我有第一个结果:
[u'merit', u'release', u'appearance',1], [u'www.bonsai.wbff.org',1], [u'whitepages.com',1], [u'the', u'childs', u'wonderland', u'company',1], [u'lottery',1]]
rdd1=rdd.map(lambda r : (r[:][0],1))
It gets just the first word in each line, it's not what I want它只获取每一行中的第一个单词,这不是我想要的
for row in rdd.collect() : row.foreach(lambda x : (x,1))
# AttributeError: 'list' object has no attribute 'foreach'
rdd3.take(100).foreach( lambda a : (a.foreach( lambda e : print(e,1))))
# AttributeError: 'list' object has no attribute 'foreach'
To print collect
and iterate locally:要在本地打印
collect
和迭代:
for xs in rdd3.take(100):
for x in xs:
print(x)
To iterate in general:一般迭代:
rdd.flatMap(lambda xs: [(x, 1) for x in xs])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.