简体   繁体   中英

PySpark broadcast value to dictionary

Have a PySpark broadcast value with content like this:

[('b000jz4hqo', {'rom': 2.4051362683438153, 'clickart': 56.65432098765432, '950': 254.94444444444443, 'image': 3.6948470209339774, 'premier': 9.27070707070707, '000': 6.218157181571815, 'dvd': 1.287598204264871, 'broderbund': 22.169082125603865, 'pack': 2.98180636777128}), ('b0006zf55o', {'laptops': 11.588383838383837, 'desktops': 12.74722222222222, 'backup': 2.8015873015873014, 'win': 0.501859142607174, 'ca': 9.10515873015873, 'v11': 50.98888888888888, '30u': 84.98148148148148, '30pk': 254.94444444444443, 'desktop': 2.23635477582846, '1': 0.3231235037318687, 'arcserve': 24.28042328042328, 'computer': 0.6965695203400122, 'lap': 127.47222222222221, 'oem': 46.35353535353535, 'international': 9.44238683127572, 'associates': 7.284126984126985})]

So it is a key->list broadcast variable.

Attempts to convert broadcast.value into a dictionary results in

TypeError: unhashable type: 'dict'

Using code like

from itertools import izip
amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collect())
i = iter(amazonWeightsBroadcast.value)
amazonWeightsDict = dict(izip(i, i))

Also tried (gives the same "unshapable" error):

amazonWeightsDict = dict(amazonWeightsBroadcast.value[i:i+2] for i in range(0, len(amazonWeightsBroadcast.value), 2))

So if it's not possible to convert a broadcast variable into a dictionary, what will be a better solution to lookup a value-list by a key?

Python 2.7.6 Spark 1.3.1

Took me a while.. problem was in how the broadcast variable was created. Had to use .collectAsMap() and not just .collect() Now it is working as expected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM