简体   繁体   中英

Computing aggregates on group by using combinebykey spark rdd in python (pyspark)

I am new to spark rdd, I want to use spark shuffling operation for calculating the aggregates by grouping them with keys. At first my approach was to use rdd.groupby() but while executing the it is taking longer time to converge and quite memory inffecient, I know this operation is quite costly in terms of shuffling. I came across an another operation rdd.combinebykey() but I am facing problem while understanding and using it.

This is my data stored in rdd call it as "customerrdd"

[(u'1', u'Customer#000000001', u'IVhzIApeRb ot,c,E', u'15', u'25-989-741-2988', u'711.56', u'BUILDING', u'to the even, regular platelets. regular, ironic epitaphs nag e', u''), (u'2', u'Customer#000000002', u'XSTf4,NCwDVaWNe6tEgvwfmRchLXak', u'13', u'23-768-687-3665', u'121.65', u'AUTOMOBILE', u'l accounts. blithely ironic theodolites integrate boldly: caref', u''), (u'3', u'Customer#000000003', u'MG9kdTD2WBHm', u'1', u'11-719-748-3364', u'7498.12', u'AUTOMOBILE', u' deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov', u''), (u'4', u'Customer#000000004', u'XxVSJsLAGtn', u'4', u'14-128-190-5944', u'2866.83', u'MACHINERY', u' requests. final, regular ideas sleep final accou', u''), (u'5', u'Customer#000000005', u'KvpyuHCplrB84WgAiGV6sYpZq7Tj', u'3', u'13-750-942-6364', u'794.47', u'HOUSEHOLD', u'n accounts will have to unwind. foxes cajole accor', u''), (u'6', u'Customer#000000006', u'sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn', u'20', u'30-114-968-4951', u'7638.57', u'AUTOMOBILE', u'tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious', u''), (u'7', u'Customer#000000007', u'TcGe5gaZNgVePxU5kRrvXBfkasDTea', u'18', u'28-190-982-9759', u'9561.95', u'AUTOMOBILE', u'ainst the ironic, express theodolites. express, even pinto beans among the exp', u''), (u'8', u'Customer#000000008', u'I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5', u'17', u'27-147-574-9335', u'6819.74', u'BUILDING', u'among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide', u''), (u'9', u'Customer#000000009', u'xKiAFTjUsCuxfeleNqefumTrjS', u'8', u'18-338-906-3675', u'8324.07', u'FURNITURE', u'r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl', u''), (u'10', u'Customer#000000010', u'6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2', u'5', u'15-741-346-9870', u'2753.54', u'HOUSEHOLD', u'es regular deposits haggle. fur', u''), (u'11', u'Customer#000000011', u'PkWS 3HlXqwTuzrKg633BEi', u'23', u'33-464-151-3439', u'-272.60', u'BUILDING', u'ckages. requests sleep slyly. quickly even pinto beans promise above the slyly regular pinto beans. ', u''), (u'12', u'Customer#000000012', u'9PWKuhzT4Zr1Q', u'13', u'23-791-276-1263', u'3396.49', u'HOUSEHOLD', u' to the carefully final braids. blithely regular requests nag. ironic theodolites boost quickly along', u''), (u'13', u'Customer#000000013', u'nsXQu0oVjD7PM659uC3SRSp', u'3', u'13-761-547-5974', u'3857.34', u'BUILDING', u'ounts sleep carefully after the close frays. carefully bold notornis use ironic requests. blithely', u''), (u'14', u'Customer#000000014', u'KXkletMlL2JQEA ', u'1', u'11-845-129-3851', u'5266.30', u'FURNITURE', u', ironic packages across the unus', u''), (u'15', u'Customer#000000015', u'YtWggXoOLdwdo7b0y,BZaGUQMLJMX1Y,EC,6Dn', u'23', u'33-687-542-7601', u'2788.52', u'HOUSEHOLD', u' platelets. regular deposits detect asymptotes. blithely unusual packages nag slyly at the fluf', u''), (u'16', u'Customer#000000016', u'cYiaeMLZSMAOQ2 d0W,', u'10', u'20-781-609-3107', u'4681.03', u'FURNITURE', u'kly silent courts. thinly regular theodolites sleep fluffily after ', u''), (u'17', u'Customer#000000017', u'izrh 6jdqtp2eqdtbkswDD8SG4SzXruMfIXyR7', u'2', u'12-970-682-3487', u'6.34', u'AUTOMOBILE', u'packages wake! blithely even pint', u''), (u'18', u'Customer#000000018', u'3txGO AiuFux3zT0Z9NYaFRnZt', u'6', u'16-155-215-1315', u'5494.43', u'BUILDING', u's sleep. carefully even instructions nag furiously alongside of t', u''), (u'19', u'Customer#000000019', u'uc,3bHIx84H,wdrmLOjVsiqXCq2tr', u'18', u'28-396-526-5053', u'8914.71', u'HOUSEHOLD', u' nag. furiously careful packages are slyly at the accounts. furiously regular in', u''), (u'20', u'Customer#000000020', u'JrPk8Pqplj4Ne', u'22', u'32-957-234-8742', u'7603.40', u'FURNITURE', u'g alongside of the special excuses-- fluffily enticing packages wake ', u''), (u'21', u'Customer#000000021', u'XYmVpr9yAHDEn', u'8', u'18-902-614-8344', u'1428.25', u'MACHINERY', u' quickly final accounts integrate blithely furiously u', u''), (u'22', u'Customer#000000022', u'QI6p41,FNs5k7RZoCCVPUTkUdYpB', u'3', u'13-806-545-9701', u'591.98', u'MACHINERY', u's nod furiously above the furiously ironic ideas. ', u''), (u'23', u'Customer#000000023', u'OdY W13N7Be3OC5MpgfmcYss0Wn6TKT', u'3', u'13-312-472-8245', u'3332.02', u'HOUSEHOLD', u'deposits. special deposits cajole slyly. fluffily special deposits about the furiously ', u''), (u'24', u'Customer#000000024', u'HXAFgIAyjxtdqwimt13Y3OZO 4xeLe7U8PqG', u'13', u'23-127-851-8031', u'9255.67', u'MACHINERY', u'into beans. fluffily final ideas haggle fluffily', u''), (u'25', u'Customer#000000025', u'Hp8GyFQgGHFYSilH5tBfe', u'12', u'22-603-468-3533', u'7133.70', u'FURNITURE', u'y. accounts sleep ruthlessly according to the regular theodolites. unusual instructions sleep. ironic, final', u''), (u'26', u'Customer#000000026', u'8ljrc5ZeMl7UciP', u'22', u'32-363-455-4837', u'5182.05', u'AUTOMOBILE', u'c requests use furiously ironic requests. slyly ironic dependencies us', u''), (u'27', u'Customer#000000027', u'IS8GIyxpBrLpMT0u7', u'3', u'13-137-193-2709', u'5679.84', u'BUILDING', u' about the carefully ironic pinto beans. accoun', u''), (u'28', u'Customer#000000028', u'iVyg0daQ,Tha8x2WPWA9m2529m', u'8', u'18-774-241-1462', u'1007.18', u'FURNITURE', u' along the regular deposits. furiously final pac', u''), (u'29', u'Customer#000000029', u'sJ5adtfyAkCK63df2,vF25zyQMVYE34uh', u'0', u'10-773-203-7342', u'7618.27', u'FURNITURE', u'its after the carefully final platelets x-ray against ', u''), (u'30', u'Customer#000000030', u'nJDsELGAavU63Jl0c5NKsKfL8rIJQQkQnYL2QJY', u'1', u'11-764-165-5076', u'9321.01', u'BUILDING', u'lithely final requests. furiously unusual account', u''), (u'31', u'Customer#000000031', u'LUACbO0viaAv6eXOAebryDB xjVst', u'23', u'33-197-837-7094', u'5236.89', u'HOUSEHOLD', u's use among the blithely pending depo', u''), (u'32', u'Customer#000000032', u'jD2xZzi UmId,DCtNBLXKj9q0Tlp2iQ6ZcO3J', u'15', u'25-430-914-2194', u'3471.53', u'BUILDING', u'cial ideas. final, furious requests across the e', u''), (u'33', u'Customer#000000033', u'qFSlMuLucBmx9xnn5ib2csWUweg D', u'17', u'27-375-391-1280', u'-78.56', u'AUTOMOBILE', u's. slyly regular accounts are furiously. carefully pending requests', u''), (u'34', u'Customer#000000034', u'Q6G9wZ6dnczmtOx509xgE,M2KV', u'15', u'25-344-968-5422', u'8589.70', u'HOUSEHOLD', u'nder against the even, pending accounts. even', u''), (u'35', u'Customer#000000035', u'TEjWGE4nBzJL2', u'17', u'27-566-888-7431', u'1228.24', u'HOUSEHOLD', u'requests. special, express requests nag slyly furiousl', u''), (u'36', u'Customer#000000036', u'3TvCzjuPzpJ0,DdJ8kW5U', u'21', u'31-704-669-5769', u'4987.27', u'BUILDING', u'haggle. enticing, quiet platelets grow quickly bold sheaves. carefully regular acc', u''), (u'37', u'Customer#000000037', u'7EV4Pwh,3SboctTWt', u'8', u'18-385-235-7162', u'-917.75', u'FURNITURE', u'ilent packages are carefully among the deposits. furiousl', u''), (u'38', u'Customer#000000038', u'a5Ee5e9568R8RLP 2ap7', u'12', u'22-306-880-7212', u'6345.11', u'HOUSEHOLD', u'lar excuses. closely even asymptotes cajole blithely excuses. carefully silent pinto beans sleep carefully fin', u''), (u'39', u'Customer#000000039', u'nnbRg,Pvy33dfkorYE FdeZ60', u'2', u'12-387-467-6509', u'6264.31', u'AUTOMOBILE', u'tions. slyly silent excuses slee', u''), (u'40', u'Customer#000000040', u'gOnGWAyhSV1ofv', u'3', u'13-652-915-8939', u'1335.30', u'BUILDING', u'rges impress after the slyly ironic courts. foxes are. blithely ', u''), (u'41', u'Customer#000000041', u'IM9mzmyoxeBmvNw8lA7G3Ydska2nkZF', u'10', u'20-917-711-4011', u'270.95', u'HOUSEHOLD', u'ly regular accounts hang bold, silent packages. unusual foxes haggle slyly above the special, final depo', u''), (u'42', u'Customer#000000042', u'ziSrvyyBke', u'5', u'15-416-330-4175', u'8727.01', u'BUILDING', u'ssly according to the pinto beans: carefully special requests across the even, pending accounts wake special', u''), (u'43', u'Customer#000000043', u'ouSbjHk8lh5fKX3zGso3ZSIj9Aa3PoaFd', u'19', u'29-316-665-2897', u'9904.28', u'MACHINERY', u'ial requests: carefully pending foxes detect quickly. carefully final courts cajole quickly. carefully', u''), (u'44', u'Customer#000000044', u'Oi,dOSPwDu4jo4x,,P85E0dmhZGvNtBwi', u'16', u'26-190-260-5375', u'7315.94', u'AUTOMOBILE', u'r requests around the unusual, bold a', u''), (u'45', u'Customer#000000045', u'4v3OcpFgoOmMG,CbnF,4mdC', u'9', u'19-715-298-9917', u'9983.38', u'AUTOMOBILE', u'nto beans haggle slyly alongside of t', u''), (u'46', u'Customer#000000046', u'eaTXWWm10L9', u'6', u'16-357-681-2007', u'5744.59', u'AUTOMOBILE', u'ctions. accounts sleep furiously even requests. regular, regular accounts cajole blithely around the final pa', u''), (u'47', u'Customer#000000047', u'b0UgocSqEW5 gdVbhNT', u'2', u'12-427-271-9466', u'274.58', u'BUILDING', u'ions. express, ironic instructions sleep furiously ironic ideas. furi', u''), (u'48', u'Customer#000000048', u'0UU iPhBupFvemNB', u'0', u'10-508-348-5882', u'3792.50', u'BUILDING', u're fluffily pending foxes. pending, bold platelets sleep slyly. even platelets cajo', u''), (u'49', u'Customer#000000049', u'cNgAeX7Fqrdf7HQN9EwjUa4nxT,68L FKAxzl', u'10', u'20-908-631-4424', u'4573.94', u'FURNITURE', u'nusual foxes! fluffily pending packages maintain to the regular ', u''), (u'50', u'Customer#000000050', u'9SzDYlkzxByyJ1QeTI o', u'6', u'16-658-112-3221', u'4266.13', u'MACHINERY', u'ts. furiously ironic accounts cajole furiously slyly ironic dinos.', u''), (u'51', u'Customer#000000051', u'uR,wEaiTvo4', u'12', u'22-344-885-4251', u'855.87', u'FURNITURE', u'eposits. furiously regular requests integrate carefully packages. furious', u''), (u'52', u'Customer#000000052', u'7 QOqGqqSy9jfV51BC71jcHJSD0', u'11', u'21-186-284-5998', u'5630.28', u'HOUSEHOLD', u'ic platelets use evenly even accounts. stealthy theodolites cajole furiou', u''), (u'53', u'Customer#000000053', u'HnaxHzTfFTZs8MuCpJyTbZ47Cm4wFOOgib', u'15', u'25-168-852-5363', u'4113.64', u'HOUSEHOLD', u'ar accounts are. even foxes are blithely. fluffily pending deposits boost', u''), (u'54', u'Customer#000000054', u',k4vf 5vECGWFy,hosTE,', u'4', u'14-776-370-4745', u'868.90', u'AUTOMOBILE', u'sual, silent accounts. furiously express accounts cajole special deposits. final, final accounts use furi', u''), (u'55', u'Customer#000000055', u'zIRBR4KNEl HzaiV3a i9n6elrxzDEh8r8pDom', u'10', u'20-180-440-8525', u'4572.11', u'MACHINERY', u'ully unusual packages wake bravely bold packages. unusual requests boost deposits! blithely ironic packages ab', u''), (u'56', u'Customer#000000056', u'BJYZYJQk4yD5B', u'10', u'20-895-685-6920', u'6530.86', u'FURNITURE', u'. notornis wake carefully. carefully fluffy requests are furiously even accounts. slyly expre', u''), (u'57', u'Customer#000000057', u'97XYbsuOPRXPWU', u'21', u'31-835-306-1650', u'4151.93', u'AUTOMOBILE', u'ove the carefully special packages. even, unusual deposits sleep slyly pend', u''), (u'58', u'Customer#000000058', u'g9ap7Dk1Sv9fcXEWjpMYpBZIRUohi T', u'13', u'23-244-493-2508', u'6478.46', u'HOUSEHOLD', u'ideas. ironic ideas affix furiously express, final instructions. regular excuses use quickly e', u''), (u'59', u'Customer#000000059', u'zLOCP0wh92OtBihgspOGl4', u'1', u'11-355-584-3112', u'3458.60', u'MACHINERY', u'ously final packages haggle blithely after the express deposits. furiou', u''), (u'60', u'Customer#000000060', u'FyodhjwMChsZmUz7Jz0H', u'12', u'22-480-575-5866', u'2741.87', u'MACHINERY', u'latelets. blithely unusual courts boost furiously about the packages. blithely final instruct', u''), (u'61', u'Customer#000000061', u'9kndve4EAJxhg3veF BfXr7AqOsT39o gtqjaYE', u'17', u'27-626-559-8599', u'1536.24', u'FURNITURE', u'egular packages shall have to impress along the ', u''), (u'62', u'Customer#000000062', u'upJK2Dnw13,', u'7', u'17-361-978-7059', u'595.61', u'MACHINERY', u'kly special dolphins. pinto beans are slyly. quickly regular accounts are furiously a', u''), (u'63', u'Customer#000000063', u'IXRSpVWWZraKII', u'21', u'31-952-552-9584', u'9331.13', u'AUTOMOBILE', u'ithely even accounts detect slyly above the fluffily ir', u''), (u'64', u'Customer#000000064', u'MbCeGY20kaKK3oalJD,OT', u'3', u'13-558-731-7204', u'-646.64', u'BUILDING', u'structions after the quietly ironic theodolites cajole be', u''), (u'65', u'Customer#000000065', u'RGT yzQ0y4l0H90P783LG4U95bXQFDRXbWa1sl,X', u'23', u'33-733-623-5267', u'8795.16', u'AUTOMOBILE', u'y final foxes serve carefully. theodolites are carefully. pending i', u''), (u'66', u'Customer#000000066', u'XbsEqXH1ETbJYYtA1A', u'22', u'32-213-373-5094', u'242.77', u'HOUSEHOLD', u'le slyly accounts. carefully silent packages benea', u''), (u'67', u'Customer#000000067', u'rfG0cOgtr5W8 xILkwp9fpCS8', u'9', u'19-403-114-4356', u'8166.59', u'MACHINERY', u'indle furiously final, even theodo', u''), (u'68', u'Customer#000000068', u'o8AibcCRkXvQFh8hF,7o', u'12', u'22-918-832-2411', u'6853.37', u'HOUSEHOLD', u' pending pinto beans impress realms. final dependencies ', u''), (u'69', u'Customer#000000069', u'Ltx17nO9Wwhtdbe9QZVxNgP98V7xW97uvSH1prEw', u'9', u'19-225-978-5670', u'1709.28', u'HOUSEHOLD', u'thely final ideas around the quickly final dependencies affix carefully quickly final theodolites. final accounts c', u''), (u'70', u'Customer#000000070', u'mFowIuhnHjp2GjCiYYavkW kUwOjIaTCQ', u'22', u'32-828-107-2832', u'4867.52', u'FURNITURE', u'fter the special asymptotes. ideas after the unusual frets cajole quickly regular pinto be', u''), (u'71', u'Customer#000000071', u'TlGalgdXWBmMV,6agLyWYDyIz9MKzcY8gl,w6t1B', u'7', u'17-710-812-5403', u'-611.19', u'HOUSEHOLD', u'g courts across the regular, final pinto beans are blithely pending ac', u''), (u'72', u'Customer#000000072', u'putjlmskxE,zs,HqeIA9Wqu7dhgH5BVCwDwHHcf', u'2', u'12-759-144-9689', u'-362.86', u'FURNITURE', u'ithely final foxes sleep always quickly bold accounts. final wat', u''), (u'73', u'Customer#000000073', u'8IhIxreu4Ug6tt5mog4', u'0', u'10-473-439-3214', u'4288.50', u'BUILDING', u'usual, unusual packages sleep busily along the furiou', u''), (u'74', u'Customer#000000074', u'IkJHCA3ZThF7qL7VKcrU nRLl,kylf ', u'4', u'14-199-862-7209', u'2764.43', u'MACHINERY', u'onic accounts. blithely slow packages would haggle carefully. qui', u''), (u'75', u'Customer#000000075', u'Dh 6jZ,cwxWLKQfRKkiGrzv6pm', u'18', u'28-247-803-9025', u'6684.10', u'AUTOMOBILE', u' instructions cajole even, even deposits. finally bold deposits use above the even pains. slyl', u''), (u'76', u'Customer#000000076', u'm3sbCvjMOHyaOofH,e UkGPtqc4', u'0', u'10-349-718-3044', u'5745.33', u'FURNITURE', u'pecial deposits. ironic ideas boost blithely according to the closely ironic theodolites! furiously final deposits n', u''), (u'77', u'Customer#000000077', u'4tAE5KdMFGD4byHtXF92vx', u'17', u'27-269-357-4674', u'1738.87', u'BUILDING', u'uffily silent requests. carefully ironic asymptotes among the ironic hockey players are carefully bli', u''), (u'78', u'Customer#000000078', u'HBOta,ZNqpg3U2cSL0kbrftkPwzX', u'9', u'19-960-700-9191', u'7136.97', u'FURNITURE', u'ests. blithely bold pinto beans h', u''), (u'79', u'Customer#000000079', u'n5hH2ftkVRwW8idtD,BmM2', u'15', u'25-147-850-4166', u'5121.28', u'MACHINERY', u'es. packages haggle furiously. regular, special requests poach after the quickly express ideas. blithely pending re', u''), (u'80', u'Customer#000000080', u'K,vtXp8qYB ', u'0', u'10-267-172-7101', u'7383.53', u'FURNITURE', u'tect among the dependencies. bold accounts engage closely even pinto beans. ca', u''), (u'81', u'Customer#000000081', u'SH6lPA7JiiNC6dNTrR', u'20', u'30-165-277-3269', u'2023.71', u'BUILDING', u'r packages. fluffily ironic requests cajole fluffily. ironically regular theodolit', u''), (u'82', u'Customer#000000082', u'zhG3EZbap4c992Gj3bK,3Ne,Xn', u'18', u'28-159-442-5305', u'9468.34', u'AUTOMOBILE', u's wake. bravely regular accounts are furiously. regula', u''), (u'83', u'Customer#000000083', u'HnhTNB5xpnSF20JBH4Ycs6psVnkC3RDf', u'22', u'32-817-154-4122', u'6463.51', u'BUILDING', u'ccording to the quickly bold warhorses. final, regular foxes integrate carefully. bold packages nag blithely ev', u''), (u'84', u'Customer#000000084', u'lpXz6Fwr9945rnbtMc8PlueilS1WmASr CB', u'11', u'21-546-818-3802', u'5174.71', u'FURNITURE', u'ly blithe foxes. special asymptotes haggle blithely against the furiously regular depo', u''), (u'85', u'Customer#000000085', u'siRerlDwiolhYR 8FgksoezycLj', u'5', u'15-745-585-8219', u'3386.64', u'FURNITURE', u'ronic ideas use above the slowly pendin', u''), (u'86', u'Customer#000000086', u'US6EGGHXbTTXPL9SBsxQJsuvy', u'0', u'10-677-951-2353', u'3306.32', u'HOUSEHOLD', u'quests. pending dugouts are carefully aroun', u''), (u'87', u'Customer#000000087', u'hgGhHVSWQl 6jZ6Ev', u'23', u'33-869-884-7053', u'6327.54', u'FURNITURE', u'hely ironic requests integrate according to the ironic accounts. slyly regular pla', u''), (u'88', u'Customer#000000088', u'wtkjBN9eyrFuENSMmMFlJ3e7jE5KXcg', u'16', u'26-516-273-2566', u'8031.44', u'AUTOMOBILE', u's are quickly above the quickly ironic instructions; even requests about the carefully final deposi', u''), (u'89', u'Customer#000000089', u'dtR, y9JQWUO6FoJExyp8whOU', u'14', u'24-394-451-5404', u'1530.76', u'FURNITURE', u'counts are slyly beyond the slyly final accounts. quickly final ideas wake. r', u''), (u'90', u'Customer#000000090', u'QxCzH7VxxYUWwfL7', u'16', u'26-603-491-1238', u'7354.23', u'BUILDING', u'sly across the furiously even ', u''), (u'91', u'Customer#000000091', u'S8OMYFrpHwoNHaGBeuS6E 6zhHGZiprw1b7 q', u'8', u'18-239-400-3677', u'4643.14', u'AUTOMOBILE', u'onic accounts. fluffily silent pinto beans boost blithely according to the fluffily exp', u''), (u'92', u'Customer#000000092', u'obP PULk2LH LqNF,K9hcbNqnLAkJVsl5xqSrY,', u'2', u'12-446-416-8471', u'1182.91', u'MACHINERY', u'. pinto beans hang slyly final deposits. ac', u''), (u'93', u'Customer#000000093', u'EHXBr2QGdh', u'7', u'17-359-388-5266', u'2182.52', u'MACHINERY', u'press deposits. carefully regular platelets r', u''), (u'94', u'Customer#000000094', u'IfVNIN9KtkScJ9dUjK3Pg5gY1aFeaXewwf', u'9', u'19-953-499-8833', u'5500.11', u'HOUSEHOLD', u'latelets across the bold, final requests sleep according to the fluffily bold accounts. unusual deposits amon', u''), (u'95', u'Customer#000000095', u'EU0xvmWvOmUUn5J,2z85DQyG7QCJ9Xq7', u'15', u'25-923-255-2929', u'5327.38', u'MACHINERY', u'ithely. ruthlessly final requests wake slyly alongside of the furiously silent pinto beans. even the', u''), (u'96', u'Customer#000000096', u'vWLOrmXhRR', u'8', u'18-422-845-1202', u'6323.92', u'AUTOMOBILE', u'press requests believe furiously. carefully final instructions snooze carefully. ', u''), (u'97', u'Customer#000000097', u'OApyejbhJG,0Iw3j rd1M', u'17', u'27-588-919-5638', u'2164.48', u'AUTOMOBILE', u'haggle slyly. bold, special ideas are blithely above the thinly bold theo', u''), (u'98', u'Customer#000000098', u'7yiheXNSpuEAwbswDW', u'12', u'22-885-845-6889', u'-551.37', u'BUILDING', u'ages. furiously pending accounts are quickly carefully final foxes: busily pe', u''), (u'99', u'Customer#000000099', u'szsrOiPtCHVS97Lt', u'15', u'25-515-237-9232', u'4088.65', u'HOUSEHOLD', u'cajole slyly about the regular theodolites! furiously bold requests nag along the pending, regular packages. somas', u''), (u'100', u'Customer#000000100', u'fptUABXcmkC5Wx', u'20', u'30-749-445-4907', u'9889.89', u'FURNITURE', u'was furiously fluffily quiet deposits. silent, pending requests boost against ', u'')]

I applied groupby() to customerrdd at attribute key-6 , further for aggregates operation say addition on attribute key-3 for which i have applied reducebykey operation with series of flatmap and mapling values, here is the code for it:

def func(x):
    return x


def stringconverfunc(z):
    return str(z)


def floatconverfunc(l):
    return float(l)

def aggonvalfunc(y):
    return y[3]


grouprdd=customerrdd.groupBy(lambda w:(w[6]))


result=grouprdd.flatMapValues(lambda q: func(q)).mapValues(lambda p: aggonvalfunc(p)) \
        .mapValues(lambda line: stringconverfunc(line)).mapValues(lambda line: line.strip()) \
        .mapValues(lambda line: floatconverfunc(line)).reduceByKey(lambda x, y: x + y).collect()
print result

OUTPUT:

[(u'BUILDING', 20), (u'AUTOMOBILE', 21), (u'HOUSEHOLD', 21), (u'MACHINERY', 16), (u'FURNITURE', 22)]

However, the above approach is quite costly in terms of shuffling and does not work with larger datasets. thus, I want to implement the same above concept with rdd.combinebykey in order to compute it faster and could be used for larger dataset. i have tried to implement it by refering combinebykey but getting confused how to provide keys and value on which aggregation is need to be performed. can anyone help? i would like to have suggestions

Okay, for a beginner it's hard to know all this so I'll try to give you some tips.

You can assign keys without grouping, this can be done by keyBy and doesn't involve shuffling. In the end, a key-value rdd is merely an rdd consisting of tuples of size 2 where the first entry is the key and the second one the value.
Any performance increase you could get from reduceByKey or combineByKey will be rendered useless if you do the grouping beforehand which otherwise could be avoided.

Furthermore you can call float with a string having leading and trailing whitespaces, it will strip the string automatically. You also don't need to create lambdas of the form lambda x: f(x) just use f directly without any braces, it will have the same effect. For the same reason you don't need to wrap str or float with another function.
The operator module provides functions for adding and retrieving values, so you don't need to define those either. Please look at the python docs for further information.

My solution would be:

from operator import itemgetter, add

# `itemgetter(6)` is equivalent to `lambda x: x[6]`. Therefore we'll use element at
# index 6 to key the rdd's entries.
# This operation is equivalent to `customerrdd.map(lambda x: (x[6], x))`
rdd = customerrdd.keyBy(itemgetter(6))

# Now extract element at index 3 from the values so we no longer have a tuple
rdd = rdd.mapValues(itemgetter(3))

# Convert those elements to floats
rdd = rdd.mapValues(float)

# We could've done the previous steps in one by doing
# rdd = customerrdd.map(lambda x: (x[6], float(x[3]))

# Sum them up and collect the result
result = rdd.reduceByKey(add).collect()

Without comments

from operator import itemgetter, add

result = customerrdd.keyBy(itemgetter(6))\
    .mapValues(itemgetter(3))\
    .mapValues(float)\
    .reduceByKey(add).collect()

Which returns

[(u'BUILDING', 204.0),
 (u'AUTOMOBILE', 280.0),
 (u'MACHINERY', 135.0),
 (u'HOUSEHOLD', 255.0),
 (u'FURNITURE', 224.0)]

Admittedly a different result than yours, but I ran your code and got the same. So I guess you had a different rdd for your result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM