sortbykey（）似乎不適用於pyspark中的字符串

Question

我在“TESTSortbykey.md”文件中保存了兩首詩“瑪麗有一只小羊羔”並在PYSPARK上運行了以下命令：

testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

在運行testsortbykey.collect()我得到了輸出：

[（u'Mary'，1），（u'had'，1），（u'a'，1），（u'little'，1），（u'lamb'，1），（你們） '，1），（u'fleece'，1），（u'was'，1），（u'white'，1），（u'as'，1），（u'snow'，1），（你和'，1），（你在'，1），（你在'，1），（你在'，1），（你'瑪麗'，1），（你不是' ，1），（u'the'，1），（u'Lamb'，1），（u'was'，1），（u'sure'，1），（u'to'，1），（ u'go'，1），（你'，1）]

一旦我有一個Pair RDD testsortbykey，我想應用reduceByKey（）和sortByKey（），但兩者似乎都不起作用。我使用的命令是：

 testsortbykey.sortByKey()
 testsortbykey.collect()
 testsortbykey.reduceByKey(lambda x,y: x+y )
 testsortbykey.collect()

我在兩種情況下得到的輸出是：

[（u'Mary'，1），（u'had'，1），（u'a'，1），（u'little'，1），（u'lamb'，1），（你們） '，1），（u'fleece'，1），（u'was'，1），（u'white'，1），（u'as'，1），（u'snow'，1），（你和'，1），（你在'，1），（你在'，1），（你在'，1），（你'瑪麗'，1），（你不是' ，1），（u'the'，1），（u'Lamb'，1），（u'was'，1），（u'sure'，1），（u'to'，1），（ u'go'，1），（你'，1）]

很明顯，即使有多個相同的鍵（例如'Mary'，'have'等），這些值也沒有合並。

誰有人解釋為什么？ 我還應該怎么做才能克服這個問題？

編輯：這是我的控制台的樣子，希望這會有所幫助：

    >>> testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103 stored as values in memory (estimated size 228.9 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103_piece0 stored as bytes in memory (estimated size 19.5 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.BlockManagerInfo: Added broadcast_103_piece0 in memory on localhost:57701 (size: 19.5 KB, free: 511.1 MB)
17/06/01 11:44:48 INFO spark.SparkContext: Created broadcast 103 from textFile at null:-1
>>> testsortbykey.sortByKey()
17/06/01 11:45:48 INFO mapred.FileInputFormat: Total input paths to process : 1
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 74 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 89 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104 stored as values in memory (estimated size 6.2 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_104_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 104 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 89.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 89.0 (TID 183, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 89.0 (TID 184, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 89.0 (TID 183)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 89.0 (TID 184)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 3, boot = 1, init = 1, finish = 1
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 89.0 (TID 183). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 89.0 (TID 183) in 9 ms on localhost (1/2)
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 7, boot = 3, init = 4, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 89.0 (TID 184). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 89.0 (TID 184) in 13 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 89.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 89 (sortByKey at <stdin>:1) finished in 0.013 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 74 finished: sortByKey at <stdin>:1, took 0.017325 s
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 75 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 90 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105 stored as values in memory (estimated size 6.0 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_105_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 105 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 90.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 90.0 (TID 185, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 90.0 (TID 186, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 90.0 (TID 186)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 90.0 (TID 185)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 42, boot = -8, init = 49, finish = 1
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 41, boot = -6, init = 47, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 90.0 (TID 185). 2382 bytes result sent to driver
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 90.0 (TID 186). 2223 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 90.0 (TID 185) in 49 ms on localhost (1/2)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 90.0 (TID 186) in 51 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 90.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 90 (sortByKey at <stdin>:1) finished in 0.051 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 75 finished: sortByKey at <stdin>:1, took 0.055618 s
PythonRDD[206] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:46:04 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Got job 76 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 91 (collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting ResultStage 91 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106 stored as values in memory (estimated size 5.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.BlockManagerInfo: Added broadcast_106_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:46:04 INFO spark.SparkContext: Created broadcast 106 from broadcast at DAGScheduler.scala:1006
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 91 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Adding task set 91.0 with 2 tasks
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 91.0 (TID 187, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 91.0 (TID 188, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO executor.Executor: Running task 0.0 in stage 91.0 (TID 187)
17/06/01 11:46:04 INFO executor.Executor: Running task 1.0 in stage 91.0 (TID 188)
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16016, init = 16056, finish = 1
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16017, init = 16057, finish = 1
17/06/01 11:46:04 INFO executor.Executor: Finished task 0.0 in stage 91.0 (TID 187). 2451 bytes result sent to driver
17/06/01 11:46:04 INFO executor.Executor: Finished task 1.0 in stage 91.0 (TID 188). 2252 bytes result sent to driver
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 91.0 (TID 187) in 48 ms on localhost (1/2)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 91.0 (TID 188) in 49 ms on localhost (2/2)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 91.0, whose tasks have all completed, from pool 
17/06/01 11:46:04 INFO scheduler.DAGScheduler: ResultStage 91 (collect at <stdin>:1) finished in 0.051 s
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Job 76 finished: collect at <stdin>:1, took 0.055614 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>> testsortbykey.reduceByKey(lambda x,y: x+y)
PythonRDD[212] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:47:06 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Got job 77 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 92 (collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting ResultStage 92 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107 stored as values in memory (estimated size 5.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.BlockManagerInfo: Added broadcast_107_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:47:06 INFO spark.SparkContext: Created broadcast 107 from broadcast at DAGScheduler.scala:1006
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 92 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Adding task set 92.0 with 2 tasks
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 92.0 (TID 189, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 92.0 (TID 190, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO executor.Executor: Running task 0.0 in stage 92.0 (TID 189)
17/06/01 11:47:06 INFO executor.Executor: Running task 1.0 in stage 92.0 (TID 190)
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 3, boot = 2, init = 1, finish = 0
17/06/01 11:47:06 INFO executor.Executor: Finished task 1.0 in stage 92.0 (TID 190). 2252 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 92.0 (TID 190) in 13 ms on localhost (1/2)
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 11, boot = 3, init = 7, finish = 1
17/06/01 11:47:06 INFO executor.Executor: Finished task 0.0 in stage 92.0 (TID 189). 2451 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 92.0 (TID 189) in 16 ms on localhost (2/2)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 92.0, whose tasks have all completed, from pool 
17/06/01 11:47:06 INFO scheduler.DAGScheduler: ResultStage 92 (collect at <stdin>:1) finished in 0.017 s
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Job 77 finished: collect at <stdin>:1, took 0.020758 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>>

Answer 1

第一步是正確的：

>>> rdd = sc.textFile("./yourFile.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

有什么問題？

如果你這樣做：

>>> rdd.reduceByKey(lambda x,y: x+y)

然后這個：

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

您只應用了轉換但未更改起始rdd。

但..

第一個選項（如果您只想查看轉換）：

>>> rdd.reduceByKey(lambda x,y: x+y).collect()  
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), (u'was', 2),
(u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

第二個選項（如果要在新的rdd中保存轉換）：

如果你這樣做：

>>> rddReduced = rdd.reduceByKey(lambda x,y: x+y)

然后這個：

>>> rddReduced.collect()
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), 
(u'was', 2), (u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

您已應用並保存了轉換，結果就是您要查找的內容。

如果要應用sortByKey（），則使用相同的概念

sortbykey（）似乎不適用於pyspark中的字符串

問題描述

1 個解決方案

解決方案1
0 已采納 2017-06-01 09:50:35

sortbykey（）似乎不適用於pyspark中的字符串

問題描述

1 個解決方案

解決方案1 0 已采納 2017-06-01 09:50:35

解決方案1
0 已采納 2017-06-01 09:50:35