簡體   English   中英

sortbykey()似乎不適用於pyspark中的字符串

[英]sortbykey() doesn't seem to work on strings in pyspark

我在“TESTSortbykey.md”文件中保存了兩首詩“瑪麗有一只小羊羔”並在PYSPARK上運行了以下命令:

testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

在運行testsortbykey.collect()我得到了輸出:

[(u'Mary',1),(u'had',1),(u'a',1),(u'little',1),(u'lamb',1),(你們) ',1),(u'fleece',1),(u'was',1),(u'white',1),(u'as',1),(u'snow',1), (你和',1),(你在',1),(你在',1),(你在',1),(你'瑪麗',1),(你不是' ,1),(u'the',1),(u'Lamb',1),(u'was',1),(u'sure',1),(u'to',1),( u'go',1),(你',1)]

一旦我有一個Pair RDD testsortbykey,我想應用reduceByKey()和sortByKey(),但兩者似乎都不起作用。我使用的命令是:

 testsortbykey.sortByKey()
 testsortbykey.collect()
 testsortbykey.reduceByKey(lambda x,y: x+y )
 testsortbykey.collect()

我在兩種情況下得到的輸出是:

[(u'Mary',1),(u'had',1),(u'a',1),(u'little',1),(u'lamb',1),(你們) ',1),(u'fleece',1),(u'was',1),(u'white',1),(u'as',1),(u'snow',1), (你和',1),(你在',1),(你在',1),(你在',1),(你'瑪麗',1),(你不是' ,1),(u'the',1),(u'Lamb',1),(u'was',1),(u'sure',1),(u'to',1),( u'go',1),(你',1)]

很明顯,即使有多個相同的鍵(例如'Mary','have'等),這些值也沒有合並。

誰有人解釋為什么? 我還應該怎么做才能克服這個問題?

編輯:這是我的控制台的樣子,希望這會有所幫助:

    >>> testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103 stored as values in memory (estimated size 228.9 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103_piece0 stored as bytes in memory (estimated size 19.5 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.BlockManagerInfo: Added broadcast_103_piece0 in memory on localhost:57701 (size: 19.5 KB, free: 511.1 MB)
17/06/01 11:44:48 INFO spark.SparkContext: Created broadcast 103 from textFile at null:-1
>>> testsortbykey.sortByKey()
17/06/01 11:45:48 INFO mapred.FileInputFormat: Total input paths to process : 1
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 74 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 89 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104 stored as values in memory (estimated size 6.2 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_104_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 104 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 89.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 89.0 (TID 183, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 89.0 (TID 184, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 89.0 (TID 183)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 89.0 (TID 184)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 3, boot = 1, init = 1, finish = 1
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 89.0 (TID 183). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 89.0 (TID 183) in 9 ms on localhost (1/2)
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 7, boot = 3, init = 4, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 89.0 (TID 184). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 89.0 (TID 184) in 13 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 89.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 89 (sortByKey at <stdin>:1) finished in 0.013 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 74 finished: sortByKey at <stdin>:1, took 0.017325 s
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 75 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 90 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105 stored as values in memory (estimated size 6.0 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_105_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 105 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 90.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 90.0 (TID 185, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 90.0 (TID 186, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 90.0 (TID 186)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 90.0 (TID 185)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 42, boot = -8, init = 49, finish = 1
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 41, boot = -6, init = 47, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 90.0 (TID 185). 2382 bytes result sent to driver
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 90.0 (TID 186). 2223 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 90.0 (TID 185) in 49 ms on localhost (1/2)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 90.0 (TID 186) in 51 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 90.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 90 (sortByKey at <stdin>:1) finished in 0.051 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 75 finished: sortByKey at <stdin>:1, took 0.055618 s
PythonRDD[206] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:46:04 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Got job 76 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 91 (collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting ResultStage 91 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106 stored as values in memory (estimated size 5.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.BlockManagerInfo: Added broadcast_106_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:46:04 INFO spark.SparkContext: Created broadcast 106 from broadcast at DAGScheduler.scala:1006
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 91 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Adding task set 91.0 with 2 tasks
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 91.0 (TID 187, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 91.0 (TID 188, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO executor.Executor: Running task 0.0 in stage 91.0 (TID 187)
17/06/01 11:46:04 INFO executor.Executor: Running task 1.0 in stage 91.0 (TID 188)
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16016, init = 16056, finish = 1
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16017, init = 16057, finish = 1
17/06/01 11:46:04 INFO executor.Executor: Finished task 0.0 in stage 91.0 (TID 187). 2451 bytes result sent to driver
17/06/01 11:46:04 INFO executor.Executor: Finished task 1.0 in stage 91.0 (TID 188). 2252 bytes result sent to driver
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 91.0 (TID 187) in 48 ms on localhost (1/2)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 91.0 (TID 188) in 49 ms on localhost (2/2)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 91.0, whose tasks have all completed, from pool 
17/06/01 11:46:04 INFO scheduler.DAGScheduler: ResultStage 91 (collect at <stdin>:1) finished in 0.051 s
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Job 76 finished: collect at <stdin>:1, took 0.055614 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>> testsortbykey.reduceByKey(lambda x,y: x+y)
PythonRDD[212] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:47:06 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Got job 77 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 92 (collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting ResultStage 92 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107 stored as values in memory (estimated size 5.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.BlockManagerInfo: Added broadcast_107_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:47:06 INFO spark.SparkContext: Created broadcast 107 from broadcast at DAGScheduler.scala:1006
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 92 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Adding task set 92.0 with 2 tasks
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 92.0 (TID 189, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 92.0 (TID 190, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO executor.Executor: Running task 0.0 in stage 92.0 (TID 189)
17/06/01 11:47:06 INFO executor.Executor: Running task 1.0 in stage 92.0 (TID 190)
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 3, boot = 2, init = 1, finish = 0
17/06/01 11:47:06 INFO executor.Executor: Finished task 1.0 in stage 92.0 (TID 190). 2252 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 92.0 (TID 190) in 13 ms on localhost (1/2)
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 11, boot = 3, init = 7, finish = 1
17/06/01 11:47:06 INFO executor.Executor: Finished task 0.0 in stage 92.0 (TID 189). 2451 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 92.0 (TID 189) in 16 ms on localhost (2/2)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 92.0, whose tasks have all completed, from pool 
17/06/01 11:47:06 INFO scheduler.DAGScheduler: ResultStage 92 (collect at <stdin>:1) finished in 0.017 s
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Job 77 finished: collect at <stdin>:1, took 0.020758 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>> 

第一步是正確的:

>>> rdd = sc.textFile("./yourFile.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

有什么問題?

如果你這樣做:

>>> rdd.reduceByKey(lambda x,y: x+y)

然后這個:

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

您只應用了轉換但未更改起始rdd。

但..

第一個選項(如果您只想查看轉換):

>>> rdd.reduceByKey(lambda x,y: x+y).collect()  
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), (u'was', 2),
(u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

第二個選項(如果要在新的rdd中保存轉換):

如果你這樣做:

>>> rddReduced = rdd.reduceByKey(lambda x,y: x+y)

然后這個:

>>> rddReduced.collect()
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), 
(u'was', 2), (u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

您已應用並保存了轉換,結果就是您要查找的內容。

如果要應用sortByKey(),則使用相同的概念

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM