简体   繁体   English

Map Reduce 使用 Python 读取文本文件

[英]Map Reduce to read a text file using Python

I'm trying to write a MapReduce program that can read an input file.But I don't really know how to use it in a MapReduce program in python.我正在尝试编写一个可以读取输入文件的 MapReduce 程序。但我真的不知道如何在 python 的 MapReduce 程序中使用它。

Can someone give me a code snippet of it?有人可以给我一个代码片段吗? I have tried the following Code to read a file in python.我尝试使用以下代码在 python 中读取文件。 I have already pushed the file on HDFS file system before reading.在阅读之前,我已经将文件推送到 HDFS 文件系统上。

f = open('/usr/total.txt',"r")
g_total = int(f.readline())

I have flight data set on which i have answer the total flight per as a percentage.我有飞行数据集,我在上面回答了总飞行次数的百分比。 what i have done so far is mapper is producing key value pair <year,1> and then the reducer is aggregating the output from mapper to produce <year,total_count> .到目前为止,我所做的是 mapper 正在生成键值对<year,1> ,然后 reducer 正在聚合 mapper 的输出以生成<year,total_count>

Now I am confused how to calculate it as percentage so for that I need total number of rows.现在我很困惑如何将其计算为百分比,因此我需要总行数。

How could I perform these all in mapper and reducer?我如何在 mapper 和 reducer 中执行这些操作?

calculate [...] percentage so for that i need total number of rows计算 [...] 百分比,因此我需要总行数

Calculating totals or averages isn't really a good use case for mapreduce.计算总数或平均值并不是 mapreduce 的好用例。 You need to force all data to one reducer with a common key, then you can sum, count, and divide totals to get percentages.您需要使用一个公共键将所有数据强制到一个 reducer,然后您可以对总数进行求和、计数和除以得到百分比。 For example, the mapper would output <null,year> , then the reducer would sum all yearly counts first (use a Counter object, for example), then you can sum all years together to get total number of records, since all data is available now to one reducer.例如,映射器将输出<null,year> ,然后化简器将首先对所有年度计数求和(例如,使用 Counter 对象),然后您可以将所有年份相加以获得记录总数,因为所有数据都是现在可用于一台减速机。 Once you have that, divide by each yearly count to output <year,yearly_count/total_count> , giving a percent for each year.一旦你有了这个,除以每年的计数来输出<year,yearly_count/total_count> ,给出每年的百分比。

As far as HDFS is concerned, you can use MrJob Python module to do this, which will run a Hadoop Streaming Job.就 HDFS 而言,您可以使用 MrJob Python 模块来执行此操作,它将运行 Hadoop Streaming Job。

Worth pointing out that Pyspark or Hive can easily do the same, so you don't "need" to write MapReduce, but this is how those work behind the scenes值得指出的是,Pyspark 或 Hive 可以轻松地做到这一点,因此您不需要编写 MapReduce,但这就是它们在幕后的工作方式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM