[英]MapReduce count and find average
我想在MapReduce中開發一個程序,該程序從.tbl文件中獲取cust_key和balance值。我已經將2個值連接到字符串中,然后將其發送到Reducer,所以我將計算cust_key並找到每個段的平均余額。這就是為什么我將細分添加為關鍵字。
我想分割字符串並分離兩個值,以便計算保管鍵並求和余額以求平均值。但是splitted array [0]給出了整個字符串,而不是字符串的第一個值。 array [1]引發ArrayoutofBounds異常。我希望這很清楚。
代碼如下
public class MapReduceTest {
public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
private Text segment = new Text();
private Text word = new Text();
private float balance = 0;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\\|");
balance = Float.parseFloat(line[5]);
String cust_key = line[1];
int nation = Integer.parseInt(line[3]);
if((balance > 8000) && ( nation < 15) && (nation > 1)){
segment.set(line[6]);
//word.set(cust_key+","+balance);
word.set(cust_key+","+balance);
context.write(segment,word);
}
}
}
public static class AvgReducer extends Reducer<Text,Text,Text,Text> {
Text val = new Text();
public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
String cust_key = "";
float avg,sum = 0;
int count = 0;
for(Text v : values){
String[] a = v.toString().trim().split(",");
cust_key +=a[0];
}
val.set(cust_count);
context.write(key, val);
}
}
輸入數據
8794|Customer#000008794|6dnUgJZGX73Kx1idr6|18|28-434-484-9934|7779.30|HOUSEHOLD|deposits detect furiously even requests. furiously ironic packages are slyly into th
8795|Customer#000008795|oA1cLUtWOAIFz5Douypbq1jHv glSE|9|19-829-732-8102|9794.80|BUILDING|totes. blithely unusual theodolites integrate carefully ironic foxes. unusual excuses cajole carefully carefully fi
8796|Customer#000008796|CzCzpV7SDojXUzi4165j,xYJuRv wZzn grYsyZ|24|34-307-411-6825|4323.03|AUTOMOBILE|s. pending, bold accounts above the sometimes express accounts
8797|Customer#000008797|TOWDryHNNqp8bvgMW6 FAhRoLyG1ldu2bHcJCM6|2|12-517-522-5820|219.78|FURNITURE|ly bold pinto beans can nod blithely quickly regular requests. fluffily even deposits ru
8798|Customer#000008798|bIegyozQ5kzprN|15|25-472-647-6270|6832.96|AUTOMOBILE|es-- silent instructions nag blithely
堆棧跟蹤
java.lang.Exception: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at MapReduceTest$AvgReducer.reduce(MapReduceTest.java:69)
at MapReduceTest$AvgReducer.reduce(MapReduceTest.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
17/04/12 18:40:33 INFO mapreduce.Job: Job job_local806960399_0001 running in uber mode : false
17/04/12 18:40:33 INFO mapreduce.Job: map 100% reduce 0%
17/04/12 18:40:33 INFO mapreduce.Job: Job job_local806960399_0001 failed with state FAILED due to: NA
17/04/12 18:40:33 INFO mapreduce.Job: Counters: 35
更新
減速器
public static class AvgReducer extends Reducer<Text,Text,Text,Text> {
Logger log = Logger.getLogger(AvgReducer.class.getName());
public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
float sumBalance=0,avgBalance = 0;
int cust_count = 1;
for(Text v : values){
String[] a = v.toString().trim().split(",");
//c2 += " i "+i+" "+a[0]+"\n";
sumBalance +=Float.parseFloat(a[a.length-1]);
cust_count++;
}
avgBalance = sumBalance / cust_count;
context.write(key,new Text(avgBalance+" "+cust_count));
}
}
堆棧跟蹤
java.lang.Exception: java.lang.NumberFormatException: For input string: "8991.715 289"
提前致謝。
Pig運行MapReduce(如果以這種方式配置)。 它比使用MapReduce亂糟糟得多,並且安裝在主要的Hadoop發行版中。
A = LOAD 'test.txt' USING PigStorage('|') AS (f1:int,customer_key:chararray,f3:chararray,nation:int,f5:chararray,balance:float,segment:chararray,f7:chararray);
filtered = FILTER A BY balance > 8000 AND (nation > 1 AND nation < 15);
X = FOREACH filtered generate segment,customer_key,balance;
並輸出
\d X
(BUILDING,Customer#000008795,9794.8)
不確定您是否真的想要平均值,這里只有一個元素,但是您需要在segment
和customer_key
上使用GROUP BY
,然后可以輕松使用AVG
函數 。
如果您更熟悉SQL,那么Hive可能也是一種更簡單的方法。
(除非另有配置,否則也通過MapReduce運行)
CREATE EXTERNAL TABLE IF NOT EXISTS records (
f1 INT,
customer_key STRING,
f3 STRING,
nation INT,
f5 STRING,
balance FLOAT,
f8 STRING
) ROW FORMAT DELIMETED
FIELDS TERMINATED BY '|'
LOCATION 'hdfs://path/test.txt';
那會是這樣
SELECT segment, customer_key, AVG(balance)
FROM records
WHERE balance > 8000 AND nation > 1 AND nation < 15
GROUP BY segment, customer_key;
我將介紹Apache Spark示例,但Spark SQL本質上就是該Hive查詢。
如果您真的想在Java MapReduce中嘗試此操作,請嘗試標准化輸入並快速捕獲錯誤。
返回丟棄有問題的記錄
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
float balance = 0.0;
String custKey = "";
String segment = "";
int nation = 0;
String[] line = value.toString().split("\\|");
if (line.length < 7) {
System.err.println("map: Not enough records");
return;
}
cust_key = line[1];
try {
nation = Integer.parseInt(line[3]);
balance = Float.parseFloat(line[5]);
} catch (NumberFormatException e) {
e.printStackTrace();
return;
}
if(balance > 8000 && (nation < 15 && nation > 1)){
segment.set(line[6]);
word.set(cust_key + "\t" + balance);
context.write(segment,word);
}
}
然后,如果您試圖編寫類似的輸出記錄,則reduce理想情況下應生成相同的格式
public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
float sumBalance=0
int count = 0;
for(Text v : values){
String[] a = v.toString().trim().split("\t");
if (a.length < 2) {
System.err.println("reduce: Not enough records");
continue;
}
sumBalance += Float.parseFloat(a[1]);
count++;
}
float avgBalance = count <= 1 ? sumBalance : sumBalance / count;
context.write(key,new Text(avgBalance + "\t" + count));
}
(代碼未經測試)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.