[英]Splitting the tokens with Java StringTokenizer
I have a data set that looks like this:我有一个如下所示的数据集:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
etc.
and the following code:和以下代码:
public class LotteryCount {
/**
* Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
*/
public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable lotteryKey;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
while (itr.hasMoreTokens()) {
lotteryKey.set(Integer.valueOf(itr.nextToken()));
context.write(lotteryKey, one);
}
}
}
/**
* Reducer to sum up the occurrence
*/
public static class LotteryReducer
extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
IntWritable result = new IntWritable();
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.它实际上是来自官方 apache hadoop 文档的字数,只是对我的数据集进行了一点定制。
I get the following error:我收到以下错误:
Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"
I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点? I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行,因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码,拆分它们然后计数?
Thank you in advance先感谢您
First problem - you'll need to remove the header of your file before passing to MapReduce.第一个问题 - 在传递给 MapReduce 之前,您需要删除文件的 header。
Second - you have no commas in your shown dataset, so ","
should not be given to StringTokenizer
.其次-您显示的数据集中没有逗号,因此不应将","
提供给StringTokenizer
。 Try "\t"
instead改用"\t"
Next - Not all your tokens are Integers, so blindly calling Integer.valueOf(itr.nextToken())
will not work.下一个 - 并非所有令牌都是整数,因此盲目调用Integer.valueOf(itr.nextToken())
将不起作用。 The first column is a date.第一列是日期。 You can call itr.nextToken()
before the loop to discard the date, but then you need to handle the NULL
at the end.您可以在循环之前调用itr.nextToken()
以丢弃日期,但是您需要在最后处理NULL
。
Ultimately, the mapper doesn't need to parse anything.最终,映射器不需要解析任何东西。 You can also count strings in the reducer.您还可以在 reducer 中计算字符串。
I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点? I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行,因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码,拆分它们然后计数?
The data sample you posted is tab-delimited:您发布的数据样本是制表符分隔的:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
Here's a simple example, and a few notes:这是一个简单的例子,还有一些注意事项:
line
, including the tab characters separating the data fields, just like you posted.这将示例数据的第一行用作line
,包括分隔数据字段的制表符,就像您发布的那样。StringTokenizer
with the token separator defined as as a single tab character ( \t
)它使用一个StringTokenizer
,其标记分隔符定义为单个制表符 ( \t
)hasMoreTokens()
until all tokens are seen, printing each one along the way.程序调用hasMoreTokens()
直到看到所有标记,并在此过程中打印每个标记。[]
characters, same with leading whitesapce in front of "NULL".例如,“30”有一个尾随空格字符,如果不使用[]
字符就不会被注意到,与“NULL”前面的前导空格相同。String line = "2005-01-04 03 06 07 12 32 30 NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
System.out.println("token: [" + token + "]");
}
Here's the output:这是 output:
token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [ NULL]
You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.您可以采用这种方法,处理所有行,标记制表符,并使用第二个标记作为您的“彩票号码”数据来做您喜欢的事情。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.