使用 Java StringTokenizer 拆分令牌

Question

I have a data set that looks like this:我有一个如下所示的数据集：

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:和以下代码：

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.它实际上是来自官方 apache hadoop 文档的字数，只是对我的数据集进行了一点定制。

I get the following error:我收到以下错误：

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点？ I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行，因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码，拆分它们然后计数？

Thank you in advance先感谢您

Answer 1

First problem - you'll need to remove the header of your file before passing to MapReduce.第一个问题 - 在传递给 MapReduce 之前，您需要删除文件的 header。

Second - you have no commas in your shown dataset, so "," should not be given to StringTokenizer .其次-您显示的数据集中没有逗号，因此不应将","提供给StringTokenizer 。 Try "\t" instead改用"\t"

Next - Not all your tokens are Integers, so blindly calling Integer.valueOf(itr.nextToken()) will not work.下一个 - 并非所有令牌都是整数，因此盲目调用Integer.valueOf(itr.nextToken())将不起作用。 The first column is a date.第一列是日期。 You can call itr.nextToken() before the loop to discard the date, but then you need to handle the NULL at the end.您可以在循环之前调用itr.nextToken()以丢弃日期，但是您需要在最后处理NULL 。

Ultimately, the mapper doesn't need to parse anything.最终，映射器不需要解析任何东西。 You can also count strings in the reducer.您还可以在 reducer 中计算字符串。

Answer 2

I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点？ I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行，因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码，拆分它们然后计数？

The data sample you posted is tab-delimited:您发布的数据样本是制表符分隔的：

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:这是一个简单的例子，还有一些注意事项：

This uses the first line of your sample data as line , including the tab characters separating the data fields, just like you posted.这将示例数据的第一行用作line ，包括分隔数据字段的制表符，就像您发布的那样。
It uses a StringTokenizer with the token separator defined as as a single tab character ( \t )它使用一个StringTokenizer ，其标记分隔符定义为单个制表符 ( \t )
The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.程序调用hasMoreTokens()直到看到所有标记，并在此过程中打印每个标记。
The output includes left+right brackets to show the boundary of each token. output 包括左+右括号以显示每个令牌的边界。 For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".例如，“30”有一个尾随空格字符，如果不使用[]字符就不会被注意到，与“NULL”前面的前导空格相同。

String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

Here's the output:这是 output：

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.您可以采用这种方法，处理所有行，标记制表符，并使用第二个标记作为您的“彩票号码”数据来做您喜欢的事情。

使用 Java StringTokenizer 拆分令牌

问题描述

2 个解决方案

解决方案1
0 2022-06-30 17:43:24

解决方案2
0 已采纳 2022-06-30 22:56:34

使用 Java StringTokenizer 拆分令牌

问题描述

2 个解决方案

解决方案1 0 2022-06-30 17:43:24

解决方案2 0 已采纳 2022-06-30 22:56:34

解决方案1
0 2022-06-30 17:43:24

解决方案2
0 已采纳 2022-06-30 22:56:34