简体   繁体   English

使用 Java StringTokenizer 拆分令牌

[英]Splitting the tokens with Java StringTokenizer

I have a data set that looks like this:我有一个如下所示的数据集:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:和以下代码:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.它实际上是来自官方 apache hadoop 文档的字数,只是对我的数据集进行了一点定制。

I get the following error:我收到以下错误:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点? I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行,因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码,拆分它们然后计数?

Thank you in advance先感谢您

First problem - you'll need to remove the header of your file before passing to MapReduce.第一个问题 - 在传递给 MapReduce 之前,您需要删除文件的 header。

Second - you have no commas in your shown dataset, so "," should not be given to StringTokenizer .其次-您显示的数据集中没有逗号,因此不应将","提供给StringTokenizer Try "\t" instead改用"\t"

Next - Not all your tokens are Integers, so blindly calling Integer.valueOf(itr.nextToken()) will not work.下一个 - 并非所有令牌都是整数,因此盲目调用Integer.valueOf(itr.nextToken())将不起作用。 The first column is a date.第一列是日期。 You can call itr.nextToken() before the loop to discard the date, but then you need to handle the NULL at the end.您可以在循环之前调用itr.nextToken()以丢弃日期,但是您需要在最后处理NULL

Ultimately, the mapper doesn't need to parse anything.最终,映射器不需要解析任何东西。 You can also count strings in the reducer.您还可以在 reducer 中计算字符串。

I am just interested in counting the occurrences for each individual drawn lottery number.我只是想计算每个抽奖号码的出现次数。 How can I do this by using the StringTokenizer from my code?如何通过使用我的代码中的 StringTokenizer 来做到这一点? I know that I have to split the whole row because the tokenizer is "fed" with the whole.我知道我必须拆分整行,因为标记器是“喂”整个行的。 How can I take the lotterynumbers, split them and then count?我怎样才能拿到彩票号码,拆分它们然后计数?

The data sample you posted is tab-delimited:您发布的数据样本是制表符分隔的:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:这是一个简单的例子,还有一些注意事项:

  • This uses the first line of your sample data as line , including the tab characters separating the data fields, just like you posted.这将示例数据的第一行用作line ,包括分隔数据字段的制表符,就像您发布的那样。
  • It uses a StringTokenizer with the token separator defined as as a single tab character ( \t )它使用一个StringTokenizer ,其标记分隔符定义为单个制表符 ( \t )
  • The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.程序调用hasMoreTokens()直到看到所有标记,并在此过程中打印每个标记。
  • The output includes left+right brackets to show the boundary of each token. output 包括左+右括号以显示每个令牌的边界。 For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".例如,“30”有一个尾随空格字符,如果不使用[]字符就不会被注意到,与“NULL”前面的前导空格相同。
String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

Here's the output:这是 output:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.您可以采用这种方法,处理所有行,标记制表符,并使用第二个标记作为您的“彩票号码”数据来做您喜欢的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM