简体   繁体   English

计算文件中任意数量字符的出现次数?

[英]Count the occurrence of any number of characters in a file?

I have found several ways to count the occurrence of a single character in a file in Java. 我已经找到了几种方法来计算Java中文件中单个字符的出现次数。 My question is simply this: is there any way to count the occurrence of any of the characters in a list in a file simultaneously, or am I going to have to loop through each character? 我的问题很简单:有没有办法同时计算文件列表中任何字符的出现次数,或者我是否必须遍历每个字符?

To clarify, I'm wanting something equivalent to: For each character in file, if character in list "abcdefg" increment 1. 为了澄清,我想要的东西相当于:对于文件中的每个字符,如果列表中的字符“abcdefg”递增1。

Background: I'm counting predicates in a file, and the best method I could think of was to search for occurrences of <, >, ==, etc. 背景:我在一个文件中计算谓词,我能想到的最好的方法是搜索<,>,==等的出现。

Use a Map<Character, Integer> and go through the file. 使用Map<Character, Integer>并浏览文件。 For every character you test to see if it is in the map. 对于您测试的每个角色,看它是否在地图中。 If it is not add it with value 1, otherwise get the current value, increment it and put it back. 如果它没有添加值1,否则获取当前值,增加它并将其放回。 Test both TreeMap and HashMap to see which works best for you. 测试TreeMapHashMap以查看最适合您的方法。 Now you have a complete histogram and you can easily add the interesting sums. 现在您有一个完整的直方图,您可以轻松添加有趣的总和。

Update : Saw that you are interested in finding sequences. 更新 :看到您有兴趣查找序列。 If you want to do that with good performance I would use a tool like lex, but for Java. 如果你想以良好的性能做到这一点,我会使用像lex这样的工具,但对于Java。 A quick google led me to this one: http://www.cs.princeton.edu/~appel/modern/java/JLex/ It should be straight forward to define the tokens you are interested in, and then it should be very simple to count them. 一个快速谷歌引导我到这一个: http//www.cs.princeton.edu/~appel/modern/java/JLex/应该直接定义你感兴趣的令牌,然后它应该是非常的很容易计算它们。

Update 2 : I couldn't resist to play with it. 更新2 :我忍不住玩它。 Here is a sample that seems to work using the above mentioned tool (disclaimer: I haven't used the tool so this could be completely wrong...): 这是一个似乎使用上述工具的示例(免责声明:我没有使用该工具,所以这可能完全错误......):

import java.lang.System;
import java.util.Map;
import java.util.TreeMap;

class Sample {
  public static void main(String argv[]) throws java.io.IOException {
    Map<String,Integer> map = new TreeMap<>();

    Yylex yy = new Yylex(System.in);
    Yytoken t;
    while ((t = yy.yylex()) != null) {
      String text = t.mText;

      if (!text.isEmpty()) {
        Integer i = map.get(text);
        if (i == null) {
          map.put(text, 1);
        }
        else {
          map.put(text, map.get(text)+1);
        }
      }
    } 

    System.out.println(map);
  }
}

class Yytoken {
  public String mText;

  Yytoken(String text) {
   mText = text;
  }

  public String toString() {
    return "Token: " + mText;
  }
}

%%

OTHER=(.|[\r\n])

%% 

<YYINITIAL> "," { return (new Yytoken(yytext())); }
<YYINITIAL> ":" { return (new Yytoken(yytext())); }
<YYINITIAL> ";" { return (new Yytoken(yytext())); }
<YYINITIAL> "(" { return (new Yytoken(yytext())); }
<YYINITIAL> ")" { return (new Yytoken(yytext())); }
<YYINITIAL> "[" { return (new Yytoken(yytext())); }
<YYINITIAL> "]" { return (new Yytoken(yytext())); }
<YYINITIAL> "{" { return (new Yytoken(yytext())); }
<YYINITIAL> "}" { return (new Yytoken(yytext())); }
<YYINITIAL> "." { return (new Yytoken(yytext())); }
<YYINITIAL> "+" { return (new Yytoken(yytext())); }
<YYINITIAL> "-" { return (new Yytoken(yytext())); }
<YYINITIAL> "*" { return (new Yytoken(yytext())); }
<YYINITIAL> "/" { return (new Yytoken(yytext())); }
<YYINITIAL> "=" { return (new Yytoken(yytext())); }
<YYINITIAL> "<>" { return (new Yytoken(yytext())); }
<YYINITIAL> "<"  { return (new Yytoken(yytext())); }
<YYINITIAL> "<=" { return (new Yytoken(yytext())); }
<YYINITIAL> ">"  { return (new Yytoken(yytext())); }
<YYINITIAL> ">=" { return (new Yytoken(yytext())); }
<YYINITIAL> "&"  { return (new Yytoken(yytext())); }
<YYINITIAL> "|"  { return (new Yytoken(yytext())); }
<YYINITIAL> ":=" { return (new Yytoken(yytext())); }
<YYINITIAL> "#" { return (new Yytoken(yytext())); }
<YYINITIAL> {OTHER} { return (new Yytoken("")); }
  • Reading

Since you want to count the predicates which are more than 1 character (==, !=, <-, >=) you would require a PushBackReader so that you can peek into the next character to determine the actual predicate. 由于您要计算超过1个字符(==,!=,< - ,> =)的谓词,因此您需要一个PushBackReader,以便您可以查看下一个字符以确定实际谓词。

  • Frequency of occurence 发生的频率

If you can afford to have an additional dependency then my suggestion is to use Multiset which was meant to count frequencies. 如果你能负担得起额外的依赖,那么我的建议是使用Multiset来计算频率。 If you can't then you can use Map or array based counter (I prefer this if your predicate set is finite as this simplifies the code). 如果你不能那么你可以使用Map或基于数组的计数器(如果您的谓词集是有限的,我更喜欢这个,因为这简化了代码)。

  • Parallelize? 并行?

Using the above approach is simpler as you can get the frequencies in 1 single pass. 使用上述方法更简单,因为您可以在单次通过中获得频率。 If your file is huge or have to count the frequencies across many many files then you can opt for parallelizing this using java Executors. 如果您的文件很大或者必须计算许多文件的频率,那么您可以选择使用Java Executors并行化它。

Storing 存储

If I understand correctly, you would like to find the number occurrences of not only single characters, but of short sequences of characters (ie Strings), such as == . 如果我理解正确,你不仅要找到单个字符的出现次数,还要找到短字符序列(即字符串),例如== In that case, a Map<Character, Integer> is insufficient, you need a Map<String, Integer> to store a count for each string. 在这种情况下, Map<Character, Integer>不足,您需要Map<String, Integer>来存储每个字符串的计数。

You can alternatively use a Guava 's Multiset , which is basically a nice interface for a collection that knows how many times it contains duplicate (same) elements. 您也可以使用GuavaMultiset ,它基本上是一个很好的接口,可以知道它包含重复(相同)元素的次数。

I believe that the number of predicates/operators/whatever-short-strings you want to count is defined, you can define an array / a list which would store all of the predicates that you are interested in, such as: 我相信你想要计算的谓词/运算符/任何短字符串的数量,你可以定义一个数组/列表,它将存储你感兴趣的所有谓词,例如:

List<String> operators = Arrays.asList("==", "<=", ">=", "<", ">");

Then you would "pour" all those operators as keys to the map and initialize their values to zero: 然后,您将“倒”所有这些运算符作为映射的键并将其值初始化为零:

Map<String, Integer> counts = new HashMap<>();
for (String operator : operators)
    counts.put(operator, 0);

Parsing 解析

As for the parsing, you can easily read the file line-by-line using a Scanner . 至于解析,您可以使用扫描仪逐行读取文件。 And for each line, you can use a method like this to count the number of times it contains a given sub-string: 对于每一行,您可以使用这样的方法来计算它包含给定子字符串的次数:

static int occurrences(String source, String subString) {
    int count = 0;
    int index = source.indexOf(subString);

    while (index != -1) {
        count++;
        index = source.indexOf(subString, index + 1);
    }
    return count;
}

And then using this method in a similar fashion to this: 然后以类似的方式使用此方法:

Scanner scanner = new Scanner(new File("input.txt"));
while (scanner.hasNextLine()) {
    String line = scanner.nextLine();
    for (String operator : operators) {
        int oldOccurences = counts.get(operator);
        counts.put(operator, oldOccurences + occurrences(line, operator));
    }
}

I believe that the java list interface has a Contains() method, so you can do something like 我相信java list接口有一个Contains()方法,所以你可以做类似的事情

if(someList.Contains('<'))
{
    x++
}

IT doesn't actually check for them all at once, but that stuff is hidden from you anyway IT实际上并没有立即检查它们,但无论如何,这些东西都是隐藏的

http://docs.oracle.com/javase/1.4.2/docs/api/java/util/List.html http://docs.oracle.com/javase/1.4.2/docs/api/java/util/List.html

To " count the occurrence of any of the characters in a list in a file simultaneously ": 要“同时count the occurrence of any of the characters in a list in a file simultaneously ”:

  • You can use a HashTable where the keys are the characters, and the values are the # of times you've seen that character. 您可以使用HashTable,其中键是字符,值是您看到该字符的次数。
  • Each time you read a character, check to see if it's in the HashTable: 每次阅读角色时,请检查它是否在HashTable中:
    • If so, increment its value by 1 如果是这样,请将其值增加1
    • If not, add the key, value pair to the HashTable with value initialized at 1 如果没有,请将键值对添加到HashTable,其值初始化为1

If the set of characters you care about is small (such as the "abcdefg" or "<, >, ==" in your example), a switch statement will suffice instead of using a HashTable to solve the problem 如果您关心的字符集很小(例如示例中的"abcdefg""<, >, ==" ),则switch语句就足够了,而不是使用HashTable来解决问题

A trivial way to do it is with an array: 一个简单的方法是使用数组:

final int[] occurs = new int[65536];
for (char c : file) occurs[c]++;

If you know you won't encounter too exotic chars, you can reduce the size of the array. 如果你知道你不会遇到过于异国情调的字符,你可以减少数组的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM