简体   繁体   English

Java:在两个字符串之间匹配令牌并返回匹配的令牌数

[英]Java: Match tokens between two strings and return the number of matched tokens

Need some help to find the number of matched tokens between two strings. 需要一些帮助来查找两个字符串之间的匹配标记数。 I have a list of string stored in ArrayList (example given below): 我有一个存储在ArrayList中的字符串列表(下面给出的示例):

Line 0 : WRB VBD NN VB IN CC RB VBP NNP  
Line 1 : WDT NNS VBD DT NN NNP NNP  
Line 2 : WRB MD PRP VB DT NN IN NNS POS JJ NNS  
Line 3 : WDT NN VBZ DT NN IN DT JJ NN IN DT NNP  
Line 4 : WP VBZ DT JJ NN IN  NN  

Here, you can see each string consists of a bunch of tokens separated by spaces. 在这里,您可以看到每个字符串都由一串由空格分隔的标记组成。 So, there's three things I need to work with.. 因此,我需要处理三件事。

  1. Compare the first token (WRB) in Line 0 to the tokens in Line 1 to see if they match. 将第0行中的第一个令牌(WRB)与第1行中的令牌进行比较,以查看它们是否匹配。 Move on to the next tokens in Line 0 until a match is found. 移至第0行中的下一个标记,直到找到匹配项。 If there's a match, mark the matched tokens in Line 1 so that it will not be matched again. 如果有匹配项,请在第1行中标记匹配的令牌,以使其不再匹配。
  2. Return the number of matched tokens between Line 0 and Line 1. 返回第0行和第1行之间的匹配令牌数。
  3. Return the distance of the matched tokens. 返回匹配标记的距离。 Example: token NN is found on position 3 on line 0 and position 5 on Line 1. Distance = |3-5| 示例:在行0的位置3和行1的位置5上找到令牌NN。距离= | 3-5 | = 2 = 2

I've tried using split string and store it to String[] but String[] is fixed and doesn't allow shrinking or adding of new elements. 我试过使用分割字符串并将其存储到String [],但是String []是固定的,不允许缩小或添加新元素。 Tried Pattern Matcher but with disasterous results. 尝试过模式匹配器,但结果不佳。 Tried a few other methods but there's some problems with my nested for loops..(will post part of my coding if it will help). 尝试了其他一些方法,但是嵌套的for循环存在一些问题。(如果有帮助,将发布部分代码)。

Any advice or pointers on how to solve this problem this would be very much appreciated. 对于如何解决此问题的任何建议或指示,将不胜感激。 Thank you very much. 非常感谢你。

Think about the task in different ways. 以不同的方式考虑任务。 You want to scan for tokens (thus: Scanner), and you want to match the tokens (thus: a List, because you need order.) Then you'd iterate through the different collections for each token, noting the matches and the distance. 您想要扫描令牌(因此:扫描仪),并且想要匹配令牌(因此:一个列表,因为您需要订购。)然后,您将遍历每个令牌的不同集合,并记录匹配和距离。

Have you tried using Scanner ? 您是否尝试过使用扫描仪

If not, totally do. 如果没有,那就完全可以。 It would look like this: 它看起来像这样:

String line1 = ... // your line 1
String line2 = ... // your line 2
Scanner s1 = new Scanner(line1); 

int i1 = 0;
while (s1.hasNext()) {
    String token1 = s1.next();
    Scanner s2 = new Scanner(line2);

    int i2 = 0;
    while (s2.hasNext()) {
        String token2 = s2.next();

        // now you have token1, token2 and their positions (i1, i2)
        // do whatever you want with them

        i2++;
    } // end reading line2
    i1++;
} // end reading line1

EDIT: Regarding your loops to select different lines in the Arraylist, what you need is to compare every array element to every other array element (which is probably the best thing to google if this explanation is lacking). 编辑:关于在Arraylist中选择不同行的循环,您需要的是将每个数组元素与其他每个数组元素进行比较 (如果缺少此说明,这可能是Google最好的方法)。

In Java that looks like this: 在Java中如下所示:

for (int i = 0; i < thearraylist.size()-1; i++) {
    for (int j = i+1; j < thearraylist.size(); j++) {

        // now the elements and indices i and j are compared
        // if we were running into my code above:

        String line1 = thearraylist.get(i);
        String line2 = thearraylist.get(j);

        // ... and then compare them

     }
}

The reason the second loop starts from i+1 is to eliminate these unnecessary comparisons: 第二个循环从i + 1开始的原因是为了消除这些不必要的比较:

  1. Every element would be compared to itself at each point that j=i, which is useless. 每个元素在j = i的每个点都将与自身进行比较,这是没有用的。 In the above loop, j starts at i+1 and increases, so it will never equal i. 在上面的循环中,j从i + 1开始并增加,因此它将永远不等于i。
  2. Each pair of elements will be compared twice. 每对元素将被比较两次。 For example, when i=0, j=1 you are comparing the first two elements. 例如,当i = 0,j = 1时,您正在比较前两个元素。 When i=1, j=0 you are also comparing the first two elements . 当i = 1时,j = 0时,您还在比较前两个元素 This makes the second comparison redundant. 这使得第二比较是多余的。 To get rid of the second 'backwards' comparison, we insist that j always be higher than i. 为了摆脱第二个“向后”比较,我们坚持认为j始终大于i。

If you find this confusing, I would highly recommend working it out on paper by listing the values of i and j as you move through the loop. 如果你觉得这混乱的,我会强烈建议通过列出i和j值工作它写在纸上,你通过循环移动。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM