简体   繁体   English

如何从Java的字符串行中提取特定术语?

[英]How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. 从每个字符串行提取术语时,我遇到了严重的问题。 To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only) 更具体地说,我有一个实际上不是csv格式的csv格式的文件(它将所有术语仅保存到line [0]中)

So, here's just example string line among thousands of string lines: 因此,这只是数千个字符串中的示例字符串:

(split() doesn't work.!!! ) (split()不起作用!!!!)

test.csv test.csv

"31451  CID005319044      15939353      C8H14O3S2      beta-lipoic acid     C1C[S@](=O)S[C@@H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C@@H]1CCCCC(=O)O "

I want to extract "beta-lipoic acid" , "saponin" and "Berberine" only which is located in 5th position. 我只想提取位于第5位的“β-硫辛酸”“皂苷”“小 er碱” You can see there are big spaces between terms, so that's why I said 5th position. 您会看到字词之间有很大的空格,所以这就是我说第5位的原因。

In this case, how can I extract terms located in 5th position for each line? 在这种情况下,如何提取每行第5位的字词?

One more thing : the length of whitespace between each of the six terms is not always equal. 还有一件事 :六个术语中每一个之间的空格长度并不总是相等的。 the length could be one, two, three, four, or five, or something like that. 长度可以是一,二,三,四或五,或类似的东西。 Because the length of whitespace is random, I can not use the .split() function. 因为空格的长度是随机的,所以我不能使用.split()函数。 For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.** 例如,在第一行中,我将获得“β-硫辛酸”而不是“β-硫辛酸”。**

Here is a solution for your problem using the string split and index of, 这是使用字符串split和index来解决您的问题的解决方案,

import java.util.ArrayList;

public class StringSplit {

    public static void main(String[] args) {
        String[] seperatedStr = null;
        int fourthStrIndex = 0;
        String modifiedStr = null, finalStr = null;
        ArrayList<String> strList = new ArrayList<String>();
        strList.add("31451  CID005319044      15939353      C8H14O3S2    beta-lipoic acid   C1C[S@](=O)S[C@@H]1CCCCC(=O)O ");
        strList.add("12232 COD05374044 23439353   C924O3S2   saponin       CCCC(=O)O ");
        strList.add("9048   CTD042032 23241 C3HO4O3S2  Berberine    [C@@H]1CCCCC(=O)O ");

        for (String item: strList) {
            seperatedStr = item.split("\\s+");
            fourthStrIndex = item.indexOf(seperatedStr[3])  + seperatedStr[3].length();
            modifiedStr = item.substring(fourthStrIndex, item.length());
            finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
            System.out.println(finalStr.trim());
        }
    }
}

Output: 输出:

beta-lipoic acid β-硫辛酸

saponin 皂素

Berberine 小檗碱

Option 1 : Use spring.split and check for multiple consecutive spaces. 选项1:使用spring.split并检查多个连续的空格。 Like the code below: 像下面的代码:

String s[] = str.split("\\s\\s+");
        for (String string : s) {
            System.out.println(string);
        }

Option 2 : Implement your own string split logic by browsing through all the characters. 选项2:浏览所有字符,以实现自己的字符串拆分逻辑。 Sample code below (This code is just to give an idea. I didnot test this code.) 下面的代码示例(此代码仅是一个想法。我没有测试此代码。)

public static List<String> getData(String str) {
        List<String> list = new ArrayList<>();
        String s="";
        int count=0;
         for(char c : str.toCharArray()){
             System.out.println(c);
                if (c==' '){
                    count++;
                }else {
                    s = s+c;
                }
                if(count>1&&!s.equalsIgnoreCase("")){
                    list.add(s);
                    count=0;
                    s="";
                }
            }

        return list;
    }

This would be a relatively easy fix if it weren't for beta-lipoic acid... 如果不是β-硫辛酸,这将是一个相对容易的解决方案...

Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace. 假设只有空格/制表符/其他空格分开,则可以在空格上进行拆分。

Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array

While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"... 尽管这对您的大多数术语都有效,但也会导致您丢失“β-硫辛酸”中的“酸”。

Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. 另一个骇人听闻的解决方案是在上面的代码生成的数组中添加对第六个位置的检查,并查看它是否与英语字母匹配。 If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. 如果是这样,您可以有把握地确信第六名实际上与第五名是同一术语的一部分,因此可以将它们连接在一起。 This falls apart pretty quickly though if you have terms with >= 3 words. 如果您的词条中包含> = 3个单词,这会很快消失。 So something like 所以像

Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
    // return line[4].append(line[5]) or something like that
}

Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes 您可以尝试做的另一件事是用一个空格替换所有空格组,然后删除所有不是仅由英文字母/破折号组成的空格

line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");

Then hopefully the only thing that is left would be the term you're looking for. 然后希望只剩下您要查找的术语。

Hopefully this helps, but I do admit it's rather convoluted. 希望这会有所帮助,但是我承认这相当复杂。 One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work. 问题之一是看来非术语词之间可能只有一个空格,这会欺骗Hirak提出的选项1……如果不是这种情况,则该选项应该起作用。

Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. 哦,顺便说一句,如果您最终要这样做,请将Pattern声明放在任何循环之外。 They only need to be created once. 它们只需要创建一次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM