简体   繁体   English

Java搜索字符串内容以进行部分匹配

[英]Java Searching String Contents for partial match

I'm working on a project where I need to search a paragraph of text for a particular string. 我正在开发一个项目,我需要在其中搜索特定字符串的文本段落。 However, I don't need an exact match, more of a % match. 但是,我不需要完全匹配,更多的是%匹配。

For example, here is the paragraph of text I'm searching: 例如,这是我正在搜索的文本段落:

Fluticasone Propionate Nasal Spray, USP 50 mcg per spray is a 
corticosteroid indicated for the management of the nasal symptoms of 
perennial nonallergic rhinitis in adult and pediatric patients aged 4 years 
and older."

And then I'm searching to see if any words in the following lines match the paragraph: 然后我正在搜索以下行中的任何单词是否与段落匹配:

1)Unspecified acute lower respiratory infection
2)Vasomotor rhinitis
3)Allergic rhinitis due to pollen
4)Other seasonal allergic rhinitis
5)Allergic rhinitis due to food
6)Allergic rhinitis due to animal (cat) (dog) hair and dander
7)Other allergic rhinitis
8)"Allergic rhinitis, unspecified"
9)Chronic rhinitis
10)Chronic nasopharyngitis

My initial approach to this was using a boolean and contains: 我最初的方法是使用布尔值并包含:

boolean found = med[x].toLowerCase().contains(condition[y].toLowerCase());

however, the results are negative for each loop through. 但是,每个循环的结果都是负数。

The results I expect would be: 我期望的结果是:

1) False
2) True
3) True
4) True
5) True
6) True
7) True
8) True
9) True
10) False

Very new to Java and its methods. Java及其方法的新手。 Basically if any word in A matches any word in B then flag it as true. 基本上,如果A中的任何单词与B中的任何单词匹配,则将其标记为true。 How do I do that? 我怎么做?

Thanks! 谢谢!

You have to first tokenize one of the strings. 您必须先对其中一个字符串进行标记。 What you are doing now is trying to match the whole line. 你现在正在做的是试图匹配整条线。

Something like this should work: 这样的事情应该有效:

String text = med[x].toLowerCase();
boolean found = 
  Arrays.stream(condition[y].split(" "))      
      .map(String::toLowerCase)
      .map(s -> s.replaceAll("\\W", "")
      .filter(s -> !s.isEmpty())
      .anyMatch(text::contains);

I've added the removal of punctuation characters, and any blank strings, so that we don't have false matches on those. 我添加了删除标点字符和任何空字符串,以便我们不会对这些字符进行错误匹配。 (The \\\\W actually removes characters that are not in [A-Za-z_0-9] , but you can change it to whatever you like). \\\\W实际上会删除不在[A-Za-z_0-9]中的字符,但您可以将其更改为您喜欢的任何字符)。

If you need this to be efficient, because you have a lot of text, you might want to turn it around and use a Set which has a faster lookup. 如果你需要这个有效,因为你有很多文本,你可能想要转过它并使用一个具有更快查找的Set

private Stream<String> tokenize(String s) {
   return Arrays.stream(s.split(" "))
                .map(String::toLowerCase)
                .map(s -> s.replaceAll("\\W", "")
                .filter(s -> !s.isEmpty());                   
}

Set<String> words =  tokenize(med[x]).collect(Collectors.toSet());

boolean found = tokenize(condition[y]).anyMatch(words::contains);

You might also want to filter out stop words, like to , and etc. You could use the list here and add an extra filter after the one that checks for blank strings, to check that the string is not a stop word. 您可能还需要过滤掉停用词,像toand等,您可以使用该列表在这里和检查空字符串一前一后添加一个额外的过滤器,以检查字符串不是停用词。

If you construct a list with the searchable words this would be a lot easier. 如果您使用可搜索的单词构建列表,这将更容易。 Supposing your paragraph is stored as a String: 假设您的段落存储为字符串:

ArrayList<String> dictionary = new ArrayList<>();
dictionary.add("acute lower respiratory infection");
dictionary.add("rhinitis");
for(int i =0; i<dictionary.size(); i++){
    if(paragraph.contains(dictionary.get(i))){
        System.out.println(i + "True");
    }
    else{
         System.out.println(i +"False");
    }
}

This will give you a 'crude' match percentage. 这将为您提供“原始”匹配百分比。

Here's how it works: 以下是它的工作原理:

  1. Split the text to search and the search term into a set of words. 将文本拆分为搜索,将搜索词拆分为一组单词。 This is done by splitting using a regular expression. 这是通过使用正则表达式拆分来完成的。 Each word is converted to upper case and added to a set. 每个单词都转换为大写并添加到一个集合中。

  2. Count how many words in the search term appears in the text. 计算搜索词中出现的单词数量。

  3. Calculate the percentage of words in the search term that appear in the text. 计算文本中显示的搜索词中的单词百分比。

You might want to enhance this by stripping out common words like 'a', 'the' etc. 您可能希望通过删除诸如“a”,“the”等常用词来增强此功能。

    import java.util.Arrays;
    import java.util.Set;
    import java.util.stream.Collectors;

    public class CrudeTextMatchThingy {

        public static void main(String[] args) {
            String searchText = "Fluticasone Propionate Nasal Spray, USP 50 mcg per spray is a \n" +
                    "corticosteroid indicated for the management of the nasal symptoms of \n" +
                    "perennial nonallergic rhinitis in adult and pediatric patients aged 4 years \n" +
                    "and older.";

            String[] searchTerms = {
                "Unspecified acute lower respiratory infection",
                "Vasomotor rhinitis",
                "Allergic rhinitis due to pollen",
                "Other seasonal allergic rhinitis",
                "Allergic rhinitis due to food",
                "Allergic rhinitis due to animal (cat) (dog) hair and dander",
                "Other allergic rhinitis",
                "Allergic rhinitis, unspecified",
                "Chronic rhinitis",
                "Chronic nasopharyngitis"
            };

            Arrays.stream(searchTerms).forEach(searchTerm -> {
                double matchPercent = findMatch(searchText, searchTerm);
                System.out.println(matchPercent + "% - " + searchTerm);
            });
        }

        private static double findMatch(String searchText, String searchTerm) {
            Set<String> wordsInSearchText = getWords(searchText);
            Set<String> wordsInSearchTerm = getWords(searchTerm);

            double wordsInSearchTermThatAreFound = wordsInSearchTerm.stream()
                    .filter(s -> wordsInSearchText.contains(s))
                    .count();

            return (wordsInSearchTermThatAreFound / wordsInSearchTerm.size()) * 100.0;
        }

        private static Set<String> getWords(String term) {
            return Arrays.stream(term.split("\\b"))
                    .map(String::trim)
                    .map(String::toUpperCase)
                    .filter(s -> s.matches("[A-Z0-9]+"))
                    .collect(Collectors.toSet());
        }
    }

Output: 输出:

    0.0% - Unspecified acute lower respiratory infection
    50.0% - Vasomotor rhinitis
    20.0% - Allergic rhinitis due to pollen
    25.0% - Other seasonal allergic rhinitis
    20.0% - Allergic rhinitis due to food
    20.0% - Allergic rhinitis due to animal (cat) (dog) hair and dander
    33.33333333333333% - Other allergic rhinitis
    33.33333333333333% - Allergic rhinitis, unspecified
    50.0% - Chronic rhinitis
    0.0% - Chronic nasopharyngitis

If you do not want a percentage, but true or false, you can just do..., 如果你不想要一个百分比,但是真或假,你可以做...,

    boolean matches = findMatch(searchText, searchTerm) > 0.0;

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM