简体   繁体   English

Java String Fix大写形式的缩写

[英]Java String Fix Capitalization in Abbreviations

I need a way to fix capitalization in abbreviations found within a String . 我需要一种在String找到的缩写中固定大小写的方法。 Assume all abbreviations are correctly spaced. 假定所有缩写词的间距正确。

For example, 例如,

"Robert a.k.a. Bob A.k.A. dr. Bobby"

becomes: 变为:

"Robert A.K.A. Bob A.K.A. Dr. Bobby"

Correctly capitalized abbreviations will be known ahead of time, stored in a Collection of some sort. 正确知道大写的缩写将被提前存储在某种Collection中。

I was thinking of an algorithm like this: 我在想这样的算法:

private String fix(String s) {
    StringBuilder builder = new StringBuilder();
    for (String word : s.split(" ")) {
        if (collection.contains(word.toUpperCase()) {
            // word = correct abbreviation here
        }
        builder.append(word);
        builder.append(" ");
    }
    return builder.toString().trim();
}

But as far as I know, there are a couple of problems with this approach: 但据我所知,这种方法存在两个问题:

  • If the abbreviation has a lower case letter (Dr.) 如果缩写有小写字母(博士)
  • If the word starts or ends with punctuation ("aka") 如果单词以标点符号开头或结尾(“ aka”)

I have a feeling that this can be solved with a regex, iteratively matching and replacing the correct abbreviation. 我觉得可以使用正则表达式来解决此问题,它可以反复匹配并替换正确的缩写。 But if not, how should I approach this problem? 但是,如果没有,我应该如何解决这个问题?

Instead of using a regex or rolling your own implementation, I would suggest you use an utility library. 建议不要使用正则表达式或滚动自己的实现,而建议您使用实用程序库。 WordUtils in Apache Commons Lang is perfect for the job: Apache Commons Lang中的WordUtils非常适合此工作:

String input = "Robert a.k.a. Bob A.k.A. dr. Bobby";
String capitalized = WordUtils.capitalize(input, '.', ' ');
System.out.println(capitalized);

This prints out 打印出来

Robert A.K.A. Bob A.K.A. Dr. Bobby

You do not have to use regex, ie. 您不必使用正则表达式,即。 your solution looks reasonable (although it may be slow if you have a lot of data to process). 您的解决方案看起来很合理(尽管如果要处理大量数据,可能会很慢)。

For abbreviations contained lower case letters, eg. 缩写包含小写字母,例如。 Dr. you could use a case insensitive string comparison rather than toUpperCase . 博士,您可以使用不区分大小写的字符串比较,而不是toUpperCase Actually, that's only useful if you are directly comparing the strings yourself. 实际上,仅当您自己直接比较字符串时,这才有用。 You really need a case-insensitive HashMap . 您确实需要不区分大小写的HashMap Perhaps: 也许:

Map<String, String> collection = new TreeMap<String, String>(String.CASE_INSENSITIVE_ORDER);

If the abbreviation starts or ends with punctuation, then make sure the corresponding key in your collection does too. 如果缩写以标点符号开头或结尾,请确保集合中的相应键也是如此。

This is how I went about it... 这就是我的方法...

UPDATED 更新

after reading comments by OP 在阅读了OP的评论后

it prints: 它打印:

Robert AKA Bob AKA Dr. Bobby The oo Robert AKA Bob AKA鲍比博士The oo

import java.util.ArrayList;
import java.util.List;

public class Fixer {

    List<String> collection = new ArrayList<>();

    public Fixer() {
        collection.add("Dr.");
        collection.add("A.K.A.");
        collection.add("o.o.");
    }

    /* app entry point */
    public static void main(String[] args) throws InterruptedException {
        String testCase = "robert a.k.a. bob A.k.A. dr. bobby the o.o.";

        Fixer l = new Fixer();
        String result = l.fix(testCase);

        System.out.println(result);
    }

    private String fix(String s) {
        StringBuilder builder = new StringBuilder();
        for (String word : s.split(" ")) {
            String abbr = getAbbr(word);
            if (abbr == null) {
                builder.append(word.substring(0, 1).toUpperCase());
                builder.append(word.substring(1));
            } else {
                builder.append(abbr);
            }
            builder.append(" ");
        }
        return builder.toString().trim();
    }

    private String getAbbr(String word) {
        for (String abbr : collection) {
            if (abbr.equalsIgnoreCase(word)) {
                return abbr;
            }
        }
        return null;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM