使用模式从字符串中删除数字和字符

Question

So I have a list of words, like 50,000 of them, and I want to remove certain numbers and letters from them. 因此，我有一个单词列表，例如50,000个，我想从中删除某些数字和字母。 Specifically, I want to remove anything that has a number from 0-99 followed by either an E or Z, so for example: 4E, 11Z, 11E, 20Z , etc 具体来说，我想删除数字范围为0-99，后接E或Z的任何内容，例如： 4E, 11Z, 11E, 20Z等

The words that I want to remove them from look like this:- 我想将其从中删除的单词如下所示：-

6S,9,12S-trimethyl-2E,4E,8E,10E-tetradecatetraenoic acid 6S，9,12S-三甲基-2E，4E，8E，10E-四癸烯酸
7Z,14Z-eicosadienoic acid 7Z，14Z-二十二碳二烯酸
13,17,21,25-tetramethyl-5Z-hexacosenoic acid 13,17,21,25-四甲基-5Z-己烯酸
CDP-DG(18:1(11Z)/22:6(4Z,7Z,10Z,13Z,16Z,19Z)) CDP-DG（18：1（11Z）/ 22：6（4Z，7Z，10Z，13Z，16Z，19Z））
PC(20:4(5Z,8Z,11Z,14Z)/17:2(9Z,12Z)) PC（20：4（5Z，8Z，11Z，14Z）/ 17：2（9Z，12Z））

As you can see the thing I want to remove appears in different ways in the words (as in within a bracket or after a hyphen etc). 如您所见，我要删除的内容以不同的方式出现在单词中（例如在方括号内或连字符后）。 So far, I've done: 到目前为止，我已经完成了：

public class EZConfig {

    public static void main(String[] args) throws IOException{

     BufferedReader br = new BufferedReader(new FileReader("C:/Users/colles-a-l-kxc127/Dropbox/PhD/Java/MetabolitesCompiled/src/commonNames"));

        try {

            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) {

                if(line.contains("[0-99][E|Z]")){

                    System.out.println(line + " TRUE");
                }
                else{
                    System.out.println(line);
                }

                line = br.readLine();
            }

        } finally {
            br.close();
        }
    }
}

Just to see if I can pick up the number/E or Z annotations but I can't seem. 只是看我是否可以接听数字/ E或Z注释，但我似乎看不到。 I need to basically script something that will remove all those annotations from my list of words. 我基本上需要编写一些脚本，该脚本将从我的单词列表中删除所有这些注释。 Anyone know what I can do in order to achieve this? 有人知道我可以做些什么来实现这一目标吗？

Answer 1

You cannot pass a regular expression to String.contains - or rather, it will be treated as literal. 您不能将正则表达式传递给String.contains或将其视为文字。

I would use this draft solution: 我将使用此解决方案草案：

// declare as constant somewhere
static final Pattern MY_PATTERN = Pattern.compile("\\d+[EZ]");

Then, instead of your if(line.contains("[0-99][E|Z]")){ statement, you can use: 然后，可以使用以下语句来代替if(line.contains("[0-99][E|Z]")){语句

if (MY_PATTERN.matcher(line).find()) {

On the long run, if you're removing that from your words, you probably want to use: 从长远来看，如果要从单词中删除它，则可能要使用：

line = line.replaceAll("\\d+[EZ]", "");

Edit 编辑

As newbiedoodle mentions (hadn't noticed), the character class [0-99] will not match a range between 0 and 99 . 正如newbiedoodle提到的（未注意到），字符类[0-99]将不匹配0到99之间的范围。

If you need to limit your digits to < 100 , you can use \\\\d{1,2} instead of the more generic \\\\d+ . 如果需要将数字限制为< 100 ，则可以使用\\\\d{1,2}代替更通用的\\\\d+ 。

Notes 笔记

To remove [optional] parenthesis surrounding the pattern, an optional hyphen starting it and an optional comma ending it as well, you can use the following idiom: "-?\\\\(?\\\\d+[EZ]\\\\)?,?" 要删除模式周围的[可选]括号，也可以使用可选的连字符开头和结尾的逗号，可以使用以下惯用法： "-?\\\\(?\\\\d+[EZ]\\\\)?,?" . 。

Note that parenthesis need to be double escaped in this context. 请注意，在这种情况下，括号需要两次转义。

使用模式从字符串中删除数字和字符

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-12-18 13:27:36

使用模式从字符串中删除数字和字符

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-12-18 13:27:36

解决方案1
3 已采纳 2014-12-18 13:27:36