简体   繁体   English

使用模式从字符串中删除数字和字符

[英]Removing numbers and characters from a string using pattern

So I have a list of words, like 50,000 of them, and I want to remove certain numbers and letters from them. 因此,我有一个单词列表,例如50,000个,我想从中删除某些数字和字母。 Specifically, I want to remove anything that has a number from 0-99 followed by either an E or Z, so for example: 4E, 11Z, 11E, 20Z , etc 具体来说,我想删除数字范围为0-99,后接E或Z的任何内容,例如: 4E, 11Z, 11E, 20Z

The words that I want to remove them from look like this:- 我想将其从中删除的单词如下所示:-

  • 6S,9,12S-trimethyl-2E,4E,8E,10E-tetradecatetraenoic acid 6S,9,12S-三甲基-2E,4E,8E,10E-四癸烯酸
  • 7Z,14Z-eicosadienoic acid 7Z,14Z-二十二碳二烯酸
  • 13,17,21,25-tetramethyl-5Z-hexacosenoic acid 13,17,21,25-四甲基-5Z-己烯酸
  • CDP-DG(18:1(11Z)/22:6(4Z,7Z,10Z,13Z,16Z,19Z)) CDP-DG(18:1(11Z)/ 22:6(4Z,7Z,10Z,13Z,16Z,19Z))
  • PC(20:4(5Z,8Z,11Z,14Z)/17:2(9Z,12Z)) PC(20:4(5Z,8Z,11Z,14Z)/ 17:2(9Z,12Z))

As you can see the thing I want to remove appears in different ways in the words (as in within a bracket or after a hyphen etc). 如您所见,我要删除的内容以不同的方式出现在单词中(例如在方括号内或连字符后)。 So far, I've done: 到目前为止,我已经完成了:

public class EZConfig {

    public static void main(String[] args) throws IOException{

     BufferedReader br = new BufferedReader(new FileReader("C:/Users/colles-a-l-kxc127/Dropbox/PhD/Java/MetabolitesCompiled/src/commonNames"));

        try {

            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) {

                if(line.contains("[0-99][E|Z]")){

                    System.out.println(line + " TRUE");
                }
                else{
                    System.out.println(line);
                }

                line = br.readLine();
            }

        } finally {
            br.close();
        }
    }
}

Just to see if I can pick up the number/E or Z annotations but I can't seem. 只是看我是否可以接听数字/ E或Z注释,但我似乎看不到。 I need to basically script something that will remove all those annotations from my list of words. 我基本上需要编写一些脚本,该脚本将从我的单词列表中删除所有这些注释。 Anyone know what I can do in order to achieve this? 有人知道我可以做些什么来实现这一目标吗?

You cannot pass a regular expression to String.contains - or rather, it will be treated as literal. 您不能将正则表达式传递给String.contains或将其视为文字。

I would use this draft solution: 我将使用此解决方案草案:

// declare as constant somewhere
static final Pattern MY_PATTERN = Pattern.compile("\\d+[EZ]");

Then, instead of your if(line.contains("[0-99][E|Z]")){ statement, you can use: 然后,可以使用以下语句来代替if(line.contains("[0-99][E|Z]")){语句

if (MY_PATTERN.matcher(line).find()) {

On the long run, if you're removing that from your words, you probably want to use: 从长远来看,如果要从单词中删除它,则可能要使用:

line = line.replaceAll("\\d+[EZ]", "");

Edit 编辑

As newbiedoodle mentions (hadn't noticed), the character class [0-99] will not match a range between 0 and 99 . 正如newbiedoodle提到的(未注意到),字符类[0-99]将不匹配099之间的范围。

If you need to limit your digits to < 100 , you can use \\\\d{1,2} instead of the more generic \\\\d+ . 如果需要将数字限制为< 100 ,则可以使用\\\\d{1,2}代替更通用的\\\\d+

Notes 笔记

To remove [optional] parenthesis surrounding the pattern, an optional hyphen starting it and an optional comma ending it as well, you can use the following idiom: "-?\\\\(?\\\\d+[EZ]\\\\)?,?" 要删除模式周围的[可选]括号,也可以使用可选的连字符开头和结尾的逗号,可以使用以下惯用法: "-?\\\\(?\\\\d+[EZ]\\\\)?,?" .

Note that parenthesis need to be double escaped in this context. 请注意,在这种情况下,括号需要两次转义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM