[英]Removing numbers and characters from a string using pattern
So I have a list of words, like 50,000 of them, and I want to remove certain numbers and letters from them. 因此,我有一个单词列表,例如50,000个,我想从中删除某些数字和字母。 Specifically, I want to remove anything that has a number from 0-99 followed by either an E or Z, so for example:
4E, 11Z, 11E, 20Z
, etc 具体来说,我想删除数字范围为0-99,后接E或Z的任何内容,例如:
4E, 11Z, 11E, 20Z
等
The words that I want to remove them from look like this:- 我想将其从中删除的单词如下所示:-
As you can see the thing I want to remove appears in different ways in the words (as in within a bracket or after a hyphen etc). 如您所见,我要删除的内容以不同的方式出现在单词中(例如在方括号内或连字符后)。 So far, I've done:
到目前为止,我已经完成了:
public class EZConfig {
public static void main(String[] args) throws IOException{
BufferedReader br = new BufferedReader(new FileReader("C:/Users/colles-a-l-kxc127/Dropbox/PhD/Java/MetabolitesCompiled/src/commonNames"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
if(line.contains("[0-99][E|Z]")){
System.out.println(line + " TRUE");
}
else{
System.out.println(line);
}
line = br.readLine();
}
} finally {
br.close();
}
}
}
Just to see if I can pick up the number/E or Z annotations but I can't seem. 只是看我是否可以接听数字/ E或Z注释,但我似乎看不到。 I need to basically script something that will remove all those annotations from my list of words.
我基本上需要编写一些脚本,该脚本将从我的单词列表中删除所有这些注释。 Anyone know what I can do in order to achieve this?
有人知道我可以做些什么来实现这一目标吗?
You cannot pass a regular expression to String.contains
- or rather, it will be treated as literal. 您不能将正则表达式传递给
String.contains
或将其视为文字。
I would use this draft solution: 我将使用此解决方案草案:
// declare as constant somewhere
static final Pattern MY_PATTERN = Pattern.compile("\\d+[EZ]");
Then, instead of your if(line.contains("[0-99][E|Z]")){
statement, you can use: 然后,可以使用以下语句来代替
if(line.contains("[0-99][E|Z]")){
语句
if (MY_PATTERN.matcher(line).find()) {
On the long run, if you're removing that from your words, you probably want to use: 从长远来看,如果要从单词中删除它,则可能要使用:
line = line.replaceAll("\\d+[EZ]", "");
Edit 编辑
As newbiedoodle mentions (hadn't noticed), the character class [0-99]
will not match a range between 0
and 99
. 正如newbiedoodle提到的(未注意到),字符类
[0-99]
将不匹配0
到99
之间的范围。
If you need to limit your digits to < 100
, you can use \\\\d{1,2}
instead of the more generic \\\\d+
. 如果需要将数字限制为
< 100
,则可以使用\\\\d{1,2}
代替更通用的\\\\d+
。
Notes 笔记
To remove [optional] parenthesis surrounding the pattern, an optional hyphen starting it and an optional comma ending it as well, you can use the following idiom: "-?\\\\(?\\\\d+[EZ]\\\\)?,?"
要删除模式周围的[可选]括号,也可以使用可选的连字符开头和结尾的逗号,可以使用以下惯用法:
"-?\\\\(?\\\\d+[EZ]\\\\)?,?"
. 。
Note that parenthesis need to be double escaped in this context. 请注意,在这种情况下,括号需要两次转义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.