简体   繁体   English

删除所有非数字字符但保留特定字词

[英]Remove all non-numeric characters but keep a specific word

I'm working on a script that can download mangas from www.mangafox.me in Java. 我正在开发一个可以用Java从www.mangafox.me下载漫画的脚本。

Unfortunately, this website doesn't have any APIs thus I use some archaic ways to get my data. 不幸的是,这个网站没有任何API,因此我使用一些古老的方法来获取我的数据。 However, it's possible to get an xml with every chapters of a manga. 但是,可以在漫画的每个章节中获得一个xml。 For example : http://mangafox.me/rss/nisekoi.xml . 例如: http//mangafox.me/rss/nisekoi.xml

I parse this xml and use the title tag to get a chapter's number and associated volume. 我解析这个xml并使用title标签来获取章节的编号和相关的音量。

For example, I have a string like this : Nisekoi Vol TBD Ch 215 and I want to keep only TBD and 215 . 例如,我有一个这样的字符串: Nisekoi Vol TBD Ch 215 ,我想只保留TBD215

At the moment, I replace all non-numeric characters with spaces and keep every occurences of TBD by using : 目前,我用空格替换所有非数字字符,并使用以下方法保留TBD的每个出现次数:

String title = "Nisekoi Vol TBD Ch 215";
title = title.replaceAll("[^0-9.\bTBD\b]+", " ").trim();

title equals to "TBD 215" and then I use title.split(" ") to get the volume and the chapter. title等于"TBD 215" ,然后我使用title.split(" ")来获取音量和章节。

This is working just fine until I do the same with a manga that starts with an T. Apparently, the capital T isn't replaced by a space. 这个工作正常,直到我用一个以T开头的漫画做同样的事情。显然,大写字母T没有被空格所取代。

I'm not very good at Regular expression so how do I get to replace every character that is not a number, a dot (for decimals) or the word "TBD" by a space in Java ? 我不是很擅长正则表达式,那么如何用Java中的空格替换不是数字,点(小数)或单词“TBD”的每个字符?

Thanks ! 谢谢 !

KISS - 保持简单愚蠢:用\\\\d+$抓住标题末尾的数字,然后像TBD + your_number一样将你的标题your_number

I guess that "Vol" and "Ch" are the fixed parts here, so you could use this regex : 我想这里的“Vol”和“Ch”是固定的部分,所以你可以使用这个正则表达式:

Vol (.*) Ch (.*)

and retrieve it's first group for the volume and its second for the chapter. 并检索它的卷的第一组和章节的第二组。

You can see the java code in action here . 你可以在这里看到j​​ava代码。

FYI, you're having an error because you're using a character class ( [...] ) which means "any character of the following class" rather than "this sequence of characters". 仅供参考,你有一个错误,因为你正在使用一个字符类( [...] ),这意味着“以下类的任何字符”而不是“这个字符序列”。

Without a regex, I'd try something like this: 没有正则表达式,我会尝试这样的事情:

StringBuilder sb = new StringBuilder(title.length());
for (int i = 0; i < title.length(); ++i) {
  char ch = title.charAt(i);
  if (ch == '.' || Character.isDigit(ch) {
    sb.append(ch);
  } else if (ch == 'T' && title.indexOf("TBD", i) == i) {
    sb.append("TBD");
    i += 2;
  }
}
title = sb.toString();

This should do the trick 这应该可以解决问题

Pattern pattern = Pattern.compile("Vol ([A-Z]{3}) Ch (\\d{3})");
Matcher matcher = pattern.matcher(input);
if(matcher.find()){
  String volume = matcher.group(1);
  String chapter = matcher.group(2);
}

There are many answers here - so here is mine which extends the answer from Jan. 这里有很多答案 - 所以这是我的,从1月开始延伸答案。

String title = "Nisekoi Vol TBD Ch 215.5";
Pattern pattern = Pattern.compile("[\\.\\d]+$");
Matcher matcher = pattern.matcher(title);
   if (matcher.find())
   {
     System.out.println("TBD " + matcher.group(0));
   }

output is : TBD 215.5 . 输出为: TBD 215.5


This will always match the numbers at the end of the string. 这将始终匹配字符串末尾的数字。 So it does not matter what is leading the string. 所以领导字符串是什么并不重要。 This will also match dots. 这也将匹配点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM