I'm trying to retrieve compounds, contractions, or decimals from a line of text.
I've written regex for each:
contractions => ([a-zA-Z]+\'{1}[a-zA-Z]+)
YES: don't mm'nn m'n
NO: 't don' don'''t
decimal numbers => ([0-9]+\.{1}[0-9]+)
YES: 0.1 11.11
NO: .1 1. 1..0 mn
compound => ([a-zA-Z]+\-{1}[a-zA-Z]+)
YES: twenty-six mn
NO: twenty- -six twenty--six
What I'm doing is getting a paragraph contained in one String, splitting the string by white space so I get each word. Some words are bolded like so and some obviously have commas, and periods. at the end.
What I cannot figure out is before I store each word (which I'm storing in an inverted index to search later):
How do I remove all special characters from a String unless it matches any of those regexes above so that if I encounter "don't," I can store "don't", or if I encounter " twenty-six " I can store "twenty-six", or if I encounter "family," I can store "family" ?
Try this regex: (?:\\s|^)(?!\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+).*?\\s
and replace with space:
String content = "put your string here";
Pattern pattern = Pattern.compile("(?:\\s|^)(?!\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+).*?\\s");
Matcher matcher = pattern.matcher(content);
String result = matcher.replaceAll(" ");
Also this will delete words like family
because it didn't match any categories you mentionned, was that what you wanted?
I replaced [a-zA-Z]
with \\w
and [0-9]
with \\d
, it does the same thing but makes the regex more readable I think. Also you don't need the {1}
, absence of quantifier is always considered as one.
EDIT: If you want to remove special characters not part of any category from the sentence: [^\\w ]|(\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+)
and replace with \\1
String content = "put your string here";
Pattern pattern = Pattern.compile("[^\\w ]|(\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+)");
Matcher matcher = pattern.matcher(content);
String result = matcher.replaceAll("\\1");
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.