简体   繁体   中英

Retrieving regex match from string and trimming other/special characters

I'm trying to retrieve compounds, contractions, or decimals from a line of text.

I've written regex for each:

contractions => ([a-zA-Z]+\'{1}[a-zA-Z]+)

YES: don't mm'nn m'n

NO: 't don' don'''t

decimal numbers => ([0-9]+\.{1}[0-9]+)

YES: 0.1 11.11

NO: .1 1. 1..0 mn

compound => ([a-zA-Z]+\-{1}[a-zA-Z]+)

YES: twenty-six mn

NO: twenty- -six twenty--six

What I'm doing is getting a paragraph contained in one String, splitting the string by white space so I get each word. Some words are bolded like so and some obviously have commas, and periods. at the end.

What I cannot figure out is before I store each word (which I'm storing in an inverted index to search later):

How do I remove all special characters from a String unless it matches any of those regexes above so that if I encounter "don't," I can store "don't", or if I encounter " twenty-six " I can store "twenty-six", or if I encounter "family," I can store "family" ?

Try this regex: (?:\\s|^)(?!\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+).*?\\s and replace with space:

String content = "put your string here";
Pattern pattern = Pattern.compile("(?:\\s|^)(?!\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+).*?\\s");
Matcher matcher = pattern.matcher(content);
String result = matcher.replaceAll(" ");

Also this will delete words like family because it didn't match any categories you mentionned, was that what you wanted?

I replaced [a-zA-Z] with \\w and [0-9] with \\d , it does the same thing but makes the regex more readable I think. Also you don't need the {1} , absence of quantifier is always considered as one.


EDIT: If you want to remove special characters not part of any category from the sentence: [^\\w ]|(\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+) and replace with \\1

String content = "put your string here";
Pattern pattern = Pattern.compile("[^\\w ]|(\\w+-\\w+|\\w+'\\w+|\\d+\\.\\d+)");
Matcher matcher = pattern.matcher(content);
String result = matcher.replaceAll("\\1");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM