Split words in Java with java.util.regex

Question

I have a text like that:

The C language is%y% widely used today in application, operating system, and embedded system development, and its influence is seen in most modern programming languages. UNIX has also been influential, establishing %y% concepts and principles that are now precepts of computing.%p%

Text has some unnecessary indicators: %y% and %p%

I use regex for split words using this regex:

Pattern p = Pattern.compile("[a-zA-Z]+");

I could split all words but this regex brings "y" and "p" letters. How can i ignore these indicators?

Answer 1

You could use some pre-processing to remove all of the unneccesary characters before you do your main processing. Something like this should work:

string.replaceAll("%y%|%p%","")

Answer 2

Or you may treat the indicators as separate words, and sort them out later:

Pattern p = Pattern.compile("[a-zA-Z]+|%[a-z]%");

BTW, you should not use [a-zA-Z] for natural language texts - even english text could contain words like café , names like Björn etc. For this, java.util.regex.Pattern supports predefined character classes for letters \\p{L} along with \\p{Ll} (only lowercase letters) and \\p{Lu} (only uppercase letters) that would match such words just fine.

Answer 3

If the only characters are "%y%" and "%p%" you could make it simple and just remove these before doing the regex..

eg

myString = myString.replaceAll("%y%|%p%", "");

Split words in Java with java.util.regex

Question

3 answers

solution1
2 ACCPTED 2011-11-14 23:54:46

solution2
1 2011-11-15 01:07:02

solution3
0 2011-11-14 23:57:05

Split words in Java with java.util.regex

Question

3 answers

solution1 2 ACCPTED 2011-11-14 23:54:46

solution2 1 2011-11-15 01:07:02

solution3 0 2011-11-14 23:57:05

solution1
2 ACCPTED 2011-11-14 23:54:46

solution2
1 2011-11-15 01:07:02

solution3
0 2011-11-14 23:57:05