My input is like this :
line 1
**ER1.RIAA.SOMPSFIO(LIAOEE) UTGD788 FDSJOFUZZÄ line2
JNDJZSDS ER1.RIAA.SIMEDFUA(AUDD) YIRIHFIH1465EZZÄ
line 3
UJZRJOERERÃLDE,UIE='UJ1.DHZKZ5.OZDEZN98.AAERRE',I=DZEDE POPZEOE**
I would like to get only characters with periods , ie :
ER1.RIAA.SOMPSFIO
ER1.RIAA.SIMEDFUA
UJ1.DHZKZ5.OZDEZN98.AAERRE
My solution is that :
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
nrligne++;
int counter = 0;
for (int i=0; i<line.length(); i++ ) {
if( line.charAt(i) == '.' ) {
counter++;
}
}
if (counter == 2) {
if (line.matches("^.*[A-Z0-9]+\\..[A-Z1-9]+.*$")){
line= removeTroublesomeCharacters(line);
System.out.println("ligne vaut "+line);
Pattern dsnPattern = Pattern.compile("^.*([A-Z0-9]+)\\..([A-Z1-9]+)\\..([A-Z1-9]+).*$");
Matcher m = dsnPattern.matcher(line);
if (m.matches()) {
String part1 = m.group(1);
String part2 = m.group(2);
String part3 = m.group(3);
System.out.println("part1 vaut "+part1);
System.out.println("part2 vaut "+part2);
System.out.println("part2 vaut "+part3);
}
}
For the moment the result is
ligne vaut ER1.RIAA.SOMPSFIO(LIAOEE) UTGD788
part1 vaut 1
part2 vaut IAA
part2 vaut OMPSFIO
ligne vaut PZFDSJOFUZZÃâ ER1.RIAA.SIMEDFUA(AUDD) UOOO88
part1 vaut 1
part2 vaut IAA
part2 vaut IMEDFUA
ligne vaut UJZRJOERERÃLDE,UIE='UJ1.DHZKZ5.OZDEZN98',I=DZEDE POPZEOE
part1 vaut 1
part2 vaut HZKZ5
part2 vaut ZDEZN98
File in input : http://uploadhero.co/dl/PWBLhi7d I don't understand why the regex eat the begin of each characters ? Can someone help me to fix this ?
Because you are consuming an extra character after the dot
, and not including it in character class.
\\.. // this will match a dot, and then following single character.
Also, make the .*
at the beginning of your regex to .*?
. Since quantifiers are by default greedy, they will consume all the characters, and just leave a single word just before the .
to be matched by ([a-z0-9]+)
.
Change your regex to:
"^.*?([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+).*$"
Also, since you are using Pattern
and Matcher
anyways, I will consider using the Matcher#find()
method, and build pattern just for the part that I need:
Pattern dsnPattern = Pattern.compile("([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+)");
Matcher m = dsnPattern.matcher(line);
if (m.find()) {
Since the capturing parts of your expression are preceded and followed by "eat anything" .*
expression, part of what you wish to capture ends up being consumed by these "fillers".
You can explicitly require that the characters before and after [A-Z0-9]+
groups be non-alphanumeric, like this:
"^.*(?<![A-Z0-9])([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+)(?![A-Z1-9]).*$"
(?<![A-Z0-9])
expression means "not preceded by [A-Z0-9]
" (?![A-Z1-9])
expression means "not followed by [A-Z1-9]
" EDIT :
The lookahead is not necessary, because +
is greedy:
"^.*(?<![A-Z0-9])([A-Z0-9]+)\\.([A-Z1-9]+)\\.([A-Z1-9]+).*$"
(credit for this goes to Rohit Jain )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.