I am reading from a pdf using pdfbox
and apparently, at least on a Windows-based framework, for the line break it uses a unicode as such 

.
My question is that how can I prevent this line breaking character to be concatenated to the string in below code?
tokenizer =new StringTokenizer(Text,"\\.");
while(tokenizer.hasMoreTokens())
{
String x= tokenizer.nextToken();
flag=0;
for(final String s :x.split(" ")) {
if(flag==1)
break;
if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
sum+=x+"."; //here need first to check for " 
"
// before concatenating the String "x" to String "sum"
flag=1;
}
}
}
You should discard the line separators when you split; eg
for (final String s : x.split("\\s+")) {
That is making the word separator one or more whitespace characters.
(Using trim()
won't work in all cases. Suppose that x
contains "word\\r\\nword"
. You won't split between the two words, and s
will be "word\\r\\nword"
at some point. Then s.trim()
won't remove the line break characters because they are not at the ends of the string.)
UPDATE
I just spotted that you are actually appending x
not s
. So you also need to do something like this:
sum += x.replaceAll("\\s+", " ") + "."
That does a bit more than you asked for. It replaces each whitespace sequence with a single space.
By the way, your code would be simpler and more efficient if you used a break
to get out of the loop rather than messing around with a flag. (And Java has a boolean
type ... for heavens sake!)
if (Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
sum += ....
break;
}
Are you sure you want to be adding x
here?
if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
sum+=x+"."; //here need first to check for " 
"
// before concatenating the String "x" to String "sum"
flag=1;
}
Don't you want s
?
sum += s + ".";
UPDATE
Oh, I see. So what you really want is something more like:
tokenizer = new StringTokenizer(Text,"\\.");
Pattern KEYWORD = Pattern.compile("\\b"+Keyword+"\\b", Pattern.CASE_INSENSITIVE);
StringBuilder sb = new StringBuilder(sum);
while(tokenizer.hasMoreTokens())
{
String x = tokenizer.nextToken();
if (KEYWORD.matcher(x).find()) {
sb.append(x.replaceAll("\\s+", " ")).append('.');
}
}
sum = sb.toString();
(Assuming Keyword starts and ends with letters, and doesn't itself contain any RegEx codes)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.