简体   繁体   English

java-从文本提取中省略特殊字符

[英]java - omitting special characters from text extraction

I have a program where I extract text or words from PDF file and insert those words to the table in database. 我有一个程序,可以从PDF文件中提取文本或单词,然后将这些单词插入数据库中的表中。

During insertion, I have a special regular expression to omit special characters if it exists on the word. 在插入过程中,我有一个特殊的正则表达式来省略单词中是否存在特殊字符。 The rule is if any words that has special characters in front of the word or at the end of the word, it gets removed. 规则是,如果在单词前面或单词结尾处有特殊字符的单词被删除。

Example: 例:

Text : `,test.`
Token : `test`
Text: ?good
Token : good 
 Text: ?,.
 Token:
 Text: www.stack.com
 Token: www.stack.com

As long there is no space between characters, the special characters stay. 只要字符之间没有空格,特殊字符就会保留。 This is at least how I defined it to be. 至少这是我定义的方式。

This is the general idea where I define my definition of what words to be stored. 这是我定义要存储的单词的定义的基本思想。 However, when it comes to certain words like underscore: 但是,当涉及到某些单词时,例如下划线:

Text: _
Token : Same as above

Text: _—,m‘—_
Token : same as above

It doesn't seem to treat the underscore as a special character. 下划线似乎没有被视为特殊字符。

My code: 我的代码:

String lines[] = text.split("\\r?\\n");
    for (String line : lines) {
        String[] words = line.split(" ");

        System.out.println("Line: " + line);



        preparedStatement = con1.prepareStatement(sql);
        int i=0;
        for (String word : words) {

            // check if one or more special characters at end of string then remove OR
            // check special characters in beginning of the string then remove
            // insert every word directly to table db
            word = word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
            preparedStatement.setString(1, path1);
            preparedStatement.setString(2, word);
              System.out.println("Token: " +word);
            preparedStatement.executeUpdate();
        }


    }

Is there a way to properly ignored every possible combination of special characters or symbols? 有没有办法适当地忽略特殊字符或符号的每种可能组合?

The definition of \\W is [^a-zA-Z_0-9] (see Java Pattern API ). \\ W的定义为[^a-zA-Z_0-9] (请参阅Java Pattern API )。

So to get the same behaviour without the underscores, replace \\W with [^a-zA-Z0-9] 因此,要获得没有下划线的相同行为,请将\\ W替换为[^a-zA-Z0-9]

Your line of code would then be: 您的代码行将是:

word = word.replaceAll("([^a-zA-Z_0-9]+$)|(^[^a-zA-Z_0-9]+)", "");

您可以使用以下内容替换所有特殊字符(空格除外)。

word = word.replaceAll("([ a-zA-Z0-9])", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM