简体   繁体   中英

How to use regular expressions in java to remove certain characters

General question is: how to parse a string and eliminate punctuation and replace some of them?

I'm trying to modify some input text. The case is that I have an normal text file, with punctuation and I want to get all of them eliminated. If the Symbol is an . ! ? ... I want to replace that with an "" string.

I never used regex and so I tried with string comparison, but obviously it isn't sufficient for all cases. I have trouble if there are two punctuation marks; like in the text "the second Day (the 4ht).", when I have ). togheter.

For example, from given Input I expect the following:

Input :  [...] at it!" This speech caused
Excpected output : at it <s> this speech caused

Every word in my code is added to an ArrayList because I need to work with that later.

Thanks a lot!

FileInputStream fileInputStream = new FileInputStream("TEXT.txt");
InputStreamReader inputStreamReader = new InputStreamReader(
        fileInputStream, "UTF-8");
BufferedReader bf = new BufferedReader(inputStreamReader);

words.add("<s>");
String s;
while ((s = bf.readLine()) != null) {
    String[] var = s.split(" ");

    for (int i = 0; i < var.length; i++) {
        if (var[i].endsWith(",") || var[i].endsWith(")")
                || var[i].endsWith("(") || var[i].endsWith(":") 
                ||  var[i].endsWith(";") ||var[i].endsWith("'")) {
            var[i] = var[i].substring(0, var[i].length() - 1);
            words.add(var[i].toLowerCase());
        } else if ( var[i].startsWith("'")) {
            var[i] = var[i].substring(1, var[i].length() );
            words.add(var[i].toLowerCase());
        } else if (var[i].endsWith(".") || var[i].endsWith("...")
                || var[i].endsWith("!") || var[i].endsWith("?")) {
            var[i] = var[i].substring(0, var[i].length() - 1);
            words.add(var[i].toLowerCase());
            words.add("<s>");
        } else {
            words.add(var[i].toLowerCase()); // 
            // System.out.println("\n neu eingelesenes Wort: " + var[i]);
        }}
}

Your code displays shows a lot of conditions, however let's assume that you just want to replace ALL instances of '.','?', or '!' characters.

The regex that locates those characters is [.!?] The brackets mean "character class", which mean that it matches ANY of those characters within the brackets, but not all of them. This allows us to specify multiple characters to match on.

Let's assume that you've loaded your entire file into a string name "myText".

myText.replaceAll(new Regex("[.!?]", "");

That's it! Now if you have conditions where they can only be removed from certain places that complicates things. If you need information on the conditions please edit your post to include all of the special cases where this should NOT occur.

NOTE: Since you obviously aren't loading the entire file into a single string, you can just keep calling this method on the string that you're reading as you utilize the buffer.

First use a regex to filter out the punctuations and only then split it by space and add the result to your list:

FileInputStream fileInputStream = new FileInputStream("TEXT.txt");
InputStreamReader inputStreamReader = new InputStreamReader(
        fileInputStream, "UTF-8");
BufferedReader bf = new BufferedReader(inputStreamReader);
words.add("<s>");
String s;
while ((s = bf.readLine()) != null) {
    s = s.replaceAll("[^a-zA-Z ]", ""); // replace all non-word/non-space characters with an empty string
    String[] var = s.split(" ");
    words.addAll(var);
}

You have to use

String.replaceAll(<your RegEx>, "");

To build your RegEx (and learn how they work) you can use https://regexr.com

Note: you need to replace all the \\ from your output with \\\\ in order to comply with java's escape rules.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM