简体   繁体   中英

Filter words from string

I want to filter a string.

Basically when someone types a message, I want certain words to be filtered out, like this:

User types: hey guys lol omg -omg mkdj*Omg*ndid

I want the filter to run and:

Output: hey guys lol - mkdjndid

And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.

尝试:

input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")

Use replaceAll with a regex built from the bad word:

message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");

This passes your test case:

public static void main( String[] args ) {
    List<String> badWords = Arrays.asList( "omg", "black", "white" );
    String message = "hey guys lol omg -omg mkdj*Omg*ndid";
    for ( String badWord : badWords ) {
        message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
    }
    System.out.println( message );
}

Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering. I don't quite get what you are trying to do.

I ran into this same problem and solved it in the following way:

1) Have a google spreadsheet with all words that I want to filter out

2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)

3) Replace all l33tsp33k characters with their respective alphabet letter

4) Replace all special characters but letters from the sentence

5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.

6) Check if the word is not used in combination with a good word (eg bass contains *ss). This is also loaded through the spreadsheet

6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.

We are using this in our own games and it's working like a charm. Hope you guys enjoy.

https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/

public static HashMap<String, String[]> words = new HashMap<String, String[]>();

public static void loadConfigs() {
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
        String line = "";
        int counter = 0;
        while((line = reader.readLine()) != null) {
            counter++;
            String[] content = null;
            try {
                content = line.split(",");
                if(content.length == 0) {
                    continue;
                }
                String word = content[0];
                String[] ignore_in_combination_with_words = new String[]{};
                if(content.length > 1) {
                    ignore_in_combination_with_words = content[1].split("_");
                }


                words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
            } catch(Exception e) {
                e.printStackTrace();
            }

        }
        System.out.println("Loaded " + counter + " words to filter out");
    } catch (IOException e) {
        e.printStackTrace();
    }

}


/**
 * Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
 * @param input
 * @return
 */
public static ArrayList<String> badWordsFound(String input) {
    if(input == null) {
        return new ArrayList<>();
    }

    // remove leetspeak
    input = input.replaceAll("1","i");
    input = input.replaceAll("!","i");
    input = input.replaceAll("3","e");
    input = input.replaceAll("4","a");
    input = input.replaceAll("@","a");
    input = input.replaceAll("5","s");
    input = input.replaceAll("7","t");
    input = input.replaceAll("0","o");

    ArrayList<String> badWords = new ArrayList<>();
    input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");

    for(int i = 0; i < input.length(); i++) {
        for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++)  {
            String wordToCheck = input.substring(i, i + fromIOffset);
            if(words.containsKey(wordToCheck)) {
                // for example, if you want to say the word bass, that should be possible.
                String[] ignoreCheck = words.get(wordToCheck);
                boolean ignore = false;
                for(int s = 0; s < ignoreCheck.length; s++ ) {
                    if(input.contains(ignoreCheck[s])) {
                        ignore = true;
                        break;
                    }
                }
                if(!ignore) {
                    badWords.add(wordToCheck);
                }
            }
        }
    }


    for(String s: badWords) {
        Server.getSlackManager().queue(s + " qualified as a bad word in a username");
    }
    return badWords;

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM