简体   繁体   中英

Pattern to match and replace words with upper and also lower case in them

I have this problem with eliminating meaningless words from a string, for example:

$string = "Hi, my name is Tom. jc2pMK NB,xVD NOZmF__u cYNdtR46eEb8y,74 Today i registered to stack overflow. krEBNB1cB8 cq7,zCL x5KOwwRZfU13.bI g_IXxlcztXYN , DPnmcgj2FyydHAx@ I like IT. 0T1LAkuoPXscYC5uK6mlG R1nix_5kwF ,EKxXvT1 SjZYC4A6YQ 4E";

Now I want to be able to search and destroy those meaningless words from there, in PHP. I was trying preg_replace($pattern, "", $string) but couldn't figure out a pattern for letting "Hi" stay there but deleting "jc2pMK". I bet this is an elementary procedure with strings, that every basic programmer should easily figure out, but I have no experience with regular expressions.

I am open minded about any other idea, how to get rid of the meaningless words.

If you want to solve this on the semantic level, you'd need a dictionary of some sort. A poor man's approach would be to do something like

$dict = file('wordsEn.txt', FILE_IGNORE_NEW_LINES);
$string = "Hi, my name is Tom. jc2pMK NB,xVD NOZmF__u cYNdtR46eEb8y,74 Today i registered to stack overflow. krEBNB1cB8 cq7,zCL x5KOwwRZfU13.bI g_IXxlcztXYN , DPnmcgj2FyydHAx@ I like IT. 0T1LAkuoPXscYC5uK6mlG R1nix_5kwF ,EKxXvT1 SjZYC4A6YQ 4E";
$words = explode(' ', $string); // can also use str_word_count
echo implode (' ', array_intersect($words, $dict) );

This would load a dictionary into an array, split your string into an array and then create a diff to give you the words from your string that also exist in the dictionary. In the example's case, I used http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt for a dictionary which would result in:

my name is registered to stack like

The result will only be as good as your dictionary obviously. Also, the solution does not take casing into account. But it should give you an idea on how to approach the problem.

You'll find more sophisticated solutions in PHP's Human Language and Character Encoding Support , for instance with the Enchant and PSpell extensions, which allow you to spell check words against dictionary files.

As everyone else commented, you aren't defining what a "meaningless word" is so it's impossible to answer your question. But, a regular expression that would work ONLY for your example $string , no guarantee for other strings, is the following:

Match (there's a space in front):

 (?:\w+[0-9_,@](?:\.\w)?\w*|[0-9.,]\w*)

Replace:

[leave empty]

You can test it online at regex101 .

Here's the equivalent PHP code snippet:

$output = "Hi, my name is Tom. jc2pMK NB,xVD NOZmF__u cYNdtR46eEb8y,74 Today i registered to stack overflow. krEBNB1cB8 cq7,zCL x5KOwwRZfU13.bI g_IXxlcztXYN , DPnmcgj2FyydHAx@ I like IT. 0T1LAkuoPXscYC5uK6mlG R1nix_5kwF ,EKxXvT1 SjZYC4A6YQ 4E";
$result = preg_replace('/ (?:\w+[0-9_,@](?:\.\w)?\w*|[0-9.,]\w*)/',"",$output);
echo $result; #prints Hi, my name is Tom. Today i registered to stack overflow. I like IT.

Again, this only a quick and dirty solution for your specific string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM