I am developing a "word filter" class in PHP that, among other things, need to capture purposely misspelled words. These words are inputted by User as a sentence. Let me show a simple example of a sentence inputted by an User:
I want a coke, sex, drugs and rock'n'roll
The above example is a common phrase write correctly. My class will find the suspect words sex
and drugs
and everthing will be fine.
But I suppose that the User will try to hinder the detection of words and write the things a little different. In fact he has many different ways to write the same word so that it is still readable for certain types of people. For example, the word sex
may be written as s3x
or 5ex
or 53x
or sex
or s 3 x
or s33x
or 5533xxx
of ss 33 xxx
and so on.
I know the basics of regular expressions and tried the pattern bellow:
/(\\b[\\w][\\w .'-]+[\\w]\\b)/g
Because of
\\b
word boundary [\\w]
The word can start with one letter or one digit... [\\w .'-]
... followed by any letter, digit, space, dot, quotes or dash... +
... one or more times... [\\w]
... ending with one letter or one digit. \\b
word boundary This works partially.
If the sample phrase was written as I want a coke, 5 3 x, druuu95 and r0ck'n'r011
I get 3 matches:
I want a coke
5 3 x
druuu95 and r0ck'n'r011
What I need is 8 matches
I
want
a
coke
5 3 x
druuu95
and
r0ck'n'r011
To shorten, I need a regular expression that give me each word of a sentence, even if the word begins with a digit, contains a variable number of digits, spaces, dots, dashes and quotes, and end with a letter or digit.
Any help will be appreciated.
Typically good words are 2 or more letters long (with the exception of I
and a
) and do not contain numbers. This expression isn't flawless, but does help illustrate why doing this type of language matching is absurdly difficult because it's an arms race between creative people trying to express themselves without getting caught, and the development team who is trying to catch flaws.
(?:\\s+|\\A)[#'"[({]?(?!(?:[az]{2}\\s+){3})(?:[a-zA-Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\\]]?(?=(?:\\s|\\Z))|((?:[az]{2}\\s+){3}|.*?\\b)
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
Live Demo
https://regex101.com/r/cL2bN1/1
NODE EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[#'"[({]? any character of: '#', ''', '"', '[', '(',
'{' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-zA-Z'-]{2,} any character of: 'a' to 'z', 'A' to
'Z', ''', '-' (at least 2 times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[ia] any character of: 'i', 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
[nst] any character of: 'n', 's', 't'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
[fnr] any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[?!.,;:'")}\]]? any character of: '?', '!', '.', ',', ';',
':', ''', '"', ')', '}', '\]' (optional
(matching the most amount possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\Z before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.