简体   繁体   中英

Regular Expression to match suspect words on a string

I am developing a "word filter" class in PHP that, among other things, need to capture purposely misspelled words. These words are inputted by User as a sentence. Let me show a simple example of a sentence inputted by an User:

I want a coke, sex, drugs and rock'n'roll

The above example is a common phrase write correctly. My class will find the suspect words sex and drugs and everthing will be fine.

But I suppose that the User will try to hinder the detection of words and write the things a little different. In fact he has many different ways to write the same word so that it is still readable for certain types of people. For example, the word sex may be written as s3x or 5ex or 53x or sex or s 3 x or s33x or 5533xxx of ss 33 xxx and so on.

I know the basics of regular expressions and tried the pattern bellow:

/(\\b[\\w][\\w .'-]+[\\w]\\b)/g

Because of

  • \\b word boundary
  • [\\w] The word can start with one letter or one digit...
  • [\\w .'-] ... followed by any letter, digit, space, dot, quotes or dash...
  • + ... one or more times...
  • [\\w] ... ending with one letter or one digit.
  • \\b word boundary

This works partially.

If the sample phrase was written as I want a coke, 5 3 x, druuu95 and r0ck'n'r011 I get 3 matches:

  • I want a coke
  • 5 3 x
  • druuu95 and r0ck'n'r011

What I need is 8 matches

  • I
  • want
  • a
  • coke
  • 5 3 x
  • druuu95
  • and
  • r0ck'n'r011

To shorten, I need a regular expression that give me each word of a sentence, even if the word begins with a digit, contains a variable number of digits, spaces, dots, dashes and quotes, and end with a letter or digit.

Any help will be appreciated.

Description

Typically good words are 2 or more letters long (with the exception of I and a ) and do not contain numbers. This expression isn't flawless, but does help illustrate why doing this type of language matching is absurdly difficult because it's an arms race between creative people trying to express themselves without getting caught, and the development team who is trying to catch flaws.

(?:\\s+|\\A)[#'"[({]?(?!(?:[az]{2}\\s+){3})(?:[a-zA-Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\\]]?(?=(?:\\s|\\Z))|((?:[az]{2}\\s+){3}|.*?\\b)

正则表达式可视化

** To see the image better, simply right click the image and select view in new window

This regular expression will do the following:

  • find all acceptable words
  • find all the rest and store them in Capture Group 1

Example

Live Demo

https://regex101.com/r/cL2bN1/1

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    \A                       the beginning of the string
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  [#'"[({]?                any character of: '#', ''', '"', '[', '(',
                           '{' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      [a-z]{2}                 any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    [a-zA-Z'-]{2,}           any character of: 'a' to 'z', 'A' to
                             'Z', ''', '-' (at least 2 times
                             (matching the most amount possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [ia]                     any character of: 'i', 'a'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    i                        'i'
----------------------------------------------------------------------
    [nst]                    any character of: 'n', 's', 't'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    o                        'o'
----------------------------------------------------------------------
    [fnr]                    any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  [?!.,;:'")}\]]?          any character of: '?', '!', '.', ',', ';',
                           ':', ''', '"', ')', '}', '\]' (optional
                           (matching the most amount possible))
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \Z                       before an optional \n, and the end of
                               the string
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (3 times):
----------------------------------------------------------------------
      [a-z]{2}                 any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    ){3}                     end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM