简体   繁体   中英

How to extract words containing only letters from a text in python?

For example in the following text:

"We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"

How to easily extract words containing only letters:

love, help, but,... To,... tips

I tried

words = re.findall(r'^[a-zA-Z]+',str)
    for word in words:
print word

where str is the text. This does some work but I need to tweak it somehow.

Any ideas how to do it with regular expressions?

You may use list comprehension.

s = "We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"
print [i for i in s.split() if i.isalpha()]
  • s.split() will split the input according to the spaces.
  • Just iterate over the returned items and consider the ones which exactly contain alphabets.

Use

re.findall(r'(?<!\S)[A-Za-z]+(?!\S)', x)
re.findall(r'\b[A-Za-z]+\b', x)

Or with Unicode support:

re.findall(r'(?<!\S)[^\W\d_]+(?!\S)', x)
re.findall(r'\b[^\W\d_]+\b', x)

See regex proof .

Use (?<!\\S) and (?!\\S) to find words inside whitespace. Use \\b if you need words between punctuation and whitespace.

EXPLANATION

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [A-Za-z]+                any character of: 'A' to 'Z', 'a' to 'z'
                           (1 or more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
[^\W\d_]+                any character except: non-word characters
                           (all but a-z, A-Z, 0-9, _), digits (0-9),
                           '_' (1 or more times (matching the most
                           amount possible))
---------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM