简体   繁体   中英

Python: Split a string into a list, taking out all special characters except '

I need to split a string into a list of words, separating on white spaces, and deleting all special characters except for '

For example:

page = "They're going up to the Stark's castle [More:...]"

needs to be turned into a list

["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']

right now I can only remove all special characters using

re.sub("[^\w]", " ", page).split()

or just split, keeping all special characters using

page.split() 

Is there a way to specify which characters to remove, and which to keep?

Use str.split as normal, then filter the unwanted characters out of each word:

>>> page = "They're going up to the Stark's castle [More:...]"
>>> result = [''.join(c for c in word if c.isalpha() or c=="'") for word in page.split()]
>>> result
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
import re

page = "They're going up to the Stark's castle [More:...]"
s = re.sub("[^\w' ]", "", page).split()

out:

["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']

first use [\\w' ] to match the character you need, than use ^ to match the opposite and replace wiht '' (nothing)

Here a solution.

  1. replace all chars other than alpha-numeric and single quote characters with SPACE and remove any trailing spaces.
  2. Now split the string using SPACE as delimiter.


import re

page = "They're going up to the Stark's castle   [More:...]"
page = re.sub("[^0-9a-zA-Z']+", ' ', page).rstrip()
print(page)
p=page.split(' ')
print(p)


Here is the output.

["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']

Using ''.join() and a nested list comprehension would be a simpler option in my opinion:

>>> page = "They're going up to the Stark's castle [More:...]"
>>> [''.join([c for c in w if c.isalpha() or c == "'"]) for w in page.split()]
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM