I have various word list out of which many are noisy. By noisy I mean it begins with some non alphabetic characters like ' " ', ' - ' . eg: "thisword, -thisword , -"this word, .thisword and can be several others.
Like we can remove ascii by using
from string import ascii letter
string.lstrip(ascii_letters)
is there any similar method in python that can handle non_ascii without using regular expression?
Thanks!
Why dont you use the string.puctuation
>>> from string import punctuation
>>> "-asdf".lstrip(punctuation)
'asdf'
>>> "'asdf".lstrip(punctuation)
'asdf'
>>> '"asdf'.lstrip(punctuation)
'asdf'
>>> ',asdf'.lstrip(punctuation)
'asdf'
单词中仅保留字母
"".join([x for x in word if x.isalpha()])
using itertools.dropwhile
:
>>> def removes(s):
... return "".join(itertools.dropwhile(lambda x:not x.isalnum(),s))
...
>>> removes("---thisword")
'thisword'
>>> removes("-^--thisword")
'thisword'
>>> removes("thisword")
'thisword'
>>> removes("...thisword")
'thisword'
Negate character set:
>>> from string import ascii_letters
>>> non_letter = ''.join(set(map(chr, range(128))) - set(ascii_letters))
>>> s = '-hello'
>>> s.lstrip(non_letter)
'hello'
I would suggest a while loop that trims each string until it hits an ascii. Load the non asciis into a list then search until you hit an ascii. Implement it as a function so that you can effectively abstract away the task.
Hope that helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.