简体   繁体   中英

Python Removing non-alphabetical characters with exceptions

I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using

string = filter(str.isalnum, string)

but I also have "@" in my text that I want to keep. How do I make an exception for a character like "@" ?

使用正则表达式更容易:

string = re.sub("[^A-Za-z0-9@]", "", string)

You can use re.sub

re.sub(r'[^\w\s\d@]', '', string)

Example:

>>> re.sub(r'[^\w\s\d@]', '', 'This is @ string 123 *$^%')
This is @ string 123

You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:

string = "?filter_@->me3!"

extra_chars = "@!"

filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)

string = "".join(filtered_object)

print(string)

Gives:

filter@me3!

One way to do this would be to create a function that returns True or False if an input character is valid.

import string

valid_characters = string.ascii_letters + string.digits + '@'

def is_valid_character(character):
    return character in valid_characters

# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
    return "".join(char for char in string if is_valid_character(char))

Some example output:

>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

>>> get_valid_characters("!Hello_#world?")
'Helloworld'

>>> get_valid_characters("user@example")
'user@example'

A simpler way to write it would be using regex. This will accomplish the same thing:

import re

def get_valid_characters(string):
    return re.sub(r"[^\w\d@]", "", string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM