简体   繁体   中英

Remove non-ASCII characters from pandas column

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:

+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?ً???ЩJL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:

'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

你可以试试这个:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

A common trick is to perform ASCII encoding with the errors="ignore" flag, then subsequently decoding it into ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.


Minimal Code Sample

s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s

0      Déjà vu
1         Ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

PS: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

A couple of the answers given here aren't correct. Simple validation:

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> '\x00'
s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0]
>> '\x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> '\x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> '\x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: only the options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

This worked for me. Given the series has some NaN values, it performs only on strings:

from string import printable

import pandas as pd

df["text_data"] = df["text_data"].str.split().str.join(' ')

df["text_data"] = df["text_data"].apply(lambda string_var: ''.join(filter(lambda y: y in printable, string_var)) if isinstance(string_var, str) else string_var)
from string import printable

def printable_mapper(x): 
    return ''.join([_ if _ in printable else " " for _ in x])

df.DB_user = df.DB_user.map(printable_mapper)

Here is a one liner that I use:

df = df.replace(to_replace="/[^ -~]+/g", value="", regex=True)

Using regex, it globally removes characters not in the range of ' '(space) and ~

This worked for me:

import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM