I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.
You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\\x00-\\x7F]')
finds all values in text
column that contain a character other than ASCII char (it is our "mask") df[~...]
only keeps those rows that did not match the regex. str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.