I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**She’s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.
If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col
is your target column.
Data:
df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters ):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
You can use regex
to do that.
Installation documentation is here . (just a simple pip install regex)
import re
and use [^a-zA-Z]
to filter it.
to break it down: ^
: Not az
: small letter AZ
: Capital letters
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.