简体   繁体   中英

Removing rows contains non-english words in Pandas dataframe

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one

**She’s the Hollywood Power Behind Those ...**

I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

If using Python >= 3.7:

df[df['col'].map(lambda x: x.isascii())]

where col is your target column.


Data:

df = pd.DataFrame({
    'colA': ['**She’s the Hollywood Power Behind Those ...**', 
             'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})

print(df.to_markdown())
|    | colA                                                  |
|---:|:------------------------------------------------------|
|  0 | **She’s the Hollywood Power Behind Those ...** |
|  1 | Hello, world!                                         |
|  2 | Cainã                                                 |
|  3 | another value                                         |
|  4 | test123*                                              |
|  5 | âbc                                                   |

Identifying and filtering strings with non-English characters (see the ASCII printable characters ):

df[df.colA.map(lambda x: x.isascii())]

Output:

            colA
1  Hello, world!
3  another value
4       test123*

Original approach was to use a user-defined function like this:

def is_ascii(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

You can use regex to do that.

Installation documentation is here . (just a simple pip install regex)

import re

and use [^a-zA-Z] to filter it.

to break it down: ^ : Not az : small letter AZ : Capital letters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM