Removing rows contains non-english words in Pandas dataframe

Question

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one

**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**

I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

Answer 1

If using Python >= 3.7:

df[df['col'].map(lambda x: x.isascii())]

where col is your target column.

Data:

df = pd.DataFrame({
    'colA': ['**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**', 
             'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})

print(df.to_markdown())

|    | colA                                                  |
|---:|:------------------------------------------------------|
|  0 | **SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...** |
|  1 | Hello, world!                                         |
|  2 | Cainã                                                 |
|  3 | another value                                         |
|  4 | test123*                                              |
|  5 | âbc                                                   |

Identifying and filtering strings with non-English characters (see the ASCII printable characters ):

df[df.colA.map(lambda x: x.isascii())]

Output:

            colA
1  Hello, world!
3  another value
4       test123*

Original approach was to use a user-defined function like this:

def is_ascii(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

Answer 2

You can use regex to do that.

Installation documentation is here . (just a simple pip install regex)

import re

and use [^a-zA-Z] to filter it.

to break it down: ^ : Not az : small letter AZ : Capital letters

Removing rows contains non-english words in Pandas dataframe

Question

2 answers

solution1
4 ACCPTED 2020-11-25 21:14:25

solution2
0 2020-11-25 21:07:19

Removing rows contains non-english words in Pandas dataframe

Question

2 answers

solution1 4 ACCPTED 2020-11-25 21:14:25

solution2 0 2020-11-25 21:07:19

solution1
4 ACCPTED 2020-11-25 21:14:25

solution2
0 2020-11-25 21:07:19