简体   繁体   中英

Eliminate outliers in a dataframe with different dtypes - Pandas

I want to eliminate the outliers in a dataframe that has columns with different dtypes (int64 and object). I need to remove all rows that have outliers in at least one column. So, I tried to use the following code:

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

For each column, this code computes the Z-score for each value by using the column's mean and standard deviation. 'all(axis=1)' guarantees that for each row, all columns satisfy the constraint (absolute value of each z-score is below 3).

However, as some columns' dtype is 'object', I am receiving the following error: TypeError: unsupported operand type(s) for /: 'str' and 'int'

I think this is happening because it is not possible to calculate the z-score in columns that only have strings ('object' dtype). So, I need a code that considers only the numerical columns to detect and eliminate the outliers.

Do you know how to eliminate outliers in a dataframe that has columns with different dtypes (int64 and object)?

This dataframe is about property rentals in Brazil. You can create a sample by using the following code:

data = {
    'city': ['São Paulo', 'Rio', 'Recife'],
    'area(m2)': [90, 120, 60],
    'Rooms': [3, 2, 4],
    'Bathrooms': [2, 3, 3],
    'animal': ['accept', 'do not accept', 'accept'],
    'rent($)': [2000, 3000, 800]
}

df = pd.DataFrame(
    data,
    columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)']
)

print(df)

This is how the sample looks:

       city  area(m2)  Rooms  Bathrooms         animal  rent($)
0  São Paulo        90      3          2         accept     2000
1        Rio       120      2          3  do not accept     3000
2     Recife        60      4          3         accept      800

The original dataset can be found at: https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent

Try using select_dtypes to get all columns from df of a particular type.

To select all numeric types, use np.number or 'number'

new_df = df[
    (np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]

You can iterate through the columns and get the dtypes for each column and only calculate outliers if it has the type you want. You can keep a running list of indexes to drop. Something like this.

drop_idx = []
for cols in df:
    if df[cols].dtype not in (float, int):
        continue
    # grab indexes of all outliers, notice that its '>= 3' now 
    drop_idx.extend(df[np.abs(stats.zscore(df[cols])) >= 3].index))
df = df.drop(set(drop_idx))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM