简体   繁体   中英

Data Cleaning with Pandas in Python

I am trying to clean a csv file for data analysis. How do I convert TRUE FALSE into 1 and 0?

When I search Google, they suggested df.somecolumn=df.somecolumn.astype(int) . However this csv file has 100 columns and not every column is true false(some are categorical, some are numerical). How do I do a sweeping code that allows us to convert any column with TRUE FALSE to 1 and 0 without typing 50 lines of df.somecolumn=df.somecolumn.astype(int)

you can use:

df.select_dtypes(include='bool')=df.select_dtypes(include='bool').astype(int)

A slightly different approach. First, dtypes of a dataframe can be returned using df.dtypes , which gives a pandas series that looks like this,

a     int64
b      bool
c    object
dtype: object

Second, we could replace bool with int type using replace ,

df.dtypes.replace('bool', 'int8') , this gives

a     int64
b     int8
c    object
dtype: object

Finally, pandas seires is essentially a dictionary which can be passed to pd.DataFrame.astype .

We could write it as a oneliner,

df.astype(df.dtypes.replace('bool', 'int8'))

I would do it like this:

df.somecolumn = df.somecolumn.apply(lambda x: 1 if x=="TRUE" else 0)

If you want to iterate through all your columns and check wether they have TRUE/FALSE values, you can do this:

for c in df:
    if 'TRUE' in df[c] or 'FALSE' in df[c]:
        df[c] = df[c].apply(lambda x: 1 if x=='TRUE' else 0)

Note that this approach is case-sensitive and won't work well if in the column the TRUE/FALSE values are mixed with others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM