简体   繁体   中英

How can I remove nan columns if values are string/Integer dtypes at once?

I have data like:

In [1]: d = {'ID': [14, 14, 14, 14, 14, 14, 14, 15, 15], 
         'NAME': ['KWI', 'NED', 'RICK', 'NICH', 'DIONIC', 'RICHARD', 'ROCKY', 'CARLOS', 'SIDARTH'], 
         'ID_COUNTRY':[1, 2, 3,4,5,6,7,8,9], 
         'COUNTRY':['MEXICO', 'ITALY', 'CANADA', 'ENGLAND', 'GERMANY', 'UNITED STATES', 'JAPAN', 'SPAIN', 'BRAZIL'], 
         'ID_CITY':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
         'CITY':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
         'STATUS': ['OK', 'OK', 'OK', 'OK', 'OK', 'NOT', 'OK', 'NOT', 'OK']}
    df = pd.DataFrame(data=d)

Out[2]:
      ID       NAME      ID_COUNTRY     COUNTRY        ID_CITY     CITY     STATUS
0     14       KWI           1           MEXICO          NaN        NaN        OK
1     14       NED           2           ITALY           NaN        NaN        OK
2     14       RICK          3           CANADA          NaN        NaN        OK
3     14       NICH          4           ENGLAND         NaN        NaN       OK
4     14       DIONIC        5           GERMANY         NaN        NaN        OK 
5     14       RICHARD       6           UNITED STATES   NaN        NaN        NOT
6     14       ROCKY         7           JAPAN           NaN        NaN        OK
7     15       CARLOS        8           SPAIN           NaN        NaN        NOT
8     15       SIDHART       9           BRAZIL          NaN        NaN        OK

Then I need to set the dtypes of each column for future uses using:

df.iloc[:, [0, 2, 4]] = df.iloc[:, [0, 2, 4]].astype("Int64")
df.iloc[:, [1, 3, 5, 6]] = df.iloc[:, [1, 3, 5, 6]].astype("string")

After doing this I want to drop the columns that have completely nan values and get the names of the columns dropped to be remmoved in another dataframe with the same column names like this:

 In [3]: d1 = {'ID': [14, 14, 14], 
         'NAME': ['KWI', 'NED', 'RICK'], 
         'ID_COUNTRY':[1, 2, 3], 
         'COUNTRY':['MEXICO', 'ITALY', 'CANADA'], 
         'ID_CITY':[20, 22, 24], 
         'CITY':['MX', 'AT', 'CA'], 
         'STATUS': ['OK', 'OK', 'OK']}
    df1 = pd.DataFrame(data=d1)
 Out [4]: 
      ID       NAME      ID_COUNTRY     COUNTRY        ID_CITY     CITY     STATUS
0     14       KWI           1           MEXICO          20        MX        OK
1     14       NED           2           ITALY           22        AT        OK
2     14       RICK          3           CANADA          24        CA        OK

The issue here is when I try df['CITY'].isna() because is giving me False for all the values in the column. I do not why is giving me that and when I try with df['ID_CITY'].isna() is giving me True . I guess is because one is Int64 and the other object . Examples:

In [5]: df4['ID_CITY'].isna()                       
Out[6]:                         
0    True                   
1    True
2    True                          
3    True
4    True
5    True
6    True
7    True
8    True
Name: ID_CITY, dtype: bool

In [7]: df4['CITY'].isna()
Out[8]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
Name: CITY, dtype: bool

After correcting what I mention before the desired output for df and df1 will be:

Out[9]:
      ID       NAME      ID_COUNTRY     COUNTRY          STATUS
0     14       KWI           1           MEXICO            OK
1     14       NED           2           ITALY             OK
2     14       RICK          3           CANADA            OK
3     14       NICH          4           ENGLAND           OK
4     14       DIONIC        5           GERMANY           OK 
5     14       RICHARD       6           UNITED STATES     NOT
6     14       ROCKY         7           JAPAN             OK
7     15       CARLOS        8           SPAIN             NOT
8     15       SIDHART       9           BRAZIL            OK

 Out [10]: 
      ID       NAME      ID_COUNTRY     COUNTRY     STATUS
0     14       KWI           1           MEXICO       OK
1     14       NED           2           ITALY        OK
2     14       RICK          3           CANADA       OK

Thaks for reading me.

Assuming that your input is (Instead of using column index, I have just used column names for clarifications):

d = {'ID': [14, 14, 14, 14, 14, 14, 14, 15, 15], 
         'NAME': ['KWI', 'NED', 'RICK', 'NICH', 'DIONIC', 'RICHARD', 'ROCKY', 'CARLOS', 'SIDARTH'], 
         'ID_COUNTRY':[1, 2, 3,4,5,6,7,8,9], 
         'COUNTRY':['MEXICO', 'ITALY', 'CANADA', 'ENGLAND', 'GERMANY', 'UNITED STATES', 'JAPAN', 'SPAIN', 'BRAZIL'], 
         'ID_CITY':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
         'CITY':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
         'STATUS': ['OK', 'OK', 'OK', 'OK', 'OK', 'NOT', 'OK', 'NOT', 'OK']}
df = pd.DataFrame(data=d)

You can cast a pd object to a specified dtype . For that, you can use Int64 and str (instead of string in your code) [see the link] .

df[['ID', 'ID_COUNTRY', 'ID_CITY']] = df[['ID', 'ID_COUNTRY', 'ID_CITY']].astype("Int64")
df[['NAME', 'COUNTRY', 'CITY', 'STATUS']] = df[['NAME', 'COUNTRY', 'CITY', 'STATUS']].astype("str")

With a temporary typecasting, you can determine NaN values. For this, take into account that float accepts the strings nan with an optional prefix + or - for Not a Number (NaN).

df['CITY'].astype("float").isna()

The output:

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
Name: CITY, dtype: bool

Either

df['ID_CITY'].isna()

or

df['ID_CITY'].astype("float").isna()

will result:

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
Name: ID_CITY, dtype: bool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM