简体   繁体   中英

How to count missing data in each column in python?

I have a large data frame with 85 columns. The missing data has been coded as NaN . My goal is to get the amount of missing data in each column. So I wrote a for loop to create a list to get the amounts. But it does not work.

The followings are my codes:

headers = x.columns.values.tolist() 
nans=[]
for head in headers:
    nans_col = x[x.head == 'NaN'].shape[0]
    nan.append(nans_col)

I tried to use the codes in the loop to generate the amount of missing value for a specific column by changing head to that column's name, then the code works and gave me the amount of missing data in that column.

So I do not know how to correct the for loop codes. Is somebody kind to help me with this? I highly appreciate your help.

For columns in pandas (python data analysis library) you can use:

In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isnull().sum()
Out[6]:
a    1
b    2
dtype: int64

For a single column or for sereis you can count the missing values as shown below:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isnull().sum()
Out[4]: 2

Reference

If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

Missing Data Analysis

def miss_data(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
        
        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data) 

This gives you a count (by column name) of the number of values missing (printed as True followed by the count)

missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")
#function to show the nulls total values per column
colum_name = np.array(data.columns.values)
def iter_columns_name(colum_name):
  for k in colum_name:
    print("total nulls {}=".format(k),pd.isnull(data[k]).values.ravel().sum())

#call the function
iter_columns_name(colum_name)

#outout
total nulls start_date= 0
total nulls end_date= 0
total nulls created_on= 0
total nulls lat= 9925
.
.
.

Just use Dataframe.info , and non-null count is probably what you want and more.

>>> pd.DataFrame({'a':[1,2], 'b':[None, None], 'c':[3, None]}) \
.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       2 non-null      int64    
 1   b       0 non-null      object
 2   c       1 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM