简体   繁体   中英

SQL - Get percentage of empty rows in all columns with integer and varchar columns?

I have a table with 300+ columns and I want to see if there is a way to get percentage of empty rows in each of the columns. I wanted to pull the data using Python and then write a function. However, there are 400,000,000+ rows so that is not feasible. So now I am using sum and case when statements to get the percentage of each column manually like so:

select (100.0 * sum(case when A = ' ' or A is null then 1 else 0 end)) / count(*) as A
       ,(100.0 * sum(case when B is null then 1 else 0 end)) / count(*) as B
       ,(100.0 * sum(case when C = ' ' or C is null then 1 else 0 end)) / count(*)  as C
       ,(100.0 * sum(case when D = ' ' or D is null then 1 else 0 end)) / count(*)  D
       ,(100.0 * sum(case when E = ' ' or E is null then 1 else 0 end)) / count(*)  E
       ,(100.0 * sum(case when F is null then 1 else 0 end)) / count(*) as F
       ....
from table

I have done this for about 30 columns, but I am trying to see if there is a way to do this across all the columns in the table. Some columns contain integers so I can't use

 case when C = ' ' or C is null then 1 else 0 end

I have to use

 case when C is null then 1 else 0 end
 

Is there a way to continue using this logic, where the case statement changes depending on the type of column and get the percentages for all columns? Thanks.

My solution for this problem involved using Python and Redshift. Not sure if this is the best way to do it but this is what worked for me.

I first used the function from @nachospiu's comment 'pg_get_cols' to get all the column names

query = '''select col_name
            from pg_get_cols("schema.table") 
            cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int)
            '''
cols = pd.read_sql_query(query, rsm.dbengine)

Then once I had all the column names, I created a for loop to get the percentage of each column using a try and except statement so that it would not fail if the column was not varchar type.

nulls = {}
for col in cols.col_name:
    try: 
        que = '''select (100.0 * sum(case when {} = ' ' or {} is null then 1 else 0 end)) / count(*) as perc
                 from schema.table
                '''.format(col, col)
        perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
        nulls[col] = perc
    
    except: 
        
        que = '''select (100.0 * sum(case when  {} is null then 1 else 0 end)) / count(*) as perc
                 from schema.table
                '''.format(col)
        perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
        nulls[col] = perc
      

This produced a dictionary that allowed me to view the percentage of nulls for each column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM