简体   繁体   中英

Get percentage of NULL for all columns in Hive

I would like to get the percentage of NULL values in a table in Hive. Is there an easy way to do this without having to enumerate all column names in the query? In this case there are about 50k rows and 20 columns. Thanks in advance!

Something like:

SELECT count(each_column) / count(*) FROM TABLE_1 WHERE each_column = NULL;

If you do this using code, you need to list the columns. Here is one way:

select avg(case when col1 is null then 1.0 else 0.0 end) as col1_null_p,
       avg(case when col2 is null then 1.0 else 0.0 end) as col2_null_p,
       . . .
from t;

If you take the list of columns in the table, you can readily construct the query in a spreadsheet.

The approach you need depends on the situation that you have:

  • For 20 fixed columns: Just type your query
  • For 200 fixed columns: Copy the column names to your favorite tool (excel) and build the query there
  • For n columns that may not be fixed: Write a script to generate your code

I once wrote a python script. I now don't have it at hand but it is quite easy to create with the following logic:

  1. Query the first 1 (or 0?) rows of the table, to get all the headers.
  2. Build the desired queries to generate column based statistics (like percentage of null values) and union the result
  3. Then executed the query.

Of course it can be expanded to run for different tables, and statistics, but do realize that this may not scale well.

In my case I think I had to cut the query building in batches of 20 columns each time which would then be concatenated afterwards, because running it on 400 columns just generated a too complex query.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM