简体   繁体   中英

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.

If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).

How do I do it in an SQL query or in Python Pandas? Might turn out to be simple to use some build in function however, I am not getting it right.

Data:

Group MetaData SomeValue
A     AB:xxx    20
A     AB:        5
A     PQ:yyy    30
A     PQ:        2

Expected Output result:

Group MetaDataComplete Count
A     Yes               2
A     No                2

No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with : .

select
    "Group",
    case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
    count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end

You could also use right(MetaData, 1) = ':' .

Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position .

Here is an example:

## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
        ['A', 'AB:', 5],
        ['A', 'PQ:yyy', 30],
        ['A', 'PQ:', 2]
       ]
df = pd.DataFrame(columns=cols, data=data)

# 2- New data frame with split value columns 
new = df["MetaData"].str.split(":", n = 1, expand = True) 

df["MetaData_1"]= new[0] 
df["MetaData_2"]= new[1]

# 3- Dropping old MetaData columns 
df.drop(columns =["MetaData"], inplace = True)

## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()

Out [1]:

Group         0
SomeValue     0
MetaData_1    0
MetaData_2    2
dtype: int64

If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.

import pandas as pd
import re

pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
    match = re.findall(pattern, i)
    if match:
        missing += 1
    else:
        not_missing += 1

From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:

SELECT
  Results.[Group],
  Results.MetaData,
  Results.MetaValue,
  COUNT(Results.MetaValue)
FROM (SELECT
  [Group]
  MetaData,
  SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
         Results.MetaData,
         Results.MetaValue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM