简体   繁体   中英

graphlab find all the columns that has at least one None value

How should one find all the columns in SFrame that has at least one None value in it? One way to do this would be to iterate through every column and check if any value in that column is None or not. Is there a better way to do the job?

To find None values in an SFrame use the SArray method num_missing ( doc ).

Solution

>>> col_w_none = [col for col in sf.column_names() if sf[col].num_missing()>0]

Example

>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4]})
>>> print sf
+------+-----+
| bar  | foo |
+------+-----+
|  1   |  1  |
| None |  2  |
|  3   |  3  |
|  4   |  4  |
+------+-----+
[4 rows x 2 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']

Caveats

  • It isn't optimal since it won't stop to iterate at the first None value.
  • It won't detect NaN and empty string.
>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4], 'baz':[1,2,float('nan'),4], 'qux':['spam', '', 'ham', 'eggs']} )
>>> print sf
+------+-----+-----+------+
| bar  | baz | foo | qux  |
+------+-----+-----+------+
|  1   | 1.0 |  1  | spam |
| None | 2.0 |  2  |      |
|  3   | nan |  3  | ham  |
|  4   | 4.0 |  4  | eggs |
+------+-----+-----+------+
[4 rows x 4 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']

Here is a Pandas solution:

In [50]: df
Out[50]:
   keys  values
0     1     1.0
1     2     2.0
2     2     3.0
3     3     4.0
4     3     5.0
5     3     NaN
6     3     7.0

In [51]: df.columns.to_series()[df.isnull().any()]
Out[51]:
values    values
dtype: object

In [52]: df.columns.to_series()[df.isnull().any()].tolist()
Out[52]: ['values']

Explanation:

In [53]: df.isnull().any()
Out[53]:
keys      False
values     True
dtype: bool

You can use isnull :

pd.isnull(df).sum() > 0

Example:

df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'col2': ['B','B','C','C'], 'col3': ['C','C','A','A'], 'col4': [11,12,13,np.nan], 'col5': [30,10,14,91]})
df
    col1 col2 col3  col4 col5
0   A   B   C   11.0    30
1   A   B   C   12.0    10
2   B   C   A   13.0    14
3   B   C   A   NaN     91

pd.isnull(df).sum() > 0

col1    False
col2    False
col3    False
col4     True
col5    False
dtype: bool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM