简体   繁体   中英

Why does changing one `np.nan` value change all of the nan values in pandas dataframe?

When I change one value in the entire DataFrame, it changes other values. Compare scenario 1 and scenario 2:

Scenario 1: Here notice that I only have float(np.nan) values for NaN s

info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[float(np.nan)]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])

result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])

result_df = result_df.fillna(0.0)  # does NOT fill in NaNs

The result of Scenario 1 is just a dataframe without the NaNs filled in.

Scenario 2: Here notice that I only have None value in ONE spot

info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[None]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])

result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])

result_df = result_df.fillna(0.0)  # this works!?!

Even though I only fill in one of the NaN values with None, the other float(np.nan) s get filled in with 0.0 , as if they are NaN s too.

Why is there some relationship between the NaN s?

The 1st info_num is dtype='S3' (strings). In the 2nd it is dtype=object , a mix of integers, nan (a float) and strings (and a None ).

In the dataframes I see something that prints as 'nan' in the one, and a mix of None and NaN in the other. It looks like fillna treats None and NaN the same, but ignores a string 'nan'.

The doc for fillna

Fill NA/NaN values using the specified method

Pandas NaN is the same as np.nan .

fillna uses pd.isnull to determine where to put the 0.0 value.

def isnull(obj):
    """Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

For the 2nd case:

In [116]: pd.isnull(result_df)
       G     Bd      O      P   keys
0  False  False  False  False  False
1  False  False  False   True  False
2  False  False   True  False   True
3  False  False  False  False  False
4  False  False  False  False  False

(its all False for the first, string, case).

In [121]: info_num0
array([['4', '8', '5', '6', 'ui'],
       ['1', '5', '6', 'nan', 'g'],
       ['6', '1', 'nan', '90', 'nan'],
       ['5', '2', '8', '4', 'q'],
       ['1', '6', '4', '3', 'w']], 
In [122]: info_num
array([[1, 8, 3, 0, 'ui'],
       [1, 5, 1, None, 'g'],
       [0, 2, nan, 90, nan],
       [7, 7, 1, 4, 'q'],
       [3, 7, 0, 3, 'w']], dtype=object)

np.nan is float already:

In [125]: type(np.nan)
Out[125]: float

If you'd added dtype=object to the initial array definition, you'd get the same effect as using that None :

In [140]: np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']],dtype=object)
array([[6, 7, 8, 1, 'ui'],
       [5, 2, 5, nan, 'g'],
       [3, 0, nan, 90, nan],
       [5, 2, 1, 3, 'q'],
       [1, 7, 7, 2, 'w']], dtype=object)

Better yet, create the initial data as a list of lists, rather than an array. numpy arrays have to uniform elements; with a mix of ints, nan, and strings you only get that with dtype=object . But that is little more than an array wrapper around a list. Python lists already allow this kind of diversity.

In [141]: alist = [[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']]
In [142]: alist
[[4, 0, 2, 6, 'ui'],
 [3, 3, 3, nan, 'g'],
 [3, 5, nan, 90, nan],
 [4, 0, 6, 7, 'q'],
 [0, 8, 3, 8, 'w']]
In [143]: result_df1 = pd.DataFrame(data=alist, columns=['G','Bd', 'O', 'P', 'keys'])
In [144]: result_df1
   G  Bd   O   P keys
0  4   0   2   6   ui
1  3   3   3 NaN    g
2  3   5 NaN  90  NaN
3  4   0   6   7    q
4  0   8   3   8    w

I'm not sure how pandas stores this internally, but result_df1.values does return an object array.

In [146]: result_df1.values
array([[4, 0, 2.0, 6.0, 'ui'],
       [3, 3, 3.0, nan, 'g'],
       [3, 5, nan, 90.0, nan],
       [4, 0, 6.0, 7.0, 'q'],
       [0, 8, 3.0, 8.0, 'w']], dtype=object)

So if a column has a nan , all the numbers a float ( nan is a kind of float). The first 2 columns remain integer. The last is a mix of strings and that nan .

But dtypes suggest that pandas is using a structured array, with each column being a field with the relevant dtype.

In [147]: result_df1.dtypes
G         int64
Bd        int64
O       float64
P       float64
keys     object
dtype: object

The equivalent numpy dtype would be:

dt = np.dtype([('G',np.int64),('Bd',np.int64),('O',np.float64),('P',np.float64), ('keys',object)])

We can make a structured array with this dtype. I have to turn the list of lists into a list of tuples (the structured records):

X = np.array([tuple(x) for x in alist],dt)


array([(4, 0, 2.0, 6.0, 'ui'), 
       (3, 3, 3.0, nan, 'g'),
       (3, 5, nan, 90.0, nan), 
       (4, 0, 6.0, 7.0, 'q'), 
       (0, 8, 3.0, 8.0, 'w')], 
      dtype=[('G', '<i8'), ('Bd', '<i8'), ('O', '<f8'), ('P', '<f8'), ('keys', 'O')])

That can go directly into Pandas as:

In [162]: pd.DataFrame(data=X)
   G  Bd   O   P keys
0  4   0   2   6   ui
1  3   3   3 NaN    g
2  3   5 NaN  90  NaN
3  4   0   6   7    q
4  0   8   3   8    w

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM