简体   繁体   中英

Pandas df.to_records() returns a 1d numpy array

I apologize in advance if this question seems slightly naive. I am still learning about the interplay between pandas and numpy.

I have a pandas DataFrame that I am trying to convert into an array for analysis using scikit-learn. I have tried df.values and df.to_records() to convert it, but for some reason, it changes the shape during the conversion.

This is the first few lines of DataFrame ( df ) in Pandas.

Index           Code1    Code2       Code3
0               99285    5921         5921
1               99284     NaN         5921
2               99284     NaN         4660
3               99285   42789        42789
4               99284   92321        92321
5               99283     NaN        92321
...
[94 rows x 3 columns]

However, if I call df.values , I get the following result, which, as far as I understand, is not an array as arrays are lists of tuples.

[['99285' '5921' '5921']
['99284' nan '5921']
['99284' nan '4660']
['99285' '42789' '42789']
['99284' '92321' '92321']
['99283' nan '92321']
...

If I call df.to_records() , I get the following result, which is an array, but not of the right shape as shown below.

[(0, '99285', '5921', '5921') (1, '99284', nan, '5921')
(2, '99284', nan, '4660') (3, '99285', '42789', '42789')
(4, '99284', '92321', '92321') (5, '99283', nan, '92321')
...
>>>df.to_records().shape
(94,)

Can someone help me understand what I need to do to get an array with a shape of (94,3) ?

Important notes: The columns are all strings (and need to stay as strings), not ints, if that helps.

In fact, df.values does return a numpy.ndarray . However, due to the way it prints, it looks like a lists of lists. Check by doing type(df.values) or by looking at its shape df.values.shape == (93, 4) .

However, df.to_records() does not return a numpy.ndarray , but a numpy.core.records.recarray . You can see that it is a recarray by doing

type(df.to_records())

or by noticing that the dtype is odd-looking:

df.to_records().dtype

The shape of df.to_records() just indicates how many records there are, in your case 94. Record arrays behave differently than normal numpy arrays. For example, try

df.to_records()['Code1']
df.to_records().code1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM