简体   繁体   中英

Concatenate numpy arrays to two arrays using pandas, numpy or other

I have a series of numpy arrays generated for example like this:

import random
N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']
df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

ie:

a    b    c    d    e
0.01 0.03 0.01 0.2  0.04
0.2  0.01 0.02 0.01 0.1
...

and I would like to format it so that it looks like this:

name    value
a       0.01
b       0.03
c       0.01
d       0.2
e       0.04
a       0.2
b       0.01
....

(order of data is not important)

I have tried pandas dataframe transpose:

df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

but the result looks like this:

a    0.1   0.2  0.01 0.2
b    0.3   0.1  0.2  0.01
....

Any idea on how to reformat the numpy arrays/pandas dataframe to have two columns of data?

You can use numpy.tile for repeat column names and numpy.ravel for flatten values of DataFrame :

#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
   A  B  C  D  E
0  8  8  3  7  7
1  0  4  2  5  2
2  2  2  1  0  8
3  4  0  9  6  2
4  4  1  5  3  4
df2 = pd.DataFrame({
        "name": np.tile(df.columns, len(df.index)),
        "value": df.values.ravel()})
print (df2)        
   name  value
0     A      8
1     B      8
2     C      3
3     D      7
4     E      7
5     A      0
6     B      4
7     C      2
8     D      5
9     E      2
10    A      2
11    B      2
12    C      1
13    D      0
14    E      8
15    A      4
16    B      0
17    C      9
18    D      6
19    E      2
20    A      4
21    B      1
22    C      5
23    D      3
24    E      4

Timings ( len(df) = 1M ):

#random dataframe
np.random.seed(100)
N = 1000000
df = pd.DataFrame(np.random.randint(10, size=(N,5)), columns=list('abcde'))
print (df)

In [86]: %timeit (pd.DataFrame({"name": np.tile(df.columns, len(df.index)),"value": df.values.ravel()}))
10 loops, best of 3: 84.8 ms per loop

In [87]: %timeit (pd.DataFrame(np.column_stack((np.tile(df.columns, df.shape[0]), df.values.reshape(-1,1))), columns=['name', 'value']))
10 loops, best of 3: 196 ms per loop

In [88]: %timeit (df.stack().reset_index(level=0, drop=True).reset_index(name='value').rename(columns={'index':'name'}))
1 loop, best of 3: 344 ms per loop

If need output numpy array add numpy.column_stack :

print (np.column_stack((np.tile(df.columns, len(df.index)), df.values.ravel())))
[['a' 8]
 ['b' 8]
 ['c' 3]
 ['d' 7]
 ['e' 7]
 ['a' 0]
 ['b' 4]
 ['c' 2]
 ['d' 5]
 ['e' 2]
 ['a' 2]
 ['b' 2]
 ['c' 1]
 ['d' 0]
 ['e' 8]
 ['a' 4]
 ['b' 0]
 ['c' 9]
 ['d' 6]
 ['e' 2]
 ['a' 4]
 ['b' 1]
 ['c' 5]
 ['d' 3]
 ['e' 4]]

is that what you want?

In [11]: df
Out[11]:
          a         b         c         d         e
0  0.791796  0.428642  0.887860  0.803709  0.860545
1  0.230401  0.105232  0.617007  0.557678  0.590459
2  0.448462  0.314422  0.207188  0.785642  0.022271
3  0.075631  0.707029  0.111538  0.769387  0.174297
4  0.707566  0.299966  0.197642  0.145841  0.231135

In [12]: df.stack().reset_index(level=0, drop=True).reset_index()
Out[12]:
   index         0
0      a  0.791796
1      b  0.428642
2      c  0.887860
3      d  0.803709
4      e  0.860545
5      a  0.230401
6      b  0.105232
7      c  0.617007
8      d  0.557678
9      e  0.590459
10     a  0.448462
11     b  0.314422
12     c  0.207188
13     d  0.785642
14     e  0.022271
15     a  0.075631
16     b  0.707029
17     c  0.111538
18     d  0.769387
19     e  0.174297
20     a  0.707566
21     b  0.299966
22     c  0.197642
23     d  0.145841
24     e  0.231135

You just need to concat all the columns in df together. Since columns' name are different, you need to set them with the same name. If not, pandas will add new columns into the concat result.

import random
import pandas as pd

N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']

df = pd.DataFrame(data)
df.columns = names
df = df.transpose()
print df

#           0         1         2         3         4
# a  0.643042  0.061476  0.415979  0.209272  0.394414
# b  0.175363  0.580336  0.056173  0.468121  0.388956
# c  0.096257  0.570860  0.516667  0.892087  0.956790
# d  0.082906  0.340805  0.466074  0.010123  0.293006
# e  0.430240  0.759413  0.083779  0.442159  0.434603

df_col=[df[[i]] for i in range(len(df))]    # separate columns in df
for col in df_col:
    col.columns=['value']                   # change the columns' name

res = pd.concat(df_col)                     # concat them all together
res.index.names=['name']

print res

#          value
# name          
# a     0.643042
# b     0.175363
# c     0.096257
# d     0.082906
# e     0.430240
# a     0.061476
# b     0.580336
# c     0.570860
# d     0.340805
# e     0.759413
# a     0.415979
# b     0.056173
# c     0.516667
# d     0.466074
# e     0.083779
# a     0.209272
# b     0.468121
# c     0.892087
# d     0.010123
# e     0.442159
# a     0.394414
# b     0.388956
# c     0.956790
# d     0.293006
# e     0.434603

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM