简体   繁体   中英

Adding column to numpy array based on if/then of data in array

I have a multidimensional numpy array like so:

np.array([("a",1,"x"),("b",2,"y"),("c",1,"z")])

I need to create fourth "column" to the array based on an if then of the 2nd column for example.

If [:,2] == 1 then newcolumn = 'Wow' else 'Dud'

So that it returns something like:

[("a",1,"x","Wow"),("b",2,"y","Dud"),("c",1,"z","Wow")]

As I'm going to be processing around 100 million rows of data speed is of the essence here.

Thanks in advance for any help.

Try pandas

>> import pandas as pd
>> df = pd.DataFrame([("a",1,"x"),("b",2,"y"),("c",1,"z")], columns=['col1', 'col2', 'col3'])
 df col1 col2 col3 0 a 1 x 1 b 2 y 2 c 1 z

create a function to operate on rows (doesn't have to be a lambda), and use apply on axis=1 (rows). This will give you the new column.

>> b = lambda row: "Wow" if row['col2'] == 1 else "Dud" 
>> new_col = df.apply(b, axis=1)
 new_col 0 Wow 1 Dud 2 Wow dtype: object

add your new column to the dataframe.

>> df['new_col'] = new_col
 df col1 col2 col3 new_col 0 a 1 x Wow 1 b 2 y Dud 2 c 1 z Wow

and convert back to list of tuples

tuples = [tuple(x) for x in df[['col1','col2','col3','new_col']].to_numpy()]
 [('a', 1, 'x', 'Wow'), ('b', 2, 'y', 'Dud'), ('c', 1, 'z', 'Wow')]

Suggestion: Don't use lists of tuples. Do use dataframes. Let alone for large data.

Notice dtype has to accomodate for the longest strings it will ever hold, in this case, of length 3

  a = np.array([("a",1,"x"),("b",2,"y"),("c",1,"z")], dtype='<U3')
 a array([['a', '1', 'x'], ['b', '2', 'y'], ['c', '1', 'z']], dtype='<U1')

Create a placeholder array up front, for speed. Notice Type is string, but you could leave it empty, I am not sure how it will affect speed. It would be better to only use the same type in your array and not have numpy hold non-numeric types.

> b = np.new_arr = np.empty((a.shape[0], a.shape[1] + 1), dtype=a.dtype)

Assign a to first columns

> b[:, :a.shape[1]] = a

poll relevant column for relevant condition

> cond_indices = a[:, 1] == '1'

assign by mask

>b[cond_indices, a.shape[1]] = "Wow"
>b[~cond_indices, a.shape[1]] = "Dud"

enjoy

b array([['a', '1', 'x', 'Wow'], ['b', '2', 'y', 'Dud'], ['c', '1', 'z', 'Wow']], dtype='<U3')

Your array constructor produces a string dtype:

In [73]: arr = np.array([("a",1,"x"),("b",2,"y"),("c",1,"z")])                                   
In [74]: arr                                                                                     
Out[74]: 
array([['a', '1', 'x'],
       ['b', '2', 'y'],
       ['c', '1', 'z']], dtype='<U1')

2nd column?

In [75]: arr[:,2]                                                                                
Out[75]: array(['x', 'y', 'z'], dtype='<U1')
In [76]: arr[:,1]                                                                                
Out[76]: array(['1', '2', '1'], dtype='<U1')

go to test against string:

In [77]: arr[:,1]=="1"                                                                           
Out[77]: array([ True, False,  True])

Make a new array with the desired strings:

In [78]: np.where(arr[:,1]=="1", "Wow","Dud")                                                    
Out[78]: array(['Wow', 'Dud', 'Wow'], dtype='<U3')

join it with the orginal to make a new array (this is not in-place):

In [79]: np.column_stack((arr, Out[78]))                                                         
Out[79]: 
array([['a', '1', 'x', 'Wow'],
       ['b', '2', 'y', 'Dud'],
       ['c', '1', 'z', 'Wow']], dtype='<U3')

but with pandas

In [80]: df = pd.DataFrame([("a",1,"x"),("b",2,"y"),("c",1,"z")], columns=['col1', 'col2', 'col3'
    ...: ])                                                                                      
In [81]: df                                                                                      
Out[81]: 
  col1  col2 col3
0    a     1    x
1    b     2    y
2    c     1    z
In [82]: df["newcol"] = np.where(df["col2"]==1, "Wow","Dud")                                     
In [83]: df                                                                                      
Out[83]: 
  col1  col2 col3 newcol
0    a     1    x    Wow
1    b     2    y    Dud
2    c     1    z    Wow

pandas stores its data in arrays, either one per dataframe or one per series (column). Switching to numpy does not automatically make things faster. Row iteration on an array is just as slow as a row apply on a dataframe. But as I show here, a whole-array operation often can be applied to the whole data frame. And adding a column to a dataframe is easier than adding an column to an array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM