简体   繁体   中英

equivalent to sub and paste of R in Python (concatenation of string and numbers)

Previously, with RI used sub and paste to concatenate the strings and numbers together. I found it a bit harder in Python. Here is a sample code in Python

import pandas as pd    
from numpy.random import rand
random.seed(1)
testtt = round(pd.DataFrame(rand(5,4)),3)
testtt.iloc[1,1]
print(testtt)

#        0      1      2      3
# 0  0.417  0.720  0.000  0.302
# 1  0.147  0.092  0.186  0.346
# 2  0.397  0.539  0.419  0.685
# 3  0.204  0.878  0.027  0.670
# 4  0.417  0.559  0.140  0.198

for i in range(testtt.shape[1]):
    for j in range(testtt.shape[0]):
        testtt.iloc[j,i] = str(i) + '_' + str(testtt.iloc[j,i],)


print(testtt)
#          0        1        2        3
# 0  0_0.417   1_0.72    2_0.0  3_0.302
# 1  0_0.147  1_0.092  2_0.186  3_0.346
# 2  0_0.397  1_0.539  2_0.419  3_0.685
# 3  0_0.204  1_0.878  2_0.027   3_0.67
# 4  0_0.417  1_0.559   2_0.14  3_0.198

Actually, I am looking forward to adding column index to the numbers under it. As you see for the first column "0_" is added to all of the elements under that column, for the second one "1_" is added and so forth.

I think for loops is not the best way to do it since my real data is a matrix of 90000*20 elements which takes too much time to be run.

It is my previous code in R which is far faster because the number of columns is 20 and it uses just a short loop in columns:

for (i in 1:(ncol(testtt))){
  testtt[,i] <- sub("^", paste(i,"_",sep = ""), testtt[,i] )
}

I am very new to Python. please consider it with your help.

In Python, string concatenation is done via additions. Using broadcasting you can do something like this

df.astype(str).radd(df.add_suffix('_').columns)

Out: 
         0        1        2        3
0  0_0.972  1_0.661  2_0.872  3_0.876
1  0_0.751  1_0.097  2_0.673  3_0.978
2  0_0.662  1_0.645  2_0.498  3_0.769
3  0_0.587  1_0.538  2_0.032  3_0.279
4  0_0.739  1_0.663  2_0.769  3_0.475

Here is how it works:

add_suffix method adds _ at the end of each column name.

df.add_suffix('_').columns
Out: Index(['0_', '1_', '2_', '3_'], dtype='object') 

Now it is only a matter of addition to get your desired output. However, if you add df to the df.columns, you'll get this:

df.add_suffix('_').columns + df.astype('str')
Out: 
Index([('0_0.972', '1_0.661', '2_0.872', '3_0.876'),
       ('0_0.751', '1_0.097', '2_0.673', '3_0.978'),
       ('0_0.662', '1_0.645', '2_0.498', '3_0.769'),
       ('0_0.587', '1_0.538', '2_0.032', '3_0.279'),
       ('0_0.739', '1_0.663', '2_0.769', '3_0.475')],
      dtype='object')

Since df.add_suffix('_').columns is an Index object, the returning object is also index. We want the returning object to be a DataFrame, so we do the operation on a DataFrame. radd method adds df to the right of df.columns .

You can achieve the same with a for loop:

df = df.astype('str')
for col in df:
    df[col] = '{}_'.format(col) + df[col]

Your R snippet translates into pandas as something like this:

for i in range(len(testtt.columns)):
  testtt.iloc[: i] = str(i) + '_' + testtt.iloc[:, i].round(3).astype(str)

A more efficient solution, however, is to use the name property of each Series in your DataFrame -- which, based on your numeric column names, gives us the prefix we need -- and performing the concatenation by applying a lambda (ie anonymous) function:

testtt = testtt.apply(lambda x: str(x.name) + '_' + x.round(3).astype(str))

The pd.DataFrame.apply method works on one column of a DataFrame at a time (based on the default argument axis=0 ; if axis=1 is provided instead, it works row-wise), thus eliminating the need in this case for a "for" loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM