简体   繁体   中英

Replace column names in a pandas data frame that partially match a string

Background

I would like to identify column names in a dataframe that partially match a string, and replace them with the original names plus some new elements added to them. The new elements are integers defined by a list. Here is a similar question , but I'm afraid the suggested solution will not be flexible enough in my particular case. And here is another post with a few excellent answers that come close to the problem I'm facing.

Some research

I know I can combine two lists of strings, map them pairwise into a dictionary , and rename the columns using the dictionary as input in the function df.rename . But this seems a bit too complicated, and not very flexible taking into consideration that the number of existing columns will vary. As will the numbers of columns to be renamed.

The following snippet will produce an input example:

# Libraries
import numpy as np
import pandas as pd
import itertools

# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
              columns = ['Price','obs_1','obs_2','obs_3','obs_4'])

datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                     periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print(df)

Input

在此输入图像描述

I want to identify the column names that start with obs_ , and add elements (integers) from the list newElements = [5, 10, 15, 20] following an = sign. The column named Price remains the same. Other columns appearing after the obs_ columns should also stay the same.

The following snippet will demonstrate the desired output:

# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
              columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])

df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print(df2)

Output

在此输入图像描述

My attempt

# Define the partial string I'm lookin for
stringMatch = 'Obs_'

# Put existing column names in a list
oldnames = list(df)

# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]

# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()

# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']

# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:   
    #print(repr(oldElement) + repr(str_newElements[i]))
    newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
    i = i + 1

# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)

print('My attempt: ', df)

在此输入图像描述

Having already made a complete list of the new column names I could just as well have used df.columns = newnames , but hopefully one of you have got a suggestion using df.rename in a more pythonic way.

Thank you for any suggestions!

Here's the whole code for an easy copy-paste:

# Libraries
import numpy as np
import pandas as pd
import itertools

# A dataframe
Observations = 5
Columns = 5
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['Price','obs_1','obs_2','obs_3','obs_4'])

datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
print('Input: ', df)

# Desired output
Observations = 5
Columns = 5
np.random.seed(123)
df2 = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['Price','Obs_1 = 5','Obs_2 = 10','Obs_3 = 15','Obs_4 = 20'])

df2['Dates'] = datelist
df2 = df2.set_index(['Dates'])
print('Desired output: ', df2)

# My attempts
# Define the partial string I'm lookin for
stringMatch = 'Obs_'

# Put existing column names in a list
oldnames = list(df)

# Put elements that should be added to the column names
# where the three first letters match 'obs_'
newElements = [5, 10, 15, 20]
oldElements = [1, 2, 3, 4]

# Change types of the elements in the list
str_newElements = [str(x) for x in newElements]
str_oldElements = [str(y) for y in oldElements]
str_newNames = str_newElements.copy()


# Since I know the first column should not be renamed,
# I start with 'Price' in a list
newnames = ['Price']

# Then I add the renamed parts to the same list
i = 0
for oldElement in str_oldElements:

    #print(repr(oldElement) + repr(str_newElements[i]))
    newnames.append(stringMatch + oldElement + ' = ' + str_newElements[i])
    i = i + 1

# Rename columns using the dict as input in df.rename
df.rename(columns = dict(zip(oldnames, newnames)), inplace = True)

print('My attempt: ', df)

EDIT: Aftermath

So many good answers after only one day is just amazing! This made it really hard to decide which answer to accept. I don't know if the following will add much value to the post as a whole, but I went ahead and wrapped all suggestions into functions and tested them with %timeit.

Here are the results: 在此输入图像描述

The suggestion fram HH1 was the first to be posted, and is also one of the fastest in terms of execution time. I'll include the code later if anyone is interested.

EDIT 2

The suggestion from suvy rendered these results when I tried it: 在此输入图像描述

The snippet worked fine until the last line. After running the line df = df.rename(columns=dict(zip(names,renames))) the data frame looked like this:

在此输入图像描述

You can use a list comprehension :

df.columns = [ i if "_" not in i else i + "=" + str(newElements[int(i[-1])-1]) for i in df.columns]

output

    Price   obs_1=5 obs_2=10    obs_3=15    obs_4=20
0   103     92       92         96          107
1   109     100      91         90          107
2   105     99       90         104         90
3   105     109      104        94          90
4   106     94       107        93          92

starting with your input dataframe called here df

            Price  obs_1  obs_2  obs_3  obs_4
Dates                                        
2017-06-15    103     92     92     96    107
2017-06-16    109    100     91     90    107
2017-06-17    105     99     90    104     90
2017-06-18    105    109    104     94     90
2017-06-19    106     94    107     93     92


newElements = [5, 10, 15, 20]
names = list(filter(lambda x: x.startswith('obs'), df.columns.values))
renames = list(map(lambda x,y: ' = '.join([x,str(y)]), names, newElements))
df = df.rename(columns=dict(zip(names,renames)))

returns

            Price   obs_1 = 5   obs_2 = 10  obs_3 = 15  obs_4 = 20
Dates                   
2017-06-19  103     92          92          96          107
2017-06-20  109     100         91          90          107
2017-06-21  105     99          90          104         90
2017-06-22  105     109         104         94          90
2017-06-23  106     94          107         93          92

这有用吗?

df.columns = [col + ' = ' + str(newElements.pop(0)) if col.startswith(stringMatch) else col for col in df.columns]

Select the required columns, makes desired changes and join back with original df

obs_cols = df.columns[df.columns.str.startswith('obs')]

obs_cols = [col + ' = ' + str(val) for col, val in zip(obs_cols, newElements)]

df.columns = list(df.columns[~df.columns.str.startswith('obs')]) + obs_cols


    Price   obs_1 = 5   obs_2 = 10  obs_3 = 15  obs_4 = 20
0   103     92          92          96          107
1   109     100         91          90          107
2   105     99          90          104         90
3   105     109         104         94          90
4   106     94          107         93          92

For completeness, since you mention df.rename , you could create input for that with a dictionary comprehension, in a similar manner to the list comprehensions in the other answers.

# Where Observations = len(df.index) as in the example
>>>newcols = {col: col+' = '+str(int(col[col.rfind('_')+1:])*Observations)
              for col in df.columns if col.find('obs_') != -1}
>>>df.rename(columns=newcols)
            Price  obs_1 = 5  obs_2 = 10  obs_3 = 15  obs_4 = 20
Dates                                                           
2017-06-15    103         92          92          96         107
2017-06-16    109        100          91          90         107
2017-06-17    105         99          90         104          90
2017-06-18    105        109         104          94          90
2017-06-19    106         94         107          93          92

Here I also made some assumptions about why you're adding the specific new elements you are. if these assumptions are wrong, df.rename and a dictionary comprehension can still be used with a method from one of the other answers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM