I'm trying to rename a column
by validating the values in the particular columns. Here is the set-up:
In [9]: import pandas as pd
In [10]: df = pd.DataFrame(
...: {"unknown_field": ['bob@gmail.com', 'shirley@gmail.com', 'groza@pubg.com']}
...: )
In [11]: df
Out[11]:
unknown_field
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
Using a validate_column(ser)
, which takes Pandas.Series
object as parameter, it validates the values in that column and modifies the column name of that particular column with a pre-defined set of column names. To make it simple, in this example, the column is validated as an email
column.
In [12]: def validate_column(ser):
...: # Value validation method returns that this column is email column
...: ser.rename('email', inplace=True)
...:
The current name of the unknown_field
is: unknown_field
, and as expected, the name changes to email
after executing validate_column
method:
In [13]: df.unknown_field
Out[13]:
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
Name: unknown_field, dtype: object
In [14]: validate_column(df.unknown_field)
In [15]: df.unknown_field
Out[15]:
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
Name: email, dtype: object
However, the column names within the df
is not modified as I expected. It still named as unknown_field
within the df
variable:
In [16]: df
Out[16]:
unknown_field
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
Currently, I use the following code to manually modify the column name within my df
variable.
In [17]: for col in df.select_dtypes(object):
...: df.rename(columns={col: df[col].name}, inplace=True)
...:
In [18]: df
Out[18]:
email
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
In [19]:
Is there a more efficient/straightforward method for renaming the Series
and directly reflected in the DataFrame
?
Re-write your function to accept two parameters:
def validate_column(df, col_name):
# Value validation method returns that this column is email column
return df.rename({col_name : 'email'}, axis=1)
Now, call your function through DataFrame.pipe
:
df.pipe(validate_column, col_name='unknown_field')
email
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
Very clean. This is useful if you want to chain validations:
(df.pipe(validate_column, col_name='unknown_field')
.pipe(validate_column, col_name='some_other_field')
.pipe(validate_column, col_name='third_field')
)
... or modify validate_column
to validate multiple columns at a time.
Note that the renaming is no longer done in-place, and whatever result is returned from pipe
needs to be assigned back.
Use dataframe's rename function and set columns argument.
import pandas as pd
df = pd.DataFrame({"unknown_field": ['bob@gmail.com', 'shirley@gmail.com', 'groza@pubg.com']})
df = df.rename(columns={'unknown_field': 'email'})
Output:
email
0 bob@gmail.com
1 shirley@gmail.com
2 groza@pubg.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.