I am using the below method to replace all the spaces and new line characters in the pandas dataframe column headers.
My question is:
Is a more efficient way to loop using the list comprehensions in the below code ?
def headerfiller(df):
for i in [" ","\n"]:
df.columns = [c.replace(i,"_") for c in df.columns]
You can use the string methods available for index objects, in this case columns.str.replace()
which allows you to do this without looping over the values yourself:
In [23]: df = pd.DataFrame(np.random.randn(3,3), columns=['a\nb', 'c d', 'e\n f'])
In [24]: df.columns
Out[24]: Index([u'a\nb', u'c d', u'e\n f'], dtype='object')
In [25]: df.columns.str.replace(' |\n', '_')
Out[25]: Index([u'a_b', u'c_d', u'e__f'], dtype='object')
And by using a regular expression, you can replace spaces and newlines at the same time. See the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html (for Series, but the method is the same for Index)
Using str.translate
:
>>> tbl = str.maketrans(' \n', '__')
>>> 'a b c\n'.translate(tbl)
'a_b_c_'
try:
tbl = str.maketrans('_ \n', '__') # Python 3.x
except AttributeError:
import string
tbl = string.maketrans('_ \n', '__') # Python 2.x
def headerfiller(df):
df.columns = [c.translate(tbl) for c in df.columns]
Using regular expression substitution:
>>> import re
>>> re.sub(r'[ \n]', '_', 'a b c\n')
'a_b_c_'
import re
def headerfiller(df):
df.columns = [re.sub(r' \n', '_', c) for c in df.columns]
You could split()
and '_'.join()
:
def headerfiller(df):
df.columns = ['_'.join(c.split()) for c in df.columns]
It'll lose trailing whitespace and newlines though (if that matters) and compress multiple spaces etc. to a single "_":
In [26]: "_".join("a b c\n\n\n".split())
Out[26]: 'a_b_c'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.