I have a data frame with "revenue" and "quantity". Both columns are intended to be numeric, yet consist some garbage that should be cleaned before converting to numeric - such as "," etc.(originally "object"). The following two lines do the trick:
data['revenue'] = pd.to_numeric(data['revenue'].apply(lambda x: re.sub("[^0-9]", "", x)))
data['quantity'] = pd.to_numeric(data['quantity'].apply(lambda x: re.sub("[^0-9]", "", x)))
data.dtypes
revenue int64
quantity int64
Now, I wonder if there's a one line code to do so. I tried the following:
data = data.apply(lambda x: pd.to_numeric(re.sub("[^0-9]", "", x)) if x.name in [['revenue','quantity']] else x)
That didn't change the object type to Int. Then I tried:
data[['revenue','quantity']] = pd.to_numeric(data[['revenue','quantity']].apply(lambda x: re.sub("[^0-9]", "", x)))
got the error:
TypeError: ('expected string or bytes-like object', 'occurred at index revenue')
Any ideas for more efficient code than two lines?
try this
data = data.apply(lambda x: pd.to_numeric(x.apply(lambda v: re.sub("[^0-9]", "", v))) if x.name in ['revenue','quantity'] else x)
I usually just do
for col in ['revenue', 'quantity']:
data[col] = data[col].apply(function)
It's not a one liner, but what you lose in lines you win in readability, in my opinion.
data['revenue']
is a series and apply
is called with the data items of the series. But data[['revenue', 'quantity']]
is a dataframe, and apply
is called with the Series
objects. So twice, with the series data['revenue']
and then data['quantity']
. The x
in re.sub("[^0-9]", "", x)
is a Series
object and that's why it fails.
You could change your lambda to
lambda s: s.apply(re.sub("[^0-9]", "", x))
but the Dataframe.replace
method accepts regular expressions, so there isn't a need to do apply
at all.
data[['revenue', 'quantity']].replace("[^0-9]", "", regex=True)
to_numeric
doesn't work on dataframes, but astype
does. So the full conversion would be (assuming you want int64)
data[['revenue', 'quantity']] = data[['revenue', 'quantity']].replace(
"[^0-9]", "", regex=True).astype('int64')
My proposition is:
data[['revenue', 'quantity']] = data[['revenue', 'quantity']].\
applymap(lambda v: pd.to_numeric(re.sub("[^0-9]", "", v)))
Actually a one-liner, but for readability, due to limited screen width, split into 2 lines.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.