[英]Split one column into two by specific characters in Python
I use Python3 and need to split price
column which mixed price_value
and price_unit
together in a dataframe, the example data looks like 20dollar/m2/month
or 1.8dollar/m2/day
, I want split them to this format by word dollar
:我使用 Python3 并且需要在 dataframe 中拆分将
price_value
和price_unit
混合在一起的price
列,示例数据看起来像20dollar/m2/month
或dollar
1.8dollar/m2/day
,我想通过 word 将它们拆分为这种格式:
price_value price_unit
20 dollar/m2/month
1.8 dollar/m2/day
I have tried with the following code:我尝试过使用以下代码:
Option 1:选项1:
df['price_value'] = df['price'].apply(lambda row: row.split('dollar')[0])
df['price_unit'] = df['price'].apply(lambda row: row.split('dollar')[-1])
Option 2:选项 2:
df['price_value'], df['price_unit'] = df1["price"].str.split('dollar', 1).str
But I get:但我得到:
price_value price_unit
20 /m2/month
1.8 /m2/day
How can I split them correctly?如何正确拆分它们? Thanks.
谢谢。
You may use str.extract
with a r'(?P<price_value>.*?)(?P<price_unit>dollar.*)'
regex:您可以将
str.extract
与r'(?P<price_value>.*?)(?P<price_unit>dollar.*)'
正则表达式一起使用:
>>> import pandas as pd
>>> df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price'])
>>> df['price'].str.extract(r'(?P<price_value>.*?)(?P<price_unit>dollar.*)')
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day
See the regex demo .请参阅正则表达式演示。
Details细节
(?P<price_value>.*?)
- Group "price_value": any 0+ chars other than line break chars as few as possible (?P<price_value>.*?)
- 组“price_value”:除换行符之外的任何 0+ 字符尽可能少(?P<price_unit>dollar.*)
- Group "price_unit": dollar
and any 0+ chars other than line break chars as many as possible. (?P<price_unit>dollar.*)
- 组“price_unit”:尽可能多的dollar
和除换行符之外的任何 0+ 字符。 I assume that you do not have any line breaks in the input, but if you happen to have any, prepend the pattern with the inline DOTALL modifier, (?s)
: r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'
我假设您在输入中没有任何换行符,但如果您碰巧有任何换行符,请在模式前添加内联 DOTALL 修饰符
(?s)
: r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'
To add the newly extracted columns to the existing data frame, you may also use要将新提取的列添加到现有数据框中,您还可以使用
df[['price_value', 'price_unit']] = df['price'].str.extract(r'(.*?)(dollar.*)')
Here, named capturing groups are not necessary since you define the column names beforehand.在这里,命名捕获组不是必需的,因为您事先定义了列名。
You could do:你可以这样做:
df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price_unit'])
# split by capture group
result = df['price_unit'].str.split('(dollar.*$)', expand=True).drop(2, axis=1)
# rename columns
result.columns = ['price_value', 'price_unit']
print(result)
Output Output
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.