简体   繁体   English

按 Python 中的特定字符将一列分成两列

[英]Split one column into two by specific characters in Python

I use Python3 and need to split price column which mixed price_value and price_unit together in a dataframe, the example data looks like 20dollar/m2/month or 1.8dollar/m2/day , I want split them to this format by word dollar :我使用 Python3 并且需要在 dataframe 中拆分将price_valueprice_unit混合在一起的price列,示例数据看起来像20dollar/m2/monthdollar 1.8dollar/m2/day ,我想通过 word 将它们拆分为这种格式:

price_value      price_unit
20             dollar/m2/month
1.8            dollar/m2/day

I have tried with the following code:我尝试过使用以下代码:

Option 1:选项1:

df['price_value'] = df['price'].apply(lambda row: row.split('dollar')[0])
df['price_unit'] = df['price'].apply(lambda row: row.split('dollar')[-1])

Option 2:选项 2:

df['price_value'], df['price_unit'] = df1["price"].str.split('dollar', 1).str

But I get:但我得到:

price_value      price_unit
20                /m2/month
1.8               /m2/day

How can I split them correctly?如何正确拆分它们? Thanks.谢谢。

You may use str.extract with a r'(?P<price_value>.*?)(?P<price_unit>dollar.*)' regex:您可以将str.extractr'(?P<price_value>.*?)(?P<price_unit>dollar.*)'正则表达式一起使用:

>>> import pandas as pd
>>> df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price'])
>>> df['price'].str.extract(r'(?P<price_value>.*?)(?P<price_unit>dollar.*)')
  price_value       price_unit
0          20  dollar/m2/month
1         1.8    dollar/m2/day

See the regex demo .请参阅正则表达式演示

Details细节

  • (?P<price_value>.*?) - Group "price_value": any 0+ chars other than line break chars as few as possible (?P<price_value>.*?) - 组“price_value”:除换行符之外的任何 0+ 字符尽可能少
  • (?P<price_unit>dollar.*) - Group "price_unit": dollar and any 0+ chars other than line break chars as many as possible. (?P<price_unit>dollar.*) - 组“price_unit”:尽可能多的dollar和除换行符之外的任何 0+ 字符。

I assume that you do not have any line breaks in the input, but if you happen to have any, prepend the pattern with the inline DOTALL modifier, (?s) : r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'我假设您在输入中没有任何换行符,但如果您碰巧有任何换行符,请在模式前添加内联 DOTALL 修饰符(?s) : r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'

To add the newly extracted columns to the existing data frame, you may also use要将新提取的列添加到现有数据框中,您还可以使用

df[['price_value', 'price_unit']] = df['price'].str.extract(r'(.*?)(dollar.*)')

Here, named capturing groups are not necessary since you define the column names beforehand.在这里,命名捕获组不是必需的,因为您事先定义了列名。

You could do:你可以这样做:

df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price_unit'])

# split by capture group
result = df['price_unit'].str.split('(dollar.*$)', expand=True).drop(2, axis=1)

# rename columns
result.columns = ['price_value', 'price_unit']

print(result)

Output Output

  price_value       price_unit
0          20  dollar/m2/month
1         1.8    dollar/m2/day

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM