如何在 pandas 的数据框中使用正则表达式提取部分文本

Question

I have a data frame and one of the columns are like this:我有一个数据框，其中一列是这样的：

df = index  dosage_duration
     0        5  years20mg 1X D
     1         2  days10mg 1X D
     2         2  days10mg 1X D
     3                 7  weeks
     4                2  months
     5                  3  days
     6             1  years5 MG
     7                 2  years

What I am trying to do is to extract the first part of the time and convert it to days.我要做的是提取时间的第一部分并将其转换为天。 So the result would look like this:所以结果看起来像这样：

df = index  dosage_duration       new_dosage
     0        5  years20mg 1X D    5*365
     1         2  days10mg 1X D    2
     2         2  days10mg 1X D    2
     3                 7  weeks    7*7
     4                2  months    2*30
     5                  3  days    3
     6             1  years5 MG    1*365
     7                 2  years    2*365

As you see here, 5 years being converted to 5*365 to be as days.正如您在此处看到的，将 5 年转换为 5*365 以作为天数。

I am able to get the first part lets to say 5 in the first row , 2 in the second row... but Im not sure how can I get the years days or month so I can change all the values to days scale.我能够得到第一部分，让我们在first row说5 ，在第二行说2 ......但我不确定如何获得years days或month ，所以我可以将所有值更改为天数。

Apparently, I need to be able to find the first number after the space but I don't know how can I do this part.显然，我需要能够找到space后的第一个数字，但我不知道该怎么做。

Answer 1

Let's try:我们试试看：

df = pd.DataFrame({'dosage_duration':['5 years20mg 1x D'
                                     ,'2 days10mg 1x D'
                                     ,'4 months20mg 1x D'
                                     ,'7 weeks'
                                     ,'2 months'
                                     ,'3 days'
                                     ,'1 days'
                                     ,'1 years5 MG'
                                     ,'2 years'
                                     ,'6 months'
                                     ,'1 years10 1x D'
                                     ,'10 months15']})

nmap={'years':365, 'months':30, 'weeks':7, 'days': 1}
strnmap = '|'.join(nmap.keys())

df_m = df.dosage_duration.str.extract(f'(?P<unit>\d+)\s?(?P<span>[{strnmap}]+)')
df['new_duration']= df_m['unit'].astype(int).mul(df_m['span'].map(nmap))

print(df)

Output: Output：

      dosage_duration  new_duration
0    5 years20mg 1x D          1825
1     2 days10mg 1x D             2
2   4 months20mg 1x D           120
3             7 weeks            49
4            2 months            60
5              3 days             3
6              1 days             1
7         1 years5 MG           365
8             2 years           730
9            6 months           180
10     1 years10 1x D           365
11        10 months15           300

Answer 2

split by space.按空间分割。
the first element is your number.第一个元素是你的号码。
The second element indicates what kind of time it is?第二个元素表示现在是什么时间？ day , week , month , year . day 、 week 、 month 、 year 。 Just the first letter is enough to identify what to multiply.仅第一个字母就足以确定要相乘的内容。

import pandas as pd

df  = pd.DataFrame({'dosage_duration':['5 years27abc','10 days92pqr', '5.5 weeks782364hgsdf', '3 months21647hadjh']})

mul = {
    'd':1,
    'w':7,
    'm':30,
    'y':365
}

df['new_dosage'] = df['dosage_duration'].apply(lambda x:x.split()).apply(lambda x:float(x[0])*mul[x[1][0]])
df

Output: Output：


    dosage_duration     new_dosage
0   5 years27abc        1825
1   10 days92pqr        10
2   5.5 weeks782364hgsdf    35
3   3 months21647hadjh  90

Update:更新：

if you want them as string of expression.如果您希望它们作为表达式字符串。

import pandas as pd

df  = pd.DataFrame({'t':['5 years27abc','10 days92pqr', '5 weeks782364hgsdf', '3 months21647hadjh']})

mul = {
    'd':'1',
    'w':'7',
    'm':'30',
    'y':'365'
}

df['total_time'] = df['t'].apply(lambda x:x.split()).apply(lambda x:x[0] + '*' + mul[x[1][0]])
df

Output: Output：

          t             total_time
0   5 years27abc        5*365
1   10 days92pqr        10*1
2   5 weeks782364hgsdf  5*7
3   3 months21647hadjh  3*30

如何在 pandas 的数据框中使用正则表达式提取部分文本

问题描述

2 个解决方案

解决方案1
2 2019-10-19 22:15:01

解决方案2
1 2019-10-19 21:57:22

如何在 pandas 的数据框中使用正则表达式提取部分文本

问题描述

2 个解决方案

解决方案1 2 2019-10-19 22:15:01

解决方案2 1 2019-10-19 21:57:22

解决方案1
2 2019-10-19 22:15:01

解决方案2
1 2019-10-19 21:57:22