[英]how to extract part of text using regex on a data frame in pandas
I have a data frame and one of the columns are like this:我有一个数据框,其中一列是这样的:
df = index dosage_duration
0 5 years20mg 1X D
1 2 days10mg 1X D
2 2 days10mg 1X D
3 7 weeks
4 2 months
5 3 days
6 1 years5 MG
7 2 years
What I am trying to do is to extract the first part of the time and convert it to days.我要做的是提取时间的第一部分并将其转换为天。 So the result would look like this:
所以结果看起来像这样:
df = index dosage_duration new_dosage
0 5 years20mg 1X D 5*365
1 2 days10mg 1X D 2
2 2 days10mg 1X D 2
3 7 weeks 7*7
4 2 months 2*30
5 3 days 3
6 1 years5 MG 1*365
7 2 years 2*365
As you see here, 5 years being converted to 5*365 to be as days.正如您在此处看到的,将 5 年转换为 5*365 以作为天数。
I am able to get the first part lets to say 5
in the first row
, 2
in the second row... but Im not sure how can I get the years
days
or month
so I can change all the values to days scale.我能够得到第一部分,让我们在
first row
说5
,在第二行说2
......但我不确定如何获得years
days
或month
,所以我可以将所有值更改为天数。
Apparently, I need to be able to find the first number after the space
but I don't know how can I do this part.显然,我需要能够找到
space
后的第一个数字,但我不知道该怎么做。
Let's try:我们试试看:
df = pd.DataFrame({'dosage_duration':['5 years20mg 1x D'
,'2 days10mg 1x D'
,'4 months20mg 1x D'
,'7 weeks'
,'2 months'
,'3 days'
,'1 days'
,'1 years5 MG'
,'2 years'
,'6 months'
,'1 years10 1x D'
,'10 months15']})
nmap={'years':365, 'months':30, 'weeks':7, 'days': 1}
strnmap = '|'.join(nmap.keys())
df_m = df.dosage_duration.str.extract(f'(?P<unit>\d+)\s?(?P<span>[{strnmap}]+)')
df['new_duration']= df_m['unit'].astype(int).mul(df_m['span'].map(nmap))
print(df)
Output: Output:
dosage_duration new_duration
0 5 years20mg 1x D 1825
1 2 days10mg 1x D 2
2 4 months20mg 1x D 120
3 7 weeks 49
4 2 months 60
5 3 days 3
6 1 days 1
7 1 years5 MG 365
8 2 years 730
9 6 months 180
10 1 years10 1x D 365
11 10 months15 300
day
, week
, month
, year
. day
、 week
、 month
、 year
。 Just the first letter is enough to identify what to multiply.import pandas as pd
df = pd.DataFrame({'dosage_duration':['5 years27abc','10 days92pqr', '5.5 weeks782364hgsdf', '3 months21647hadjh']})
mul = {
'd':1,
'w':7,
'm':30,
'y':365
}
df['new_dosage'] = df['dosage_duration'].apply(lambda x:x.split()).apply(lambda x:float(x[0])*mul[x[1][0]])
df
Output: Output:
dosage_duration new_dosage
0 5 years27abc 1825
1 10 days92pqr 10
2 5.5 weeks782364hgsdf 35
3 3 months21647hadjh 90
Update:更新:
import pandas as pd
df = pd.DataFrame({'t':['5 years27abc','10 days92pqr', '5 weeks782364hgsdf', '3 months21647hadjh']})
mul = {
'd':'1',
'w':'7',
'm':'30',
'y':'365'
}
df['total_time'] = df['t'].apply(lambda x:x.split()).apply(lambda x:x[0] + '*' + mul[x[1][0]])
df
Output: Output:
t total_time
0 5 years27abc 5*365
1 10 days92pqr 10*1
2 5 weeks782364hgsdf 5*7
3 3 months21647hadjh 3*30
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.