[英]Extract multiple date/time values from text field into new variable columns
I have dataframe - see below.我有 dataframe - 见下文。 This is just a snippet of the full dateframe, there are more text and date/times in each respective rows/IDS.
这只是完整日期框架的一个片段,每个行/IDS 中都有更多文本和日期/时间。 As you can see the text before and after each date/time is random.
如您所见,每个日期/时间前后的文本是随机的。
ID RESULT
1 Patients Discharged Home : 12/07/2022 11:19 Bob Melciv Appt 12/07/2022 12:19 Medicaid...
2 Stawword Geraldio - 12/17/2022 11:00 Bob Melciv Appt 12/10/2022 12:09 Risk Factors...
I would like to pull all date/times where the format is MM/DD/YYYY HH:MM
from the RESULT column and make each of those respective date/times into their own column.我想从 RESULT 列中提取格式为
MM/DD/YYYY HH:MM
的所有日期/时间,并将每个相应的日期/时间放入它们自己的列中。
ID DATE_TIME_1 DATE_TIME_2 DATE_TIME_3 .....
1 12/07/2022 11:19 12/07/2022 12:19
2 12/17/2022 11:00 12/10/2022 12:09
How about:怎么样:
Of course this doesn't cover nonsensical dates such as 55/55/1023
, but it should get you 99% of the way there.当然,这不包括无意义的日期,例如
55/55/1023
,但它应该让您完成 99% 的事情。
From @David542's regex, you can use str.extractall
:在 @David542 的正则表达式中,您可以使用
str.extractall
:
pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
out = pd.concat([df['ID'],
df['RESULT'].str.extractall(pattern).squeeze()
.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
.rename_axis(columns=None)], axis=1)
print(out)
# Output
ID DATE_TIME_1 DATE_TIME_2
0 1 12/07/2022 11:19 12/07/2022 12:19
1 2 12/17/2022 11:00 12/10/2022 12:09
A slightly modified version to convert extracted date/time to pd.DatetimeIndex
:将提取的日期/时间转换为
pd.DatetimeIndex
的稍微修改的版本:
pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
out = pd.concat([df['ID'],
df['RESULT'].str.extractall(pattern).squeeze().apply(pd.to_datetime)
.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
.rename_axis(columns=None)], axis=1)
print(out)
# Output
ID DATE_TIME_1 DATE_TIME_2
0 1 2022-12-07 11:19:00 2022-12-07 12:19:00
1 2 2022-12-17 11:00:00 2022-12-10 12:09:00
Step by step:一步步:
# 1. Date extraction (and squeeze DataFrame with 1 column to Series)
>>> out = df['RESULT'].str.extractall(pattern)
match
0 0 12/07/2022 11:19
1 12/07/2022 12:19
1 0 12/17/2022 11:00
1 12/10/2022 12:09
Name: 0, dtype: object
# 2. Move second index level as column (and add the prefix DATE_TIME_N)
>>> out = out.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
match DATE_TIME_1 DATE_TIME_2
0 12/07/2022 11:19 12/07/2022 12:19
1 12/17/2022 11:00 12/10/2022 12:09
# 3. Remove the 'match' title on column axis
>>> out = out.rename_axis(columns=None)
DATE_TIME_1 DATE_TIME_2
0 12/07/2022 11:19 12/07/2022 12:19
1 12/17/2022 11:00 12/10/2022 12:09
Finally concatenate original ID with this new dataframe along column axis.最后沿列轴将原始 ID 与这个新的 dataframe 连接起来。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.