從混合字母和數字列 pandas 中提取日期時間

Question

我在 pandas dataframe 中有一個列，其中包含兩種類型的信息 = 1. 日期和時間，2=公司名稱。 我必須將列分成兩列（date_time、full_company_name）。 首先，我嘗試根據字符數拆分列（前 19 個一列，rest 到另一列），但后來我意識到有時缺少日期，因此拆分可能不起作用。 然后我嘗試使用正則表達式，但我似乎無法正確提取它。

專欄：

所需的 output：

Answer 1

如果日期格式都正確，也許你不必使用正則表達式

df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
                         "2021-01-01 06:00:00Acme LLC"]})
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
#                                     A                 date          company
# 0  2021-01-01 05:00:00Acme Industries  2021-01-01 05:00:00  Acme Industries
# 1         2021-01-01 06:00:00Acme LLC  2021-01-01 06:00:00         Acme LLC

或者

df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")

Answer 2

注意：如果您可以選擇避免連接這些字符串，請這樣做。 這不是一個健康的習慣。

解決方案（不是那么漂亮就可以完成工作）：

import pandas as pd
from datetime import datetime
import re

df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM', 
             'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]

# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet

new_rows = []
for row in df['date_time/full_company_name']:
    # extract the date_time from the row using regex
    date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
    # handle case of empty date_time
    date_time = date_time.group() if date_time else ''
    # extract the company name from the row from where the date_time ends
    company_name = row[len(date_time):]
    # create a new row with the extracted date_time and company_name
    new_rows.append([date_time, company_name])

# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]

# After:
# date_time            full_company_name
# 2000-01-01 00:00:00       Google
# 2001-01-01 00:00:00       Apple
# 2002-01-01 00:00:00       Microsoft
# 2003-01-01 00:00:00       Facebook
# 2004-01-01 00:00:00       Amazon
# 2005-01-01 00:00:00       IBM
# 2006-01-01 00:00:00       Oracle
# 2007-01-01 00:00:00       Intel
# 2008-01-01 00:00:00       Yahoo
# 2009-01-01 00:00:00       Alphabet

Answer 3

使用非捕獲組？.* 而不是 (.*)

df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
                         "2021-01-01 06:00:00Acme LLC"]})

df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")

從混合字母和數字列 pandas 中提取日期時間

問題描述

3 個解決方案

解決方案1
2 已采納 2021-11-23 16:45:49

解決方案2
0 2021-11-23 17:23:04

解決方案3
0 2021-11-23 17:29:05

從混合字母和數字列 pandas 中提取日期時間

問題描述

3 個解決方案

解決方案1 2 已采納 2021-11-23 16:45:49

解決方案2 0 2021-11-23 17:23:04

解決方案3 0 2021-11-23 17:29:05

解決方案1
2 已采納 2021-11-23 16:45:49

解決方案2
0 2021-11-23 17:23:04

解決方案3
0 2021-11-23 17:29:05