在 pandas 中，如何用來自另一列的模式提取填充 Nan？

Question

我正在處理下面的數據，我想在Begin和End中填寫 Nan，並使用Subscription Period列中的日期。 所有列都是字符串。

我有幾種格式：

對於05/03/2020 to 04/03/2021 ，我使用：

    # clean if date begin and end in SubscriptionPeriod
    # create 3 new colonnes
    df_period = df['Subscription Period'] \
        .str.extractall(r'(?P<Period>(?P<Begin>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d).+(?P<End>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d))')
    df['Period'] = df_period['Period'].unstack()
    df['Begin'] = df_period['Begin'].unstack()
    df['End'] = df_period['End'].unstack()

對於Subscription Period的其他格式：

Subscription Hospital Sept-Dec 2018 ：我想將 9 月提取為 01/09/2018 的Begin和 31/12/2018 的End 。
Yearly Subscription Hospital (effective 17/04/2019)
Yearly Subscription Hospital (effective 01 octobre 2018)
對於這兩次，我想在Begin中獲得日期，在End中獲得更多一年。

我嘗試解決方案：

帶面具（）

mask = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
df.loc[mask, 'Begin'] = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')

with loc(): 適用於“B”，但不適用於帶有提取的正則表達式。

df.loc[(df['Begin'].isnull()) , 'Period']= 'B'

這里的數據：

data = {'Date': {0: '2020-05-05',
  1: '2018-09-12',
  2: '2020-04-22',
  3: '2020-01-01',
  4: '2019-04-17',
  5: '2018-09-07',
  6: '2018-11-20',
  7: '2018-11-28'},
 'Subscription Period': {0: 'Subscription Hospital : from 01/05/2020 to 30/04/2021',
  1: 'Subscription Hospital Sept-Dec 2018',
  2: 'Yearly Subscription Hospital from 05/03/2020 to 04/03/2021',
  3: 'Subscription Hospital from 01/01/2020 to 31/12/2020',
  4: 'Yearly Subscription Hospital (effective 17/04/2019)',
  5: 'Yearly Subscription Hospital (effective 01 octobre 2018)',
  6: 'Subscription : Hospital',
  7: 'Yearly Subscription Hospital'},
 'Period': {0: '01/05/2020 to 30/04/2021',
  1: np.NaN,
  2: '05/03/2020 to 04/03/2021',
  3: '01/01/2020 to 31/12/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN},
 'Begin': {0: '01/05/2020',
  1: np.NaN,
  2: '05/03/2020',
  3: '01/01/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN},
 'End': {0: '30/04/2021',
  1: np.NaN,
  2: '04/03/2021',
  3: '31/12/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN}}

df = pd.DataFrame.from_dict(data)

感謝您的幫助和任何提示。

Answer 1

關於您的mask示例，如果您使用的是str.extract或str.extractall ，則無需使用掩碼進行索引，因為生成的 dataframe 已編入索引。 相反，您可以使用concat加入索引並使用combine_first僅在Begin為 null 的情況下應用：

begin2 = df['Subscription Period'].str.extract(r'(\d{2}/\d{2}/\d{2,4})[)]?$').rename({0:'Begin2'}, axis=1)
df = pd.concat([df, begin2], axis=1)
df.Begin = df.Begin.combine_first(df.Begin2)
df = df.drop('Begin2', axis=1)

希望你能從這里拿走它？ 否則，您可能必須澄清您到底在哪里遇到了麻煩。

順便說一句，那些正則表達式非常多毛。 我建議轉換定義自定義 function 並使用df.apply 。

在 pandas 中，如何用來自另一列的模式提取填充 Nan？

問題描述

1 個解決方案

解決方案1
0 已采納 2020-07-26 19:18:02

在 pandas 中，如何用來自另一列的模式提取填充 Nan？

問題描述

1 個解決方案

解決方案1 0 已采納 2020-07-26 19:18:02

解決方案1
0 已采納 2020-07-26 19:18:02