从一列中提取最后一个特定的单词/值并将其移至下一行

Question

I have a DataFrame like the following我有一个如下所示的 DataFrame

|Animals        | Type         | Year |
|Penguin AVES   | Omnivore     | 2015 |
|Caiman REP     | Carnivore    | 2018 |
|Komodo.Rep     | Carnivore    | 2019 |
|Blue Jay.aves  | Omnivore     | 2015 |
|Iguana+rep     | Carnivore    | 2020 |

I want to extract the last specific words (eg AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row.我想从“Animals”列的值中提取最后的特定单词（例如 AVES 和 REP），并将其移动到下一行，同时保留整行的值。 There are several specific words other than AVES and REP.除了 AVES 和 REP 之外，还有几个特定的词。 It's not very clean (as shown by the whitespace, dot, and "+" operator before the specific words).它不是很干净（如特定单词前的空格、点和“+”运算符所示）。 The expected new DataFrame would be like the following预期的新 DataFrame 将如下所示

| Animals        | Type         | Year |
| Penguin AVES   | Omnivore     | 2015 |
| AVES           | Omnivore     | 2015 |
| Caiman REP     | Carnivore    | 2018 |
| REP            | Carnivore    | 2018 |
| Komodo.Rep     | Carnivore    | 2019 |
| Rep            | Carnivore    | 2019 |
| Blue Jay.aves  | Omnivore     | 2015 |
| aves           | Omnivore     | 2015 |
| Iguana+rep     | Carnivore    | 2020 |
| rep            | Carnivore    | 2020 |

I was thinking of using a negative indexing to split the string, but I got confused with the lambda function for this particular issue.我正在考虑使用负索引来拆分字符串，但对于这个特定问题，我对 lambda function 感到困惑。 Any idea how I should approach this problem?知道我应该如何解决这个问题吗？ Thanks in advance.提前致谢。

Answer 1

You can use str.extract to get the last word ( (\w+)$ regex, but you can also use a specific list (?i)(aves|rep)$ if needed) and assign it to replace the column, then concat the updated DataFrame to the original one, and sort_index with a stable method to interleave the rows:您可以使用str.extract获取最后一个单词（ (\w+)$正则表达式，但如果需要，您也可以使用特定列表(?i)(aves|rep)$ concat其assign给替换列，然后连接更新后的 DataFrame 为原来的，并且sort_index使用稳定的方法交错行：

out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w+)$'))])
         .sort_index(kind='stable', ignore_index=True)
      )

Output: Output：

         Animals       Type  Year
0   Penguin AVES   Omnivore  2015
1           AVES   Omnivore  2015
2     Caiman REP  Carnivore  2018
3            REP  Carnivore  2018
4     Komodo.Rep  Carnivore  2019
5            Rep  Carnivore  2019
6  Blue Jay.aves   Omnivore  2015
7           aves   Omnivore  2015
8     Iguana+rep  Carnivore  2020
9            rep  Carnivore  2020

alternative using `stack` :替代使用`stack` ：

cols = df.columns.difference(['Animals']).tolist()

out = (df.assign(Word=df['Animals'].str.extract(r'(\w+)$'))
         .set_index(cols).stack().reset_index(cols, name='Animals')
         .reset_index(drop=True)[df.columns]
      )

alternative with indexing:替代索引：

Duplicate all rows, modify the odd rows with the extracted word复制所有行，用提取的词修改奇数行

out = df.loc[df.index.repeat(2)].reset_index(drop=True)

out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(r'(\w+)$', expand=False)

从一列中提取最后一个特定的单词/值并将其移至下一行

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-25 08:49:33

alternative using `stack` :替代使用`stack` ：

alternative with indexing:替代索引：

从一列中提取最后一个特定的单词/值并将其移至下一行

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-25 08:49:33

alternative using stack :替代使用stack ：

alternative with indexing:替代索引：

解决方案1
1 已采纳 2023-01-25 08:49:33

alternative using `stack` :替代使用`stack` ：