删除第二个插入符正则表达式后的所有内容并应用于 pandas dataframe 列

Question

I have a dataframe with a column that looks like this:我有一个 dataframe，其列如下所示：

0         EIAB^EIAB^6
1           8W^W844^A
2           8W^W844^A
3           8W^W858^A
4           8W^W844^A
             ...     
826136    EIAB^EIAB^6
826137    SICU^6124^A
826138    SICU^6124^A
826139    SICU^6128^A
826140    SICU^6128^A

I just want to keep everything before the second caret, eg: 8W^W844 , what regex would I use in Python?我只想保留第二个插入符之前的所有内容，例如： 8W^W844 ，我会在 Python 中使用什么正则表达式？ Similarly PACU^SPAC^06 would be PACU^SPAC .同样PACU^SPAC^06将是PACU^SPAC 。 And to apply it to the whole column.并将其应用于整个列。

I tried r'[\\^].+$' since I thought it would take the last caret and everything after, but it didn't work.我尝试r'[\\^].+$'因为我认为它会占用最后一个插入符和之后的所有内容，但它没有用。

Answer 1

You can negate the character group to find everything except ^ and put it in a match group.您可以否定字符组以找到除^之外的所有内容并将其放入匹配组中。 you don't need to escape the ^ in the character group but you do need to escape the one outside.你不需要转义字符组中的^但你需要转义外面的那个。

re.match(r"([^^]+\^[^^]+)", "8W^W844^A").group(1)

This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with这在 pandas dataframe 中非常有用。假设你想在单个列上执行此操作，你可以提取你想要的字符串

df['col'].str.extract(r'^([^^]+\^[^^]+)', expand=False)

NOTE笔记

Originally, I used replace , but the extract solution suggested in the comments executed in 1/4 the time.最初，我使用replace ，但在 1/4 的时间内执行了注释中建议的extract解决方案。

import pandas as pd
import numpy as np
from timeit import timeit

df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()

def t1():
    df.bar.str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
    
def t2():
    df.bar.str.extract(r'^([^^]+\^[^^]+)', expand=False)

print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))

output output

replace 39.73989862400049
extract 9.910304663004354

Answer 2

I don't think regex is really necessary here, just slice the string up to the position of the second caret:我不认为这里真的需要正则表达式，只需将字符串切成第二个插入符的 position：

>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'

Explanation : str.find accepts a second argument of where to start the search, place it just after the position of the first caret.解释： str.find接受第二个参数，即从哪里开始搜索，将其放在第一个插入符号的 position 之后。

删除第二个插入符正则表达式后的所有内容并应用于 pandas dataframe 列

问题描述

2 个解决方案

解决方案1
2 已采纳 2023-01-26 04:52:53

解决方案2
1 2023-01-26 04:45:46

删除第二个插入符正则表达式后的所有内容并应用于 pandas dataframe 列

问题描述

2 个解决方案

解决方案1 2 已采纳 2023-01-26 04:52:53

解决方案2 1 2023-01-26 04:45:46

解决方案1
2 已采纳 2023-01-26 04:52:53

解决方案2
1 2023-01-26 04:45:46