简体   繁体   English

Substring function 提取部分字符串

[英]Substring function to extract part of the string

data = {'desc': ['ADRIAN PETER - ANN 80020355787C - 11 Baillon Pass.pdf', 'AILEEN MARCUS - ANC 800E15432922 - 5 Mandarin Way.pdf',
               'AJITH SINGH - ANN 80020837750 - 11 Berkeley Loop.pdf', 'ALEX MARTIN-CURTIS - ANC 80021710355 - 26 Dovedale St.pdf',
               'Alice.Smith\Jodee - Karen - ANE 80020428377 - 58 Harrisdale Dr.pdf']}
df = pd.DataFrame(data, columns = ['desc'])
df

From the data frame, I want to create a new column called ID, and in that ID, I want to have only those values starting after ANN, ANC or ANE.从数据框中,我想创建一个名为 ID 的新列,并且在该 ID 中,我希望只有那些在 ANN、ANC 或 ANE 之后开始的值。 So I am expecting a result as below.所以我期待如下结果。

ID
80020355787C 
800E15432922 
80020837750 
80021710355 
80020428377 

I tried running the code below, but it did not get the desired result.我尝试运行下面的代码,但没有得到想要的结果。 Appreciate your help on this.感谢您对此的帮助。

df['id'] = df['desc'].str.extract(r'\-([^|]+)\-')

You can use - AN[NCE] (800[0-9A-Z]+) - , where:您可以使用- AN[NCE] (800[0-9A-Z]+) - ,其中:

  • AN[NCE] matches literally AN followed by N or C or E ; AN[NCE]按字面意思匹配AN后跟NCE
  • 800[0-9A-Z]+ matches literally 800 followed by one or more characters between 0 and 9 or between A and Z . 800[0-9A-Z]+按字面意思匹配800后跟一个或多个介于09之间或介于AZ之间的字符。
>>> df['desc'].str.extract(r'- AN[NCE] (800[0-9A-Z]+) -')
              0
0  80020355787C
1  800E15432922
2   80020837750
3   80021710355
4   80020428377

If not all your ids start with "800", you can just remove it from the pattern.如果不是所有的 ID 都以“800”开头,您可以将其从模式中删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM