简体   繁体   English

在python中的2个字符串之间提取子字符串

[英]Extract substring between 2 strings in python

I have a python dataframe with a string column that I want to separate into several more columns. 我有一个带有字符串列的python数据框,我想将其分成更多列。

Some rows of the DF look like this: DF的某些行如下所示:

COLUMN

ORDP//NAME/iwantthispart/REMI/MORE TEXT
/REMI/SOMEMORETEXT
/ORDP//NAME/iwantthispart/ADDR/SOMEADRESS
/BENM//NAME/iwantthispart/REMI/SOMEMORETEXT

So basically i want everything after '/NAME/' and up to the next '/'. 所以基本上我想在'/ NAME /'之后到下一个'/'之间的所有内容。 However. 然而。 Not every row has the '/NAME/iwantthispart/' field, as can be seen in the second row. 如第二行所示,并非每一行都有“ / NAME / iwantthispart /”字段。

I've tried using split functions, but ended up with the wrong results. 我试过使用分割函数,但结果错误。

mt['COLUMN'].apply(lambda x: x.split('/NAME/')[-1])

This just gave me everything after the /NAME/ part, and in the cases that there was no /NAME/ it returned the full string to me. 这只是给我/ NAME /部分之后的所有内容,并且在没有/ NAME /的情况下,它会将完整的字符串返回给我。

Does anyone have some tips or solutions? 有人有一些技巧或解决方案吗? Help is much appreciated! 非常感谢帮助! (the bullets are to make it more readable and are not actually in the data). (项目符号是为了使其更具可读性,并且实际上不在数据中)。

You could use str.extract to extract the pattern of choice, using a regex: 您可以使用正则表达式使用str.extract提取选择的模式:

# Generally, to match all word characters:
df.COLUMN.str.extract('NAME/(\w+)')

OR 要么

# More specifically, to match everything up to the next slash:
df.COLUMN.str.extract('NAME/([^/]*)')

Both of which returns: 两者都返回:

0    iwantthispart
1              NaN
2    iwantthispart
3    iwantthispart

These two lines will give you the second word regardless if the first word is name or not 这两行将为您提供第二个单词,无论第一个单词是否是名称

mt["column"]=mt["column"].str.extract(r"(\w+/\w+/)")
mt["column"].str.extract(r"(\/\w+)")

This will give the following result as a column in pandas dataframe: 作为熊猫数据框中的一列,这将给出以下结果:

/iwantthispart
/SOMEMORETEXT
/iwantthispart
/iwantthispart

and incase you are only interested in the lines that contain NAME this will work for you just fine: 并且如果您只对包含NAME的行感兴趣,那么对您来说就可以了:

mt["column"]=mt["column"].str.extract(r"(\NAME/\w+/)")
mt["column"].str.extract(r"(\/\w+)")

This will give the following result: 这将产生以下结果:

/iwantthispart
/NaN
/iwantthispart
/iwantthispar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM