简体   繁体   English

使用正则表达式python提取字符串

[英]extract strings using regex python

I have text in a file that I am pushing into a string.我在一个文件中有文本,我将它推入一个字符串。

txt = "PRIMARY INDEX its_mnth_content_aggr ( AC_ID ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,
DISPATCH_ID ,CASE_CREATE_DT ) 
ABDCGFWERRUU 
asdffggb 
PRIMARY INDEX its_mnth_content_aggr ( AC_CASE ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,\
DISPATCH_ID ,CASE_CREATE_DT )"

I want to extract the complete primary index from it as in primary index (....)我想从中提取完整的主索引,如主索引 (....)

so far i have below到目前为止,我有以下

x3 = re.findall(r"\bPRIMARY\sINDEX\s\w+\W.*", txt)

that gives me这给了我

['PRIMARY INDEX its_mnth_content_aggr ( AC_CASE_ID ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,DISPATCH_ID ,CASE_CREATE_DT )  ABDCGFWERRUU  qwerrtyyuiu PRIMARY INDEX its_mnth_content_aggr ( AC_CASE_ID ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,DISPATCH_ID ,CASE_CREATE_DT )']

I want something like this我想要这样的东西

['PRIMARY INDEX its_mnth_content_aggr ( AC_CASE_ID ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,DISPATCH_ID ,CASE_CREATE_DT ) PRIMARY INDEX its_mnth_content_aggr ( AC_CASE_ID ,ROW_ADDED_DT ,NOTE_SEQ_NR ,BIZ_UNIT_CD ,DISPATCH_ID ,CASE_CREATE_DT )'] 

can someone please help有人可以帮忙吗

You regex says that you want a string that starts by PRIMARY INDEX followed by any characters.您的正则表达式表示您想要一个以PRIMARY INDEX开头的字符串,后跟任何字符。 So it matches all your string;所以它匹配你所有的字符串;


You have to be more specific.你必须更具体。

PRIMARY INDEX[A-Za-z(_,\n\\ ]*\)
  • the string should start with: PRIMARY INDEX字符串应以: PRIMARY INDEX开头
  • then there could be any letter or special characters in [A-Za-z(_,\\n\\\\ ] , followed by * because we don't know the number of these characters.那么[A-Za-z(_,\\n\\\\ ]可能有任何字母或特殊字符,后跟*因为我们不知道这些字符的数量。
  • and it ends by a )并以)结尾

You can try it here你可以在这里试试

You can use您可以使用

re.findall(r'\bPRIMARY\s+INDEX\s+\w+\s*\([^()]*\)', txt)

See the regex demo查看正则表达式演示

Details细节

  • \\b - word boundary \\b - 词边界
  • PRIMARY\\s+INDEX - PRIMARY , 1+ whitespaces, INDEX PRIMARY\\s+INDEX - PRIMARY , 1+ 空格, INDEX
  • \\s+ - 1+ whitespaces \\s+ - 1+ 个空格
  • \\w+ - 1+ word chars \\w+ - 1+ 个字字符
  • \\s* - 0+ whitespaces \\s* - 0+ 个空格
  • \\( - a ( char \\( - a (字符
  • [^()]* - 0+ chars other than ( and ) [^()]* - 除()之外的 0+ 个字符
  • \\) - a ) char. \\) - a )字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM