简体   繁体   English

正则表达式匹配一个混合大写字母和\\ n的字符串

[英]regex to match a string with mixed capital letters and \n

I want to write a regex which will match a string only if it starts with an \\n, continues with at least one capital letter and ends with an \\n. 我想编写一个正则表达式,只有当它以\\ n开头时才匹配字符串,继续至少一个大写字母并以\\ n结束。 The string could contain repetitions of this pattern, for example: 该字符串可能包含此模式的重复,例如:

\\n[AZ]\\n[AZ]\\n. \\ n [AZ] \\ n上[AZ] \\ n上。

I've tried this regular expression: \\n(([AZ]+\\n)+), on this input: 我试过这个正则表达式:\\ n(([AZ] + \\ n)+),在这个输入上:

200LA 012F5421F2E8A172 164 XRAY 1.950 0.176 NA no Endolysin [Enterobacteria phage T4] ||1C63A 1C64A 1C65A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILR NAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA YKNL 200LA 012F5421F2E8A172 164 XRAY 1.950 0.176 NA无细胞内溶素[肠杆菌噬菌体T4] || 1C63A 1C64A 1C65A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILR NAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA YKNL

I expected to get this result: ('MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL') 我希望得到这样的结果:('MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSRARQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL')

but instead, I got this one: ('MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILR\\nNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA\\nYKNL\\n', 'YKNL\\n') 但相反,我得到了这个:('MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILR \\ nNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSRRQQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA \\ nYKNL \\ n','YKNL \\ n')

Does anybody know what went wrong? 有人知道出了什么问题吗?

Thanks! 谢谢!

Your regex matches the (longest possible version of the) first section that matches your condition. 你的正则表达式匹配符合条件的第一部分(可能的最长版本)。

The first line in your text doesn't start with an \\n so it moves to look at the line right after the first \\n , and since it matches the condition, the groups specified by your regex are accepted as a result. 文本中的第一行不以\\n开头,因此它会移动到第一个\\n后面的行,并且由于它与条件匹配,因此接受正则表达式指定的组作为结果。

For your result, I would suggest matching with \\n(?:[AZ]+\\n)+ ( ?: stands for non-capturing group to prevent capturing of none but the last group), then replaceing newlines with empty strings: 对于你的结果,我会建议匹配\\n(?:[AZ]+\\n)+?:代表非捕获组,以防止没有,但在最后一组中捕获),然后保换空字符串换行:

>>> a = """>200LA 012F5421F2E8A172 164 XRAY  1.950  0.176 NA no Endolysin <ENLYS_BPT4(1-164)> [Enterobacteria phage T4] ||1C63A 1C64A 1C65A
... MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILR
... NAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA
... YKNL
... """
>>> m = re.findall('\n(?:[A-Z]+\n)+', a)
>>> m[0].replace('\n', '')
'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMAQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM