简体   繁体   English

使用python中的regex提取可变长度数

[英]Extract a variable length number using regex in python

I have a file in very bad shape but I am being able to parse it and extract most of the values required except one. 我的文件的形状非常糟糕,但是我可以解析它并提取除一个以外的大多数所需值。 And I need you help on how to regex to extract a variable length number. 我需要你帮助如何正则表达式提取可变长度数字。

To parse and extract other features I have used List indexes along with different spliiters '|', ' ' and ':'. 为了解析和提取其他功能,我使用了List索引以及不同的spliiters'|',''和':'。 But in this case I am being able to reach to block (below) and have to extract for each row the digits around '_' separately as x and y. 但在这种情况下,我能够到达阻止(下面)并且必须为每一行提取'_'周围的数字作为x和y。

One way could be to first split by ':' and than by ' ' and finally by '-' but and extract index position [0] and [1] but that will be the most in-efficient way to do so. 一种方法可能是首先按':'而不是''和最后按' - '分割,但提取索引位置[0]和[1],但这将是最有效的方法。

chr5:17399789-17401949 REVERSE chr5:17399789-17401949反转

chr5:6414488-6415907 FORWARD chr5:6414488-6415907转发

chr5:2981156-2982709 FORWARD chr5:2981156-2982709向前

chr5:6311725-6313323 REVERSE chr5:6311725-6313323 REVERSE

chr5:12791432-12794551 REVERSE chr5:12791432-12794551 REVERSE

chr5:927915-930781 FORWARD chr5:927915-930781转发

chr5:19585936-19587841 FORWARD chr5:19585936-19587841前进

chr5:26894856-26896488 FORWARD chr5:26894856-26896488前进

chr5:18138775-18142147 REVERSE chr5:18138775-18142147 REVERSE

chr5:20537525-20538943 REVERSE chr5:20537525-20538943反向

chr5:22496196-22500543 REVERSE chr5:22496196-22500543 REVERSE

chr5:4747860-4753592 REVERSE chr5:4747860-4753592 REVERSE

The above block has come from 'bigger block' like this: 上面的块来自“更大的块”,如下所示:

AT1G09410.1 | AT1G09410.1 | Symbols: | 符号: pentatricopeptide (PPR) repeat-containing protein | 五肽(PPR)重复序列蛋白| chr1:3035443-3037560 FORWARD chr1:3035443-3037560 FORWARD

Can I extract at 'bigger block' also? 我也可以在“更大的区块”处提取内容吗?

My programming level can be best describes as beginner and need you help. 我的编程水平最好描述为初学者,需要你的帮助。

Thanks 谢谢

AK AK

One approach would be to define your regular expression as the following Python "raw" String: 一种方法是将正则表达式定义为以下Python“原始”字符串:

    numericalBlockRegEx = r'chr\d+:(?P<firstNumBlock>\d+)-(?P<secondNumBlock>\d+)'

Finally, once you actually run your regular expression over each line of the file (you'll likely need to use a call to search rather than match) you can extract the numerical block you're interested in by a simple call to: 最后,一旦在文件的每一行上实际运行了正则表达式(您可能需要使用调用进行搜索而不是匹配),您可以通过以下简单调用来提取您感兴趣的数字块:

    x = match.group('firstNumBlock') #Gets first number block matched
    y = match.group('secondNumBlock') #Gets second number block matched

Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM