[英]Extract a variable length number using regex in python
I have a file in very bad shape but I am being able to parse it and extract most of the values required except one. 我的文件的形状非常糟糕,但是我可以解析它并提取除一个以外的大多数所需值。 And I need you help on how to regex to extract a variable length number.
我需要你帮助如何正则表达式提取可变长度数字。
To parse and extract other features I have used List indexes along with different spliiters '|', ' ' and ':'. 为了解析和提取其他功能,我使用了List索引以及不同的spliiters'|',''和':'。 But in this case I am being able to reach to block (below) and have to extract for each row the digits around '_' separately as x and y.
但在这种情况下,我能够到达阻止(下面)并且必须为每一行提取'_'周围的数字作为x和y。
One way could be to first split by ':' and than by ' ' and finally by '-' but and extract index position [0] and [1] but that will be the most in-efficient way to do so. 一种方法可能是首先按':'而不是''和最后按' - '分割,但提取索引位置[0]和[1],但这将是最有效的方法。
chr5:17399789-17401949 REVERSE chr5:17399789-17401949反转
chr5:6414488-6415907 FORWARD chr5:6414488-6415907转发
chr5:2981156-2982709 FORWARD chr5:2981156-2982709向前
chr5:6311725-6313323 REVERSE chr5:6311725-6313323 REVERSE
chr5:12791432-12794551 REVERSE chr5:12791432-12794551 REVERSE
chr5:927915-930781 FORWARD chr5:927915-930781转发
chr5:19585936-19587841 FORWARD chr5:19585936-19587841前进
chr5:26894856-26896488 FORWARD chr5:26894856-26896488前进
chr5:18138775-18142147 REVERSE chr5:18138775-18142147 REVERSE
chr5:20537525-20538943 REVERSE chr5:20537525-20538943反向
chr5:22496196-22500543 REVERSE chr5:22496196-22500543 REVERSE
chr5:4747860-4753592 REVERSE chr5:4747860-4753592 REVERSE
The above block has come from 'bigger block' like this: 上面的块来自“更大的块”,如下所示:
AT1G09410.1 | AT1G09410.1 | Symbols: |
符号: pentatricopeptide (PPR) repeat-containing protein |
五肽(PPR)重复序列蛋白| chr1:3035443-3037560 FORWARD
chr1:3035443-3037560 FORWARD
Can I extract at 'bigger block' also? 我也可以在“更大的区块”处提取内容吗?
My programming level can be best describes as beginner and need you help. 我的编程水平最好描述为初学者,需要你的帮助。
Thanks 谢谢
AK AK
One approach would be to define your regular expression as the following Python "raw" String: 一种方法是将正则表达式定义为以下Python“原始”字符串:
numericalBlockRegEx = r'chr\d+:(?P<firstNumBlock>\d+)-(?P<secondNumBlock>\d+)'
Finally, once you actually run your regular expression over each line of the file (you'll likely need to use a call to search rather than match) you can extract the numerical block you're interested in by a simple call to: 最后,一旦在文件的每一行上实际运行了正则表达式(您可能需要使用调用进行搜索而不是匹配),您可以通过以下简单调用来提取您感兴趣的数字块:
x = match.group('firstNumBlock') #Gets first number block matched
y = match.group('secondNumBlock') #Gets second number block matched
Cheers! 干杯!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.