简体   繁体   English

从python 2.7.5中的文本文件中提取字符串

[英]Extracting a string from a text file in python 2.7.5

Hello I am new to python, and I hope you can help me. 您好,我是python的新手,希望您能为我提供帮助。 I have a text file (call it data.txt) with data on gene number with corresponding rs number and some distance measure. 我有一个文本文件(称为data.txt),其中包含有关基因编号的数据以及相应的rs编号和一些距离度量。 The data looks something like this: 数据看起来像这样:

   rs1982171     55349     40802

   rs6088650     55902     38550

   rs1655902     3105      12220

   rs1013677     55902      0

where the first column is rs number, second column is gene number, and third column is some distance measure. 其中第一列是rs号,第二列是基因号,第三列是某种距离度量。 The data is much bigger, but hopefully the above gives you an idea of the dataset. 数据要大得多,但是希望上面的内容可以使您对数据集有所了解。 What I want to do is find all the rs numbers that correspond to a certain gene. 我要做的是找到与某个基因相对应的所有rs号。 For example, for the data set above, gene 55902= {rs6088650, rs1013677}. 例如,对于上面的数据集,基因55902 = {rs6088650,rs1013677}。 Ideally, I want my code to find all rs numbers corresponding to a given gene. 理想情况下,我希望我的代码找到与给定基因相对应的所有rs号。 Since I am unable to do that now, I instead wrote a short code that gives the lines that contain the string "55902" in the data.txt file: 由于我现在无法执行此操作,因此我写了一个简短的代码,该代码给出了data.txt文件中包含字符串“ 55902”的行:

  import re
  data=open("data.txt","r")
  for line in data:
      line=line.rstrip()
      if re.search("55902",line):
      print line

The problem with this code is that the output is something like this: 这段代码的问题是输出是这样的:

    rs6088650    55902     38550

    rs1655902    3105      12220

    rs1013677    55902     0

I want my code to ignore the string "55902" in the rs number. 我希望我的代码忽略rs号中的字符串“ 55902”。 In other words, I don't my code to output the second line in the above output because the gene number is not 55902. I would like my output to be : 换句话说,由于基因编号不是55902,因此我不需要在上述输出中输出第二行代码。我希望输出为:

       rs6088650     55902   38550

       rs1013677     55902   0

How can I modify the above code to achieve what I want. 如何修改上面的代码来实现我想要的。 Any help would be appreciated. 任何帮助,将不胜感激。 Thanks in advance. 提前致谢。

You can use word boundary ( \\b ) , to match whole word search: 您可以使用单词边界( \\b来匹配整个单词搜索:

>>> import re
>>> re.search(r"\b55902\b", "rs1655902     3105      12220")
>>> re.search(r"\b55902\b", "rs6088650     55902     38550")
<_sre.SRE_Match object at 0x7f82594566b0>

if re.search(r"\b55902\b", line):
    ....

You can do this easily with a more powerful regular expression. 您可以使用功能更强大的正则表达式轻松完成此操作。 One possible quick solution is to use a regex of the form: 一种可能的快速解决方案是使用以下形式的正则表达式:

r'\b55902\b'

The \\b are word boundaries. \\b是单词边界。

There's no need for regular expressions here, as all you're looking for is a simple static sequence. 这里不需要正则表达式,因为您需要的只是一个简单的静态序列。 This line: 这行:

if re.search("55902",line):

Could be expressed as: 可以表示为:

if "55902" in line:

And if you only want to check the second column, split the line first: 如果只想检查第二列,请先分割行:

if '55902' in line.split()[1]:

Since you're now already checking the correct column, check for equality rather than membership: 由于您现在已经检查了正确的列,因此请检查是否相等而不是成员身份:

if line.split()[1] == '55902':

If you want to use regex , then you can use match or search along with word boundary \\b as 如果要使用regex ,则可以将matchsearch与单词border \\b一起使用

x = "   rs1982171     55349     40802".strip()

if (re.match(r"\b55349\b", x.split()[1])):
    print x

IDEONE DEMO IDEONE演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM