简体   繁体   中英

Regex searching using list elements to find matches in large document

This is my first script, and I am trying to compare two genome files, one of which has more data points than the other.

The content of the files looks like this:

rs3094315       1       742429  AA
rs12562034      1       758311  GG
rs3934834       1       995669  CC

There are tabs between each field. There's about 500,000 lines in each file.

In order to compare them easily, I wanted to keep only the data points that both the files contained, and discard any data points unique to either of them. To do this, I have created a list of all the DNA positions that are unique and now I am trying to search through each line of the original datafile and print all lines NOT containing these unique DNA positions to a new file.

Everything in my code has worked up until I try to search through the genome file using regex to print all non-unique DNA positions. I can get the script to print all items in the LaurelSNP_left list inside the for loop, but when I try to use re.match for each item, I get this error message:

Traceback (most recent call last):
  File "/Users/laurelhochstetler/scripts/identify_SNPs.py", line 57, in <module>
    if re.match(item,"(.*)", Line):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 137, in match
    return _compile(pattern, flags).match(string)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py", line 500, in compile
    p = sre_parse.parse(p, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 673, in parse
    p = _parse_sub(source, pattern, 0)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 308, in _parse_sub
    itemsappend(_parse(source, state))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 401, in _parse
    if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'

My question is two-fold:

  1. How can I use my list in an regex expression?
  2. Is there a better way to accomplish what I am trying to do here?

Here's my code:

#!/usr/bin/env python
import re #this imports regular expression module
import collections

MomGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Mary_Maloney_Full_20110514145353.txt', 'r')
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
LineNumber = 0 
momSNP = []
LaurelSNP = []
f = open("mom_edit.txt","w")
for Line in MomGenome:
    if LineNumber > 0:
        Line=Line.strip('\n')
        ElementList=Line.split('\t')

        momSNP.append(ElementList[0])

        LineNumber = LineNumber + 1
MomGenome.close()
for Line in LaurelGenome:
    if LineNumber > 0:
        Line=Line.strip('\n')
        ElementList=Line.split('\t')

        LaurelSNP.append(ElementList[0])

        LineNumber = LineNumber + 1
momSNP_multiset = collections.Counter(momSNP)            
LaurelSNP_multiset = collections.Counter(LaurelSNP)
overlap = list((momSNP_multiset and LaurelSNP_multiset).elements())
momSNP_left = list((momSNP_multiset - LaurelSNP_multiset).elements())
LaurelSNP_left = list((LaurelSNP_multiset - momSNP_multiset).elements())
LaurelGenome=open('/Users/laurelhochstetler/Documents/genetics fun/genome_Laurel_Hochstetler_Full_20100411230740.txt', 'r')
i = 0
for Line in LaurelGenome:
    for item in LaurelSNP_left:
            if i < 1961:
                if re.match(item, Line):
                    pass

                else:
                    print Line

            i = i + 1
    LineNumber = LineNumber + 1

Short answer: I don't think you need regexp for what you're trying to do.

Long answer: Let's analyse your code.

At the beginning there's:

LineNumber = 0 
MomGenome = open('20110514145353.txt', 'r')
for Line in MomGenome:
    if LineNumber > 0:
        Line = Line.strip('\n')
        ElementList = Line.split('\t')

        momSNP.append(ElementList[0])

        LineNumber = LineNumber + 1

MomGenome.close()

That it could be improved with the with statement, and I guess your line counting is only there to skip some kind of header, I'll use next() for that:

with open('20110514145353.txt') as mom_genome:
    next(mom_genome)    # skipping the first line
    for line in mom_genome:
        elements = line.strip().split('\t')
        mom_SNP.append(elements[0])

If you noticed I also tried to avoid using CamelCase names for variables, this way you'll be following some style guides. I also changed .strip('\\n') to .strip() check the official str.strip() to see if that still does what you want.
The above can be done on your other file too.

After you read the files there's this line:

overlap = list((momSNP_multiset and LaurelSNP_multiset).elements())

Are you sure this does what you want?
Shouldn't that and be an & , like:

overlap = list((momSNP_multiset & LaurelSNP_multiset).elements())

Let's look at this example:

>>> from collections import Counter
>>> a = Counter(a=4, b=2, c=0, d=-2)
>>> b = Counter(a=2, b=0, c=0)
>>> a
Counter({'a': 4, 'b': 2, 'c': 0, 'd': -2})
>>> b
Counter({'a': 2, 'c': 0, 'b': 0})
>>> a and b    # This will return b
Counter({'a': 2, 'c': 0, 'b': 0})
>>> c & d    # this will return the common elements
Counter({'a': 2})

a and b it will return b since bool(a) is evaulated to True , take a look at the official doc .

After that it comes the match, this is really not clear. You do:

LaurelGenome = open('20100411230740.txt', 'r')
i = 0
for Line in LaurelGenome:
    for item in LaurelSNP_left:
        if i < 1961:
            if re.match(item, Line):
                pass
            else:
               print Line

        i = i + 1
    LineNumber = LineNumber + 1

So as I was saying at the beginning I think you don't need regexp at all.
And I think you're trying to do something like:

with open('20100411230740.txt') as laural_genome:
    for line in laureal_genome:
        i = 0
        for item in laurelSNP_left:
            if i > 1960:
                break

            if line.strip().split('\t')[0] == item:
                print line

            i += 1

I've guessed a lot during this answer, so feel free to give more information and tell me where I guessed wrong :)

The third argument of re.match is for options (see the manual). You're calling it with something bogus (the line number)

Am I missing something critical about the problem? I feel this solution is alarmingly less involving than the existing ones. For two different genome files, I have:

with file('zzz.txt') as f1:
    first = frozenset([i.strip() for i in f1 if i.strip()])

with file('yyy.txt') as f2:
    common = [i.strip().split('\t') for i in f2 if i.strip() in first]

genomes = {}
for i in common:
    genomes[i[0]] = i[1:]

This should print out all duplicates (entries that are common to both files) while requiring no more space than the size of the first read file. Thus, you can speed it up by checking which is a smaller file (perhaps by filesize) beforehand to minimize the memory impact.

Regex does not seem to be necessary here--and if not this solution, frozensets have intersection as well, if you prefer not using list comprehensions.

EDIT : Updated to have each iteration in a Python dict.

You want to print every line from file 2 whose ID does not occur in file 1. Make a set of the IDs in file 1, and use them as you loop through file 2:

momSNP = set()
for line in MomGenome:
    snp, rest = line.split(None, 1) # Split into two pieces only
    momSNP.add(snp)

for line in MyGenome:
    snp, rest = line.split(None, 1)
    if snp in momSNP:
        print line

This only needs to store the 500k SNPs, so it shouldn't be too much of a problem memory-wise.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM