简体   繁体   中英

Indexing in python starting just after a specific string

I have a tab separated file with values as follows:

12  6814296 2   192 C:0.911458  T:0.0885417
12  6814328 2   192 C:1 T:0
12  6814345 2   192 C:1 T:0
12  6814360 2   192 C:1 T:0
12  6814381 2   192 G:1 A:0
12  6814396 2   192 C:1 A:0
12  6814397 2   192 G:0.989583  A:0.0104167
12  6814464 2   192 T:1 C:0
12  6814468 2   192 C:0.927083  TCCC:0.0729167
12  6814486 2   192 C:1 T:0
12  6814551 2   192 G:1 C:0
12  6814567 2   192 A:1 G:0
12  6814589 2   192 C:0.989583  T:0.0104167
12  6814619 2   192 G:1 A:0
12  6814663 2   192 A:1 G:0
12  6814732 2   192 C:1 T:0
12  6814752 4   192 CTTT:0.979167   CTTTTT:0    CT:0.015625 C:0.00520833
12  6814786 2   192 C:1 <CN0>:0
12  6814798 2   192 C:0.984375  T:0.015625
12  6814828 2   192 C:0.989583  G:0.0104167
12  6814951 2   192 G:1 C:0

From this file, I have to create a csv file with 3 comma-separated values in each row.

Below is my code:

file1 = open('/home/aahm/Documents/gene1.frq', 'r')
input_data = file1.readlines()
for line in input_data:
    rm_newline = line.strip('\n')
    comma_separated = rm_newline.split('\t')
    a = comma_separated[0]
    b = comma_separated[1]
    c = comma_separated[-1]
    d = c[2:]
    if comma_separated [2] == '2':
        e = a + ','+ b +',' + d
        print (e)
    elif comma_separated [2] == '3':
        f = comma_separated[-1]
        g = f[2:]
        h = comma_separated[-2]
        i = h[2:]
        if g > i:
            j = a + ','+ b +',' + g
            print (j)
        else:
            k = a + ','+ b +',' + i
            print (k)
    elif comma_separated [2] == '4':
        l = comma_separated[-1]
        m = l[2:]
        n = comma_separated[-2]
        o = n[2:]
        p = comma_separated[-3]
        q = p[2:]
        if m > o and m > p:
            r = a + ','+ b +',' + m
            print (r)
            
        elif o > m and o > p:
            s = a + ','+ b +',' + o
            print (s)
            
        elif p > m and p > o:
            t =  a + ','+ b +',' + p
            print (t)

The code works well except that for indexing I have used these:

d = c[2:]
g = f[2:]
i = h[2:] 

etc.

For column 6 and 7 and 8 in the input file, I need only the numbers as output. However, my indexing gives me character strings as well as numbers for some rows as the character string preceding ':' is greater than 1. An example is given below

The value in the last column is TCCC:0.0729167 for 1 row. When indexing 'd = c[2:]' is used for indexing, I get CC:0.0729167as output, whereas I need only 0.0729167 as output.

I am stuck with this and do not have any hint at all about how to proceed. I would be very grateful for any help. Thanks!

You are slicing the list starting from third character (included) to the end, which gives you 'CC:0.0729167' in your example. As other people said in the comments, you could just use yourstring.split(":")[1] in order to split the string based on the position of the colon, and then retrieve the second half of it by specifying its index with [1] .

As per the comments others have made, where you have a ":" remaining in the column data you need to split it out. However, the code you have here is already rather opaque - all the alphabet-letter variables makes it quite difficult to see what a simple piece of code is actually trying to do. To avoid making it worse, in the example below I've defined a simple function getnum, which you feed a field and it do the split for you if needed. Of course, this won't work if the field has more than one ":" character, but it would be easy enough to modify getnum. I've then altered you code to run every field through this getnum function.

To make life easier for yourself, I would encourage you to use more meaningful variable names than a, b, c and so on. Also, a little explanatory comment here and there would go a long way - I think with these in place you would probably have been able to crack the problem yourself!

input_data = file1.readlines()

# process a field to only use numbers after a :
def getnum(src):
    if ":" in src:
        return src.split(":")[1]
    else:
        return src

for line in input_data:
    rm_newline = line.strip('\n')
    comma_separated = rm_newline.split('\t')
    a = getnum(comma_separated[0])
    b = getnum(comma_separated[1])
    c = getnum(comma_separated[-1])
    d = c[2:]
    if comma_separated [2] == '2':
        e = a + ','+ b +',' + d
        print (e)
    elif comma_separated [2] == '3':
        f = getnum(comma_separated[-1])
        g = f[2:]
        h = getnum(comma_separated[-2])
        i = h[2:]
        if g > i:
            j = a + ','+ b +',' + g
            print (j)
        else:
            k = a + ','+ b +',' + i
            print (k)
    elif comma_separated [2] == '4':
        l = getnum(comma_separated[-1])
        m = l[2:]
        n = getnum(comma_separated[-2])
        o = n[2:]
        p = getnum(comma_separated[-3])
        q = p[2:]
        if m > o and m > p:
            r = a + ','+ b +',' + m
            print (r)
            
        elif o > m and o > p:
            s = a + ','+ b +',' + o
            print (s)
            
        elif p > m and p > o:
            t =  a + ','+ b +',' + p
            print (t)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM