[英]Indexing in python starting just after a specific string
I have a tab separated file with values as follows:我有一个制表符分隔的文件,其值如下:
12 6814296 2 192 C:0.911458 T:0.0885417
12 6814328 2 192 C:1 T:0
12 6814345 2 192 C:1 T:0
12 6814360 2 192 C:1 T:0
12 6814381 2 192 G:1 A:0
12 6814396 2 192 C:1 A:0
12 6814397 2 192 G:0.989583 A:0.0104167
12 6814464 2 192 T:1 C:0
12 6814468 2 192 C:0.927083 TCCC:0.0729167
12 6814486 2 192 C:1 T:0
12 6814551 2 192 G:1 C:0
12 6814567 2 192 A:1 G:0
12 6814589 2 192 C:0.989583 T:0.0104167
12 6814619 2 192 G:1 A:0
12 6814663 2 192 A:1 G:0
12 6814732 2 192 C:1 T:0
12 6814752 4 192 CTTT:0.979167 CTTTTT:0 CT:0.015625 C:0.00520833
12 6814786 2 192 C:1 <CN0>:0
12 6814798 2 192 C:0.984375 T:0.015625
12 6814828 2 192 C:0.989583 G:0.0104167
12 6814951 2 192 G:1 C:0
From this file, I have to create a csv file with 3 comma-separated values in each row.从这个文件中,我必须创建一个 csv 文件,每行有 3 个逗号分隔值。
Below is my code:下面是我的代码:
file1 = open('/home/aahm/Documents/gene1.frq', 'r')
input_data = file1.readlines()
for line in input_data:
rm_newline = line.strip('\n')
comma_separated = rm_newline.split('\t')
a = comma_separated[0]
b = comma_separated[1]
c = comma_separated[-1]
d = c[2:]
if comma_separated [2] == '2':
e = a + ','+ b +',' + d
print (e)
elif comma_separated [2] == '3':
f = comma_separated[-1]
g = f[2:]
h = comma_separated[-2]
i = h[2:]
if g > i:
j = a + ','+ b +',' + g
print (j)
else:
k = a + ','+ b +',' + i
print (k)
elif comma_separated [2] == '4':
l = comma_separated[-1]
m = l[2:]
n = comma_separated[-2]
o = n[2:]
p = comma_separated[-3]
q = p[2:]
if m > o and m > p:
r = a + ','+ b +',' + m
print (r)
elif o > m and o > p:
s = a + ','+ b +',' + o
print (s)
elif p > m and p > o:
t = a + ','+ b +',' + p
print (t)
The code works well except that for indexing I have used these:该代码运行良好,除了索引我使用了这些:
d = c[2:]
g = f[2:]
i = h[2:]
etc.等等
For column 6 and 7 and 8 in the input file, I need only the numbers as output.对于输入文件中的第 6 列和第 7 列和第 8 列,我只需要 output 这样的数字。 However, my indexing gives me character strings as well as numbers for some rows as the character string preceding ':' is greater than 1. An example is given below
但是,我的索引为我提供了字符串以及某些行的数字,因为 ':' 前面的字符串大于 1。下面给出了一个示例
The value in the last column is TCCC:0.0729167 for 1 row.最后一列的值为 TCCC:0.0729167 表示 1 行。 When indexing 'd = c[2:]' is used for indexing, I get CC:0.0729167as output, whereas I need only 0.0729167 as output.
当索引'd = c [2:]'用于索引时,我得到CC:0.0729167as output,而我只需要0.0729167作为output。
I am stuck with this and do not have any hint at all about how to proceed.我坚持这一点,根本没有任何关于如何进行的提示。 I would be very grateful for any help.
如果有任何帮助,我将不胜感激。 Thanks!
谢谢!
You are slicing the list starting from third character (included) to the end, which gives you 'CC:0.0729167' in your example.您正在从第三个字符(包括)开始对列表进行切片,在您的示例中为您提供“CC:0.0729167”。 As other people said in the comments, you could just use
yourstring.split(":")[1]
in order to split the string based on the position of the colon, and then retrieve the second half of it by specifying its index with [1]
.正如其他人在评论中所说,您可以使用
yourstring.split(":")[1]
根据冒号的 position 拆分字符串,然后通过指定其索引来检索它的后半部分[1]
。
As per the comments others have made, where you have a ":" remaining in the column data you need to split it out.根据其他人的评论,您需要将其拆分出来的列数据中剩余一个“:”。 However, the code you have here is already rather opaque - all the alphabet-letter variables makes it quite difficult to see what a simple piece of code is actually trying to do.
但是,您在此处的代码已经相当不透明 - 所有字母字母变量使得很难看出一段简单的代码实际上试图做什么。 To avoid making it worse, in the example below I've defined a simple function getnum, which you feed a field and it do the split for you if needed.
为了避免变得更糟,在下面的示例中,我定义了一个简单的 function getnum,您可以提供一个字段,如果需要,它会为您进行拆分。 Of course, this won't work if the field has more than one ":" character, but it would be easy enough to modify getnum.
当然,如果字段有多个“:”字符,这将不起作用,但修改 getnum 很容易。 I've then altered you code to run every field through this getnum function.
然后,我更改了您的代码以通过此 getnum function 运行每个字段。
To make life easier for yourself, I would encourage you to use more meaningful variable names than a, b, c and so on.为了让自己的生活更轻松,我鼓励您使用比 a、b、c 等更有意义的变量名。 Also, a little explanatory comment here and there would go a long way - I think with these in place you would probably have been able to crack the problem yourself!
此外,这里有一点解释性评论,go 会有很长的路要走——我认为有了这些,你可能已经能够自己解决问题了!
input_data = file1.readlines()
# process a field to only use numbers after a :
def getnum(src):
if ":" in src:
return src.split(":")[1]
else:
return src
for line in input_data:
rm_newline = line.strip('\n')
comma_separated = rm_newline.split('\t')
a = getnum(comma_separated[0])
b = getnum(comma_separated[1])
c = getnum(comma_separated[-1])
d = c[2:]
if comma_separated [2] == '2':
e = a + ','+ b +',' + d
print (e)
elif comma_separated [2] == '3':
f = getnum(comma_separated[-1])
g = f[2:]
h = getnum(comma_separated[-2])
i = h[2:]
if g > i:
j = a + ','+ b +',' + g
print (j)
else:
k = a + ','+ b +',' + i
print (k)
elif comma_separated [2] == '4':
l = getnum(comma_separated[-1])
m = l[2:]
n = getnum(comma_separated[-2])
o = n[2:]
p = getnum(comma_separated[-3])
q = p[2:]
if m > o and m > p:
r = a + ','+ b +',' + m
print (r)
elif o > m and o > p:
s = a + ','+ b +',' + o
print (s)
elif p > m and p > o:
t = a + ','+ b +',' + p
print (t)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.