[英]Keep rows with max value of a specific column
I am new to Python and I want to do the following. 我是Python的新手,我想做以下事情。 I have a csv file (input.csv) which contains a header row and 4 columns.
我有一个csv文件(input.csv),其中包含标题行和4列。 A part of this csv file is depicted below:
该csv文件的一部分如下所示:
gene-name p-value stepup(p-value) fold-change
IFIT1 6.79175E-005 0.0874312 96.0464
IFITM1 0.00304362 0.290752 86.3192
IFIT1 0.000439152 0.145488 81.499
IFIT3 5.87135E-005 0.0838258 77.1737
RSAD2 6.7615E-006 0.0685623 141.898
RSAD2 3.98875E-005 0.0760279 136.772
IFITM1 0.00176673 0.230063 72.0445
I want to keep only the rows with the highest value of fold-change and remove all other rows containing the same gene name with lower value of fold-change. 我只想保留倍数变化最高的行,而删除所有其他具有相同基因名称且倍数变化较低的行。 For example, in this case I need a csv output file of the following format:
例如,在这种情况下,我需要以下格式的csv输出文件:
gene-name p-value stepup(p-value) fold-change
IFIT1 6.79175E-005 0.0874312 96.0464
IFITM1 0.00304362 0.290752 86.3192
RSAD2 6.7615E-006 0.0685623 141.898
IFIT3 5.87135E-005 0.0838258 77.1737
I would be grateful to you if you provided me a solution to this problem. 如果您为我提供了解决此问题的方法,我将不胜感激。
Thank you very much. 非常感谢你。
The dumb solution: walk each line in the file, do a manual compare. 愚蠢的解决方案:遍历文件中的每一行,进行手动比较。 Assumptions:
假设:
:: ::
fi = open('inputfile.csv','r') # read
header = fi.readline()
# capture the header line ("gene-name p-value stepup(p-value) fold-change")
out_a = [] # we will store the results in here
for line in fi: # we can read a line this way too
temp_a = line.strip('\r\n').split(' ')
# strip the newlines, split the line into an array
try:
pos = [gene[0] for gene in out_a].index(temp_a[0])
# try to see if the gene is already been seen before
# [0] is the first column (gene-name)
# return the position in out_a where the existing gene is
except ValueError: # python throws this if a value is not found
out_a.append(temp_a)
# add it to the list initially
else: # we found an existing gene
if float(temp_a[3]) > float(out_a[pos][3]):
# new line has higher fold-change (column 4)
out_a[pos] = temp_a
# so we replace
fi.close() # we're done with our input file
fo = open('outfile.csv','w') # prepare to write to output
fo.write(header) # don't forget about our header
for result in out_a:
# iterate through out_a and write each line to fo
fo.write(' '.join(result) + '\n')
# result is a list [XXXX,...,1234]
# we ' '.join(result) to turn it back into a line
# don't forget the '\n' which makes each result on a line
fo.close()
One advantage of this is it preserves the first-encountered order of the genes from the input file. 这样的一个优点是它保留了输入文件中基因的第一个遇到的顺序。
try using pandas: 尝试使用熊猫:
import pandas as pd
df = pd.read_csv('YOUR_PATH_HERE')
print(df.loc[(df['gene-name'] != df.loc[(df['fold-change'] == df['fold-change'].max())]['gene-name'].tolist()[0])])
The code is long because I chose to do it in one line, but what the code is doing is this. 代码很长,因为我选择在一行中执行此操作,但是代码正在执行此操作。 I grab the
gene-name
of the highest fold-change
, I then use the !=
operator to say, "grab me everything where the gene-name
is not the same as the gene-name
of the calculation we just did. 我抢
gene-name
最高的fold-change
,然后我用的!=
操作员说,“抢了我一切,其中gene-name
是不一样的gene-name
,我们只是做了计算。
broken down: 细分:
# gets the max value in fold-change
max_value = df['fold-change'].max()
# gets the gene name of that max value
gene_name_max = df.loc[df['fold-change'] == max_value]['gene-name']
# reassigning so you see the progression of grabbing the name
gene_name_max = gene_name_max.values[0]
# the final output
df.loc[(df['gene-name'] != gene_name_max)]
output: 输出:
gene-name p-value stepup(p-value) fold-change
0 IFIT1 0.000068 0.087431 96.0464
1 IFITM1 0.003044 0.290752 86.3192
2 IFIT1 0.000439 0.145488 81.4990
3 IFIT3 0.000059 0.083826 77.1737
6 IFITM1 0.001767 0.230063 72.0445
EDIT: 编辑:
to get the expected output use groupby
: 要获得预期的输出,请使用
groupby
:
import pandas as pd
df = pd.read_csv('YOUR_PATH_HERE')
df.groupby(['gene-name'], sort=False)['fold-change'].max()
# output below
gene-name
IFIT1 96.0464
IFITM1 86.3192
IFIT3 77.1737
RSAD2 141.8980
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.