如何使用python读取和修改（.gct）文件？

Question

Which libraries would help me read a gct file in python and edit it like removing the rows with NaN values.哪些库可以帮助我在 python 中读取 gct 文件并对其进行编辑，例如删除具有 NaN 值的行。 And how will the following code change if I apply it to a .gct file?如果我将其应用于 .gct 文件，以下代码将如何更改？

data = pd.read_csv('PAAD1.csv')
new_data = data.dropna(axis = 0, how ='any')
print("Old data frame length:", len(data), "\nNew data frame length:",  
       len(new_data), "\nNumber of rows with at least 1 NA value: ", 
       (len(data)-len(new_data)))
new_data.to_csv('EditedPAAD.csv')

Answer 1

You should use the cmapPy package for this.您应该为此使用cmapPy包。 Compared to read_csv it gives you more freedom and domain specific utilities.与read_csv相比，它为您提供了更多的自由和特定领域的实用程序。 Eg if your *.gct looks like this例如，如果您的*.gct看起来像这样

#1.2            
22215   2       
Name    Description Tumor_One   Normal_One
1007_s_at   na  -0.214548   -0.18069
1053_at "RFC2 : replication factor C (activator 1) 2, 40kDa |@RFC2|"    0.868853    -1.330921
117_at  na  1.124814    0.933021
121_at  PAX8 : paired box gene 8 |@PAX8|    -0.825381   0.102078
1255_g_at   GUCA1A : guanylate cyclase activator 1A (retina) |@GUCA1A|  -0.734896   -0.184104
1294_at UBE1L : ubiquitin-activating enzyme E1-like |@UBE1L|    -0.366741   -1.209838
1316_at "THRA : thyroid hormone receptor, alpha (erythroblastic leukemia viral (v-erb-a) oncogene homolog, avian) |@THRA|"  -0.126108   1.486972
1320_at "PTPN21 : protein tyrosine phosphatase, non-receptor type 21 |@PTPN21|" 3.083681    -0.086705
...

You can extract only rows with a desired probeset id (row id), eg ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']您只能提取具有所需探针集 ID（行 ID）的行，例如['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']

So to read a file, remove the nan in the description and save it again, do:因此，要读取文件，请删除description中的nan并再次保存，请执行以下操作：

from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write

data = parse('example.gct', rid=['1007_s_at', '1053_at',
                                 '117_at', '121_at',
                                 '1255_g_at', '1294_at  UBE1L'])
# remove nan values from row_metadata (description column)
data.row_metadata_df.dropna(inplace=True)
# remove the entries of .data_df where nan values are in row_metadata
data.data_df = data.data_df.loc[data.row_metadata_df.index]

# Can only write GCT version 1.3
write(data, 'new_example.gct')

The new_example.gct looks then like this: new_example.gct看起来像这样：

#1.3
3   2   1   0
id  Description Tumor_One   Normal_One

1053_at RFC2 : replication factor C (activator 1) 2, 40kDa |@RFC2|  0.8689  -1.3309

121_at  PAX8 : paired box gene 8 |@PAX8|    -0.8254 0.1021

1255_g_at   GUCA1A : guanylate cyclase activator 1A (retina) |@GUCA1A|  -0.7349 -0.1841

Answer 2

Quick search in google will give you the following: https://pypi.org/project/cmapPy/在谷歌快速搜索会给你以下内容： https : //pypi.org/project/cmapPy/

Regarding to the code, if you don't care about the metadata in the 2 first rows, it seems to work for your purpose, but you should first indicate that the delimiter is TAB and skip the 2 first rows - pandas.read_csv(PATH_TO_GCT_FILE, sep='\\t',skiprows=2)关于代码，如果您不关心前 2 行中的元数据，它似乎适合您的目的，但您应该首先指出分隔符是TAB并跳过前 2 行 - pandas.read_csv(PATH_TO_GCT_FILE, sep='\\t',skiprows=2)

如何使用python读取和修改（.gct）文件？

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-02-13 10:15:44

解决方案2
0 2020-02-13 08:09:50

如何使用python读取和修改（.gct）文件？

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-02-13 10:15:44

解决方案2 0 2020-02-13 08:09:50

解决方案1
2 已采纳 2020-02-13 10:15:44

解决方案2
0 2020-02-13 08:09:50