Pandas read_csv 错误地命名列

Question

I am trying to import a Leukemia gene expression data set found at https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida .我正在尝试导入在https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida找到的白血病基因表达数据集。 This data set has a lot of columns (22285) and the columns imported towards the end have an incorrect name.该数据集有很多列 (22285)，最后导入的列名称不正确。 For example the last column named AFFX-r2-P1-cre-3_at is actually called 217005_at in the csv file.例如，名为AFFX-r2-P1-cre-3_at的最后一列实际上在 csv 文件中称为217005_at 。 The image below shows my juypter notebook cells.下图显示了我的 juypter 笔记本单元格。 I am not sure why it is being formatted this way?我不确定为什么要这样格式化？ Any help would be greatly appreciated.任何帮助将不胜感激。

Python代码

Answer 1

Evidently the CSV file has column names that start with 'AFFX-r2-P1' -- it's not a pandas issue.显然 CSV 文件的列名以“AFFX-r2-P1”开头——这不是 pandas 问题。 Using the built-in csv package shows:使用内置csv package 显示：

import csv
from pathlib import Path

data_file = Path('../../../Downloads/Leukemia_GSE9476.csv')

with open(data_file, 'rt') as lines:
    csv_file = csv.reader(lines)
    fields = next(csv_file)
#
[
    (field_number, field)
    for field_number, field in enumerate(fields)
    if field.startswith('AFFX-r2-P1')
]

The output is: output 是：

[(22277, 'AFFX-r2-P1-cre-3_at'), (22278, 'AFFX-r2-P1-cre-5_at')]

Pandas read_csv 错误地命名列

问题描述

1 个解决方案

解决方案1
0 2021-11-28 15:58:42

Pandas read_csv 错误地命名列

问题描述

1 个解决方案

解决方案1 0 2021-11-28 15:58:42

解决方案1
0 2021-11-28 15:58:42