简体   繁体   English

Pandas read_csv 错误地命名列

[英]Pandas read_csv incorrectly naming columns

I am trying to import a Leukemia gene expression data set found at https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida .我正在尝试导入在https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida找到的白血病基因表达数据集。 This data set has a lot of columns (22285) and the columns imported towards the end have an incorrect name.该数据集有很多列 (22285),最后导入的列名称不正确。 For example the last column named AFFX-r2-P1-cre-3_at is actually called 217005_at in the csv file.例如,名为AFFX-r2-P1-cre-3_at的最后一列实际上在 csv 文件中称为217005_at The image below shows my juypter notebook cells.下图显示了我的 juypter 笔记本单元格。 I am not sure why it is being formatted this way?我不确定为什么要这样格式化? Any help would be greatly appreciated.任何帮助将不胜感激。

Python代码

Evidently the CSV file has column names that start with 'AFFX-r2-P1' -- it's not a pandas issue.显然 CSV 文件的列名以“AFFX-r2-P1”开头——这不是 pandas 问题。 Using the built-in csv package shows:使用内置csv package 显示:

import csv
from pathlib import Path

data_file = Path('../../../Downloads/Leukemia_GSE9476.csv')

with open(data_file, 'rt') as lines:
    csv_file = csv.reader(lines)
    fields = next(csv_file)
#
[
    (field_number, field)
    for field_number, field in enumerate(fields)
    if field.startswith('AFFX-r2-P1')
]

The output is: output 是:

[(22277, 'AFFX-r2-P1-cre-3_at'), (22278, 'AFFX-r2-P1-cre-5_at')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM