简体   繁体   English

用二进制值替换CSV文件中的分类值

[英]Replace categorical values in CSV file with binary values

I have a clinical data set and I have to replace 我有临床数据集,必须更换

  • the 1st column values 'DECEASED' with 1, if the value 'Date' > 365 else replace with 0 (zero), 如果值'Date'> 365,则第一列的值'DECEASED'为1,否则用0(零)替换,
  • the value 'LIVING' with one if 'Day_to_follow_up' > 365 如果'Day_to_follow_up'> 365,则值为'LIVING'且值为1

In addition I need to assign the ages 另外我需要指定年龄

  • 0-25 to bin 0, 0-25到bin 0,
  • 25-50 to bin 1, 25-50到垃圾箱1
  • 50-75 to bin 2 50-75到垃圾箱2
  • above 75 to bin 4. 75以上到bin 4。

Here is my code. 这是我的代码。

import csv
import pandas as pd
with open('combined_file', 'rb') as f,open('newFile', 'wb') as out:
    reader = csv.reader(f)


    writer = csv.writer(out)
    for row in reader:
        #print "AABB"
        if 'DECEASED' in row[1]:
            if row[10]>365:
                row[1]=1
                writer.writerow(row)
            elif row[10]<365:
                row[1]=0
                writer.writerow(row)
        if 'LIVING' in row[1]:
            if row[11]>365:
                row[1]=1
                writer.writerow(row)

sample input 样本输入

sample id , status , age ,gender ,date ,days_to_last_followup
0     ,    Deceased , 42 , M  ,   326 ,    149
1     ,    Deceased , 56 , F  ,   500 ,    30
2     ,    living   , 43 ,M   ,   25  ,    150

sample output 样本输出

sample id , status , age ,gender,date ,days_to_last_followup
0     ,       0    , 1 ,  M    ,326 ,    149
1     ,       1    , 2 , F     ,500 ,    30
2     ,       0    , 1 ,M   ,   25  ,    150

I'm not sure what your question is, based off this post. 根据这篇文章,我不确定您的问题是什么。 Either way, the logical structure would have an issue if both 'Deceased' and 'Living' were in row[1]. 无论哪种方式,如果“已故”和“活着”都在行中,则逻辑结构会出现问题[1]。 I'd suggest you create some test cases to look for bad data, since ETL processes routinely have to deal with unexpected data formats/fields. 我建议您创建一些测试用例以查找不良数据,因为ETL流程通常必须处理意外的数据格式/字段。

I'm also not sure why you are importing the pandas library. 我也不确定为什么要导入熊猫库。 You don't seem to be calling it anywhere in the code you posted. 您似乎在所发布的代码中的任何地方都没有调用它。

Your code is a good starting point - a few things that the code does not cover: 您的代码是一个很好的起点-代码未涵盖的几件事:

  • What happens when 'DECEASED' and 'LIVING' are both in row[1] ? 当行中的“减少”和“生活”都出现row[1]什么? Your code will write two rows. 您的代码将写两行。 To fix this, set the if 'LIVING' to elif 'LIVING' . 要解决此问题,请将if 'LIVING'elif 'LIVING'
  • You need an else case to catch what happens when neither DECEASED or LIVING is in row[1] . 您需要一个else案例来捕捉当row[1]没有DECEASEDLIVING时发生的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM