简体   繁体   English

使用python替换和删除csv中的列

[英]Replacing and deleting columns from a csv using python

Here is a code that I am writing 这是我正在编写的代码

import csv
import openpyxl

def read_file(fn):
    rows = []

    with open(fn) as f:
        reader = csv.reader(f, quotechar='"',delimiter=",")
        for row in reader:
            if row:                     
                rows.append(row)
    return rows 


replace = {x[0]:x[1:] for x in read_file("replace.csv")}


delete = set( (row[0] for row in read_file("delete.csv")) )  


result = []

input_file="input.csv"
with open(input_file) as f:
    reader = csv.reader(f, quotechar='"')
    for row in reader:
        if row:
            if row[7] in delete:
                continue                                   
            elif row[7] in replace:

                result.append(replace[row[7]])   
            else:
                result.append(row)                       



with open ("done.csv", "w+", newline="") as f:
    w = csv.writer(f,quotechar='"', delimiter= ",")
    w.writerows(result)

here are my files: 这是我的文件:

input.csv: input.csv:

c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-","aaaaa","-","-","bbbbb","-",","
"-","-","-","-","-","-","-","ccccc","-","-","ddddd","-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","

this is a 13 column csv. 这是一个13列的csv。 I am interested only in the 8th and the 11th fields. 我只对第8和第11领域感兴趣。

this is my replace.csv: 这是我的replace.csv:

"aaaaa","11111","22222"

delete.csv: delete.csv:

ccccc

so what I am doing is compare the first column of replace.csv(line by line) with the 8th column of input.csv and if they match then replace 8th column of input.csv with the second column of replace.csv and 11th column of input with the 3rd column of replace.csv and for delete.csv it compares both files line by line and if match is found it deletes the entire row. 所以我正在做的是将replace.csv的第一列(逐行)与input.csv的第8列进行比较,如果它们匹配,则将第5列input.csv替换为replace.csv和第11列的第二列使用replace.csv的第3列和delete.csv的输入,它逐行比较两个文件,如果找到匹配,则删除整行。 and if any line is not present in either replace.csv or delete.csv then print the line as it is. 如果replace.csv或delete.csv中没有任何行,则按原样打印该行。 so my desired output is: 所以我想要的输出是:

c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-",11111,"-","-",22222,"-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","

but when I run this code it gives me an output like this: 但是当我运行这段代码时,它会给我一个这样的输出:

c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
11111,22222

where am I going wrong? 我哪里错了? I am trying to make changes to my program that I had earlier posted a question about.Since the input file has changed I am trying to make changes to my program. 我正在尝试对我之前发布过一个问题的程序进行更改。由于输入文件已更改,我正在尝试对程序进行更改。 https://stackoverflow.com/a/54388144/9279313 https://stackoverflow.com/a/54388144/9279313

@anuj I think SafeDev's solution is optimal but if you don't want to go with pandas, just make little changes in your code. @anuj我认为SafeDev的解决方案是最佳的,但如果您不想使用pandas,只需对代码进行少量更改即可。

for row in reader:
    if row:
        if row[7] in delete:
            continue                                   
        elif row[7] in replace:
            key = row[7]
            row[7] = replace[key][0]
            row[10]= replace[key][1]
            result.append(row)
        else:
            result.append(row)  

Hope this solves your issue. 希望这能解决您的问题。

It's actually quite simple. 它实际上非常简单。 Instead of making it by scratch just use the panda library. 不要只是使用熊猫图书馆。 From there it's easier to handle any dataset. 从那里开始处理任何数据集都会更容易。 This is how you would do it: 这是你怎么做的:

EDIT: 编辑:

import pandas as pd

input_csv = pd.read_csv('input.csv')
replace_csv = pd.read_csv('replace.csv', header=None)
delete_csv = pd.read_csv('delete.csv')

r_lst = [i for i in replace_csv.iloc[:, 0]]
d_lst = [i for i in delete_csv]

input2_csv = pd.DataFrame.copy(input_csv)
for i, row in input_csv.iterrows():
    if row['c8'] in r_lst:
        input2_csv.loc[i, 'c8'] = replace_csv.iloc[r_lst.index(row['c8']), 1]
        input2_csv.loc[i, 'c11'] = replace_csv.iloc[r_lst.index(row['c8']), 2]
    if row['c8'] in d_lst:
        input2_csv = input2_csv[input2_csv.c8 != row['c8']]

input2_csv.to_csv('output.csv', index=False)

This process can be made even more dynamic by turning it into a function that has parameters of column names and replacing 'c8' and 'c11' with those two parameters. 通过将其转换为具有列名参数的函数并将这两个参数替换为“c8”和“c11”,可以使此过程更加动态化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM