简体   繁体   中英

Python - csv column restructuring

I'm totally new to python.

I have a csv file. Its structure is:

Ip, Tag, Sentence_id, scheme
#1, yes,      1,        1
#1, yes,      2,        2 
#2, no,       1,        1
#3  maybe,    3,        3

There are 100 sentence_id, a number of ips and 6 schemes (1-6).

Is there a way with python to restructure the csv so that it's output is a matrix of the following structure:

Ip,   scheme 1, scheme 2, scheme 3, sentence_id
#1      yes,      NA,          NA,      1
#1      NA,       yes,         NA,      2
#2      no,       NA,          NA,      1
#3      NA,       NA,         maybe,    3

I'm not sure if python is the "right" language to use for something like this. I've been guided to either python or awk, but have no idea of either. Thanks.

This should read through the file line by line, reorder columns, etc.

Just replace in_filename, out_filename, NUM_SCHEMES, and the header with whatever you want.

from csv import reader, writer

with open("in_filename") as in_file, open("out_filename") as out_file:
    in_csv = csv.reader(in_file)
    out_csv = csv.writer(out_file)

    next(in_csv) # Skip header
    out_csv.writerow(["Ip", "scheme 1", "scheme 2", "scheme 3", "sentence_id"])

    for row in in_csv:
        ip, tag, sentence_id, scheme = row

        out_row = [ip]

        for i in range(NUM_SCHEMES):
            out_row.append(tag if i == scheme else "NA")

        out_row.append(sentence_id)

        out_csv.writerow(out_row)

You can use csv.DictReader to load the file into a list of dictionaries, then from that, find the maximum scheme value to build your output field names. For each row in the list, set the scheme N field to be equal to the value of the Tag column. Then we use csv.DictWriter to fill any missing values with NA where a key is not present, eg:

import csv

with open('input.csv', 'rb') as fin, open('output.csv', 'wb') as fout:
    rows = list(csv.DictReader(fin, skipinitialspace=True))
    schemes = range(1, max(int(row['scheme']) for row in rows) + 1)
    fieldnames = ['Ip'] + ['scheme {}'.format(i) for i in schemes] + ['Sentence_id']
    csvout = csv.DictWriter(fout, fieldnames=fieldnames, extrasaction='ignore', restval='NA')
    csvout.writeheader()
    for row in rows:
        row['scheme {}'.format(row['scheme'].strip())] = row['Tag']
        csvout.writerow(row)

This gives the following output given your example input:

Ip,scheme 1,scheme 2,scheme 3,Sentence_id
#1,yes,NA,NA,1
#1,NA,yes,NA,2
#2,no,NA,NA,1
#3,NA,NA,maybe,3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM