I'm totally new to python.
I have a csv file. Its structure is:
Ip, Tag, Sentence_id, scheme
#1, yes, 1, 1
#1, yes, 2, 2
#2, no, 1, 1
#3 maybe, 3, 3
There are 100 sentence_id, a number of ips and 6 schemes (1-6).
Is there a way with python to restructure the csv so that it's output is a matrix of the following structure:
Ip, scheme 1, scheme 2, scheme 3, sentence_id
#1 yes, NA, NA, 1
#1 NA, yes, NA, 2
#2 no, NA, NA, 1
#3 NA, NA, maybe, 3
I'm not sure if python is the "right" language to use for something like this. I've been guided to either python or awk, but have no idea of either. Thanks.
This should read through the file line by line, reorder columns, etc.
Just replace in_filename, out_filename, NUM_SCHEMES, and the header with whatever you want.
from csv import reader, writer
with open("in_filename") as in_file, open("out_filename") as out_file:
in_csv = csv.reader(in_file)
out_csv = csv.writer(out_file)
next(in_csv) # Skip header
out_csv.writerow(["Ip", "scheme 1", "scheme 2", "scheme 3", "sentence_id"])
for row in in_csv:
ip, tag, sentence_id, scheme = row
out_row = [ip]
for i in range(NUM_SCHEMES):
out_row.append(tag if i == scheme else "NA")
out_row.append(sentence_id)
out_csv.writerow(out_row)
You can use csv.DictReader
to load the file into a list of dictionaries, then from that, find the maximum scheme
value to build your output field names. For each row in the list, set the scheme N
field to be equal to the value of the Tag
column. Then we use csv.DictWriter
to fill any missing values with NA
where a key is not present, eg:
import csv
with open('input.csv', 'rb') as fin, open('output.csv', 'wb') as fout:
rows = list(csv.DictReader(fin, skipinitialspace=True))
schemes = range(1, max(int(row['scheme']) for row in rows) + 1)
fieldnames = ['Ip'] + ['scheme {}'.format(i) for i in schemes] + ['Sentence_id']
csvout = csv.DictWriter(fout, fieldnames=fieldnames, extrasaction='ignore', restval='NA')
csvout.writeheader()
for row in rows:
row['scheme {}'.format(row['scheme'].strip())] = row['Tag']
csvout.writerow(row)
This gives the following output given your example input:
Ip,scheme 1,scheme 2,scheme 3,Sentence_id
#1,yes,NA,NA,1
#1,NA,yes,NA,2
#2,no,NA,NA,1
#3,NA,NA,maybe,3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.