Suppose I have 100,000 users. I want to create a csv file with 100,000 rows and K columns. K is expected to be around a few hundred. Each row contains one user's data, and each column is one variable. I create such csv data with a for
loop, of which each iteration constructs a dictionary whose keys are the variable names. If I knew the K variables names, I can use csv.DictWriter
to append new row.
The problem is that I don't know the variables' names or the number K. One way is to use pandas.DataFrame.append
function. I don't like this one, because pandas
' documentation says append
is slow for iterative appending. I cannot use loc
as suggested somewhere, because the number of columns varies.
My current strategy is to create three list
. list1
is to store variable names; list2
is to save values; and list3
is to save row index. Append list in python within loop is easy. From list3
, I crate a list of unique row indexes, which will be used as the field names of the csv. For each variable, I create a dictionary whose keys are row indexes, and the corresponding value is the value of the variable in that row. Then I use csv.DictWriter
to create the csv file. The last step is to transpose the created csv file.
I am glad to hear improvement suggestions.
# Example: three rows (r1, r2, r3) and four variables (n1, n2, n3, n4)
list1 = ['n1', 'n2', 'n3', 'n2', 'n3', 'n3', 'n4'] # n* is variable name
list2 = ['v11', 'v12', 'v13', 'v22', 'v23', 'v33', 'v34'] # v* is value
list3 = ['r1', 'r1', 'r1', 'r2', 'r2', 'r3', 'r3'] # v* is row id
# Convert to data of the following format
# n1 n2 n3 n4
# v11 v12 v13 NA
# NA v22 v23 NA
# NA NA v33 v34
# MY CURRENT WORKFLOW:
# 1. create a list of unique row id
from collections import OrderedDict
rowIds = list(OrderedDict.fromkeys(list3)) # this preserve row id order
# 2. create a list of unique variable names
names = list(OrderedDict.fromkeys(list1))
# 3. For each variable n*, create a dictionary whose keys are row id, and
# whose values are values of the variable in the row of the row id in the
# key.
import csv
with open('example.csv', 'w', newline='') as csvfile:
# use rowIds as fieldname for DictWriter
fieldnames = rowIds
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for name in names:
index = [i for i, x in enumerate(list1) if x == name]
dict1 = {list3[i]: list2[i] for i in index}
writer.writerow(dict1)
# Transpose row-to-column and use 'names' as the new header. There are
# plenty ways to do this.
You should be able to build each row until a change in row ID is seen as follows:
import csv
list1 = ['n1', 'n2', 'n3', 'n2', 'n3', 'n3', 'n4']
list2 = ['v11', 'v12', 'v13', 'v22', 'v23', 'v33', 'v34']
list3 = ['r1', 'r1', 'r1', 'r2', 'r2', 'r3', 'r3']
header = sorted(set(list1)) # Build a list of column names
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=header, restval='NA')
csv_output.writeheader()
row = {}
cur_row = list3[0]
for v1, v2, v3 in zip(list1, list2, list3):
if cur_row == v3:
row[v1] = v2
else:
csv_output.writerow(row)
row = {v1 : v2}
cur_row = v3
csv_output.writerow(row)
Giving you an output.csv
file containing:
n1,n2,n3,n4
v11,v12,v13,NA
NA,v22,v23,NA
NA,NA,v33,v34
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.