简体   繁体   中英

Efficient way to create csv iteratively in python

Suppose I have 100,000 users. I want to create a csv file with 100,000 rows and K columns. K is expected to be around a few hundred. Each row contains one user's data, and each column is one variable. I create such csv data with a for loop, of which each iteration constructs a dictionary whose keys are the variable names. If I knew the K variables names, I can use csv.DictWriter to append new row.

The problem is that I don't know the variables' names or the number K. One way is to use pandas.DataFrame.append function. I don't like this one, because pandas ' documentation says append is slow for iterative appending. I cannot use loc as suggested somewhere, because the number of columns varies.

My current strategy is to create three list . list1 is to store variable names; list2 is to save values; and list3 is to save row index. Append list in python within loop is easy. From list3 , I crate a list of unique row indexes, which will be used as the field names of the csv. For each variable, I create a dictionary whose keys are row indexes, and the corresponding value is the value of the variable in that row. Then I use csv.DictWriter to create the csv file. The last step is to transpose the created csv file.

I am glad to hear improvement suggestions.

# Example: three rows (r1, r2, r3) and four variables (n1, n2, n3, n4)
list1 = ['n1', 'n2', 'n3', 'n2', 'n3', 'n3', 'n4']    # n* is variable name
list2 = ['v11', 'v12', 'v13', 'v22', 'v23', 'v33', 'v34']    # v* is value
list3 = ['r1', 'r1', 'r1', 'r2', 'r2', 'r3', 'r3']    # v* is row id
# Convert to data of the following format
# n1  n2  n3  n4
# v11 v12 v13 NA
# NA  v22 v23 NA
# NA  NA  v33 v34

# MY CURRENT WORKFLOW:
# 1. create a list of unique row id
from collections import OrderedDict
rowIds = list(OrderedDict.fromkeys(list3))  # this preserve row id order
# 2. create a list of unique variable names
names = list(OrderedDict.fromkeys(list1))
# 3. For each variable n*, create a dictionary whose keys are row id, and
# whose values are values of the variable in the row of the row id in the
# key.
import csv
with open('example.csv', 'w', newline='') as csvfile:
    # use rowIds as fieldname for DictWriter
    fieldnames = rowIds
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for name in names:
        index = [i for i, x in enumerate(list1) if x == name]
        dict1 = {list3[i]: list2[i] for i in index}
        writer.writerow(dict1)
# Transpose row-to-column and use 'names' as the new header. There are
# plenty ways to do this.

You should be able to build each row until a change in row ID is seen as follows:

import csv

list1 = ['n1', 'n2', 'n3', 'n2', 'n3', 'n3', 'n4']
list2 = ['v11', 'v12', 'v13', 'v22', 'v23', 'v33', 'v34']
list3 = ['r1', 'r1', 'r1', 'r2', 'r2', 'r3', 'r3']

header = sorted(set(list1))     # Build a list of column names

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=header, restval='NA')
    csv_output.writeheader()
    row = {}
    cur_row = list3[0]

    for v1, v2, v3 in zip(list1, list2, list3):
        if cur_row == v3:
            row[v1] = v2
        else:
            csv_output.writerow(row)
            row = {v1 : v2}
            cur_row = v3

    csv_output.writerow(row)

Giving you an output.csv file containing:

n1,n2,n3,n4
v11,v12,v13,NA
NA,v22,v23,NA
NA,NA,v33,v34

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM