Looking for a more efficient way to vectorize a CSV with information on different rows

Question

I'm working on a machine learning competition where the goal is to predict the type, or motivation, of a trip a customer is making to the supermarket given information about the trip. I have a CSV file of the following format:

TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
999,5,Friday,68113152929,-1,FINANCIAL SERVICES,1000
30,7,Friday,60538815980,1,SHOES,8931
30,7,Friday,7410811099,1,PERSONAL CARE,4504
26,8,Friday,2238403510,2,PAINT AND ACCESSORIES,3565
26,8,Friday,2006613744,2,PAINT AND ACCESSORIES,1017

The first step I take is converting this data into feature vectors. To do that I turn each of the categorical variables into dummy variables, then the each vector would be a unique sample. The problem in creating the vectors is that samples are not separated on rows; you have data about samples on different rows. Above, for example, we have 5 rows but only 3 samples (5, 7 and 8). Here are the feature vectors for the samples above:

'Friday', 68113152929, 60538815980, 7410811099, 2238403510, 2006613744, 'FINANCIAL SERVICES', 'SHOES', 'PERSONAL CARE', 'PAINT AND ACCESSORIES', 1000, 8931, 4504, 3565, 1017, 'Returned'
 [ 1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.  0.  0.  0.  0.  1.]
 [ 1.  0.  1.  1.  0.  0.  0.  1.  1.  0.  0.  1.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  2.  2.  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.]

Notice that I added the 'Returned' feature at the end (when one of the scan counts for any of the Upcs, or items bought, is negative, this is 1). There would also be a "target" vector with corresponding labels:

[999, 30, 26]

My problem is generating those vectors efficiently. I went over this part of the program relatively quickly, testing my code only on a small portion (100 - 1000 rows) of the total data I have (~700k rows). When I finished the rest of the program (the learning and prediction part) and went back to the entire data, the "vectorizing" part seemed to be taking way too long. Do you have any suggestions on approaching this (getting feature vectors from the csv file) to get better performance?

Here's the code I'm using for now. Please take a look if you want to know what I'm doing right now. Go to "Iter by row" to get directly to the vectorizing part:

import pandas as pd
#Reshape smaller vector by adding zeros to the beginning. Add the vectors but add 0 if there's a value in both vectors other than zero
def vector_add(P, Q):

    a = []
    for x,y in izip_longest(reversed(P), reversed(Q), fillvalue=0):
        if x == 0 or y == 0:
            a.append(x+y)
        else:
            a.append(1)

    return a[::-1]

csv_file = open('exp-train', 'rb')

df = pd.read_csv(csv_file)

# Get features
visitnums = df.drop_duplicates(subset='VisitNumber')['VisitNumber']
days = df.drop_duplicates(subset='Weekday')['Weekday']
upcs = df.drop_duplicates(subset='Upc')['Upc']
departments = df.drop_duplicates(subset='DepartmentDescription')['DepartmentDescription']
finenums = df.drop_duplicates(subset='FinelineNumber')['FinelineNumber']
# List to contain all feature vectors
lines = []

# Put in list and put put list in the large list
top_line = []

top_line.append('VisitType')

for day in days:
    top_line.append(day)


for upc in upcs:
    top_line.append(upc)

for department in departments:
    top_line.append(department)

for finenum in finenums:
    top_line.append(finenum)

top_line.append('Returned')

lines.append(top_line)

#Iterate by row
counter = 0
#Back variable deal with duplicate samples
back = 'no'
line = []
returned = 0

for i, row in enumerate(df.itertuples()):
#Line2 to deal with duplicate samples
line2 = []

if not back == row[2]:
    if not back == 'no':
        line.append(returned)
        returned = 0
        lines.append(line)
        line = []
    line.append(row[1])

    for day in days:
        if day == row[3]:
            line.append(1)
        else:
            line.append(0)
    for upc in upcs:
        if upc == row[4]:
            if int(row[5]) < 0:
                returned = 1
                line.append(0)
            else:
                line.append(int(row[5]))
        else:
            line.append(0)

    for department in departments:
        if department == row[6]:
            line.append(1)
        else:
            line.append(0)

    for finenum in finenums:
        if finenum == row[7]:
            line.append(1)
        else:
            line.append(0)

else:

    for upc in upcs:
        if upc == row[4]:
            if int(row[5]) < 0:
                returned = 1
                line2.append(0)
            else:
                line2.append(int(row[5]))
        else:
            line2.append(0)

    for department in departments:
        if department == row[6]:
            line2.append(1)
        else:
            line2.append(0)

    for finenum in finenums:
        if finenum == row[7]:
            line2.append(1)
        else:
            line2.append(0)
    #Deal with multiple samples by adding line, line2 into line        
    line = vector_add(line, line2)



back = row[2]

if i == (len(df.index) - 1): 
    line.append(returned)
    returned = 0
    lines.append(line)
a = time.time()

Please let me know if there's a good/better way to go about this.

Answer 1

If I understand you correctly, you can simply create a formula like that:

import pandas as pd
import numpy as np

df = pd.read_csv('exp-train')

from patsy import dmatrices
#here the ~ sign is an = sign
#The C() lets our algorithm know that those variables are categorical
formula_ml = 'TripType ~ VisitNumber + C(Weekday) + Upc + ScanCount + C(DepartmentDescription)+ FinelineNumber'

#assign the variables
Y_train, X_train = dmatrices(formula_ml, data=df, return_type='dataframe')

Y_train= np.asarray(Y_train).ravel()

You can choose which features you want to use for your machine learning algorithm by changing the formula.

Patsy package you can find here

Answer 2

Pure python code can be really slow - that is why numpy etc. are written in C, Fortran, and Cython.

For example an integer in pure python is stored using 12 bytes instead of 8. Builiding a list() of integers via append is expected to be slow and expensive.

To speed up, try to

allocate a numpy vector of integer zeros as desired
instead of appending 0s and 1s, ignore the zeros and only set the 1s

also use the python profiler to identigy where your hotspots are.

Looking for a more efficient way to vectorize a CSV with information on different rows

Question

2 answers

solution1
2 2015-12-06 04:47:54

solution2
0 ACCPTED 2015-12-06 10:30:05

Looking for a more efficient way to vectorize a CSV with information on different rows

Question

2 answers

solution1 2 2015-12-06 04:47:54

solution2 0 ACCPTED 2015-12-06 10:30:05

solution1
2 2015-12-06 04:47:54

solution2
0 ACCPTED 2015-12-06 10:30:05