Read multiple CSV files and insert data into a table using python multiprocessing without using pandas

Question

I have 3 region specific CSV files in a folder ie, data_cityA.csv , data_cityB.csv and data_cityC.csv . I have to read and identify the region specific file; insert it into a table with adding one extra column that will contain info about the particular region.

list_of_file=glob.glob('./*csv')
for file_name in list_of_files:
    count = 0
    total = 0   

    with open(file_name,'r')as csvfile:
        read=csv.reader(csvfile)
        next(read)
        if "cityA" in file_name:
            reg="cityA"
        elif "cityB" in file_name:
            reg="cityB"
        elif "cityC" in file_name:
            reg="cityC"

        with open(file_name, 'r')as csv_file:
            reader=csv.reader(csv_file)
            data=list(reader)
            total=len(data)     
            temp_data=[]

        for row in read:
            row.append(reg) #concatenating region name 
            temp_data.append(tuple(row))
            count+=1
            total=-1

            if count>999 or total==1:
                insert_query="INSERT INTO table_name(A,B,C,D,E) values (1,2,3,4,5)"
                curser.executemoany(insert_query,temp_data)
                conn.commit()
                count=0
                insert_query=" "
                temp_data=[]

cursor.callproc('any_proc')
conn.close()

It is taking like 4-5 hours to process(data size is <=500MB). I have tried to implement it with python multiprocessing but wasn't able to do that successfully. "I can't use pandas". The database is sybase . Any ideas? Is there any better way to it rather than multiprocessing?

Answer 1

One big problem you have here is this: data=list(reader) . This will read the whole file to memory at once. If the file is 500MB, then 500MB will be loaded to memory at once. Your other option is to use reader as an iterator. That comes with the disadvantage that you don't know the total number of records beforehand, so after you exit the loop, you must perform an insertion of the residual rows.

The second thing that may greatly impact your performance is the insertion. You can parallelise it using multiprocessing (below is a suggestion using Pool), but since it's a new process, you will have to deal with connecting to the database again (and closing it afterwards).

from multiprocessing.pool import Pool
list_of_files = glob.glob('./*csv')
pool = Pool()
pool.map(process_file, list_of_files)
pool.close()
pool.join()
cursor.callproc('any_proc')
conn.close()


def process_file(file_name):
    # Make a new connection
    # conn = ...
    cursor = conn.cursor()
    temp_data = []

    def do_insert():
        insert_query = "INSERT INTO table_name(A,B,C,D,E) values (1,2,3,4,5)"
        cursor.executemany(insert_query, temp_data)
        conn.commit()

    with open(file_name, 'r')as csvfile:
        read = csv.reader(csvfile)
        next(read)
        if "cityA" in file_name:
            reg = "cityA"
        elif "cityB" in file_name:
            reg = "cityB"
        elif "cityC" in file_name:
            reg = "cityC"

        for row in read:
            row.append(reg)  # concatenating region name
            temp_data.append(tuple(row))
            if len(temp_data) > 999:
                do_insert()
                temp_data = []
    if temp_data:
        do_insert()
    conn.close()

Answer 2

Database roundtrips are slowing you down.

You're essentially making 1 round trip for every row. 500MB sounds like a lot of rows...so that's a lot of round trips. Check if there's a way in sybase where you can supply a csv and have it be loaded to a table. Fewer calls (maybe even 1) with a lots of rows rather than 1 row per call.

Answer 3

Maybe you can consider doing this outside of Python.

Consider the following table...

create table t1 ( 
    k int not null, 
    v varchar(255) null, 
    city varchar(255) null)
go

...and the file, "file.txt"

1,Line 1
2,Line 2
3,Line 3
4,Line 4
5,Line 5

Becareful not have a blank line at the end of the file.

Use "Stream EDitor" to add the extra column, in this case "CityA"

cat file.txt | sed s/$/\,CityA/g > file_2.txt
cat file_2.txt
1,Line 1,CityA
2,Line 2,CityA
3,Line 3,CityA
4,Line 4,CityA
5,Line 5,CityA

Ensure the database is configured for bulk copy, your DBA can assist with this.

use master
go
sp_dboption 'db_name', 'select', true
go

Then use Sybase's bcp utility to load the file:

bcp database.owner.table in file_2.txt -U login -S server -c -t, -Y -b 1000

The parameters are as follows:

Database Name
Object Owner
Table Name
Direction in
File Name
-U User name
-S server name (Sybase Instance Name rather than physical host name)
-c = Use character data
-t, = Field terminator is,
-Y Client side character set conversion - may not be required
-b 1000 = Commit 1000 rows at time. If you're loading 500MB you probably want this so as not to hit LOG_SUSPEND.

Read multiple CSV files and insert data into a table using python multiprocessing without using pandas

Question

3 answers

solution1
0 2019-09-19 15:45:15

solution2
0 2019-09-20 03:57:29

solution3
0 2019-09-20 13:52:16

Read multiple CSV files and insert data into a table using python multiprocessing without using pandas

Question

3 answers

solution1 0 2019-09-19 15:45:15

solution2 0 2019-09-20 03:57:29

solution3 0 2019-09-20 13:52:16

solution1
0 2019-09-19 15:45:15

solution2
0 2019-09-20 03:57:29

solution3
0 2019-09-20 13:52:16