I have 3 region specific CSV files in a folder ie, data_cityA.csv
, data_cityB.csv
and data_cityC.csv
. I have to read and identify the region specific file; insert it into a table with adding one extra column that will contain info about the particular region.
list_of_file=glob.glob('./*csv')
for file_name in list_of_files:
count = 0
total = 0
with open(file_name,'r')as csvfile:
read=csv.reader(csvfile)
next(read)
if "cityA" in file_name:
reg="cityA"
elif "cityB" in file_name:
reg="cityB"
elif "cityC" in file_name:
reg="cityC"
with open(file_name, 'r')as csv_file:
reader=csv.reader(csv_file)
data=list(reader)
total=len(data)
temp_data=[]
for row in read:
row.append(reg) #concatenating region name
temp_data.append(tuple(row))
count+=1
total=-1
if count>999 or total==1:
insert_query="INSERT INTO table_name(A,B,C,D,E) values (1,2,3,4,5)"
curser.executemoany(insert_query,temp_data)
conn.commit()
count=0
insert_query=" "
temp_data=[]
cursor.callproc('any_proc')
conn.close()
It is taking like 4-5 hours to process(data size is <=500MB). I have tried to implement it with python multiprocessing but wasn't able to do that successfully. "I can't use pandas". The database is sybase
. Any ideas? Is there any better way to it rather than multiprocessing?
One big problem you have here is this: data=list(reader)
. This will read the whole file to memory at once. If the file is 500MB, then 500MB will be loaded to memory at once. Your other option is to use reader as an iterator. That comes with the disadvantage that you don't know the total number of records beforehand, so after you exit the loop, you must perform an insertion of the residual rows.
The second thing that may greatly impact your performance is the insertion. You can parallelise it using multiprocessing (below is a suggestion using Pool), but since it's a new process, you will have to deal with connecting to the database again (and closing it afterwards).
from multiprocessing.pool import Pool
list_of_files = glob.glob('./*csv')
pool = Pool()
pool.map(process_file, list_of_files)
pool.close()
pool.join()
cursor.callproc('any_proc')
conn.close()
def process_file(file_name):
# Make a new connection
# conn = ...
cursor = conn.cursor()
temp_data = []
def do_insert():
insert_query = "INSERT INTO table_name(A,B,C,D,E) values (1,2,3,4,5)"
cursor.executemany(insert_query, temp_data)
conn.commit()
with open(file_name, 'r')as csvfile:
read = csv.reader(csvfile)
next(read)
if "cityA" in file_name:
reg = "cityA"
elif "cityB" in file_name:
reg = "cityB"
elif "cityC" in file_name:
reg = "cityC"
for row in read:
row.append(reg) # concatenating region name
temp_data.append(tuple(row))
if len(temp_data) > 999:
do_insert()
temp_data = []
if temp_data:
do_insert()
conn.close()
Database roundtrips are slowing you down.
You're essentially making 1 round trip for every row. 500MB sounds like a lot of rows...so that's a lot of round trips. Check if there's a way in sybase where you can supply a csv and have it be loaded to a table. Fewer calls (maybe even 1) with a lots of rows rather than 1 row per call.
Maybe you can consider doing this outside of Python.
Consider the following table...
create table t1 (
k int not null,
v varchar(255) null,
city varchar(255) null)
go
...and the file, "file.txt"
1,Line 1
2,Line 2
3,Line 3
4,Line 4
5,Line 5
Becareful not have a blank line at the end of the file.
Use "Stream EDitor" to add the extra column, in this case "CityA"
cat file.txt | sed s/$/\,CityA/g > file_2.txt
cat file_2.txt
1,Line 1,CityA
2,Line 2,CityA
3,Line 3,CityA
4,Line 4,CityA
5,Line 5,CityA
Ensure the database is configured for bulk copy, your DBA can assist with this.
use master
go
sp_dboption 'db_name', 'select', true
go
Then use Sybase's bcp utility to load the file:
bcp database.owner.table in file_2.txt -U login -S server -c -t, -Y -b 1000
The parameters are as follows:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.