简体   繁体   中英

How do make faster query from pandas to postgresql

I have a CSV file and I have to search if which rows are in the database. for example, from my CSV I have to use name, surname, and birthdate to find the university name in DB. For example:

在此处输入图像描述

在此处输入图像描述

from this image example, I should find XXX YYY study in university 1, AAA BBB in university 2, and no result for TTT YYY.

My solution is following which is very slow. CSV file has a 50k line and DB 40M.

I use python pandas, and read CSV files, then I create a new column combine of the name, surname, and birthdate. example data from the new combine column: "XXX+YYYY+29-05-1953"

Then I get a list of all possible data from the new combine column. Lets say list is: combine_list = data[new_column].tolist()

And now my amazing query:))

query = Select concat(name ,'+',surname,'+',birthdate) as new_column, university
        from db_table where name is not NULL and surname is not NULL and birthdate is not NULL
        and concat(name ,'+',surname,'+',birthdate) in {tuple(combine_list)}"

Could you please give me the advice to find them faster?

You could query the columns as a tuple:

Select concat(name ,'+',surname,'+',birthdate) as new_column, university
from db_table
where (name, surname, birthdate) IN (('XXX', 'YYY', '29-05-53'),
                                     ('AAA', 'BBB', '01-01-1997'), ...)

This should be faster than querying against concatenated values, especially if there is an index over the columns in the WHERE clause.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM