将 1100 万行从 Postgresql 导入到 Pandas/Python

Question

I am trying to load 11 million records from a PostgreSQL DB which is hosted on an AWS server.我正在尝试从托管在 AWS 服务器上的 PostgreSQL 数据库加载 1100 万条记录。 I have tried to use pandas read_sql, and I am getting the result in 4 hours.我曾尝试使用 Pandas read_sql，我在 4 小时内得到了结果。 I have 32 GB of RAM on my laptop and also Core i7, 7th gen.我的笔记本电脑上有 32 GB 的 RAM，还有 Core i7，第 7 代。 I have also set the chunk size to 10000, but it does not improve the crazy time.我还将块大小设置为 10000，但这并没有改善疯狂时间。 I have looked at many articles online, and tried all of them but none of them speeds up my process.我在网上看了很多文章，并尝试了所有文章，但没有一篇能加快我的进程。 I want to ideally load this data under 20 minutes if possible or whatever is the shortest amount of time possible.如果可能的话，我希望在 20 分钟内或尽可能短的时间内加载这些数据。 I need this data in a dataframe so that I can do some merges with other files that I have, and if I can fetch the data in Python, I can automate my process.我需要数据帧中的这些数据，以便我可以与我拥有的其他文件进行一些合并，如果我可以在 Python 中获取数据，我可以自动化我的过程。 My code is displayed below:我的代码显示如下：

from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time


connection = psycopg2.connect(user="abc",
                                      password="efg",
                                      host="123.amazonaws.com",
                                      port="5432",
                                      database="db")

date='2020-03-01'
columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

postgreSQL_select_Query = 'select ' +  columns + ' from "Table" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'


x=pd.read_sql_query(postgreSQL_select_Query, connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=10000)

Please suggest what I can do to improve this code, and reduce runtime.请建议我可以做些什么来改进此代码并减少运行时间。

I am also attaching another code segment, that I am using to do this, but the same result as it is fetching the rows in HOURS.我还附加了另一个代码段，我正在使用它来执行此操作，但结果与在 HOURS 中获取行的结果相同。 Any guidance would be greatly appreciated.任何指导将不胜感激。

Second Approach:第二种方法：

# -*- coding: utf-8 -*-

@author: ssullah
"""
from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time

start = time.time()
print("Started")

#Retreiving records from DB
def getdata():  
    try:
        start = time.time()
        print("Started")
        connection = psycopg2.connect(user="a"
                                      password="as",
                                      host="aws",
                                      port="5432",
                                      database="as")


        cur= connection.cursor()

        date='2020-03-01'
        columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

        postgreSQL_select_Query = 'select ' +  columns + ' from "ALLADTS" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'

        cur = connection.cursor('cursor-name') # server side cursor
        cur.itersize = 10000 # how much records to buffer on a client
        cur.execute(postgreSQL_select_Query)

        mobile_records = cur.fetchall() 


    #Column names as per schema, defined above
        col_names=["LastName","FirstName","DateOfBirth","PatientGender","Key"]

    # Create the dataframe, passing in the list of col_names extracted from the description
        records = pd.DataFrame(mobile_records,col_names)

        return records;


    except (Exception, psycopg2.Error) as error :
        print ("Error while fetching data from PostgreSQL", error)

    finally:
        #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")


records=getdata()
end = time.time()
print("The total time:", (end - start)/60, 'minutes')

Answer 1

Update:更新：

Instead of loading data in Python, I decided to create a temporary table in postgresql using Python and load the new file from pandas to Postgresql.我决定使用 Python 在 postgresql 中创建一个临时表，并将新文件从 Pandas 加载到 Postgresql，而不是在 Python 中加载数据。 Once the table was populated using a query in python, I was able to query and get the desired output aas the final result back in the panda dataframe.使用 python 中的查询填充表后，我就能够查询并获得所需的输出作为最终结果返回到熊猫数据帧中。

All of this took 1.4 minutes, and the same query takes 30 minutes to run in Pgadmin, so by leveraging Python, and doing the calculation using sql query written in Python, I was able to exponentially speed up the process, and at the same time not have to deal with 11 million records in my memory.所有这一切都需要 1.4 分钟，而在 Pgadmin 中运行相同的查询需要 30 分钟，因此通过利用 Python，并使用用 Python 编写的 sql 查询进行计算，我能够以指数方式加快进程，同时不必处理我记忆中的 1100 万条记录。 Thank you for your advice.感谢您的意见。

将 1100 万行从 Postgresql 导入到 Pandas/Python

问题描述

1 个解决方案

解决方案1
0 2020-04-03 21:27:43

将 1100 万行从 Postgresql 导入到 Pandas/Python

问题描述

1 个解决方案

解决方案1 0 2020-04-03 21:27:43

解决方案1
0 2020-04-03 21:27:43