简体   繁体   English

将 1100 万行从 Postgresql 导入到 Pandas/Python

[英]Importing 11 million rows from Postgresql to Pandas/Python

I am trying to load 11 million records from a PostgreSQL DB which is hosted on an AWS server.我正在尝试从托管在 AWS 服务器上的 PostgreSQL 数据库加载 1100 万条记录。 I have tried to use pandas read_sql, and I am getting the result in 4 hours.我曾尝试使用 Pandas read_sql,我在 4 小时内得到了结果。 I have 32 GB of RAM on my laptop and also Core i7, 7th gen.我的笔记本电脑上有 32 GB 的 RAM,还有 Core i7,第 7 代。 I have also set the chunk size to 10000, but it does not improve the crazy time.我还将块大小设置为 10000,但这并没有改善疯狂时间。 I have looked at many articles online, and tried all of them but none of them speeds up my process.我在网上看了很多文章,并尝试了所有文章,但没有一篇能加快我的进程。 I want to ideally load this data under 20 minutes if possible or whatever is the shortest amount of time possible.如果可能的话,我希望在 20 分钟内或尽可能短的时间内加载这些数据。 I need this data in a dataframe so that I can do some merges with other files that I have, and if I can fetch the data in Python, I can automate my process.我需要数据帧中的这些数据,以便我可以与我拥有的其他文件进行一些合并,如果我可以在 Python 中获取数据,我可以自动化我的过程。 My code is displayed below:我的代码显示如下:

from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time


connection = psycopg2.connect(user="abc",
                                      password="efg",
                                      host="123.amazonaws.com",
                                      port="5432",
                                      database="db")

date='2020-03-01'
columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

postgreSQL_select_Query = 'select ' +  columns + ' from "Table" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'


x=pd.read_sql_query(postgreSQL_select_Query, connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=10000)

Please suggest what I can do to improve this code, and reduce runtime.请建议我可以做些什么来改进此代码并减少运行时间。

I am also attaching another code segment, that I am using to do this, but the same result as it is fetching the rows in HOURS.我还附加了另一个代码段,我正在使用它来执行此操作,但结果与在 HOURS 中获取行的结果相同。 Any guidance would be greatly appreciated.任何指导将不胜感激。

Second Approach:第二种方法:

# -*- coding: utf-8 -*-

@author: ssullah
"""
from io import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd
import numpy as np
import time

start = time.time()
print("Started")

#Retreiving records from DB
def getdata():  
    try:
        start = time.time()
        print("Started")
        connection = psycopg2.connect(user="a"
                                      password="as",
                                      host="aws",
                                      port="5432",
                                      database="as")


        cur= connection.cursor()

        date='2020-03-01'
        columns= '"LastName","FirstName","DateOfBirth","PatientGender","Key"'

        postgreSQL_select_Query = 'select ' +  columns + ' from "ALLADTS" where "CreatedDate"::date>=' + "'" + date + "'" + 'limit 11000000'

        cur = connection.cursor('cursor-name') # server side cursor
        cur.itersize = 10000 # how much records to buffer on a client
        cur.execute(postgreSQL_select_Query)

        mobile_records = cur.fetchall() 


    #Column names as per schema, defined above
        col_names=["LastName","FirstName","DateOfBirth","PatientGender","Key"]

    # Create the dataframe, passing in the list of col_names extracted from the description
        records = pd.DataFrame(mobile_records,col_names)

        return records;


    except (Exception, psycopg2.Error) as error :
        print ("Error while fetching data from PostgreSQL", error)

    finally:
        #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")


records=getdata()
end = time.time()
print("The total time:", (end - start)/60, 'minutes')

Update:更新:

Instead of loading data in Python, I decided to create a temporary table in postgresql using Python and load the new file from pandas to Postgresql.我决定使用 Python 在 postgresql 中创建一个临时表,并将新文件从 Pandas 加载到 Postgresql,而不是在 Python 中加载数据。 Once the table was populated using a query in python, I was able to query and get the desired output aas the final result back in the panda dataframe.使用 python 中的查询填充表后,我就能够查询并获得所需的输出作为最终结果返回到熊猫数据帧中。

All of this took 1.4 minutes, and the same query takes 30 minutes to run in Pgadmin, so by leveraging Python, and doing the calculation using sql query written in Python, I was able to exponentially speed up the process, and at the same time not have to deal with 11 million records in my memory.所有这一切都需要 1.4 分钟,而在 Pgadmin 中运行相同的查询需要 30 分钟,因此通过利用 Python,并使用用 Python 编写的 sql 查询进行计算,我能够以指数方式加快进程,同时不必处理我记忆中的 1100 万条记录。 Thank you for your advice.感谢您的意见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python,大型csv文件上的pandas.read_csv,具有来自Google云端硬盘文件的1000万行 - Python, pandas.read_csv on large csv file with 10 Million rows from Google Drive file 从Teradata提取数百万条记录到Python(pandas) - Extract a few million records from Teradata to Python (pandas) 导入pandas.rpy.common时Python中的分段错误11 - Segmentation fault 11 in Python when importing pandas.rpy.common 无法使用python从数据库中选择一千万行 - can not select ten million rows from a database using python 百万行上的模糊正则表达式匹配 Pandas df - Fuzzy regex match on million rows Pandas df 在用于重新索引大型DataFrame的Python Pandas中创建大型MultiIndex(一千万行)的问题 - Problems with creating large MultiIndex (10 million rows) in Python Pandas used to reindex large DataFrame 如何使用 pandas 或 python 将具有数百万行的表从 PostgreSQL 复制到 Amazon Redshift - How to copy a table with millions of rows from PostgreSQL to Amazon Redshift using pandas or python 使用 Reticulate 从 R 导入 Python package(熊猫) - Importing Python package (pandas) from R with Reticulate 从python机盖导入熊猫时出错 - Error when importing Pandas from python canopy 通过 Python 更新 PostgreSQL 中的行 - pandas.to_sql:更新? - Updating rows in PostgreSQL via Python - pandas.to_sql: update?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM