简体   繁体   English

从Teradata提取数百万条记录到Python(pandas)

[英]Extract a few million records from Teradata to Python (pandas)

I have data from 6 months of emails (email properties like send date, subject line plus recipient details like age, gender etc, altogether around 20 columns) in my teradata table. 我的Teradata表中有6个月的电子邮件数据(电子邮件属性,如发送日期,主题行以及收件人详细信息,如年龄,性别等,总共约20列)。 It comes around 20 million in total, which I want to be brough into Python for further predictive modelling purpose. 它总共约有2000万,我想深入了解Python以用于进一步的预测建模。

I tried to run the selection query using 'pyodbc' connector but it just runs for hours & hours. 我试图使用“ pyodbc”连接器运行选择查询,但它只运行了几个小时。 Then I stopped it & modified the query to fetch just 1 month of data (may be 3-4 million) but still takes a very long time. 然后我停止了它并修改了查询以仅获取1个月的数据(可能是3-4百万),但是仍然需要很长时间。

Is there any better(faster) option than 'pyodbc' or any different approach altogether ? 有没有比“ pyodbc”更好(更快)的选择或完全不同的方法?

Any input is appreciated. 任何输入表示赞赏。 thanks 谢谢

When communicating between Python and Teradata I recommend to use the Teradata -package (pip teradata; https://developer.teradata.com/tools/reference/teradata-python-module ). 在Python和Teradata之间进行通信时,我建议使用Teradata -package(pip teradata; https://developer.teradata.com/tools/reference/teradata-python-module )。 It leverages ODBC (or REST) to connect. 它利用ODBC(或REST)进行连接。

Beside this you could use JDBC via JayDeBeApi. 除此之外,您可以通过JayDeBeApi使用JDBC。 JDBC could be sometimes some faster then ODBC. JDBC有时可能比ODBC快一些。

Both options support Python Database API Specification, so that your other code around doesn't have to be touched. 这两个选项都支持Python数据库API规范,因此您的周围其他代码都无需改动。 Eg pandas.read_sql works fine with connections from above. 例如,pandas.read_sql在上面的连接中工作正常。

Your performance issues look like some other issues: 您的效果问题看起来像其他一些问题:

  1. network connectivity 网络连接

  2. Python (Pandas) memory handling Python(Pandas)内存处理

ad 1) throughput can only replaced with more throughput 广告1)吞吐量只能由更高的吞吐量代替

ad 2) you could try to do as many as possible within the database (feature engineering) + your local machine should have RAM ("pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset") - Maybe Apache Arrow can relieve some of your local RAM issues 广告2),您可以尝试在数据库中进行尽可能多的操作(功能工程),并且您的本地计算机应具有RAM(“熊猫的经验法则:RAM是数据集大小的5至10倍”)-也许Apache Arrow可以缓解您的一些本地RAM问题

Check: 校验:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM