简体   繁体   English

从 S3 读取 ORC 文件到 Pandas

[英]Read ORC file from S3 to Pandas

I'm trying to read an orc file from s3 into a Pandas dataframe.我正在尝试将一个兽人文件从 s3 读入 Pandas dataframe。 In my version of pandas there is no pd.read_orc(...).在我的 pandas 版本中,没有 pd.read_orc(...)。

I tried to do this:我试图这样做:

session = boto3.Session()
s3_client = session.client('s3')

s3_key = "my_object_key"


data = s3_client.get_object(
    Bucket='my_bucket',
    Key=s3_key
)

orc_bytes = data['Body'].read()

Which reads the object as bytes.它将 object 读取为字节。

Now I try to do this:现在我尝试这样做:

orc_data = pyorc.Reader(orc_bytes)

But it fails because:但它失败了,因为:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-deaabe8232ce> in <module>
----> 1 data = pyorc.Reader(orc_data)

/anaconda3/envs/linear_opt_3.7/lib/python3.7/site-packages/pyorc/reader.py in __init__(self, fileo, batch_size, column_indices, column_names, struct_repr, converters)
     65             conv = converters
     66         super().__init__(
---> 67             fileo, batch_size, column_indices, column_names, struct_repr, conv
     68         )
     69 

TypeError: Parameter must be a file-like object, but `<class 'bytes'>` was provided

Eventually I would like to land it as.csv or something I can read into pandas.最终我想把它作为.csv 或者我可以读入 pandas 的东西。 Is there a better way to do this?有一个更好的方法吗?

Try wrapping the S3 data in an io.BytesIO :尝试将 S3 数据包装在io.BytesIO中:

import io

orc_bytes = io.BytesIO(data['Body'].read())
orc_data = pyorc.Reader(orc_bytes)

Here's the function that solves the problem end to end:这是端到端解决问题的function:

import boto3
import pyorc
import io
import pandas as pd

session = boto3.Session()
s3_client = session.client('s3')

def load_s3_orc_to_local_df(key, bucket):
    data = s3_client.get_object(Bucket=bucket, Key=key)
    orc_bytes = io.BytesIO(data['Body'].read())   
    reader = pyorc.Reader(orc_bytes)
    schema = reader.schema
    columns = [item for item in schema.fields]
    rows = [row for row in reader]   
    df = pd.DataFrame(data=rows, columns=columns)
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM