简体   繁体   中英

How to read multiple files in a directory, all of which are csv.gzip with Airflow S3 Hook or boto3?

I have a directory in S3, let's say s3://test-bucket/test-folder/2020-08-28/ which has files as such:

2020-08-28 03:29:13   29397684 data_0_0_0.csv.gz
2020-08-28 03:29:13   29000150 data_0_1_0.csv.gz
2020-08-28 03:29:13   38999956 data_0_2_0.csv.gz
2020-08-28 03:29:13   32079942 data_0_3_0.csv.gz
2020-08-28 03:29:13   34154791 data_0_4_0.csv.gz
2020-08-28 03:29:13   45348128 data_0_5_0.csv.gz
2020-08-28 03:29:13   60904419 data_0_6_0.csv.gz

I'm trying to create an Airflow operator using an S3 hook ( https://airflow.readthedocs.io/en/stable/_modules/airflow/hooks/S3_hook.html ) which will dump the content of these files somewhere. I tried:

contents = s3.read_key(key=s3://test-bucket/test-folder/2020-08-28/),
contents = s3.read_key(key=s3://test-bucket/test-folder/2020-08-28/data_0_0_0.csv)
contents = s3.read_key(key=s3://test-bucket/test-folder/2020-08-28/data_0_0_0.csv.gz)

None of these seem to work. I noticed there's s3.select_key but that doesn't seem to have the right parameters, only input and output serialization. Any way to import this data using S3 hook without doing anything to the files themselves?

My next problem is that there are a bunch of files within the folder s3://test-bucket/test-folder/2020-08-28/ . I tried using list_keys but it's not liking the bucket name:

keys = s3.list_keys('s3://test-bucket/test-folder/2020-08-28/')

gives

Invalid bucket name "s3://test-bucket/test-folder/2020-08-28/": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"

I have also tried the same thing, but removing the "s3://". It's not giving me an authentication error at any point. When I put in .csv.gz in the read_key call above, it tells me

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

which I'm assuming has to do with the fact that it's gzipped?

So how can I 1. read keys from S3 which are compressed csv files, and 2. how can I read all the csv files at once within a given directory?

Assuming that you're reading the files from a directory like s3://your_bucket/your_directory/YEAR-MONTH-DAY/ . Then you could do two things:

  • Read Paths to Data . Read the paths to the .csv.gz files in each subdirectory

  • Load the Data . In this example, we're going to load them as pandas.DataFrame , but alternatively you can leave it as gzip Object.

1.A Read the paths with Airflow S3 Hook

# Initialize the s3 hook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
s3_hook = S3Hook()

# Read the keys from s3 bucket
paths = s3_hook.list_keys(bucket_name='your_bucket_name', prefix='your_directory')

where, to list the keys it is using a paginator behind. This is, where we get to the third form for reading in the list of paths.

1.B Read the paths with a Paginator

In case of the paginator, as example, if you want to list the objects from s3_//your_bucket/your_directory/item.csv.gz , ..., etc. Then, the paginator will work like (example taken from the docs )

client = boto3.client('s3', region_name='us-west-2')
paginator = client.get_paginator('list_objects')
operation_parameters = {'Bucket': 'your_bucket',
                        'Prefix': 'your_directory'}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
    print(page['Contents'])

and this will output a list of dictionaries, from which you can filter the Key of each dictionary to obtain a list of paths to read, that is, the paginator will throw something like

[{'Key': 'your_directoyr/file_1.csv.gz
....},
..., 
{'Key': 'your_directoyr/file_n.csv.gz
....}

Now we get to a third form to do this, which is similar as the one before.

1.C Read the paths with Boto 3 client

To read the paths, consider the following function

import boto3 

s3_client = boto3.client('s3')

def get_all_s3_objects(s3_client, **base_kwargs):
    continuation_token = None
    while True:
        list_kwargs = dict(MaxKeys=1000, **base_kwargs)
        if continuation_token:
            list_kwargs['ContinuationToken'] = continuation_token
        response = s3_client.list_objects_v2(**list_kwargs)
        yield from response.get('Contents', [])
        if not response.get('IsTruncated'):  # At the end of the list?
            break
        continuation_token = response.get('NextContinuationToken')

when you call this function with the suffix Key and your bucket name for example

files = get_all_s3_objects(s3_client, Bucket='your_bucket_name', Prefix=f'your_directory/YEAR-MONTH-DAY')
paths = [f['Key'] for f in files]

by calling paths you will get a list with .csv.gz files. In your case, this will be

[data_0_0_0.csv.gz,
data_0_1_0.csv.gz,
data_0_2_0.csv.gz]

Then you can take this as input for the following function to read your data as pandas dataframe, for example.

2. Load data

Consider the function

from io import BytesIO
import pandas as pd

def load_csv_gzip(s3_client, bucket, key):
    with BytesIO() as f:
        s3_files = s3_client.download_fileobj(Bucket=bucket,
                           Key=key,
                           Fileobj=f)
        f.seek(0)
        gzip_fd = gzip.GzipFile(fileobj=f)
        return pd.read_csv(gzip_fd)

Finally, you would provide a list with the .csv.gz files, you can iteratively load each path and concat the result to a pandas dataframe or you can just load a single .csv.gz file. For example,

data = pd.concat([load_csv_gzip(s3_client, 'your_bucket', path) for p in paths])

where each element of path would be something like your_subdirectory/2020-08-28/your_file.csv.gz .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM