简体   繁体   中英

Read a .gz file from Google Cloud storage via Python (Jupyter)

I'm trying to read a .gz file from Google Cloud storage via Python on Jupyter notebook.

I get error by the first code.

TypeError: can't concat str to bytes

from google.cloud import storage
import pandas as pd
from io import StringIO

client = storage.Client()
bucket = client.get_bucket("nttcomware")
blob = bucket.get_blob(f"test.csv.gz")
df = pd.read_csv(s, compression='gzip', float_precision="high")
df.head()

I get second error by the second code.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

from google.cloud import storage
import pandas as pd
from io import StringIO

client = storage.Client()
bucket = client.get_bucket("nttcomware")
blob = bucket.get_blob(f"test.csv.gz")
bt = blob.download_as_string()
s = str(bt, "utf-8")
s = StringIO(s)
df = pd.read_csv(s, compression='gzip', float_precision="high")
df.head()

Please suggest.

I solved by myself fortunately. I hope it helps for others.

client = storage.Client()

# get the bucket
bucket = client.get_bucket("nttcomware")

# get the blob object
blob_name = "test.csv.gz"
blob = bucket.get_blob(blob_name)

# convert blob into string and consider as BytesIO object. Still compressed by gzip
data = io.BytesIO(blob.download_as_string())

# open gzip into csv
with gzip.open(data) as gz:
    #still byte type string
    file = gz.read()
    # erase the .gz extension and get the blob object
    blob_decompress = bucket.blob(blob_name.replace('.gz',''))
    # convert into byte type again
    blob_decompress = blob_decompress.download_as_string()
    # decode the byte type into string by utf-8
    blob_decompress = blob_decompress.decode('utf-8')
    # StringIO object
    s = StringIO(blob_decompress)
    

df = pd.read_csv(s, float_precision="high")
df.head()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM