如何从 gcs 存储桶中解压缩 tsv 文件并将其加载到 Bigquery

Question

下面是从 gcs 获取tsv.gz文件并解压缩文件并将其转换为逗号分隔的 csv 文件以将 csv 数据加载到 Bigquery 的代码。

storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
  if blobs.name.endswith(".tsv.gz"):
    source_file = blobs.name
    uri = "gs://{}/{}".format(bucket_name, source_file)
    gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
    with gcs_file_system.open(uri) as f:
        gzf = gzip.GzipFile(mode="rb", fileobj=f)
        csv_table=pd.read_table(gzf)
        csv_table.to_csv('GfG.csv',index=False)

代码似乎无法有效地将数据加载到 BQ 中，因为出现了很多问题。 以为文件的转换做错了。 请把你的想法放在哪里出了问题？

Answer 1

如果您的文件是 gzip（不是 zip，我的意思是 gzip），并且在 Cloud Storage 中，请不要加载它，解压缩并流式加载它。

您可以直接在 BigQuery 中按原样加载，这太神奇了！！ 这是一个样本

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    autodetect=True, #Automatic schema
    field_delimiter=",", # Use \t if your separator is tab in your TSV file
    skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

如何从 gcs 存储桶中解压缩 tsv 文件并将其加载到 Bigquery

问题描述

1 个解决方案

解决方案1
1 2022-12-14 20:36:57

如何从 gcs 存储桶中解压缩 tsv 文件并将其加载到 Bigquery

问题描述

1 个解决方案

解决方案1 1 2022-12-14 20:36:57

解决方案1
1 2022-12-14 20:36:57