简体   繁体   English

如何从 gcs 存储桶中解压缩 tsv 文件并将其加载到 Bigquery

[英]How to unzip and load tsv file into Bigquery from gcs bucket

Below is the code to get the tsv.gz file from gcs and unzip the file and converting into comma separated csv file to load csv data into Bigquery.下面是从 gcs 获取tsv.gz文件并解压缩文件并将其转换为逗号分隔的 csv 文件以将 csv 数据加载到 Bigquery 的代码。

storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
  if blobs.name.endswith(".tsv.gz"):
    source_file = blobs.name
    uri = "gs://{}/{}".format(bucket_name, source_file)
    gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
    with gcs_file_system.open(uri) as f:
        gzf = gzip.GzipFile(mode="rb", fileobj=f)
        csv_table=pd.read_table(gzf)
        csv_table.to_csv('GfG.csv',index=False)

Code seems not effective to load data into BQ as getting many issues.代码似乎无法有效地将数据加载到 BQ 中,因为出现了很多问题。 Thought doing wrong with the conversion of file.以为文件的转换做错了。 Please put you thoughts where it went wrong?请把你的想法放在哪里出了问题?

If your file is gzip (not zip, I mean gzip), and in Cloud Storage, don't load it, unzip it and stream load it.如果您的文件是 gzip(不是 zip,我的意思是 gzip),并且在 Cloud Storage 中,请不要加载它,解压缩并流式加载它。

You can directly load, as is, in BigQuery, it's magic!!您可以直接在 BigQuery 中按原样加载,这太神奇了!! Here a sample这是一个样本

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    autodetect=True, #Automatic schema
    field_delimiter=",", # Use \t if your separator is tab in your TSV file
    skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 GCS 加载_Csv_data 到 Bigquery - Load_Csv_data from GCS to Bigquery 如何获取最后一个文件最后一个文件存放在gcs存储桶中(python) - How to get the last file last file deposited in a gcs bucket (python) 如何从 GCS 将数据加载到 BigQuery(使用 load_table_from_uri 或 load_table_from_dataframe)而不复制 BQ 表中的现有数据 - how to load data into BigQuery from GCS (using load_table_from_uri or load_table_from_dataframe) without duplicating existing data in the BQ table 在 GCS 存储桶上记录来自 MlFlow 的工件 - Logging Artifacts from MlFlow on GCS Bucket How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python? - How to access a GCS Blob that contains an xml file in a bucket with the pandas.read_xml() function in python? 如何将特定日期创建的所有文件从一个存储桶复制到 GCS 中的另一个存储桶? - How to copy all the files created on a specific date from one bucket to another in GCS? 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python 如何将 BigQuery 视图作为 csv 文件传输到 Google Cloud Storage 存储桶 - How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file 使用 Cloud Functions 中的服务帐户将数据从 GCS 上传到 Bigquery - Upload data from GCS to Bigquery using service account in Cloud Functions 将最后添加的文件复制到 GCS 存储桶到 Azure Blob 存储 - Copy last added file to a GCS bucket into Azure Blob storage
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM