简体   繁体   English

不使用 Google Cloud Storage 将 BigQuery 数据导出为 CSV

[英]Export BigQuery Data to CSV without using Google Cloud Storage

I am currently writing a software, to export large amounts of BigQuery data and store the queried results locally as CSV files.我目前正在编写一个软件,用于导出大量 BigQuery 数据并将查询结果存储在本地作为 CSV 文件。 I used Python 3 and the client provided by google.我使用了 Python 3 和谷歌提供的客户端。 I did configuration and authentification, but the problem is, that i can't store the data locally.我做了配置和认证,但问题是,我无法在本地存储数据。 Everytime i execute, i get following error message :每次执行时,都会收到以下错误消息

googleapiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/round-office-769/jobs?alt=json returned "Invalid extract destination URI 'response/file-name-*.csv'. Must be a valid Google Storage path."> googleapiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/round-office-769/jobs?alt=json 返回“无效的提取目标 URI 'response/file-name-*.csv' . 必须是有效的 Google 存储路径。">

This is my Job Configuration:这是我的工作配置:

def export_table(service, cloud_storage_path,
             projectId, datasetId, tableId, sqlQuery,
             export_format="CSV",
             num_retries=5):

# Generate a unique job_id so retries
# don't accidentally duplicate export
job_data = {
    'jobReference': {
        'projectId': projectId,
        'jobId': str(uuid.uuid4())
    },
    'configuration': {
        'extract': {
            'sourceTable': {
                'projectId': projectId,
                'datasetId': datasetId,
                'tableId': tableId,
            },
            'destinationUris': ['response/file-name-*.csv'],
            'destinationFormat': export_format
        },
        'query': {
            'query': sqlQuery,
        }
    }
}
return service.jobs().insert(
    projectId=projectId,
    body=job_data).execute(num_retries=num_retries)

I hoped i could just use a local path instead of a cloud storage, to store data, but i was wrong.我希望我可以只使用本地路径而不是云存储来存储数据,但我错了。

So my Question is:所以我的问题是:

Can i download the queried data locally(or to a local database) or do i have to use Google Cloud Storage?我可以在本地(或本地数据库)下载查询的数据还是必须使用 Google Cloud Storage?

You need to use Google Cloud Storage for your export job.您需要将 Google Cloud Storage 用于导出作业。 Exporting data from BigQuery is explained here , check also the variants for different path syntaxes. 此处解释 从 BigQuery 导出数据,还要检查不同路径语法的变体。

Then you can download the files from GCS to your local storage.然后您可以将文件从 GCS 下载到您的本地存储。

Gsutil tool can help you further to download the file from GCS to local machine. Gsutil工具可以帮助您进一步将文件从 GCS 下载到本地机器。

You cannot download with one move locally, you first need to export to GCS, than to transfer to local machine.本地不能一招下载,需要先导出到GCS,再传输到本地。

You can download all data directly (without routing it through Google Cloud Storage) using paging mechanism.您可以使用分页机制直接下载所有数据(无需通过 Google Cloud Storage 进行路由)。 Basically you need to generate a page token for each page, download the data in the page and iterate this until all data has been downloaded ie no more tokens are available.基本上,您需要为每个页面生成一个页面令牌,下载页面中的数据并对其进行迭代,直到下载完所有数据,即没有更多令牌可用。 Here is an example code in Java, which hopefully clarifies the idea:这是 Java 中的示例代码,希望能阐明这个想法:

import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.client.util.Data;
import com.google.api.services.bigquery.model.*;

/* your class starts here */

private String projectId = ""; /* fill in the project id here */
private String query = ""; /* enter your query here */
private Bigquery bigQuery;
private Job insert;
private TableDataList tableDataList;
private Iterator<TableRow> rowsIterator;
private List<TableRow> rows;
private long maxResults = 100000L; /* max number of rows in a page */

/* run query */
public void open() throws Exception {
    HttpTransport transport = GoogleNetHttpTransport.newTrustedTransport();
    JsonFactory jsonFactory = new JacksonFactory();
    GoogleCredential credential = GoogleCredential.getApplicationDefault(transport, jsonFactory);
    if (credential.createScopedRequired())
        credential = credential.createScoped(BigqueryScopes.all());
    bigQuery = new Bigquery.Builder(transport, jsonFactory, credential).setApplicationName("my app").build();

    JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query);
    JobConfiguration jobConfig = new JobConfiguration().setQuery(queryConfig);
    Job job = new Job().setConfiguration(jobConfig);
    insert = bigQuery.jobs().insert(projectId, job).execute();
    JobReference jobReference = insert.getJobReference();

    while (true) {
        Job poll = bigQuery.jobs().get(projectId, jobReference.getJobId()).execute();
        String state = poll.getStatus().getState();
        if ("DONE".equals(state)) {
            ErrorProto errorResult = poll.getStatus().getErrorResult();
            if (errorResult != null)
                throw new Exception("Error running job: " + poll.getStatus().getErrors().get(0));
            break;
        }
        Thread.sleep(10000);
    }

    tableDataList = getPage();
    rows = tableDataList.getRows();
    rowsIterator = rows != null ? rows.iterator() : null;
}

/* read data row by row */
public /* your data object here */ read() throws Exception {
    if (rowsIterator == null) return null;

    if (!rowsIterator.hasNext()) {
        String pageToken = tableDataList.getPageToken();
        if (pageToken == null) return null;
        tableDataList = getPage(pageToken);
        rows = tableDataList.getRows();
        if (rows == null) return null;
        rowsIterator = rows.iterator();
    }

    TableRow row = rowsIterator.next();
    for (TableCell cell : row.getF()) {
        Object value = cell.getV();
        /* extract the data here */
    }

    /* return the data */
}

private TableDataList getPage() throws IOException {
    return getPage(null);
}

private TableDataList getPage(String pageToken) throws IOException {
    TableReference sourceTable = insert
            .getConfiguration()
            .getQuery()
            .getDestinationTable();
    if (sourceTable == null)
        throw new IllegalArgumentException("Source table not available. Please check the query syntax.");
    return bigQuery.tabledata()
            .list(projectId, sourceTable.getDatasetId(), sourceTable.getTableId())
            .setPageToken(pageToken)
            .setMaxResults(maxResults)
            .execute();
}

您可以在该表上运行 tabledata.list() 操作并设置“alt=csv”,它将以 CSV 形式返回表的开头。

If you install the Google BigQuery API and pandas and pandas.io, you can run Python inside a Jupyter notebook, query the BQ Table, and get the data into a local dataframe.如果您安装了 Google BigQuery API 和 pandas 和 pandas.io,您可以在 Jupyter notebook 中运行 Python,查询 BQ 表,并将数据放入本地数据帧中。 From there, you can write it out to CSV.从那里,您可以将其写出到 CSV。

Another way to do this is from the UI, once the query results have returned you can select the "Download as CSV" button.另一种方法是从用户界面,一旦查询结果返回,您可以选择“下载为 CSV”按钮。 在此处输入图片说明

As Mikhail Berlyant said,正如米哈伊尔·伯利安特所说,

BigQuery does not provide ability to directly export/download query result to GCS or Local File. BigQuery 不提供将查询结果直接导出/下载到 GCS 或本地文件的功能。

You can still export it using the Web UI in just three steps您仍然可以使用 Web UI 仅通过三个步骤将其导出

  1. Configure query to save the results in a BigQuery table and run it.配置查询以将结果保存在 BigQuery 表中并运行它。
  2. Export the table to a bucket in GCS.将表导出到 GCS 中的存储桶。
  3. Download from the bucket.从存储桶下载。

To make sure costs stay low, just make sure you delete the table once you exported the content to GCS and delete the content from the bucket and the bucket once you downloaded the file(s) to your machine.为确保成本保持较低,只需确保在将内容导出到 GCS 后删除表,并在将文件下载到计算机后从存储桶和存储桶中删除内容。

Step 1第1步

When in BigQuery screen, before running the query go to More > Query Settings在 BigQuery 屏幕中,在运行查询之前转到更多 > 查询设置

配置查询

This opens the following这将打开以下内容

查询设置

Here you want to have这里你想拥有

  • Destination: Set a destination table for query results Destination:设置查询结果的目标表
  • Project name: select the project.项目名称:选择项目。
  • Dataset name: select a dataset.数据集名称:选择一个数据集。 If you don't have one, create it and come back.如果您没有,请创建它并返回。
  • Table name: give whatever name you want (must contain only letters, numbers, or underscores).表名:给出你想要的任何名称(必须只包含字母、数字或下划线)。
  • Result size: Allow large results (no size limit).结果大小:允许大结果(没有大小限制)。

Then Save it and the Query is configured to be saved in a specific table.然后保存它,查询被配置为保存在特定的表中。 Now you can run the Query.现在您可以运行查询。

Step 2第2步

To export it to GCP you have to go to the table and click EXPORT > Export to GCS.要将其导出到 GCP,您必须转到表格并单击导出 > 导出到 GCS。

BigQuery 导出表

This opens the following screen这将打开以下屏幕

导出到 GCS

In Select GCS location you define the bucket, the folder and the file.Select GCS location 中,您定义存储桶、文件夹和文件。

For instances, you have a bucket named daria_bucket ( Use only lowercase letters, numbers, hyphens (-), and underscores (_). Dots (.) may be used to form a valid domain name. ) and want to save the file(s) in the root of the bucket with the name test , then you write (in Select GCS location)例如,您有一个名为daria_bucket的存储桶(仅使用小写字母、数字、连字符 (-) 和下划线 (_)。点 (.) 可用于形成有效的域名。 )并且想要保存文件( s) 在名称为test的存储桶的根目录中,然后您编写(在 Select GCS location 中)

daria_bucket/test.csv

If the file is too big (more than 1 GB), you'll get an error.如果文件太大(超过 1 GB),您将收到错误消息。 To fix it, you'll have to save it in more files using wildcard.要修复它,您必须使用通配符将其保存在更多文件中。 So, you'll need to add *, just like that所以,你需要添加*,就像那样

daria_bucket/test*.csv

通配符导出到 GCS

This is going to store, inside of the bucket daria_bucket, all the data extracted from the table in more than one file named test000000000000, test000000000001, test000000000002, ... testX.这将在存储桶 daria_bucket 中存储从表中提取的所有数据,其中包含多个名为 test000000000000、test000000000001、test000000000002、... testX 的文件。

Step 3第 3 步

Then go to Storage and you'll see the bucket.然后转到存储,您将看到存储桶。

GCS桶

Go inside of it and you'll find the one (or more) file(s).进入其中,您会找到一个(或多个)文件。 You can then download from there.然后你可以从那里下载。

Data export from BigQuery table to CSV file using Python pandas:使用 Python pandas 将数据从 BigQuery 表导出到 CSV 文件:

import pandas as pd
from google.cloud import bigquery

selectQuery = """SELECT * FROM dataset-name.table-name"""
bigqueryClient = bigquery.Client()
df = bigqueryClient.query(selectQuery).to_dataframe()
df.to_csv("file-name.csv", index=False)

Maybe you can use the simba odbc driver provided by Google and use any tool that provides odbc connection for creating the csv.也许您可以使用 Google 提供的 simba odbc 驱动程序并使用任何提供 odbc 连接的工具来创建 csv。 It can be even microsoft ssis and you don't even need to code.它甚至可以是 microsoft ssis,你甚至不需要编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python将嵌套的BigQuery数据导出到云存储 - Export nested BigQuery data to cloud storage using python 使用 Cloud Function 将数据加载到 Google Cloud Storage 和 BigQuery - Loading Data to Google Cloud Storage & BigQuery with Cloud Function 使用python将历史数据从Google云存储移至按日期划分的bigquery表 - Moving historical data from google cloud storage to date-partitioned bigquery table using python 使用 python 将 BigQuery 表数据导出到具有 where 子句的 Google Cloud Storage - Exporting BigQuery Table Data to Google Cloud Storage having where clause using python 将数据从Google Cloud Storage上的本地文件加载到BigQuery表 - Load data from local file on Google Cloud Storage to BigQuery table 使用Python将文件从Google云端存储上传到Bigquery - Uploading a file from Google Cloud Storage to Bigquery using Python 使用Python代码将CSV数据从Google Storage加载到Bigquery吗? - Python code to load CSV data from Google Storage to Bigquery? 使用 Cloud function 将 TXT 文件转换为 CSV 并在 Google BigQuery 中填充数据 - Convert TXT file into CSV with Cloud function and populate data in Google BigQuery 将表格从google bigquery导出到google storage - export tables from google bigquery to google storage 使用 python 将 CSV 从谷歌云上传到 Bigquery - Uploading CSV from google cloud to Bigquery using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM