简体   繁体   English

如何使用Python通过Cloud Dataflow将CSV文件导入Cloud Bigtable?

[英]How to import CSV file into Cloud Bigtable via Cloud Dataflow with Python?

The easiest way to describe what I'm doing is essentially to follow this tutorial: Import a CSV file into a Cloud Bigtable table , but in the section where they start the Dataflow job, they use Java: 描述我正在做什么的最简单方法实质上是遵循本教程: 将CSV文件导入Cloud Bigtable表中 ,但是在他们开始Dataflow作业的部分中,他们使用Java:

mvn package exec:exec \
    -DCsvImport \
    -Dbigtable.projectID=YOUR_PROJECT_ID \
    -Dbigtable.instanceID=YOUR_INSTANCE_ID \
    -Dbigtable.table="YOUR_TABLE_ID" \
    -DinputFile="YOUR_FILE" \
    -Dheaders="YOUR_HEADERS"

Is there a way to do this particular step in python? 有没有办法在python中执行此特定步骤? The closest I could find was the apache_beam.examples.wordcount example here , but ultimately I'd like to see some code where I can add some customization into the Dataflow job using Python. 我能找到的最接近的是apache_beam.examples.wordcount例子在这里 ,但最终我想看到一些代码,我可以添加一些定制成使用Python的数据流任务。

一个用于写入Cloud Bigtable的连接器 ,您可以将其用作导入CSV文件的起点。

Google Dataflow does not have a Python connector for BigTable. Google Dataflow没有适用于BigTable的Python连接器。

Here is a link to the Apache Beam connectors for both Java and Python: 这是指向Java和Python的Apache Beam连接器的链接:

Built-in I/O Transforms 内置I / O转换

I'd suggest doing something like this. 我建议做这样的事情。

DataFrame.to_gbq(destination_table, project_id, chunksize=10000, verbose=True, reauth=False, if_exists='fail', private_key=None)

You will find all parameters, and explanations of each, in the link below. 您可以在下面的链接中找到所有参数以及每个参数的说明。

https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.to_gbq.html#pandas.DataFrame.to_gbq https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.to_gbq.html#pandas.DataFrame.to_gbq

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM