简体   繁体   English

使用Pandas从Cloud DataLab访问Big Query

[英]Accessing Big Query from Cloud DataLab using Pandas

I have a Jypyter Notebook accessing Big Query using Pandas as the vehicle: 我有一个Jypyter Notebook使用Pandas作为车辆访问Big Query:

df = pd.io.gbq.read_gbq( query, project_id = 'xxxxxxx-xxxx' )

This works fine from my local machine! 这可以从我的本地机器正常工作! (great, in fact!) But when I load the same notebook to Cloud DataLab I get: (很棒,事实上!)但是当我将相同的笔记本加载到Cloud DataLab时,我得到:

DistributionNotFound: google-api-python-client

Which seems rather disappointing! 这似乎相当令人失望! I believe that the module should be installed with Pandas.. but somehow Google is not including it? 我相信该模块应该与Pandas一起安装..但不知何故谷歌不包括它? It would be most preferable for a bunch of reasons to not have to change the code from what we develop on our local machines to what is needed in Cloud DataLab, in this case we heavily parameterize the data access... 出于一系列原因,最好不要将代码从我们在本地机器上开发的代码更改为Cloud DataLab中所需的代码,在这种情况下,我们会大量参数化数据访问...

Ok I ran: 好的我跑了:

!pip install --upgrade google-api-python-client

Now when I run the notebook I get an auth prompt that I cannot resolve since DataLab is on a remote machine: 现在,当我运行笔记本时,我得到一个auth提示,由于DataLab在远程计算机上,我无法解决:

Your browser has been opened to visit:
 >>> Browser string>>>>
If your browser is on a different machine then exit and re-run this
application with the command-line parameter 

 --noauth_local_webserver

Don't see an obvious answer to this? 没有看到明显的答案吗?

I use the code suggested below by @Anthonios Partheniou from within the same notebook (executing it in a cell block) after updating the google-api-python-client in the notebook and I got the following traceback: 在更新了笔记本中的google-api-python-client之后,我在同一个笔记本中使用@Anthonios Partheniou建议的代码(在单元块中执行),我得到了以下回溯:

TypeError                                 Traceback (most recent call last)
<ipython-input-3-038366843e56> in <module>()
  5                            scope='https://www.googleapis.com/auth/bigquery',
  6                            redirect_uri='urn:ietf:wg:oauth:2.0:oob')
----> 7 storage = Storage('bigquery_credentials.dat')
  8 authorize_url = flow.step1_get_authorize_url()
  9 print 'Go to the following link in your browser: ' + authorize_url

/usr/local/lib/python2.7/dist-packages/oauth2client/file.pyc in __init__(self, filename)
 37 
 38     def __init__(self, filename):
---> 39         super(Storage, self).__init__(lock=threading.Lock())
 40         self._filename = filename
 41 

 TypeError: object.__init__() takes no parameters

He mentions the need to be executing the notebook from the same folder yet the only way that I know of for executing a datalab notebook is via the repo? 他提到需要从同一个文件夹执行笔记本,但我知道执行datalab笔记本的唯一方法是通过repo?

While the new module of using the new Jupyter Datalab module is a possible alternative The ability to use the full Pandas BQ interface unchanged on local and DataLab instances would be hugely helpful! 虽然使用新的Jupyter Datalab模块的新模块是一种可能的替代方案,但在本地和DataLab实例上使用完整的Pandas BQ接口的能力将非常有用! So xing my fingers for a solution! 所以xing我的手指寻求解决方案!

pip installed:
GCPDataLab 0.1.0
GCPData 0.1.0
wheel 0.29.0
tensorflow 0.6.0
protobuf 3.0.0a3
oauth2client 1.4.12
futures 3.0.3
pexpect 4.0.1
terminado 0.6
pyasn1 0.1.9
jsonschema 2.5.1
mistune 0.7.2
statsmodels 0.6.1
path.py 8.1.2
ipython 4.1.2
nose 1.3.7
MarkupSafe 0.23
py-dateutil 2.2
pyparsing 2.1.1
pickleshare 0.6
pandas 0.18.0
singledispatch 3.4.0.3
PyYAML 3.11
nbformat 4.0.1
certifi 2016.2.28
notebook 4.0.2
cycler 0.10.0
scipy 0.17.0
ipython-genutils 0.1.0
pyasn1-modules 0.0.8
functools32 3.2.3-2
ipykernel 4.3.1
pandocfilters 1.2.4
decorator 4.0.9
jupyter-core 4.1.0
rsa 3.4.2
mock 1.3.0
httplib2 0.9.2
pytz 2016.3
sympy 0.7.6
numpy 1.11.0
seaborn 0.6.0
pbr 1.8.1
backports.ssl-match-hostname 3.5.0.1
ggplot 0.6.5
simplegeneric 0.8.1
ptyprocess 0.5.1
funcsigs 0.4
scikit-learn 0.16.1
traitlets 4.2.1
jupyter-client 4.2.2
nbconvert 4.1.0
matplotlib 1.5.1
patsy 0.4.1
tornado 4.3
python-dateutil 2.5.2
Jinja2 2.8
backports-abc 0.4
brewer2mpl 1.4.1
Pygments 2.1.3

end 结束

Google BigQuery authentication in pandas is normally straight forward, except when pandas code is executed on a remote server. 大熊猫中的Google BigQuery身份验证通常是直接的,除非在远程服务器上执行pandas代码。 For example, running pandas on Datalab in the cloud. 例如,在云中的Datalab上运行pandas。 In that case, use the following code to create the credentials file that pandas needs to access Google BigQuery in Google Datalab. 在这种情况下,请使用以下代码创建pandas在Google Datalab中访问Google BigQuery所需的凭据文件。

from oauth2client.client import OAuth2WebServerFlow
from oauth2client.file import Storage
flow = OAuth2WebServerFlow(client_id='<Client ID from Google API Console>',
                           client_secret='<Client secret from Google API Console>',
                           scope='https://www.googleapis.com/auth/bigquery',
                           redirect_uri='urn:ietf:wg:oauth:2.0:oob')
storage = Storage('bigquery_credentials.dat')
authorize_url = flow.step1_get_authorize_url()
print 'Go to the following link in your browser: ' + authorize_url
code = raw_input('Enter verification code: ')
credentials = flow.step2_exchange(code)
storage.put(credentials)

Once you complete the process I don't expect you will see the error (as long as the notebook is in the same folder as the newly created 'bigquery_credentials.dat' file). 完成此过程后,我不希望您看到错误(只要笔记本与新创建的'bigquery_credentials.dat'文件位于同一文件夹中)。

You also need to install the google-api-python-client python package as it is required by pandas for Google BigQuery support. 您还需要安装google-api-python-client python软件包,因为pandas需要Google BigQuery支持。 You can run either of the following in a notebook to install it. 您可以在笔记本中运行以下任一操作来安装它。

Either

!pip install google-api-python-client --no-deps
!pip install uritemplate --no-deps
!pip install simplejson --no-deps

or 要么

%%bash
pip install google-api-python-client --no-deps
pip install uritemplate --no-deps
pip install simplejson --no-deps

The --no-deps option is needed so that you don't accidentally update a python package which is installed in datalab by default (to ensure other parts of datalab don't break). 需要--no-deps选项,以便您不会意外更新默认情况下安装在datalab中的python包(以确保datalab的其他部分不会中断)。

Note: With pandas 0.19.0 (not released yet), it will be much easier to use pandas in Google Cloud Datalab. 注意:使用pandas 0.19.0(尚未发布),在Google Cloud Datalab中使用pandas会更容易。 See Pull Request #13608 请参阅Pull Request#13608

Note: You also have the option to use the (new) google datalab module inside of jupyter (and that way the code will also work in Google Datalab on the cloud). 注意:您还可以选择在jupyter中使用(新)google datalab模块(这样代码也可以在云上的Google Datalab中使用)。 See the following related stack overflow answer: How do I use gcp package from outside of google datalabs? 请参阅以下相关堆栈溢出答案: 如何从google datalabs外部使用gcp包?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Cloud Datalab中的Pandas(来自bigquery)跟踪错误删除重复项 - Remove duplicates with Pandas in Cloud Datalab (from bigquery) traceback error 从 Google Cloud Storage 将 CSV 文件读取到 Datalab 并转换为 Pandas 数据帧 - Read CSV file to Datalab from Google Cloud Storage and convert to pandas dataframe 无法从DataLab查询表 - Cannot query table from datalab 使用 Python API 从 Google Cloud Datalab 上传文件到 Google Cloud Storage Bucket - Upload files to Google Cloud Storage Bucket from Google Cloud Datalab using Python API Datalab:如何将Big Query标准SQL查询导出到数据框? - Datalab: How to export Big Query standard SQL query to dataframe? 使用 Python 中的 Cloud Functions 将 CSV 数据从云存储加载到 Big Query 的最佳逻辑应该是什么 - What should be the best logic to load CSV data from cloud storage to Big Query using Cloud Functions in Python 为什么我不能在Cloud Datalab上安装熊猫分析? - Why can't I install pandas-profiling on Cloud Datalab? 如何将数据从Google存储云读取到Google云数据板 - How to read data from Google storage cloud to Google cloud datalab 如何使用 Python 自动将文件从 Google Cloud Storage 上传到 Big Query? - How to automate file uploads from Google Cloud Storage to Big Query using Python? 使用 Cloud Function Python 将 CSV 文件从云存储加载到 Big Query 表时出现“客户端”未定义错误 - Getting 'client' not defined error while loading CSV file from cloud storage to Big Query table using Cloud Function Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM