简体   繁体   中英

module google.cloud has no attribute storage

I'm trying to run a beam script in python on GCP following this tutorial:

[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]

but I keep getting the following error:

AttributeError: module 'google.cloud' has no attribute 'storage'

I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.

My full script:

import apache_beam as beam
import json

query = """
    SELECT 
    year, 
    plurality, 
    apgar_5min, 
    mother_age, 
    father_age,
    gestation_weeks,
    ever_born,
    case when mother_married = true then 1 else 0 end as mother_married,
    weight_pounds as weight,
    current_timestamp as time,
    GENERATE_UUID() as guid
    FROM `bigquery-public-data.samples.natality` 
    order by rand()
    limit 100    
""" 

class ApplyDoFn(beam.DoFn):
    def __init__(self):
        self._model = None
        from google.cloud import storage
        import pandas as pd
        import pickle as pkl
        self._storage = storage
        self._pkl = pkl
        self._pd = pd
    
    def process(self, element):
        if self._model is None:
            bucket = self._storage.Client().get_bucket('bqr_dump')
            blob = bucket.get_blob('natality/sklearn-linear')
            self._model = self._pkl.loads(blob.download_as_string())
            
        new_x = self._pd.DataFrame.from_dict(element,
                                            orient='index').transpose().fillna(0)
        pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
        return [ {'guid': element['guid'],
                 'predicted_weight': pred_weight,
                 'time': str(element['time'])}]



# set up pipeline options
options = {'project': my-project-name,
           'runner': 'DataflowRunner',
           'temp_location': 'gs://bqr_dump/tmp',
           'staging_location': 'gs://bqr_dump/tmp'
           }

pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)

with beam.Pipeline(options=pipeline_options) as pipeline:
    (
        pipeline
        | 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
            query=query,
            use_standard_sql=True))
        | 'Apply Model' >> beam.ParDo(ApplyDoFn())
        | 'Save to BigQuery' >> beam.io.WriteToBigQuery(
            'pzn-pi-sto:beam_test.weight_preds', 
            schema='guid:STRING,weight:FLOAT64,time:STRING', 
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))

and my requirements.txt:

google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0

This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.

To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here , a similar case as yours, this should fix your case.

For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here . Your code should be something like this:

import apache_beam as beam
import json
import pandas as pd
import pickle as pkl

from google.cloud import storage
...

Let me know if this information helped you!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM