简体   繁体   English

使用google cloud dataflow python的ImportError phonenumbers

[英]ImportError phonenumbers with google cloud dataflow python

I am trying to do a relatively simple import of the module phonenumbers in Python. 我试图在Python中相对简单地导入模块phonenumbers。

I have tested the module on a seperate python file without any other imports and it works completely fine. 我已经在一个单独的python文件上测试了模块,没有任何其他导入,它完全正常。

These are the packages I have installed: 这些是我安装的软件包:

from __future__ import absolute_import
from __future__ import print_function

import argparse
import csv
import logging
import os
import phonenumbers

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

And this is is my error message: 这是我的错误信息:

Traceback (most recent call last):
  File "clean.py", line 114, in <module>
    run()
  File "clean.py", line 109, in run
    | 'WriteOutputText' >> beam.io.WriteToText(known_args.output))
  File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 389, in __exit__
    self.run().wait_until_finish()
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 996, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 733, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 472, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
    module = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 766, in _import_module
    return __import__(import_name)
ImportError: No module named phonenumbers

Any help would be greatly appreciated, thanks! 任何帮助将不胜感激,谢谢!

EDIT : I have already installed phonenumbers with pip 编辑:我已经用pip安装了phonenumbers

$ pip install phonenumbers
Requirement already satisfied: phonenumbers in c:\python27\lib\site-packages (8.9.7)
gax-google-logging-v2 0.8.3 has requirement google-gax<0.13.0,>=0.12.5, but you'll have google-gax 0.15.16 which is incompatible.
gcloud 0.18.3 has requirement google-gax<0.13dev,>=0.12.3, but you'll have google-gax 0.15.16 which is incompatible.
google-cloud-vision 0.29.0 has requirement requests<3.0dev,>=2.18.4, but you'll have requests 2.18.2 which is incompatible.
gax-google-pubsub-v1 0.8.3 has requirement google-gax<0.13.0,>=0.12.5, but you'll have google-gax 0.15.16 which is incompatible.
google-cloud-spanner 0.29.0 has requirement requests<3.0dev,>=2.18.4, but you'll have requests 2.18.2 which is incompatible.

This is the pip freeze output 这是点子冻结输出

$ pip freeze
adal==1.0.1
apache-beam==2.4.0
asn1crypto==0.22.0
avro==1.8.2
azure==3.0.0
azure-batch==4.1.3
azure-common==1.1.12
azure-cosmosdb-nspkg==2.0.2
azure-cosmosdb-table==1.0.3
azure-datalake-store==0.0.22
azure-eventgrid==0.1.0
azure-graphrbac==0.40.0
azure-keyvault==0.3.7
azure-mgmt==2.0.0
azure-mgmt-advisor==1.0.1
azure-mgmt-applicationinsights==0.1.1
azure-mgmt-authorization==0.30.0
azure-mgmt-batch==5.0.1
azure-mgmt-batchai==0.2.0
azure-mgmt-billing==0.1.0
azure-mgmt-cdn==2.0.0
azure-mgmt-cognitiveservices==2.0.0
azure-mgmt-commerce==1.0.1
azure-mgmt-compute==3.0.1
azure-mgmt-consumption==2.0.0
azure-mgmt-containerinstance==0.3.1
azure-mgmt-containerregistry==1.0.1
azure-mgmt-containerservice==3.0.1
azure-mgmt-cosmosdb==0.3.1
azure-mgmt-datafactory==0.4.0
azure-mgmt-datalake-analytics==0.3.0
azure-mgmt-datalake-nspkg==2.0.0
azure-mgmt-datalake-store==0.3.0
azure-mgmt-devtestlabs==2.2.0
azure-mgmt-dns==1.2.0
azure-mgmt-eventgrid==0.4.0
azure-mgmt-eventhub==1.2.0
azure-mgmt-hanaonazure==0.1.1
azure-mgmt-iothub==0.4.0
azure-mgmt-iothubprovisioningservices==0.1.0
azure-mgmt-keyvault==0.40.0
azure-mgmt-loganalytics==0.1.0
azure-mgmt-logic==2.1.0
azure-mgmt-machinelearningcompute==0.4.1
azure-mgmt-managementpartner==0.1.0
azure-mgmt-marketplaceordering==0.1.0
azure-mgmt-media==0.2.0
azure-mgmt-monitor==0.4.0
azure-mgmt-msi==0.1.0
azure-mgmt-network==1.7.1
azure-mgmt-notificationhubs==1.0.0
azure-mgmt-nspkg==2.0.0
azure-mgmt-powerbiembedded==1.0.0
azure-mgmt-rdbms==0.1.0
azure-mgmt-recoveryservices==0.2.0
azure-mgmt-recoveryservicesbackup==0.1.1
azure-mgmt-redis==5.0.0
azure-mgmt-relay==0.1.0
azure-mgmt-reservations==0.1.0
azure-mgmt-resource==1.2.2
azure-mgmt-scheduler==1.1.3
azure-mgmt-search==1.0.0
azure-mgmt-servermanager==1.2.0
azure-mgmt-servicebus==0.4.0
azure-mgmt-servicefabric==0.1.0
azure-mgmt-sql==0.8.6
azure-mgmt-storage==1.5.0
azure-mgmt-subscription==0.1.0
azure-mgmt-trafficmanager==0.40.0
azure-mgmt-web==0.34.1
azure-nspkg==2.0.0
azure-servicebus==0.21.1
azure-servicefabric==6.1.2.9
azure-servicemanagement-legacy==0.20.6
azure-storage-blob==1.1.0
azure-storage-common==1.1.0
azure-storage-file==1.1.0
azure-storage-nspkg==3.0.0
azure-storage-queue==1.1.0
CacheControl==0.12.5
cachetools==2.1.0
certifi==2017.7.27.1
cffi==1.10.0
chardet==3.0.4
click==6.7
configparser==3.5.0
crcmod==1.7
cryptography==2.0.3
deprecation==2.0.3
dill==0.2.6
docopt==0.6.2
entrypoints==0.2.3
enum34==1.1.6
fasteners==0.14.1
firebase-admin==2.11.0
Flask==0.12.2
funcsigs==1.0.2
future==0.16.0
futures==3.2.0
gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gax-google-logging-v2==0.8.3
gax-google-pubsub-v1==0.8.3
gcloud==0.18.3
google-api-core==0.1.4
google-apitools==0.5.20
google-auth==1.5.0
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.2.0
google-cloud==0.33.1
google-cloud-bigquery==0.28.0
google-cloud-bigquery-datatransfer==0.1.1
google-cloud-bigtable==0.28.1
google-cloud-container==0.1.1
google-cloud-core==0.28.1
google-cloud-dataflow==2.4.0
google-cloud-datastore==1.4.0
google-cloud-dns==0.28.0
google-cloud-error-reporting==0.28.0
google-cloud-firestore==0.28.0
google-cloud-language==1.0.2
google-cloud-logging==1.4.0
google-cloud-monitoring==0.28.1
google-cloud-pubsub==0.30.1
google-cloud-resource-manager==0.28.1
google-cloud-runtimeconfig==0.28.1
google-cloud-spanner==0.29.0
google-cloud-speech==0.30.0
google-cloud-storage==1.6.0
google-cloud-trace==0.17.0
google-cloud-translate==1.3.1
google-cloud-videointelligence==1.0.1
google-cloud-vision==0.29.0
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
googledatastore==7.0.1
grpc-google-iam-v1==0.11.4
grpc-google-logging-v2==0.8.1
grpc-google-pubsub-v1==0.8.1
grpcio==1.12.0
gunicorn==19.7.1
hdfs==2.1.0
httplib2==0.9.2
idna==2.5
ipaddress==1.0.18
iso8601==0.1.12
isodate==0.6.0
itsdangerous==0.24
Jinja2==2.9.6
jmespath==0.9.3
keyring==12.2.1
keystoneauth1==3.8.0
linecache2==1.0.0
MarkupSafe==1.0
mock==2.0.0
monotonic==1.5
msgpack==0.5.6
msrest==0.5.0
msrestazure==0.4.32
ndg-httpsclient==0.4.2
nelson==0.4.0
oauth2client==3.0.0
oauthlib==2.1.0
os-service-types==1.2.0
packaging==17.1
pathlib2==2.3.2
pbr==4.0.3
phonenumbers==8.9.7
ply==3.8
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
protobuf==3.5.2.post1
psutil==5.4.6
psycopg2==2.7.3.2
pyasn1==0.4.3
pyasn1-modules==0.2.1
pycparser==2.18
pyjwt==1.5.0
pyOpenSSL==17.2.0
pyparsing==2.2.0
pyreadline==2.1
python-dateutil==2.7.3
pytz==2018.3
PyVCF==0.6.8
pywin32-ctypes==0.1.2
PyYAML==3.12
rackspaceauth==0.2.0
requests==2.18.2
requests-oauthlib==0.8.0
requests-toolbelt==0.8.0
rsa==3.4.2
scandir==1.7
six==1.10.0
SQLAlchemy==1.1.14
stevedore==1.28.0
traceback2==1.4.0
twilio==6.5.0
typing==3.6.4
unittest2==1.1.0
urllib3==1.22
virtualenv==16.0.0
Werkzeug==0.12.2

EDIT: Code == 编辑:代码==

from __future__ import absolute_import
from __future__ import print_function

import argparse
import csv
import logging
import os
from collections import OrderedDict
import phonenumbers

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


class ParseCSVFn(beam.DoFn):
    """Parses the raw CSV data into a Python dictionary."""

    def process(self, elem):
        try:
            row = list(csv.reader([elem]))[0]
            month, day, year = row[2].split('/')
            birth_dict = {
                'day': day,
                'month': month,
                'year': year,
            }
            order_dict = OrderedDict(birth_dict)
            data_dict = {
                'phoneNumber': row[4],
                'firstName': row[0],
                'lastName': row[1],
                'birthDate': order_dict,
                'voterId': row[3],
            }

            order_data_dict = OrderedDict(data_dict)

            yield order_data_dict

        except:

            pass


def run(argv=None):
    """Pipeline entry point, runs the all the necessary processes"""
    parser = argparse.ArgumentParser()
    parser.add_argument('--input',
                        type=str,
                        dest='input',
                        default='gs://wordcount_project/demo-contacts-small*.csv',
                        help='Input file to process.')
    parser.add_argument('--output',
                        dest='output',
                        # CHANGE 1/5: The Google Cloud Storage path is required
                        # for outputting the results.
                        default='gs://wordcount_project/cleaned.csv',
                        help='Output file to write results to.')

    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_args.extend([
        # CHANGE 2/5: (OPTIONAL) Change this to DataflowRunner to
        # run your pipeline on the Google Cloud Dataflow Service.
        '--runner=DataflowRunner',
        # CHANGE 3/5: Your project ID is required in order to run your pipeline on
        # the Google Cloud Dataflow Service.
        '--project=--------',
        # CHANGE 4/5: Your Google Cloud Storage path is required for staging local
        # files.
        # '--dataset=game_dataset',
        '--staging_location=gs://wordcount_project/staging',
        # CHANGE 5/5: Your Google Cloud Storage path is required for temporary
        # files.
        '--temp_location=gs://wordcount_project/temp',
        '--job_name=cleaning-jobs',
    ])

    pipeline_options = PipelineOptions(pipeline_args)

    pipeline_options.view_as(SetupOptions).save_main_session = True

    with beam.Pipeline(options=pipeline_options) as p:
        (p
         | 'ReadInputText' >> beam.io.ReadFromText(known_args.input)
         | 'ParseDataFn' >> beam.ParDo(ParseCSVFn())
         # | 'JsonBirthDay' >> beam.ParDo(JsonBirthDay())
         # | 'MatchNumber' >> beam.ParDo(MatchNumber('phoneNumber'))
         # | 'MapData' >> beam.Map(lambda elem: (elem['phoneNumber'], elem['firstName'], elem['lastName'],
         #                                       elem['birthDate'], elem['voterId']))
         | 'WriteOutputText' >> beam.io.WriteToText(known_args.output))


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

I have also tried to install specific packages of google-gax and requests but it didn't seem to help 我还尝试安装google-gax和请求的特定包,但它似乎没有帮助

EDIT: new coding error: 编辑:新的编码错误:

  File "new_clean.py", line 226, in <module>
    run()
  File "new_clean.py", line 219, in run
    | 'WriteToText' >> beam.io.WriteToText(known_args.output)
  File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 389, in __exit__
    self.run().wait_until_finish()
  File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 369, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 382, in run
    return self.runner.run_pipeline(self)
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 324, in run_pipeline
    self.dataflow_client.create_job(self.job), self)
  File "C:\Python27\lib\site-packages\apache_beam\utils\retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\apiclient.py", line 461, in create_job
    self.create_job_description(job)
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\apiclient.py", line 491, in create_job_description
    job.options, file_copy=self._gcs_file_copy)
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\dependency.py", line 328, in stage_job_resources
    setup_options.requirements_file, requirements_cache_path)
  File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\dependency.py", line 262, in _populate_requirements_cache
    processes.check_call(cmd_args)
  File "C:\Python27\lib\site-packages\apache_beam\utils\processes.py", line 44, in check_call
    return subprocess.check_call(*args, **kwargs)
  File "C:\Python27\lib\subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:\\Python27\\python.exe', '-m', 'pip', 'download', '--dest', 'c:\\users\\james\\appdata\\local\\temp\\dataflow-requirements-cache', '-r', 'requirements.txt', '--no-binary', ':all:']' returned non-zero exit status 1

It may be that Dataflow is not receiving the file with the extra dependencies of your pipeline. 可能是Dataflow没有接收具有管道额外依赖性的文件。 To install them, you'd do this: 要安装它们,你可以这样做:

pip freeze > requirements.txt

Then you'll want to edit the requirements.txt file and leave only the packages that were installed from PyPI and are used in your pipeline. 然后,您将要编辑requirements.txt文件,只保留从PyPI安装并在管道中使用的软件包。

When you run your pipeline, pass the following command-line option: 运行管道时,请传递以下命令行选项:

--requirements_file requirements.txt

This is documented in the Apache Beam's Python Pipeline Dependencies docs . 这在Apache Beam的Python Pipeline Dependencies文档中有记录

Hope that helps. 希望有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM