I've installed apache_beam Python SDK and apache airflow Python SDK in a Docker. Python Version: 3.5
Apache Airflow: 1.10.5
I'm trying to execute apache-beam pipeline using **DataflowPythonOperator**
. When I run a DAG from airflow UI at that time I get
Import Error: import apache_beam as beam. Module not found
With the same setup I tried **DataflowTemplateOperator**
and it's working perfectly fine.
When I tried same docker setup with Python 2 and apache airflow 1.10.3, two months back at that time operator didn't returned any error and was working as expected.
After SSH into docker when I checked the installed libraries (using pip freeze) in a docker container I can see the installed versions of apache-beam and apache-airflow. apache-airflow==1.10.5 apache-beam==2.15.0
Dockerfile:
RUN pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install apache-beam
RUN pip install apache-beam[gcp]
RUN pip install google-api-python-client
ADD . /home/beam
RUN pip install apache-airflow[gcp_api]
airflow operator:
new_task = DataFlowPythonOperator(
task_id='process_details',
py_file="path/to/file/filename.py",
gcp_conn_id='google_cloud_default',
dataflow_default_options={
'project': 'xxxxx',
'runner': 'DataflowRunner',
'job_name': "process_details",
'temp_location': 'GCS/path/to/temp',
'staging_location': 'GCS/path/to/staging',
'input_bucket': 'bucket_name',
'input_path': 'GCS/path/to/bucket',
'input-files': 'GCS/path/to/file.csv'
},
dag=test_dag)
This look like a known issue: https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/46
please run pip install six==1.10. This is a known issue in Beam ( https://issues.apache.org/jira/browse/BEAM-2964 ) which we are trying to get fixed upstream.
So try installing six==1.10
using pip
From this github link will help you to solve your problem. Follow below steps.
Read following nice article on virtualenv, this will help in later steps,
https://www.dabapps.com/blog/introduction-to-pip-and-virtualenv-python/?utm_source=feedly
Create virtual environment ( Note I created it in cloudml-samples folder & named it env)
titanium-vim-169612:~/cloudml-samples$ virtualenv env
Activate virtual env
@titanium-vim-169612:~/cloudml-samples$ source env/bin/activate
Install cloud-dataflow using following link: (this brings in apache_beam)
https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python
Now u can check that apache_beam is present in env/lib/python2.7/site-packages/
@titanium-vim-169612:~/cloudml-samples/flowers$ ls ../env/lib/python2.7/site-packages/
Run the sample At this point, I got an error about missing tensorflow. I installed tensorflow in my virtualenv by using the link below (use installation steps for virtualenv),
https://www.tensorflow.org/install/install_linux#InstallingVirtualenv
The sample seems to work now.
This may not be an option for you, but I was getting the same error with python 2. Executing the same script with python 3 resolved the error.
I was running through the dataflow tutorial: https://codelabs.developers.google.com/codelabs/cpb101-simple-dataflow-py/
and when I follow the instructions as specified:
python grep.py
I get the error from the title of your post. I hit it with:
python3 grep.py
and it works as expected. I hope it helps. Happy hunting if it doesn't. See the link for details on what exactly I was running.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.