繁体   English   中英

如何从 docker 容器连接到 hdfs?

[英]How to connect to hdfs from the docker container?

我的目标是从气流中的 hdfs 读取文件并进行进一步的操作。

经过研究,我发现我需要使用的url如下:

df = pd.read_parquet('http://localhost:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN') ,

其中 localhost/172.20.80.1/computer-name.mshome.net 可以互换使用,

9870 - 名称节点端口,

hadoop_files/sample_2022_01.parquet - 我在根目录中创建的文件夹和文件。

我可以在 PyCharm 中本地访问和读取文件,但我无法在 docker 的气流中获得相同的结果。 我尝试使用 docker 中托管的本地 hdfs 和 hdfs 并将主机更改为 host.docker.internal,但我遇到了同样的错误。

堆栈跟踪:

[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
    branch = super().execute(context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
    df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
  File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
    df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
    mode="rb",
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
    path_or_handle, mode, is_text=False, storage_options=storage_options
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
    storage_options=storage_options,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
    with urlopen(req_info) as req:
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
    return urllib.request.urlopen(*args, **kwargs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>

使用 host.docker.internal:

urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

您需要使用气流 docker 容器内的任何可路由地址。

如果 hadoop 也在 docker 容器内,请使用docker inspect CONTAINER ( doc ) 检查它的 IP 地址。 如果 hadoop 在 localhost 上,您可以设置network_mode: "host" ( doc )

如果您使用的是 macOS 并且拥有基本上是虚拟机的 docker 桌面应用程序,那么还有一个重要通知。 所以在这种情况下,您需要一些额外的设置,例如检查这个

其中 localhost/172.20.80.1/computer-name.mshome.net 可以互换使用,

它们不应该在 Docker 网络中互换。

在 Airflow 中,您可以使用 Docker 服务名称,而不是IP 地址,并确保容器位于同一个桥接网络中(不是主机模式,仅适用于 Linux)。 host.docker.internal也不正确,因为您尝试访问另一个容器,而不是您的主机

https://docs.docker.com/network/bridge/

我还建议使用 Airflow Spark 运算符从 HDFS 实际读取 Parquet,使用 Spark,而不是 Pandas 或 WebHDFS。 如果需要,您可以将 Spark 数据帧转换为 Pandas

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM