简体   繁体   English

如何检测为什么 Ansible 剧本在执行期间挂起

[英]How to detect why Ansible playbook hangs during execution

Some of tasks I wrote start and never end.我写的一些任务有始有终。 Ansible does not provide any errors or logs that would explain this, even with -vvvv option. Ansible 没有提供任何可以解释这一点的错误或日志,即使使用 -vvvv 选项也是如此。 Playbook just hangs and passing hours doesn't change anything.剧本只是挂起,过去的时间不会改变任何东西。

When I try to run my tasks manually (by entering commands via SSH) everything is fine.当我尝试手动运行我的任务(通过 SSH 输入命令)时,一切都很好。

Example task that hangs:挂起的示例任务:

- name: apt upgrade
  shell: apt-get upgrade

Is there any way to see stdout and stderr?有什么办法可以查看 stdout 和 stderr? I tried:我试过了:

- name: apt upgrade
  shell: apt-get upgrade
  register: hello
- debug: msg="{{ hello.stdout }}"
- debug: msg="{{ hello.stderr }}"

but nothing changed.但没有任何改变。

I do have required permissions and I pass correct sudo password - other tasks that require sudo execute correctly.我确实拥有所需的权限,并且我传递了正确的 sudo 密码 - 其他需要 sudo 的任务才能正确执行。

Most Probable cause of your problem would be SSH connection.问题的最可能原因是 SSH 连接。 When a task requires a long execution time SSH timeouts.当一个任务需要很长的执行时间时 SSH 超时。 I faced such problem once, in order to overcome the SSH timeout thing, create a ansible.cfg in the current directory from which your are running Ansible add the following:我曾经遇到过这样的问题,为了克服 SSH 超时的问题,在运行 Ansible 的当前目录中创建一个ansible.cfg ,添加以下内容:

[ssh_connection]

ssh_args = -o ServerAliveInterval=n

Where n is the ServerAliveInterval (seconds) which we use while connecting to the server through SSH.其中n是我们通过 SSH 连接到服务器时使用的 ServerAliveInterval(秒)。 Set it between 1-255.将其设置在 1-255 之间。 This will cause ssh client to send null packets to server every n seconds to avoid connection timeout.这将导致 ssh 客户端每n秒向服务器发送一次空包以避免连接超时。

I was having same problems with a playbook.我在使用剧本时遇到了同样的问题。

It ran perfectly until some point then stopped so I've added async and poll parameters to avoid this behavior它运行得很完美,直到某个时候停止,所以我添加了asyncpoll参数来避免这种行为

- name: update packages full into each server
  apt: upgrade=full
  ignore_errors: True
  async: 60
  poll: 60

and it worked like a charm!它就像一个魅力! I really don't know what happened but it seems now Ansible take in mind what's going on and don't freezes anymore !我真的不知道发生了什么,但现在 Ansible 记住发生了什么,不要再冻结了!

Hope it helps希望它有帮助

I had the same issues and after a bit of fiddling around I found the problem to be in the step of gathering facts.我遇到了同样的问题,经过一番摆弄后,我发现问题出在收集事实的步骤中。 Here are a few tips to better resolve any similar issue.这里有一些技巧可以更好地解决任何类似的问题。

Disable fact-gathering in your playbook:在你的剧本中禁用事实收集:

---
- hosts: myservers
  gather_facts: no
..

Rerun the playbook.重新运行剧本。 If it works, then it means that the culprit is not in the SSH itself but rather in the script gathering the facts.如果它有效,那么这意味着罪魁祸首不在 SSH 本身,而是在收集事实的脚本中。 We can debug that issue quite easily.我们可以很容易地调试这个问题。

  1. SSH to the remote box SSH 到远程盒子
  2. Find the setup file somewhere in .ansible folder..ansible文件夹中的某处找到setup文件。
  3. Run it with ./setup or python -B setup使用./setuppython -B setup运行它

If it hangs, then we know that the problem is here for sure.如果它挂起,那么我们肯定知道问题出在这里。 To find excactly what makes it hang you can simply open the file with an editor and add print statements mainly in the populate() method of Facts .要准确找到导致它挂起的原因,您只需使用编辑器打开文件并主要在Factspopulate()方法中添加print语句。 Rerun the script and see how long it goes.重新运行脚本,看看它运行了多长时间。

For me the issue seemed to be trying to resolve the hostname at line self.facts['fqdn'] = socket.getfqdn() and with a bit of googling it turned out to be an issue with resolving the remote hostname .对我来说,问题似乎是试图在self.facts['fqdn'] = socket.getfqdn()行解析主机名,经过一些谷歌搜索,结果证明是解析远程主机名的问题

A totally different work-around for me.对我来说完全不同的解决方法。 I had this from a Debian Jessie ( Linux PwC-Deb64 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux ) to another Debian image I was trying to build in AWS.我从 Debian Jessie ( Linux PwC-Deb64 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux ) 到另一个 Debian 映像我有这个尝试在 AWS 中构建。

After many of the suggestions here didn't work for me, I got suspicion around the SSH "shared" connection.在此处的许多建议对我不起作用之后,我对 SSH“共享”连接产生了怀疑。 I went to my ansible.cfg and found the ssh_args lines and set ControlMaster=no .我去了我的ansible.cfg并找到了ssh_args行并设置ControlMaster=no This may now perform slowly because I've lost the SSH performance boost that this is supposed to give, but it seems like there is some interaction between this and apt-get that is causing the issue.这现在可能执行缓慢,因为我已经失去了它应该提供的 SSH 性能提升,但似乎这和apt-get之间存在一些导致问题的交互。

Your ansible.cfg could be in the directory that you run ansible from, or in /etc/ansible .ansible.cfg可以在您运行目录ansible从,或/etc/ansible If the latter, you may like to take a copy of it into a local directory before you start changing it!如果是后者,您可能希望在开始更改之前将其复制到本地目录中!

In my case, ansible was "hanging forever" because apt-get was trying to ask me a question!就我而言,ansible 是“永远挂起”,因为 apt-get 试图问我一个问题! How did I figure this out?我是怎么想出来的? I went to the target server and ran ps -aef | grep apt我去了目标服务器并运行ps -aef | grep apt ps -aef | grep apt and then did a kill on the appropriate "stuck" apt-get command. ps -aef | grep apt然后对适当的“卡住” apt-get命令进行了kill

Immediately after I did that, my ansible playbook sprang back to life and reported (with ansible-playbook -vvv option given):在我这样做之后,我的 ansible playbook 立即恢复生机并报告(给出了ansible-playbook -vvv选项):

    " ==> Deleted (by you or by a script) since installation.",
    " ==> Package distributor has shipped an updated version.",
    "   What would you like to do about it ?  Your options are:",
    "    Y or I  : install the package maintainer's version",
    "    N or O  : keep your currently-installed version",
    "      D     : show the differences between the versions",
    "      Z     : start a shell to examine the situation",
    " The default action is to keep your current version.",
    "*** buildinfo.txt (Y/I/N/O/D/Z) [default=N] ? "

After reading that helpful diagnostic output, I immediately realized I needed some appropriate dpkg options (see for example, this devops post ).在阅读了有用的诊断输出后,我立即意识到我需要一些合适的 dpkg 选项(例如,参见这篇 devops 帖子)。 In my case, I chose:就我而言,我选择了:

apt:
  name: '{{ item }}'
  state: latest
  update_cache: yes
  # Force apt to always update to the newer config files in the package:
  dpkg_options: 'force-overwrite,force-confnew'
loop: '{{ my_packages }}'

Also, don't forget to clean up after your killed ansible session with something like this, or your install will still likely fail:另外,不要忘记在使用类似这样的东西杀死 ansible 会话后进行清理,否则您的安装仍然可能会失败:

sudo dpkg --configure -a

删除我的 SSH 密钥的密码为我修复了它,例如:

ssh-keygen -p

I was using ansible to install a cluster of OpenDayLight SDN controllers on Ubuntu 20.4 VMs.我正在使用 ansible 在 Ubuntu 20.4 虚拟机上安装一个 OpenDayLight SDN 控制器集群。 Gathering facts was reporting a python version warning and hanging.搜集资料报了python版本警告挂了。 Installing python 3.8 on my 3 VM worker nodes resolved the issue在我的 3 个 VM 工作节点上安装 python 3.8 解决了这个问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM