简体   繁体   English

GCP部署实例因Ansible脚本而失败

[英]GCP deploy instance fails from ansible script

I've been deploying clusters in GCP via ansible scripts for more then a year now, but all of a sudden one of my scripts keeps giving me this error: 一年多以前,我一直在通过ansible脚本在GCP中部署集群,但是突然之间,我的一个脚本一直在给我这个错误:

libcloud.common.google.GoogleBaseError: u\\"The zone 'projects/[project]/zones/europe-west1-d' does not have enough resources available to fulfill the request. Try a different zone, or try again later. libcloud.common.google.GoogleBaseError:u \\“区域'projects / [project] / zones / europe-west1-d'没有足够的资源来满足请求。请尝试其他区域,或稍后再试。

The obvious reason would be that I don't have enough resources, but not a whole lot has changed and quotas look good: 显而易见的原因是我没有足够的资源,但是并没有改变很多,而且配额看起来不错: 配额

The ansible script itself doesn't ask for a lot. ansible脚本本身并不需要太多。 I'm creating 3 instances of n1-standard-4 with 100GB SSD. 我正在使用100GB SSD创建3个n1-standard-4实例。 See snippet of script below: 请参见下面的脚本片段:

tasks:
    - name: create boot disks
      gce_pd:
          disk_type: pd-ssd
          image: "debian-9-stretch-v20171025"
          name: "{{ item.node }}-disk"
          size_gb: 100
          state: present
          zone: "europe-west1-d"
          service_account_email: "{{ service_account_email }}"          
          credentials_file: "{{ credentials_file }}"
          project_id: "{{ project_id }}"          
      with_items: "{{nodes}}"
      async: 3600
      poll: 2

    - name: create instances
      gce:        
        instance_names: "{{item.node}}"
        zone: "europe-west1-d"
        machine_type: "n1-standard-4"        
        preemptible: "{{ false if item.num == '0' else true }}"        
        disk_auto_delete: true
        disks:
          - name: "{{ item.node }}-disk"
            mode: READ_WRITE
        state: present
        service_account_email: "{{ service_account_email }}"
        service_account_permissions: "compute-rw"
        credentials_file: "{{ credentials_file }}"
        project_id: "{{ project_id }}"
        tags: "elasticsearch"        
      register: gce_raw_results
      with_items: "{{nodes}}"
      async: 3600
      poll: 2

Update 1: 更新1:

  • The service account is editor of the entire project. 服务帐户是整个项目的编辑器。 So right issue seems unlikely. 因此,正确的问题似乎不太可能。
  • It started happening March 24 2018. And every night since then. 它开始发生在2018年3月24日。此后每天晚上。 So if it's a 'out of stock' issue that would be very coincidental, right? 因此,如果这是一个“脱销”问题,那将是非常偶然的,对吧? Besides I have been running this script the entire day so far and it fails most of the time (see below for success). 此外,到目前为止,我整天都在运行此脚本,并且大多数时候它都失败了(有关成功,请参见下文)。
  • I've tested a few times and it might have something to do with the 'preemptible' flag on the instance. 我已经测试了几次,这可能与实例上的“ preemptible”标志有关。 (I start 3 nodes, but at least the first has to stay up to at least work) => preemptible: "{{ false if item.num == '0' else true }}" If I turn off preemptible (false) then it runs without a hitch. (我启动了3个节点,但至少第一个必须保持工作状态)=>可preemptible: "{{ false if item.num == '0' else true }}"如果我关闭了可抢占(false)然后运行顺利。 The 'workaround' seems to be just don't use preemptible instances, but this used to work for a year without failing once. “解决方法”似乎只是不使用可抢占的实例,但这曾经工作了一年而没有失败一次。 Did something change? 有什么变化吗? Did GCP's API change? GCP的API是否更改? Did ansible gce not implement these changes? ansible gce是否未实施这些更改?

The full error is: 完整的错误是:

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************************************************************************** ok: [localhost] 任务[聚会事实] ************************************************* ************************************************** ************************************************** ************************************************** ************************************************** ************************************************* 好:[本地主机]

TASK [create boot disks] **************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost] => (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'1', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) ok: [localhost] => (item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'2', u'machine_type': u'n1-standa 任务[创建启动盘] ************************************************ ************************************************** ************************************************** ************************************************** ************************************************** ****************************************************已更改: [localhost] =>(item = {u'node':u'elasticsearch-link-0',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name ':u'elasticsearch-link',u'num':u'0',u'machine_type':u'n1-standard-4',u'project_id':u'[projectid]'})更改为:[本地主机] =>(item = {u'node':u'elasticsearch-link-1',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name': u'elasticsearch-link',u'num':u'1',u'machine_type':u'n1-standard-4',u'project_id':u'[projectid]'})好:[localhost] = >(item = {u'node':u'elasticsearch-link-2',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name':u' elasticsearch-link',u'num':u'2',u'machine_type':u'n1-standa rd-4', u'project_id': u'[projectid]'}) rd-4',u'project_id':u'[projectid]'})

TASK [create instances] ***************************************************************************************************************************************************************************************************************************************************************************************************** changed: [localhost] => (item={u'node': u'elasticsearch-link-0', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'0', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) changed: [localhost] => (item={u'node': u'elasticsearch-link-1', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'1', u'machine_type': u'n1-standard-4', u'project_id': u'[projectid]'}) failed: [localhost] (item={u'node': u'elasticsearch-link-2', u'ip_field': u'private_ip', u'zone': u'europe-west1-d', u'cluster_name': u'elasticsearch-link', u'num': u'2', u'machine_type': u'n1-stand 任务[创建实例] ************************************************* ************************************************** ************************************************** ************************************************** ************************************************** ****************************************************已更改: [localhost] =>(item = {u'node':u'elasticsearch-link-0',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name ':u'elasticsearch-link',u'num':u'0',u'machine_type':u'n1-standard-4',u'project_id':u'[projectid]'})更改为:[本地主机] =>(item = {u'node':u'elasticsearch-link-1',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name': u'elasticsearch-link',u'num':u'1',u'machine_type':u'n1-standard-4',u'project_id':u'[projectid]'})失败:[localhost]( item = {u'node':u'elasticsearch-link-2',u'ip_field':u'private_ip',u'zone':u'europe-west1-d',u'cluster_name':u'elasticsearch- link',u'num':u'2',u'machine_type':u'n1-stand ard-4', u'project_id': u'[projectid]'}) => {"ansible_job_id": "371957735383.2688", "changed": false, "cmd": "/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/gce.py", "data": "", "failed": 1, "finished": 1, "item": {"cluster_name": "elasticsearch-link", "ip_field": "private_ip", "machine_type": "n1-standard-4", "node": "elasticsearch-link-2", "num": "2", "project_id": "[projectid]", "zone": "europe-west1-d"}, "msg": "Traceback (most recent call last):\\n File \\"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\\", line 158, in _run_module\\n (filtered_outdata, json_warnings) = _filter_non_json_lines(outdata)\\n File \\"/tmp/.ansible-airflow/ansible-tmp-1522742180.0-71790706749341/async_wrapper.py\\", line 99, in _filter_non_json_lines\\n raise ValueError('No start of json char found')\\nValueError: No start of json char found\\n", "stderr": "Traceback (most recent call last):\\n File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 750, in \\n main()\\n ard-4',u'project_id':u'[projectid]'})=> {“ ansible_job_id”:“ 371957735383.2688”,“ changed”:false,“ cmd”:“ /tmp/.ansible-airflow/ansible- tmp-1522742180.0-71790706749341 / gce.py“,” data“:”“,” failed“:1,” finished“:1,” item“:{” cluster_name“:” elasticsearch-link“,” ip_field“:” private_ip”,“ machine_type”:“ n1-standard-4”,“ node”:“ elasticsearch-link-2”,“ num”:“ 2”,“ project_id”:“ [projectid]”,“ zone”:“ europe-west1-d“},” msg“:”追踪(最近一次通话最近):\\ n文件\\“ / tmp / .ansible-airflow / ansible-tmp-1522742180.0-71790706749341 / async_wrapper.py \\”,第158行,在_run_module \\ n(filtered_outdata,json_warnings)= _filter_non_json_lines(outdata)\\ n文件\\“ / tmp / .ansible-airflow / ansible-tmp-1522742180.0-71790706749341 / async_wrapper.py \\”中,第99行,在_filter_non_json_lines \\ n中ValueError('找不到json字符的开始')\\ nValueError:找不到json字符的开始\\ n“,” stderr“:”追踪(最近一次调用是最近的):\\ n File \\“ / tmp / ansible_OnIK1e / ansible_module_gce.py \\“,第750行,位于\\ n main()\\ n中 File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 712, in main\\n module, gce, inames, number)\\n File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 524, in create_instances\\n instance, lc_machine_type, lc_image(), **gce_args\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\\", line 3874, in create_node\\n self.connection.async_request(request, method='POST', data=node_data)\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\\", line 784, in async_request\\n response = request(**kwargs)\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\\", line 121, in request\\n response = super(GCEConnection, self).request(*args, **kwargs)\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\\", line 806, in request\\n *args, **kwargs)\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\\", line 641, in request\\n response = responseCls(**kwargs)\\n File \\"/usr/local/lib/python2 文件\\“ / tmp / ansible_OnIK1e / ansible_module_gce.py \\”,行712,在主\\ n模块中,gce,inames,数字)\\ n文件\\“ / tmp / ansible_OnIK1e / ansible_module_gce.py \\”,行524,在create_instances中\\ n实例,lc_machine_type,lc_image(),** gce_args \\ n文件\\“ / usr / local / lib / python2.7 / dist-packages / libcloud / compute / drivers / gce.py \\”,行3874,位于create_node中\\ n self.connection.async_request(请求,方法=“ POST”,数据= node_data)\\ n文件\\“ / usr / local / lib / python2.7 / dist-packages / libcloud / common / base.py \\”,第784行,位于async_request \\ n response = request(** kwargs)\\ n文件\\“ / usr / local / lib / python2.7 / dist-packages / libcloud / compute / drivers / gce.py \\”,第121行,在请求中\\ n response = super(GCEConnection,self).request(* args,** kwargs)\\ n文件\\“ / usr / local / lib / python2.7 / dist-packages / libcloud / common / google.py \\ “,位于请求的第806行\\ n * args,** kwargs)\\ n文件\\“ / usr / local / lib / python2.7 / dist-packages / libcloud / common / base.py \\”,位于第641行request \\ n response = responseCls(** kwargs)\\ n文件\\“ / usr / local / lib / python2 .7/dist-packages/libcloud/common/base.py\\", line 163, in init \\n self.object = self.parse_body()\\n File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\\", line 268, in parse_body\\n raise GoogleBaseError(message, self.status, code)\\nlibcloud.common.google.GoogleBaseError: u\\"The zone 'projects/[projectid]/zones/europe-west1-d' does not have enough resources available to fulfill the request. .7 / dist-packages / libcloud / common / base.py \\“,第163行, init \\ n self.object = self.parse_body()\\ n File \\” / usr / local / lib / python2.7 / dist -packages / libcloud / common / google.py \\“,行268,位于parse_body \\ n中,引发GoogleBaseError(消息,self.status,代码)\\ nlibcloud.common.google.GoogleBaseError:u \\”区域'projects / [projectid ] / zones / europe-west1-d'没有足够的资源来满足请求。 Try a different zone, or try again later.\\"\\n", "stderr_lines": ["Traceback (most recent call last):", " File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 750, in ", " main()", " File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 712, in main", " 尝试使用其他区域,或稍后再试。\\“ \\ n”,“ stderr_lines”:[“追踪(最近一次通话最近):”,“ File \\” / tmp / ansible_OnIK1e / ansible_module_gce.py \\“,第750行,在“,“ main()”,“ File \\” / tmp / ansible_OnIK1e / ansible_module_gce.py \\”,“ main”中的第712行中,
module, gce, inames, number)", " File \\"/tmp/ansible_OnIK1e/ansible_module_gce.py\\", line 524, in create_instances", " instance, lc_machine_type, lc_image(), **gce_args", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\\", line 3874, in create_node", " module,gce,inames,number)“,” File \\“ / tmp / ansible_OnIK1e / ansible_module_gce.py \\”,行524,在create_instances中“,” instance,lc_machine_type,lc_image(),** gce_args“,” File \\“ /usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py \\“,第3874行,位于create_node中,”,“
self.connection.async_request(request, method='POST', data=node_data)", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\\", line 784, in async_request", " response = request(**kwargs)", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py\\", line 121, in request", " response = super(GCEConnection, self).request(*args, **kwargs)", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\\", line 806, in request", " *args, **kwargs)", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\\", line 641, in request", " response = responseCls(**kwargs)", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py\\", line 163, in init ", " self.object = self.parse_body()", " File \\"/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py\\", line 268, in parse_body", " raise GoogleBaseError(message, self.status, code)", "libcloud.common.google.GoogleBaseError: u\\"The zone 'projects/[projectid self.connection.async_request(request,method ='POST',data = node_data)“,” File \\“ / usr / local / lib / python2.7 / dist-packages / libcloud / common / base.py \\”,行784,在async_request中”,“ response = request(** kwargs)”,“ File \\” / usr / local / lib / python2.7 / dist-packages / libcloud / compute / drivers / gce.py \\”,第121行,在请求中”,“ response = super(GCEConnection,self).request(* args,** kwargs)”,“ File \\” / usr / local / lib / python2.7 / dist-packages / libcloud / common / google .py \\”,第806行,在请求中”,“ * args,** kwargs)”,“ File \\” / usr / local / lib / python2.7 / dist-packages / libcloud / common / base.py \\” ,第641行,在请求中”,“ response = responseCls(** kwargs)”,“ File \\” / usr / local / lib / python2.7 / dist-packages / libcloud / common / base.py \\”,第163行,在init的 “”,“ self.object = self.parse_body()”,“ File \\” / usr / local / lib / python2.7 / dist-packages / libcloud / common / google.py \\”的第268行中parse_body“,”引发GoogleBaseError(消息,self.status,代码)“,” libcloud.common.google.GoogleBaseError:u \\“区域'projects / [projectid ]/zones/europe-west1-d' does not have enough resources available to fulfill the request. ] / zones / europe-west1-d'没有足够的资源来满足请求。 Try a different zone, or try again later.\\""]} to retry, use: --limit @/usr/local/airflow/ansible/playbooks/elasticsearch-link-cluster-create.retry 尝试使用其他区域,或稍后再试。\\“”]}重试,请使用:--limit @ / usr / local / airflow / ansible / playbooks / elasticsearch-link-cluster-create.retry

The error message is not showing that is an error with the quota, but rather an issue with the zone resources, I would advise you to try a new zone. 错误消息没有显示配额错误,而是区域资源问题,我建议您尝试一个新区域。

Quoting from the documentation : 引用文档

Even if you have a regional quota, it is possible that a resource might not be available in a specific zone. 即使您具有区域配额,也可能在特定区域中资源不可用。 For example, you might have quota in region us-central1 to create VM instances, but might not be able to create VM instances in the zone us-central1-a if the zone is depleted. 例如,您可能在区域us-central1中有配额来创建VM实例,但是如果该区域已耗尽,则可能无法在区域us-central1-a中创建VM实例。 In such cases, try creating the same resource in another zone, such as us-central1-f. 在这种情况下,请尝试在另一个区域(例如us-central1-f)中创建相同的资源。

Therefore when creating the script you should take this possibility into account even if it is not so common. 因此,在创建脚本时,即使这种情况并不常见,也应考虑到这种可能性。

This issue is even more highlithed in case of preentible instances since: 对于容易出现的实例,此问题更加重要,因为:

Preemptible instances are finite Compute Engine resources, so they might not always be available. 可抢占的实例是有限的Compute Engine资源,因此它们可能并不总是可用。 [...] these instances if it requires access to those resources for other tasks. 这些实例,如果它需要访问这些资源来执行其他任务。 Preemptible instances are excess Compute Engine capacity so their availability varies with usage. 可抢占实例是Compute Engine的多余容量,因此其可用性随使用情况而变化。

UPDATE UPDATE

To doublecheck what I am saying you can try to keep the preentible flag and change the zone to be sure the script it is working properly and it is a stockout happening during the evening (and since during the day it works this should be the case). 要再次确认我在说什么,您可以尝试保留preentible标志并更改区域,以确保脚本正常运行并且在晚上发生断货(而且在白天,应该是这种情况) 。

  • If the issue it is really the availability -| 如果问题确实在于可用性-| you might consider to spin up preentible instance and if not available, catch the error and then either rely on normal one or on a different zone |- 您可能会考虑启动可能的实例,如果不可用,则捕获错误,然后依靠普通实例或其他区域|-

UPDATE2 UPDATE2

As I promised I created on your behalf the feature request, you can follow the updates on the public tracker. 正如我所承诺的代表您创建功能请求一样,您可以在公共跟踪器上关注更新。 I advise you to start it in order to receive the updates on the email: 我建议您启动它,以便通过电子邮件接收更新:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM