Portal:Cloud VPS/Admin/Runbooks/TfInfraTestApplyFailed

The procedures in this runbook require admin permissions to complete.

Error / Incident

This alert means that our tofu tests failed to apply, this is, create all the resources.

We run the tofu tests from the machine tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud, that regularly try to apply (create) and then destroy a collection of resources on cloud vps.

The tests are triggered daily by a cronjob for the root user:

root@tf-bastion:~# crontab -l

[...]

0 12 * * * systemd-cat -t tf-infra-test /root/tf-infra-test/tofu-test.sh eqiad1

Debugging

ssh to the tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud and try running the creation manually, that will give you more logs.

NOTE: if you are not member of the project, you can try sshing as root, or using the console cookbook (wmcs.openstack.cloudvirt.vm_console).

dcaro@urcuchillay$ ssh tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud
dcaro@tf-bastion$ sudo -i
root@tf-bastion:~# cd tf-infra-test/                                                           
root@tf-bastion:~/tf-infra-test# tofu apply -var datacenter=eqiad1
data.openstack_images_image_v2.debian: Reading...                                              
...
Do you want to perform these actions?
  OpenTofu will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
...

If you know which resource failed, you can try also applying that specific resource like this:

root@tf-bastion:~/tf-infra-test# tofu apply -var datacenter=eqiad1 --target=openstack_db_instance_v1.postgresql

Common issues

Add here any new common issues you find.

Quota errors when creating the trove databases

There's several quotas attached to trove databases:

Project database quotas
Trove's regular quotas

It might be that you are hitting your own project's database quotas or torve is hitting it's own.

Quota usage and requests out of sync

It happened once that the quota that trove registers as 'requested' and the one it registers as 'used' get out of sync, it seems that the requests get lost somehow and it does not decrease the 'requested' counter.

You'll see that as a regular quota error, to debug and fix see Portal:Cloud_VPS/Admin/Trove#Reserved_quota_does_not_go_down.

Rabbitmq issues

Sometimes restarting Rabbitmq fixes an issue where tofuinfratest fails to create Trove instances:

cloudcumin1001:~$ sudo cumin 'cloudrabbit*' 'systemctl restart rabbitmq-server'

Related information

tf-infra-test repo

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support

Chat in real time in the IRC channel #wikimedia-cloud ^connect or the bridged Telegram group
Discuss via email after you have subscribed to the cloud@ mailing list

Stay aware of critical changes and plans

Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Read the News wiki page

Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)

Old incidents

Add here any new tasks for incidents you might encounter.

phab:T341764