Portal:Cloud VPS/Admin/Runbooks/TfInfraTestApplyFailed
Error / Incident
This alert means that our tofu tests failed to apply, this is, create all the resources.
We run the tofu tests from the machine tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud
, that regularly try to apply (create) and then destroy a collection of resources on cloud vps.
The tests are triggered daily by a cronjob for the root user:
root@tf-bastion:~# crontab -l
[...]
0 12 * * * systemd-cat -t tf-infra-test /root/tf-infra-test/tofu-test.sh eqiad1
Debugging
ssh to the tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud
and try running the creation manually, that will give you more logs.
NOTE: if you are not member of the project, you can try sshing as root, or using the console cookbook (wmcs.openstack.cloudvirt.vm_console
).
dcaro@urcuchillay$ ssh tf-bastion.tofuinfratest.eqiad1.wikimedia.cloud
dcaro@tf-bastion$ sudo -i
root@tf-bastion:~# cd tf-infra-test/
root@tf-bastion:~/tf-infra-test# tofu apply -var datacenter=eqiad1
data.openstack_images_image_v2.debian: Reading...
...
Do you want to perform these actions?
OpenTofu will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
...
If you know which resource failed, you can try also applying that specific resource like this:
root@tf-bastion:~/tf-infra-test# tofu apply -var datacenter=eqiad1 --target=openstack_db_instance_v1.postgresql
Common issues
Add here any new common issues you find.
Quota errors when creating the trove databases
There's several quotas attached to trove databases:
- Project
database
quotas - Trove's regular quotas
It might be that you are hitting your own project's database
quotas or torve is hitting it's own.
Quota usage and requests out of sync
It happened once that the quota that trove registers as 'requested' and the one it registers as 'used' get out of sync, it seems that the requests get lost somehow and it does not decrease the 'requested' counter.
You'll see that as a regular quota error, to debug and fix see Portal:Cloud_VPS/Admin/Trove#Reserved_quota_does_not_go_down.
Rabbitmq issues
Sometimes restarting Rabbitmq fixes an issue where tofuinfratest fails to create Trove instances:
cloudcumin1001:~$ sudo cumin 'cloudrabbit*' 'systemctl restart rabbitmq-server'
Related information
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect or the bridged Telegram group
- Discuss via email after you have subscribed to the cloud@ mailing list
- Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
- Read the News wiki page
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)
Old incidents
Add here any new tasks for incidents you might encounter.
- {{phab:T341764}}