User:Razzi/Debugging eventlogging to druid network flows internal hourly.service
Appearance
In IRC I saw this alert today:
PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
SSHing on to an-launcher showed this error:
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 ERROR DataFrameToDruid: Druid ingestion task index_hadoop_network_flows_internal_lggaghgk_2022-02-18T22:00:35.639Z for network_flows_internal failed Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO HiveToDruid: Done. Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Invoking stop() from shutdown hook Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkUI: Stopped Spark web UI at http://an-launcher1002.eqiad.wmnet:4041 Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Interrupting monitor thread Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Shutting down all executors Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: (serviceOption=None, Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: services=List(), Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: started=false) Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Stopped Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MemoryStore: MemoryStore cleared Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManager: BlockManager stopped Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManagerMaster: BlockManagerMaster stopped Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Successfully stopped SparkContext Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Shutdown hook called Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-fcccd681-efb8-4816-9a57-f8a66dc0b7db Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-f21f647b-6de0-4473-94de-46767d4f8fc8 Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Main process exited, code=exited, status=1/FAILURE Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Failed with result 'exit-code'.
Unfortunately it has been recovering and then failing continuously. Root cause TBD