Data Platform/Data Lake/Data Issues/2023-11 eventgate-analytics-external Data Loss
Date issue detected: | 2023-10-17 |
Date resolved: | 2023-11-06 |
Start of issue: | 2020-12-02 |
Status: | Resolved |
Phabricator Ticket(s): | T326002 - [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors |
Summary
The intake service for analytics instrumentation event data had a subtle bug that caused it to drop about 0.25% of events. Only analytics instrumentation event streams handled by this one service were affected.
Description
Dynamic stream config was a feature added to eventgate-wikimedia to allow faster deployments of streams. If enabled, eventgate-wikimedia will look up stream config for stream if it does not yet have it. That is, you can produce to a stream, and if eventgate-wikimedia does not have cached stream config for the stream, it will attempt to fetch the stream config at request time. This feature was only used by the eventgate-analytics-external deployment of eventgate-wikimedia.
In hindsight, this feature caused some subtle bugs. In 2020-10, we noticed some produce failures via failures of the ProduceCanaryEvents job. We attempted to work around this problem by refactoring eventgate-wikimedia's stream config code to be more efficient.
This change seems to have fixed the original problem (needs citation :) ), but introduced a subtle race condition. Stream configs are recached periodically via a TTL. When TTL is enabled alongside of the dynamic stream config feature, it was possible that the caches were cleared at the time a produce request comes in. The produce request would end up causing the cleared configs to be cached before the TTL caused recache finished. This would cause the stream to exist in the eventgate-wikimedia stream config cache, but with empty configuration settings. This resulted in the stream 'is configured but does not have any settings' error.
Until the next TTL caused recache, any stream with cached empty config settings would fail produce requests.
This bug was not noticed for years, as eventgate-analytics-external accepts events from the internet at large, and there is no alerting on produce failures of this kind, as the internet can attempt to produce anything it wants to this service, resulting in common rejections of events.
Because this is a race condition, there is no regular pattern of streams and times that had data loss. Any stream might have data loss for periods of up to the TTL of 5 minutes at a time. Also, because there are multiple replicas of the eventgate-analytics-external service, and each replica has its own in memory cache, this could affect any variety of streams at at time.
We don't have good information for a complete summary of lost data. The best we have is logs in logstash, for which we only have a 90 day retention. As an extreme example, on 2023-11-05, ~5.5 million events were lost because of this bug. It is difficult to say in which streams this data loss occurred. On 11-05, there were about ~364 million events produced to eventgate-analytics-external. This adds up to about 1.5% of events lost on that day. Note that this is an extreme example, and probably happened because a high volume stream (session_tick perhaps?) happened to be affected by the race condition, possibly several times in one day. Also note that higher volume streams in general are more likely to encounter this race condition, as there are more chances for the recache TTL to coincide with a produce request.
Additionally, this bug was the main cause of the not-infrequent ProduceCanaryEvents failure alerts Data Engineers had to respond to in Ops Week.
Estimate of loss
Here's a back of the napkin estimation of total percentage of events lost:
- According to logstash, there were 49547395 of these error messages between 2023-09-11 and 2023-11-06.
- Produce rate to eventgate-analytics-external is around ~4000 events per second, give or take 1000ish depending on time of day.
- That comes to about 345600000 events per day
- There 56 days between 2023-09-11 and 2023-11-06, so approximately 345600000*56 = 19353600000 events were produced in this time.
- 49547395 / 19353600000 = 0.00256011 = 0.25% loss
According to this very rough estimate, eventgate-analytics-external dropped about 0.25% of events it should have produced.
Fix
The bug has been fixed by removing the dynamic stream config feature. Instead, all stream configs are recached every 60 seconds, and no stream configs are looked up at request time.
Recommendations
It would be nice to have alerting on these kinds of errors. To do this, we'd need a logstash error log frequency / anomaly based alert.
Root Cause
Bug in dynamic stream config code.
Affected Datasets
All event stream data collected via eventgate-analytics-external. These are analytics instrumentation event streams.
webrequest based traffic data (e.g. Pageviews, Unique Devices) was not affected.
The SessionLength metric is affected, as it is computed from the mediawiki.client.session_tick event stream.
List of possibly affected streams as of 2023-12
This is a list of all streams that use the eventgate-analytics-external service as of 2023-12-11. Streams that previously existed during the affected time period are not included here.
curl 'https://meta.wikimedia.org/w/api.php?action=streamconfigs&constraints=destination_event_service=eventgate-analytics-external' | jq -r '.streams | keys'
- android.app_appearance_settings_interaction
- android.app_session
- android.article_link_preview_interaction
- android.article_page_scroll_interaction
- android.article_toc_interaction
- android.article_toolbar_interaction
- android.breadcrumbs_event
- android.create_account_interaction
- android.customize_toolbar_interaction
- android.daily_stats
- android.edit_history_interaction
- android.find_in_page_interaction
- android.image_recommendation_event
- android.image_recommendation_interaction
- android.install_referrer_event
- android.metrics_platform.article_link_preview_interaction
- android.metrics_platform.article_toc_interaction
- android.metrics_platform.article_toolbar_interaction
- android.metrics_platform.find_in_page_interaction
- android.notification_interaction
- android.reading_list_interaction
- android.user_contribution_screen
- app_donor_experience
- app_patroller_experience
- app_session
- development.network.probe
- eventgate-analytics-external.error.validation
- eventgate-analytics-external.test.event
- eventlogging_CentralNoticeBannerHistory
- eventlogging_CentralNoticeImpression
- eventlogging_CodeMirrorUsage
- eventlogging_ContentTranslationAbuseFilter
- eventlogging_CpuBenchmark
- eventlogging_DesktopWebUIActionsTracking
- eventlogging_EditAttemptStep
- eventlogging_HelpPanel
- eventlogging_HomepageModule
- eventlogging_HomepageVisit
- eventlogging_InukaPageView
- eventlogging_KaiOSAppFeedback
- eventlogging_KaiOSAppFirstRun
- eventlogging_LandingPageImpression
- eventlogging_MobileWebUIActionsTracking
- eventlogging_NavigationTiming
- eventlogging_NewcomerTask
- eventlogging_PaintTiming
- eventlogging_PrefUpdate
- eventlogging_QuickSurveyInitiation
- eventlogging_QuickSurveysResponses
- eventlogging_ReferencePreviewsBaseline
- eventlogging_ReferencePreviewsCite
- eventlogging_ReferencePreviewsPopups
- eventlogging_SaveTiming
- eventlogging_SearchSatisfaction
- eventlogging_ServerSideAccountCreation
- eventlogging_SpecialInvestigate
- eventlogging_SuggestedTagsAction
- eventlogging_TemplateDataApi
- eventlogging_TemplateDataEditor
- eventlogging_TemplateWizard
- eventlogging_Test
- eventlogging_TranslationRecommendationAPIRequests
- eventlogging_TranslationRecommendationUIRequests
- eventlogging_TranslationRecommendationUserAction
- eventlogging_TwoColConflictConflict
- eventlogging_TwoColConflictExit
- eventlogging_UniversalLanguageSelector
- eventlogging_VirtualPageView
- eventlogging_VisualEditorFeatureUse
- eventlogging_VisualEditorTemplateDialogUse
- eventlogging_WMDEBannerEvents
- eventlogging_WMDEBannerInteractions
- eventlogging_WMDEBannerSizeIssue
- eventlogging_WikibaseTermboxInteraction
- eventlogging_WikidataCompletionSearchClicks
- eventlogging_WikipediaPortal
- inuka.wiki_highlights_experiment
- ios.edit_history_compare
- ios.login_action
- ios.navigation_events
- ios.notification_interaction
- ios.reading_lists
- ios.search
- ios.sessions
- ios.setting_action
- ios.suggested_edits_alt_text_prototype
- ios.talk_page_interaction
- ios.user_history
- ios.watchlists
- mediawiki.accountcreation_block
- mediawiki.client.session_tick
- mediawiki.content_translation_event
- mediawiki.edit_attempt
- mediawiki.editattempt_block
- mediawiki.editgrowthconfig
- mediawiki.ipinfo_interaction
- mediawiki.maps_interaction
- mediawiki.mediasearch_interaction
- mediawiki.mentor_dashboard.interaction
- mediawiki.mentor_dashboard.personalized_praise
- mediawiki.mentor_dashboard.visit
- mediawiki.page_content_change.v1
- mediawiki.pref_diff
- mediawiki.reading_depth
- mediawiki.searchpreview
- mediawiki.skin_diff
- mediawiki.special_diff_interactions
- mediawiki.structured_task.article.image_suggestion_interaction
- mediawiki.structured_task.article.link_suggestion_interaction
- mediawiki.talk_page_edit
- mediawiki.web_ab_test_enrollment
- mediawiki.web_ui_actions
- mediawiki.web_ui_scroll
- mediawiki.web_ui_scroll_migrated
- mediawiki.welcomesurvey.interaction
- mediawiki.wikistories_consumption_event
- mediawiki.wikistories_contribution_event
- mwcli.command_execute
- test.instrumentation
- test.instrumentation.sampled
- wd_propertysuggester.client_side_property_request
- wd_propertysuggester.server_side_property_request
- wikifunctions.ui