Recently we've been seeing intermittent session timeout errors, in tests that have run (in some cases for years) without issues. These happen at various times in the test lifecycle, from before(:each) to after(:suite).
They take the form:
[2024-10-04T05:14:52Z] Selenium::WebDriver::Error::UnknownError:
[2024-10-04T05:14:52Z] Unable to execute request for an existing session: java.util.concurrent.TimeoutException
[2024-10-04T05:14:52Z] Build info: version: '4.20.0', revision: '866c76ca80'
[2024-10-04T05:14:52Z] System info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.1.106-116.188.amzn2023.aarch64', java.version: '17.0.11'
[2024-10-04T05:14:52Z] Driver info: driver.version: unknown
[2024-10-04T05:14:52Z] # ./spec/support/helpers/test_helpers/capybara_helpers.rb:7:in `visit'
[2024-10-04T05:14:52Z] # ./spec/features/fullscreen/playlist_spec.rb:33:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/rails_helper.rb:135:in `block (3 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/rails_helper.rb:134:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/support/good_job.rb:7:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/support/database_cleaner.rb:19:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/spec_helper.rb:177:in `block (4 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./app/lib/tiny_otel.rb:86:in `in_span'
[2024-10-04T05:14:52Z] # ./spec/spec_helper.rb:176:in `block (3 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ./spec/spec_helper.rb:163:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z] # ------------------
[2024-10-04T05:14:52Z] #
***
Caused by:
***
[2024-10-04T05:14:52Z] # Selenium::WebDriver::Error::WebDriverError:
[2024-10-04T05:14:52Z] # java.lang.RuntimeException: Unable to execute request for an existing session: java.util.concurrent.TimeoutException
The issue doesn't reproduce when running the tests locally (on our developer machines). We have tried a number of steps to fix the problem (or at least narrow it down):
- Specifying a custom HTTP client and increasing the timeout.
- Backing off and retrying in the case of these errors.
- Upgrading selenium-webdriver Gem to the very latest version (4.25.0).
- Upgrading to Chrome 124 from 117.
- Connecting to a running CI agent and inspecting. Nothing seemed amiss. CPU, memory, usage all normal.
- Bumping shm allocation to 8GiB from 2GiB on the containers.
- Downloading system logs from CI agents while running to check for OOM killing.
- Detaching a CI instance from Buildkite, and run the tests manually there. The intermittent session errors persisted.
- Doubling the size of the CI agents from c7g.8xlarge to c7g.16xlarge, in case of resource constraints.
- Switching from Chrome to Firefox (I was particularly surprised that this didn't resolve the issue).
- Explicitly closing the session at the end of every suite.
- Removing all session resets from tests and helpers.
- Trying x86 instead of ARM build agents.
- Running just one test group manually. This also produced the same session errors, eliminating the sheer number of tests, or the number of parallel Docker containers.
- Downgrading selenium-webdriver to a version from 2023.
- Installing Chromium and Chromedriver locally on the test container, instead of using a remote browser configuration across containers. This produced the same session errors, but with a local failure message.
- Setting the page load strategy to eager.
We also tried switching the tests from Selenium to Cuprite, at which point the timeouts disappeared.
When we run the tests with Chromium in the same container as the test runner, we get the same intermittent failures), but with a different error:
1.1) Failure/Error: page.driver.resize_window_to(page.driver.current_window_handle, 1920, 1080)
Selenium::WebDriver::Error::WebDriverError:
unable to connect to /root/.cache/selenium/chromedriver/linux64/129.0.6668.89/chromedriver 127.0.0.1:9515
# ./spec/support/selenium.rb:31:in `block (2 levels) in <top (required)>'
# ./spec/rails_helper.rb:135:in `block (3 levels) in <top (required)>'
# ./spec/rails_helper.rb:134:in `block (2 levels) in <top (required)>'
# ./spec/support/good_job.rb:7:in `block (2 levels) in <top (required)>'
# ./spec/support/database_cleaner.rb:19:in `block (2 levels) in <top (required)>'
# ./spec/spec_helper.rb:177:in `block (4 levels) in <top (required)>'
# ./app/lib/tiny_otel.rb:86:in `in_span'
# ./spec/spec_helper.rb:176:in `block (3 levels) in <top (required)>'
# ./spec/spec_helper.rb:163:in `block (2 levels) in <top (required)>'
We've filed a bug with Selenium, as this problem appears to be unrelated to our code, and raised a question on StackOverflow as well ๐
Has anyone else experienced this problem, & is able to shed some light on the cause? Right now we're working around this by retrying test groups that fail with this error ๐ฅ