I recently took a course, Software Systems: Behind The Abstractions, where I spent several weeks popping the hood on systems ranging from simple interpretors to Postgres. As a former mechanical engineer, I love exploring “how stuff works”. However, I had a more practical motivation for SSBA.
All engineers work with abstractions. However, given enough work with (non-trivial) abstractions, implementation details eventually leak through. When this happens, knowledge of the systems behind the abstractions can be the key to impactful optimizations — or even unblocking a project entirely.
I recently encountered a "leak" during an incident involving the surprising behavior of "negative" browser timeouts.
x
is a negative number. What does the console
display when the following statement runs in the browser*?
setTimeout(() => console.log("Timeout!"), x)
abs(x)
milliseconds.*Disregard additional delays due to event loop queueing.
4 is correct — but why? The MDN web docs don’t answer this question (at least not without subtle extrapolation from a passing comment). For the answer, we need to look behind the browser API.
It’s Monday morning. I’m on call and getting alerts that our frontend can’t connect to a backend service. Users are contacting support. I spin up my dev environment and reproduce the issue.
The timing is strange. We had a long weekend and haven’t deployed for 3 days. Further, I don’t see relevant changes in the last week. I rebuild dev with a week-old commit and again reproduce the issue.
Without correlating changes, I start stepping through the frontend
connection code, and my colleagues dive in as well. The connection is
established in a callback for a setTimeout
statement. The
callback should fire within seconds, but it never does.
Here’s a simplified version of the code:
// lastConnTime is undefined or past timestamp
// Wait 5 seconds if not first connection
let nextConnTime = (lastConnTime ?? 0) + 5000
let delay = nextConnTime - Date.now();
setTimeout(connectService(), delay);
The service wasn’t connecting, so lastConnTime
was zero and the value assigned to
delay
was a large negative number (as I write, the
expression evaluates to less than negative 240).
I didn’t know how setTimeout
handled negative delays, so
I tested a few values:
Case | Delay Value | Behavior |
---|---|---|
Small | -1 | Fires immediately |
Medium | -1,000,000 | Fires immediately |
Day before incident | -1,655,508,200,000 | Fires immediately |
During incident | -1,655,737,200,000 | Never fires |
It was obvious that there was something nuanced happening with large negative values, and that we should just clip the minimum delay value at 0 (which our frontend engineer did, resolving the immediate incident).
Despite the fix, this resolution was unsettling — our code had worked for weeks in dev and production before suddenly breaking. Seeing the magnitude of the numbers brought our attention back to an MDN comment about overflow:
Browsers including Internet Explorer, Chrome, Safari, and Firefox store the delay as a 32-bit signed integer internally. This causes an integer overflow when using delays larger than 2,147,483,647 ms (about 24.8 days), resulting in the timeout being executed immediately.
Our values were, in fact, overflowing, but had always been doing so (the minimum 32-bit signed integer is -2,147,483,647, three orders of magnitude smaller than our delay values). Digging deeper, I found the Firefox SetTimeout source:
::SetTimeout(
nsresult TimeoutManager* aHandler,
TimeoutHandlerint32_t interval,
bool aIsInterval,
::Reason aReason,
Timeoutint32_t* aReturn
) {
//...code omitted for brevity...
= std::max(0, interval);
interval
//...code omitted for brevity...
This confirmed that (at least for Firefox) the browser implementation stores the delay as a 32-bit signed integer, and that the browser clips the minimum delay at zero.
I began thinking back to discussions of binary encodings from SSBA. JavaScript uses a different representation for large numbers than C++. In C++, once a number overflows, the sign of the original number is irrelevant. All that matters is the sign of the lowest 32 bits when interpreted as a (two's-compliment) signed integer.
Here are the values passed to setTimeout
before and
during the incident in 64-bit two’s complement representations, with the
32nd lowest-order bit marked (negative two's-compliment numbers have a leading one):
Now the situation was clear — we were hitting the inverse of the situation mentioned in the MDN docs. Our overflowing delay value was previously getting truncated to a negative value, and at the time of the incident, the sign of the truncated value flipped.
The combined effect of the int32 truncation and following browser zero-min-value clipping is illustrated below:
The bug actually had nothing to do with negative input values — the real issue was passing delay values that could not be represented inside the browser API. This issue was then masked for a few weeks by the API clipping negative values at zero.
This was a particularly thorny “leak”, because we had a 25 day window where the overflowing delay value provided the desired behavior, followed by an immediate jump to practically infinite connection delays. This obscured the connection between code changes and the incident and allowed the incident to manifest at an entirely unexpected time, across all environments.
While this further understanding did not drive any further fixes, it did provide peace of mind that we understood the root cause and could be confident we had addressed this bug for good. As we closed out the incident, someone joked that “we could have just waited 25 days, and it would have worked again”.