Degraded

Automatos Primary Worker Offline

Dec 5, 2024 at 10:40am UTC

Affected services

Automatos - Arbitrum One

Automatos - Base

Automatos - Blast

Automatos - Boba Ethereum

Automatos - BSC

Automatos - Ethereum

Automatos - Filecoin

Automatos - Gravity

Automatos - Linea

Automatos - Lisk

Automatos - Manta Pacific

Automatos - Mantle

Automatos - Mode

Automatos - Moonbeam

Automatos - Optimism

Automatos - Polygon

Automatos - Polygon ZkEVM

Automatos - Rootstock

Automatos - Scroll

Automatos - Sei

Automatos - Taiko

Automatos - ZkSync Era

Resolved
Dec 8, 2024 at 5:43am UTC

Post Mortem: Automatos Primary Worker Offline

The primary worker node became unresponsive at about Dec 05 at 10:40am UTC.
Our team was notified at 11:08am UTC and responded immediately.
After a brief issue diagnosis, the primary worker was manually rebooted and came back online and fully functional at 11:17 UTC.

Our 3rd logging party service used for verbose debugging, BetterStack, went down briefly, resulting in network errors. We log network errors, but the amount of data being sent was more than BetterStack permits, resulting in an infinite cycle of errors. Eventually, the server ran out of memory as its network buffers grew, causing it to crash.

The backup workers took over at the cost of short delays in submitting transactions.

We dramatically reduced the verbosity of network error log reporting to prevent such issues.
We have suspended using BetterStack log reporting, until investigation to avoid and handle error 429s is concluded. Note that our Datadog log system still records all vital information.
We created an alert for network throughput changes so that we can closely monitor irregularities, being alerted before downtime occurs.

Updated
Dec 5, 2024 at 11:17am UTC

The primary worker was manually rebooted and came back online.

Created
Dec 5, 2024 at 10:40am UTC

The primary Automatos worker node went offline.