Environment
- Runtime: Node.js, PM2
- Hosting: Client-hosted Windows server
- Constraints: 24/7 casino operations, Zero tolerance for extended downtime, Client-controlled infrastructure
Problem
A major client experienced a daily outage between midnight and 1:00 AM where the application appeared online but stopped processing or displaying data.
Client / User Impact:
- Application appeared functional but was effectively frozen
- No new data could be viewed or stored
- Operational impact during active casino hours
Timing:
- Frequency: Daily
- Time Window: 12:00 AM – 1:00 AM
- Duration: 5–10 minutes
Prior Investigation
Multiple teams investigated the issue with no resolution:
- Support
- Systems Engineering
- Database Administration
- Install / Config teams
Key Insight
Incorrect assumption: The database connection was being cleanly reset during each application restart.
Actual behavior: PM2 restarted the process so quickly that the application repeatedly reattached to a stale database connection, looping until the database forcefully dropped it.
Resolution
- Updated PM2 setting
exp_backoff_restart_delay to 100ms - The delayed restart allowed stale connections to terminate properly and ensured clean reconnections.
- Code changes required: No
Outcome
The outage no longer occurred during the next scheduled midnight window.
The configuration was added to future releases to prevent similar failures across other client deployments.
Environment
- Runtime: Node.js
- Hosting: Heroku (Linux, cloud-hosted)
- Constraints: Financial impact, Production system, Third-party vendor dependency (Plaid)
Problem
Transactions appeared permanently pending due to missing vendor transaction IDs, preventing status updates and reconciliation.
Client / User Impact:
- Customer balances showed unpaid or incorrect amounts
- Users lost access to available credit
- Account statements were inaccurate
- Potential negative impact to user credit
Timing:
- Frequency: Ongoing
- Duration: Undetected for multiple years (2018–2021)
Prior Investigation
Multiple teams investigated the issue with no resolution:
- CTO
- Director of Engineering
Key Insight
Incorrect assumption: Each incident was an isolated, one-off vendor issue caused by a missing transaction ID.
Actual behavior: By comparing transaction tables against transaction update logs, I identified a systemic pattern: thousands of transactions had invalid or missing vendor IDs due to repeated API rate-limit failures (HTTP 429), leaving them permanently unreconciled.
Resolution
- I proposed validating transactions directly against the vendor’s API as the source of truth and updating local records only when verification succeeded, allowing accurate recovery without canceling legitimate transactions.
- Code changes required: Yes
Outcome
Over 1,300 user accounts and 15,000+ transactions (>$250K) were corrected without mass cancellation or customer-facing fallout.
The reconciliation logic was incorporated into existing support workflows, reducing engineering involvement and preventing recurrence through regular validation checks.
Environment
- Runtime: Node.js, Application Server
- Hosting: Client-hosted Microsoft Windows Server
- Constraints: 24/7 casino operations, No acceptable downtime, Client-managed infrastructure
Problem
A full disk caused a critical user credential file to be corrupted during a write operation, resulting in all users losing access to the application.
Client / User Impact:
- All users were unable to log in as if no accounts existed
- Casino operations were disrupted due to loss of system access
Timing:
- Duration: 30–45 minutes of active user impact
Prior Investigation
Multiple teams investigated the issue with no resolution:
- VP of Engineering
- Director of Database Administration and Systems Engineering
Key Insight
Incorrect assumption: The only way to restore user access was to wait for the client to provide their original user configuration and credential files.
Actual behavior: I recognized that the database already contained a reliable source of truth for user identities through historical activity records and proposed reconstructing accounts directly from the database to immediately restore access.
Resolution
- I first stabilized the system by freeing disk space, then proposed and guided the recovery approach of extracting user identities from the database to rebuild access without waiting on external client data. Leadership adopted this approach, enabling rapid restoration under active outage conditions.
- Code changes required: No
Outcome
User access was restored within the outage window, database activity resumed, and the client confirmed full operational recovery.
This incident led me to proactively audit disk capacity and log growth on all client systems I accessed, preventing similar credential corruption incidents in future engagements.