Wrtn · 2024.01 — 2024.07

AI Model Serving Service Development & Operations

  • Developed the JVM-based hamlet2 as an extension of Spring AI and replaced the existing Node-based hamlet1 with zero downtime
  • Achieved RPM 750 and a 20% improvement in API response time

Background

The existing Node-based hamlet1, operated as a monolith, was highly complex and could be maintained by only one person.
A zero-downtime replacement was a hard requirement, and a design for long-term evolution was also needed.

Outcomes

  • Resolved hamlet1’s high complexity and single-operator problem, securing operational ease by splitting it out into the JVM-based hamlet2
  • Designed a Spring AI based provider / model extension structure, laying the foundation for long-term scaling such as serving 27 PTUs worldwide
  • Served related links and dynamic chips on Wrtn’s main screen, with RPM 750 and a 20% improvement in API response time
  • Reduced errors and enabled proactive detection through improved retry logic and observability

Service in production

Screenshot of the related links and dynamic chips served live on Wrtn’s main screen.

Details

Service Rollout Plan
Aligned target features, architecture, and migration schedule with stakeholder teams, proceeding phase by phase from development through production.

Service rollout plan architecture

Phased rollout plan aligning features, architecture, and migration schedule across teams.

Hamlet1 / Hamlet2 Comparison
During peak hours it handled 15,000 requests per 20 minutes, around RPM 750, with a 20% improvement in duration.

Hamlet1 Hamlet2 comparison

Peak-hour metrics comparing hamlet1 and hamlet2: ~RPM 750 with 20% faster duration.

Pre-launch Load Testing
Using Gatling-based load testing, I verified the target throughput and infrastructure specs in advance.

Load test result 1

Gatling load test results verifying target throughput before launch.

Load test result 2

Additional Gatling load test results confirming required infrastructure specs.

Observability Improvements
I configured a new dashboard to surface AI model call failures and infrastructure load.

AI model call failure monitoring

New dashboard surfacing AI model call failures for proactive detection.

Infrastructure load level 1

Dashboard panel monitoring infrastructure load metrics.

Infrastructure load level 2

Additional infrastructure load monitoring panel.

Infrastructure load level 3

Further infrastructure load monitoring panel on the dashboard.

Zero-downtime Migration
By splitting the cutover between the existing and new services, I replaced the live service features with zero downtime.

Zero-downtime migration complete

Cutover-complete notice with metrics confirming the zero-downtime switch from william to hamlet2.