Job Description

AI Platform Engineer
Posting Start Date:  19/05/2026
Country/Region:  Singapore
Work Location:  Singapore Great World City
Business/ Function:  IT

Job Summary

We are seeking a passionate AI Platform Engineer to build and own the infrastructure layer that every AI use case in Kuok Group runs on — the LLM gateway, the deployment platform, CI/CD pipelines, model serving, observability, cost controls, and the eval pipeline infrastructure, end to end.

 

This is a T-shaped role: broad cloud and DevOps foundations, with deep specialism in LLM infrastructure. The ideal candidate is equally comfortable provisioning environments and managing release pipelines as they are configuring a model gateway, wiring up LangSmith traces, and building an eval harness.

 

Working closely with the Head, AI Platform on architecture direction and with the LLM Ops / MLOps Engineer on the observability and eval layer, this person will be the backbone of the platform that Applied AI Engineers depend on to ship confidently and at pace.

Key Responsibilities

Deployment Platform & CI/CD

  • Design, build, and maintain CI/CD pipelines for all AI use cases — from code commit through staging to production, with automated release gates and rollback capability
  • Own environment provisioning and infra-as-code (Terraform or equivalent) — staging, UAT, and production environments should be reproducible, version-controlled, and auditable
  • Manage the deployment platform end to end: release scheduling, environment promotion, incident response, and post-deployment validation
  • Champion good deployment hygiene: automated pipelines, version-controlled configuration, and documented environment differences as standard practice

LLM Gateway & Model Serving

  • Build and operate the LLM gateway layer (LiteLLM or equivalent) — API access controls, rate limiting, model routing, and failover across Azure-backed endpoints
  • Manage model serving configuration: endpoint management, load balancing, latency SLOs, and model switching without disrupting live use cases
  • Own secrets and access management for all model API credentials and service accounts across environments
  • Maintain a prompt and model version registry so that every production use case can be traced to a specific model version and prompt configuration

Observability, Cost & Controls

  • Instrument all deployed use cases with LLM observability tooling (LangSmith or equivalent) — traces, latency, token counts, and error rates as standard
  • Build and maintain cost telemetry dashboards: per-use-case token consumption, compute spend, and alerting on cost anomalies
  • Implement and maintain token budget controls and rate limits across BUs — keeping cost visible and predictable is a shared responsibility that starts at the platform layer
  • Own general platform monitoring and reliability: uptime, alerting, on-call runbooks, and incident response for platform-layer issues

Eval Pipeline Infrastructure

  • Build the infrastructure layer for LLM evaluation pipelines — test harnesses, regression runners, and LLM-as-judge scaffolding used by Applied AI Engineers per use case
  • Work with the LLM Ops / MLOps Engineer on eval pipeline design
  • Ensure eval pipeline runs are logged, versioned, and traceable — eval results should be reproducible
  • Support evals as a consistent deployment gate — working with the team to ensure every use case has a passing eval run on the current model version before moving to production

Standards & Collaboration

  • Maintain platform documentation — architecture diagrams, runbooks, environment specs, and onboarding guides — so institutional knowledge is shared and accessible across the team
  • Work within the Head, AI Platform's engineering standards: all platform changes go through code review before deployment
  • Support the QA / Dev Engineers (Applied AI cluster) on integration and regression testing where it touches the platform layer
  • Proactively surface platform-layer risks and capacity constraints to the Head, AI Platform

Key Requirements

  • Solid cloud and DevOps engineering foundations — you have built and operated CI/CD pipelines, managed environments with IaC, and handled production deployments and rollbacks on at least one major cloud platform (Azure, AWS, or GCP); comfortable working across Linux and Windows Server, and familiar with core networking concepts — VPC/VNET, DNS, firewalls, and load balancers
  • Hands-on experience with LLM infrastructure: you have configured and operated a model gateway or API proxy layer, managed multi-model routing, and dealt with rate limits and failover in a live environment
  • LLM observability experience — you have instrumented production AI systems with tracing and monitoring tooling and used the data to diagnose issues
  • Cost telemetry and token controls — you understand how LLM API costs are structured and have built or operated dashboards and controls to keep spend visible and bounded
  • Strong Python skills and comfort with the full LLM deployment tooling ecosystem — equally at home in application code and infrastructure configuration
  • Strong appreciation for documentation and configuration management — environments as code, clear runbooks, and written context that helps the team move faster together

 

Strong Advantage

  • Experience with eval pipeline infrastructure: test harness design, regression frameworks, LLM-as-judge scaffolding, or automated output quality checks
  • Security and access management experience in an AI context: IAM, RBAC, secrets management, API credential rotation, encryption at rest and in transit, and least-privilege access design for model-serving environments
  • Familiarity with MLOps practices: model versioning, A/B traffic splitting, canary deployments for model updates
  • Experience supporting engineering teams as a platform provider — you understand that your internal customers are the engineers shipping use cases, and you design for their velocity as well as for reliability
  • Exposure to enterprise multi-tenant environments: managing shared infrastructure across multiple teams or business units with different access and cost boundaries; familiarity with virtualisation platforms (VMware, Hyper-V, or Nutanix) is a plus

Education

Bachelors in Computer Science or Computer Engineering

Certifications