Operational Excellence & Incident Management
• Production Stability: Ensure high availability and reliability of our AI- driven skin health products.
• Global Incident Response (24/7): Orchestrate a "Follow- the- Sun" support model by leveraging the hybrid team structure.
External/Partners: Manage Level 1/Level 2 monitoring, alert triage, and off hour coverage.
In- house: Handle critical, complex Level 3 incidents and root cause analysis.
• Monitoring: Define and track SLIs, SLOs, and SLAs. Implement comprehensive observability dashboards (metrics, logs, traces).
Infrastructure & Platform Engineering (LLMOps/MLOps)
• Environment Strategy: Architect and maintain robust environments (Dev, Staging, Prod) tailored for distinct needs:
Product Teams: Seamless CI/CD pipelines for web/mobile apps.
AI Teams: Specialized AIOps/MLOps pipelines for model training, fine tuning, and inference.
• DataOps: Build and maintain scalable data pipelines ensuring high throughput for image processing and health data analysis.
Cloud Resource & GPU Management
• Cost & Resource Optimization: Lead capacity planning for GPU allocation and optimization in particular for cost- effective model training and inference.
• Cloud Architecture: Manage cloud infrastructure (AWS/GCP/Azure) using Infrastructure as Code (Terraform/Pulumi).
Implement FinOps practices to provide visibility into cloud spend and resource utilization
Security & Corporate IT (Global Scope)
• Data Security: Act as the primary owner of Data System Security. Ensure compliance with health data standards (e.g., HIPAA, GDPR).
• Office Network & IT: Oversee the design and security of office networks and IT infrastructure across our three global locations: USA, France, and Vietnam.
Technical Coordination & Vendor Standards
• Vendor Technical Oversight: Act as the technical expert to monitor MSPs and external contractors. Define technical SLAs, evaluate their delivery quality, and ensure they meet our system&039;s reliability requirements.
• Operational Integration: Ensure external partners follow our security and infrastructure- as- code (IaC) practices, maintaining a seamless "One Team" workflow
• Knowledge Sharing & Standards: Set high technical standards for the internal DevOps/SRE team; mentor junior engineers and drive a culture of "Automate Everything" across regions (Vietnam & France).
The Tech Stack
• Cloud: AWS / GCP.
• IaC & CI/CD: Terraform, Ansible, GitHub Actions / GitLab CI, ArgoCD.
• AI/Data: LLMOps tools (e.g., Kubeflow, Ray, ClearML, ...), GPU orchestration (NVIDIA tools), Vector Databases.
• Observability: Prometheus, Grafana, ELK Stack / Datadog.
• Security: IAM, VPNs, Firewalls, Secret Management (Vault).
• Core: Kubernetes (K8s), Docker, Linux.