Senior SRE (Site Reliability Engineering)

CÔNG TY TNHH BELLE ASIA

Mức lương

Đang cập nhật

Địa điểm làm việc

Hà Nội

Kinh nghiệm yêu cầu

OccupationalExperienceRequirements, 72

Thông tin cơ bản

Mô tả công việc

Operational Excellence & Incident Management
• Production Stability: Ensure high availability and reliability of our AI- driven skin health products.
• Global Incident Response (24/7): Orchestrate a "Follow- the- Sun" support model by leveraging the hybrid team structure.

External/Partners: Manage Level 1/Level 2 monitoring, alert triage, and off hour coverage.
In- house: Handle critical, complex Level 3 incidents and root cause analysis.

• Monitoring: Define and track SLIs, SLOs, and SLAs. Implement comprehensive observability dashboards (metrics, logs, traces).
Infrastructure & Platform Engineering (LLMOps/MLOps)
• Environment Strategy: Architect and maintain robust environments (Dev, Staging, Prod) tailored for distinct needs:

Product Teams: Seamless CI/CD pipelines for web/mobile apps.
AI Teams: Specialized AIOps/MLOps pipelines for model training, fine tuning, and inference.

• DataOps: Build and maintain scalable data pipelines ensuring high throughput for image processing and health data analysis.
Cloud Resource & GPU Management
• Cost & Resource Optimization: Lead capacity planning for GPU allocation and optimization in particular for cost- effective model training and inference.
• Cloud Architecture: Manage cloud infrastructure (AWS/GCP/Azure) using Infrastructure as Code (Terraform/Pulumi).
Implement FinOps practices to provide visibility into cloud spend and resource utilization
Security & Corporate IT (Global Scope)
• Data Security: Act as the primary owner of Data System Security. Ensure compliance with health data standards (e.g., HIPAA, GDPR).
• Office Network & IT: Oversee the design and security of office networks and IT infrastructure across our three global locations: USA, France, and Vietnam.
Technical Coordination & Vendor Standards
• Vendor Technical Oversight: Act as the technical expert to monitor MSPs and external contractors. Define technical SLAs, evaluate their delivery quality, and ensure they meet our system&039;s reliability requirements.
• Operational Integration: Ensure external partners follow our security and infrastructure- as- code (IaC) practices, maintaining a seamless "One Team" workflow
• Knowledge Sharing & Standards: Set high technical standards for the internal DevOps/SRE team; mentor junior engineers and drive a culture of "Automate Everything" across regions (Vietnam & France).
The Tech Stack
• Cloud: AWS / GCP.
• IaC & CI/CD: Terraform, Ansible, GitHub Actions / GitLab CI, ArgoCD.
• AI/Data: LLMOps tools (e.g., Kubeflow, Ray, ClearML, ...), GPU orchestration (NVIDIA tools), Vector Databases.
• Observability: Prometheus, Grafana, ELK Stack / Datadog.
• Security: IAM, VPNs, Firewalls, Secret Management (Vault).
• Core: Kubernetes (K8s), Docker, Linux.

Yêu cầu công việc

• Experience
5+ years in DevOps/SRE
• Technical Mastery:

Strong background in Python or Go scripting.
Experience with LLMOps /MLOps/workflows (managing GPU clusters is a huge plus).
Deep expertise in Kubernetes administration and troubleshooting.

• Security Mindset:

Previous experience securing healthcare data or PII is highly desirable. Ability to design secure access workflows for external collaborators.
Ability to work with maturity and discretion with sensitive data
Proactive attitude toward managing risks

• Entrepreneur mindset:

Ability to work autonomously in a startup environment.
High ownership and self- starter mindset
Ability to connect business needs and tech requirements

• Communication:

Excellent communication skills to explain infrastructure constraints to Product/AI teams.
Language: Fluent English is mandatory for daily communication with US/France teams and international vendors.

• Hybrid Team Experience: Proven experience managing mixed teams (in- house and outsourced/offshore) is a strong plus.

Quyền lợi

• Competitive Package: Attractive salary, stock options, and benefits.
• Cutting- edge Tech:Get your hands dirty with the latest in Large Language Models and Computer Vision infrastructure.
• International Exposure: Daily collaboration with top- t ier talent in US and France.
• Leadership Opportunity: Define the engineering culture and build your own team from the ground up in Vietnam.
• Global Impact:Work on products that genuinely improve people&039;s health and confidence.

Cập nhật gần nhất lúc: 2026-01-31 18:35:02

Xem thêm

Người tìm việc lưu ý:

Bạn đang xem tin Senior SRE (Site Reliability Engineering) - Mã tin đăng: 5517518. Mọi thông tin liên quan tới tin tuyển dụng này là do người đăng tin đăng tải và chịu trách nhiệm. Chúng tôi luôn cố gắng để có chất lượng thông tin tốt nhất, nhưng chúng tôi không đảm bảo và không chịu trách nhiệm về bất kỳ nội dung nào liên quan tới tin việc làm này. Nếu người tìm việc phát hiện có sai sót hay vấn đề gì xin hãy báo cáo cho chúng tôi