Site Reliability Engineer Python, Bash, or Go

PAVE
Mức lương
Đang cập nhật
Địa điểm làm việc
Quận 1, Hồ Chí Minh
Kinh nghiệm yêu cầu
Cập nhật
Thông tin cơ bản

Mô tả công việc

Mô tả công việc

We&039;re seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you&039;ll play a critical role in our GCP to AWS migration while maintaining and improving system reliability. As an SRE at PAVE.ai, you&039;ll implement best practices for monitoring, incident response, and automation to achieve 99.9%+ uptime. You&039;ll work hands- on with AWS infrastructure to build resilient systems that process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces globally.
Key ResponsibilitiesSystem Reliability & Stability

Participate in 24/7 on- call rotation and incident response
Perform root cause analysis for incidents and implement permanent fixes
Conduct regular reliability reviews and implement improvements
Implement proactive monitoring and alerting to detect issues before they impact customers
Monitor and maintain production systems to ensure 99.9%+ uptime
Create and maintain runbooks for common operational procedures

AWS Infrastructure Management

Deploy and manage AWS services including EC2, ECS/EKS, RDS, S3, CloudFront
Configure auto- scaling policies and load balancing for high availability
Support migration efforts from GCP to AWS under Lead DevOps guidance
Optimize AWS infrastructure for performance, cost, and reliability
Manage AWS networking components (VPC, Security Groups, ALB/NLB)
Implement AWS best practices for security, backup, and disaster recovery

Monitoring & Observability

Implement log aggregation and analysis using ELK stack or similar
Define and track SLIs (Service Level Indicators) for critical services
Set up distributed tracing and application performance monitoring
Design and implement comprehensive monitoring solutions using CloudWatch, Prometheus, Grafana
Establish baseline metrics and identify performance anomalies
Create meaningful dashboards and alerts for service health

Automation & Infrastructure as Code

Develop automation scripts to reduce manual operations and toil
Develop tools to improve developer productivity and deployment velocity
Automate routine tasks such as backups, scaling, and maintenance
Implement Infrastructure as Code using Terraform and CloudFormation
Build self- healing mechanisms for common failure scenarios
Create CI/CD pipelines for reliable and repeatable deployments

Performance Optimization

Fine- tune resource allocation and utilization
Analyze system performance and identify bottlenecks
Optimize cloud costs without compromising reliability
Implement caching strategies to reduce latency
Optimize application and database performance
Conduct load testing and capacity planning

Incident Management

Respond to production incidents with urgency and professionalism
Improve MTTR (Mean Time To Recovery) through better tooling and processes
Maintain incident communication with stakeholders
Document incidents and contribute to post- mortem analysis
Follow incident management procedures and escalation protocols
Implement preventive measures based on incident learnings

Collaboration & Documentation

Document infrastructure, procedures, and troubleshooting guides
Provide guidance on reliability best practices during design phase
Work closely with development teams to improve application reliability
Share knowledge through team presentations and training sessions
Collaborate on capacity planning and scaling strategies
Support developers with production debugging and optimization

Success Metrics

Maintain 99.9%+ uptime for assigned services
Automate 50% of manual operational tasks
Zero critical security incidents
Complete AWS migration tasks on schedule
Achieve all SLO targets for assigned services
Reduce incident MTTR by 30% within first year

Yêu cầu công việc

Yêu cầu công việc

Technical Skills

AWS Expertise:

Understanding of AWS security best practices
Familiarity with AWS Well- Architected Framework
Strong proficiency with core AWS services (EC2, S3, RDS, VPC, IAM)
Experience with container services (ECS, EKS, ECR)
Knowledge of AWS monitoring and logging (CloudWatch, CloudTrail)
Experience with AWS CLI and SDKs

SRE & DevOps Tools:

Infrastructure as Code: Terraform, CloudFormation, or AWS CDK
Containerization: Docker, Kubernetes, Helm
Configuration management: Ansible, Chef, or Puppet
Version control: Git, GitHub/GitLab
Scripting languages: Python, Bash, or Go
CI/CD tools: Jenkins, GitLab CI, GitHub Actions

Monitoring & Observability:

APM tools: New Relic, Datadog, or AppDynamics
Distributed tracing: Jaeger, Zipkin, or AWS X- Ray
Alert management: PagerDuty, Opsgenie, or similar
Log management: ELK Stack, Splunk, or CloudWatch Logs
Prometheus, Grafana, or similar metrics platforms

Technical Fundamentals:

Understanding of distributed systems and microservices
Experience with performance tuning and optimization
Strong Linux/Unix system administration skills
Networking concepts: TCP/IP, DNS, Load Balancing, CDN
Knowledge of security principles and best practices
Database administration: PostgreSQL, MySQL, Redis

Soft skills:

Excellent written and verbal communication skills in both English and Vietnamese
Detail- oriented with strong documentation skills
Team player with collaborative mindset
Continuous learning mindset for new technologies
Strong problem- solving and troubleshooting abilities
Ability to work effectively under pressure during incidents
Proactive approach to identifying and solving problems

Experience

Proven track record of improving system reliability and uptime
Experience with 24/7 on- call responsibilities and incident management
2+ years of hands- on AWS experience in production environments
Experience maintaining high- traffic, high- availability systems
2- 5 years of experience in DevOps, SRE, or Infrastructure Engineering

Preferred Qualifications

Experience with chaos engineering and failure injection
AWS certifications (SysOps Administrator, DevOps Engineer, or Solutions Architect)
Contributions to open- source DevOps/SRE projects
Experience with AI/ML infrastructure and GPU workloads
Knowledge of SRE practices from Google&039;s SRE book
Experience with GCP and cloud migration projects
Experience with FinOps and cloud cost optimization
Knowledge of compliance frameworks (SOC2, ISO 27001)
Familiarity with automotive industry or vehicle inspection systems
Experience with serverless architectures (Lambda, API Gateway)

Quyền lợi

Tại sao bạn sẽ yêu thích làm việc tại đây

Competitive Compensation & Perks

15 days of annual leave.
13th- month bonus
Thoughtful appreciation gifts throughout the year.
Attractive salary package.
Premium healthcare coverage for you and your family.

Growth & Learning Opportunities

Continuous learning programs to sharpen your skills and grow your career.
Learn from everything, everywhere—but be a smart copy- paster, not a copycat!
Clear career paths for both technical experts and aspiring leaders.
Work on cutting- edge, large- scale products in the car inspection field.
Be ready to embrace and implement new ideas in a fast- paced environment.

An Inspiring Workplace

Be motivated, creative, and passionate—we can’t ask for more!
Flexible hybrid work model and a strong focus on work- life balance.
A modern, fully- equipped Office with a well- stocked pantry.
Respect and care for your teammates, your environment, and even yourself.
Treat yourself well, and while you’re at it, save the Earth too.

A Mindset for Growth

Have the courage to move fast, stay flexible, and take full responsibility for every single line of code.
It’s okay to be late sometimes, but make sure you’re fully accountable and aware of your actions.
Always look back at your work and strive to make it better—nothing is perfect, and that’s where you come in.

A Dynamic and Open Culture

We don’t stick rigidly to the gameplan, so feel free to add or remove your own “blah blah” from this list. 😉

Cập nhật gần nhất lúc: 2025-10-08 04:25:03

Xem thêm

Đặc điểm công việc

Hạn nộp hồ sơ
11/11/2025
Hình thức làm việc
Đang cập nhật
Cấp bậc
Nhân Viên
Số lượng cần tuyển
Đang Cập Nhật
Ngành nghề
Xây dựng
Khu vực
Quận 1, Hồ Chí Minh
Xem thêm
Xem thêm
Người tìm việc lưu ý:
Bạn đang xem tin Site Reliability Engineer Python, Bash, or Go - Mã tin đăng: 5317680. Mọi thông tin liên quan tới tin tuyển dụng này là do người đăng tin đăng tải và chịu trách nhiệm. Chúng tôi luôn cố gắng để có chất lượng thông tin tốt nhất, nhưng chúng tôi không đảm bảo và không chịu trách nhiệm về bất kỳ nội dung nào liên quan tới tin việc làm này. Nếu người tìm việc phát hiện có sai sót hay vấn đề gì xin hãy báo cáo cho chúng tôi

PAVE

Quy mô: Cập nhật
Trụ sở: Cập nhật

Bí kíp tìm việc an toàn

Dưới đây là những dấu hiệu của các tổ chức, cá nhân tuyển dụng không minh bạch:
1. Dấu hiệu phổ biến:
Hình ảnh 1
Nội dung mô tả công việc sơ sài, không đồng nhất với công việc thực tế
Hình ảnh 2
Hứa hẹn "việc nhẹ lương cao", không cần bỏ nhiều công sức dễ dàng lấy tiền "khủng"
Hình ảnh 3
Yêu cầu tải app, nạp tiền, làm nhiệm vụ
Hình ảnh 4
Yêu cầu nộp phí phỏng vấn, phí giữ chỗ...
Hình ảnh 5
Yêu cầu ký kết giấy tờ không rõ ràng hoặc nộp giấy tờ gốc
Hình ảnh 6
Địa điểm phỏng vấn bất bình thường
2. Cần làm gì khi gặp việc làm, công ty không minh bạch:
- Kiểm tra thông tin về công ty, việc làm trước khi ứng tuyển
- Báo cáo tin tuyển dụng với 123job thông qua nút "Báo cáo tin tuyển dụng" để được hỗ trợ và giúp các ứng viên khác tránh được rủi ro
- Hoặc liên hệ với 123job thông qua kênh hỗ trợ ứng viên của 123job:
Hotline: 0961.469.398

Việc làm đề xuất liên quan

Việc làm đã xem gần đây

Từ khóa tìm việc làm tại 123Job
Site reliability engineer tại tỉnh/thành