About the Role
We are building world- class cloud systems that power high- performance digital platforms across AWS and Azure. This role blends architecture, reliability engineering, and DevOps innovation for someone who thrives on scale, speed, and solving complex infrastructure challenges.
What You Will Do
• Lead incident response and root- cause analysis, implementing permanent fixes to reduce recurrence and recovery time.
• Establish and enforce security baselines, IAM policies, logging, encryption, and compliance automation aligned with SOC 2, HIPAA, PCI- DSS, ISO 27001, and GDPR.
• Continuously optimize infrastructure for performance, scalability, and cost efficiency, with measurable savings targets.
• Lead SOC 2 and HIPAA compliance efforts, including control implementation, evidence automation, audit readiness, and ongoing compliance maintenance.
• Design, implement, and maintain platform- specific failovers, including automatic offer failover if Everflow is unavailable and seamless checkout failover to Shopify if WowPay is down, ensuring uninterrupted traffic monetization.
• Implement monitoring, alerting, and self- healing systems to support zero- downtime operations.
• Architect, deploy, and operate secure, highly available AWS and Azure environments at scale, with clear ownership of uptime, resilience, and cost reduction.
• Lead Infrastructure as Code using Terraform, CloudFormation, or Bicep to ensure repeatable, auditable, and recoverable environments.
• Be on- call and available 24 hours per day during critical incidents, owning incident response, coordination, and resolution.
• Partner with engineering, product, and marketing teams to embed reliability, failover readiness, and cost discipline into every release.
• Design and maintain automated CI/CD pipelines that enable fast, reliable, low- risk deployments.
• Own disaster recovery, multi- region redundancy, and all front- end and back- end failover systems to keep offers, traffic, and checkouts live at all times.
Demonstrated experience designing, operating, and maintaining PCI- DSS compliant environments.
• Own continuous improvement across deployment speed, observability, resilience, security, and cost control.
• Keeping databases solid by ensuring backups work, replication runs smoothly, failover kicks in automatically, and recovery testing is performed.
• Mentor engineering teams in automation, reliability, security, and cloud cost optimization.
• Execute special projects, including platform migrations, infrastructure modernization, and high- priority reliability initiatives.