Senior Site Reliability Engineer
2 semanas atrás
Iterable is the leading AI-powered customer engagement platform that helps leading brands like Redfin, SeatGeek, Priceline, Calm, and Box create dynamic, individualized experiences at scale. Our platform empowers organizations to activate customer data, design seamless cross-channel interactions, and optimize engagement—all with enterprise-grade security and compliance. Today, nearly 1,200 brands across 50+ countries rely on Iterable to drive growth, deepen customer relationships, and deliver joyful customer experiences.
With a global presence—including offices in San Francisco, New York, Denver, London, and Lisbon, plus remote employees worldwide—we are committed to building a diverse and inclusive workplace. We welcome candidates from all backgrounds and encourage you to apply. Learn more about our story and mission on our Culture and About Us pages. Let's shape the future of customer engagement together
How you will make an impact:
As a Senior Engineer on the Cloud Platform team, your impact will be measured by the continuous improvement of our platform's reliability, scalability, and security posture.
- SLO Ownership & Error Budget Management: Take direct ownership of the established Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services (e.g., latency, availability, error rate). You will manage and use the Error Budget as the primary drivers to prioritize reliability work
- Scale and Harden the Core Platform: Apply deep technical expertise in Kubernetes, AWS, traffic management, and Infrastructure-as-Code to scale and harden the foundational platform that powers Iterable's product workloads.
- Drive Systemic Improvements: This role centers on hands-on engineering skill, technical leadership, and systemic reliability improvements within our complex, distributed multi-region platform.
What you'll do
- Kubernetes Platform Engineering
Use your Kubernetes and AWS expertise to evolve EKS lifecycle, multi-tenant isolation, and regional consistency, ensuring clusters remain secure, performant, and predictable as we scale.
- Traffic & Ingress Reliability
Apply advanced knowledge of cloud-native traffic management, and API gateways to strengthen routing, authentication, rate-limiting, and secure communication protocols (like mTLS). This focus will dramatically improve both the reliability and security posture of the platform's public and internal service access points.
- Infrastructure-as-Code at Scale
Demonstrate mastery in IaC to manage complex, multi-region architecture. Use tools like Terraform Cloud to build reusable modules, validate changes through policy-as-code, and establish safe multi-account patterns our teams can rely on.
- Security & Access Control
Drive a zero-trust posture by establishing service guardrails and access controls across the platform: This includes: implementing policy-as-code solutions,brokering least-privilege access for platform using cloud Identity and Access Management (IAM) best practices, and Integrating and managing identity providers to define Role-Based Access Control (RBAC) across environments.
- Reliability Engineering & Incident Leadership
Demonstrate strong diagnostic and incident-response leadership to rapidly isolate issues across clusters, networks, and workloads. Your ultimate responsibility will be to lead and drive systemic, long-term fixes and root-cause investigations, ensuring all necessary actions are taken to eliminate repeat failures .
- Collaboration, Influence & Mentorship
Guide and influence engineering teams across the organization through design reviews, operational best practices, and reliability-focused decision-making.
Required Core Competencies & Proficiencies
- Core Platform & Infrastructure Expertise
- Demonstrate deep skill in managing complex, distributed environments at scale, specifically focusing on:
- Cloud-Native Orchestration: Expertise in Kubernetes.
- Infrastructure Automation: Master of Infrastructure-as-code (IaC), including Terraform.
- Advanced Networking & Connectivity: understanding of core networking fundamentals, including routing, DNS, network segmentation (VPC.subnets), and connectivity services (e.g., transit gateways and network endpoints)
- Platform Systems: Deep competence in traffic/ingress systems and strong programming fundamentals in Go or Python
- Security & Reliability Skills
- Fluency with IAM/IRSA, Vault, mTLS, and least-privilege design, combined with a proven ability to deliver measurable reliability improvements through automation, guardrails, and smart engineering.
- Leadership & Communication
- Demonstrate a strong operational mindset, excellent technical communication (both written and verbal), and the ability to influence designs, mentor others, and elevate platform engineering practices across teams.
- Experience and Proficiency
- Demonstrate advanced proficiency and technical leadership in managing large-scale, resilient production systems. This experience is typically gained through roles such as:
- Site Reliability Engineer (SRE)
- Cloud Platform Engineer
- DevOps Engineer
- Other closely related infrastructure roles
Perks & Benefits:
- Competitive salaries & meaningful equity
- Private Medical Insurance
- Life/Risk Assurance
- Meal Allowance: 8.55€ per day
- Community Days (additional paid holidays)
- Paid Annual Leave (22 days)
- Paid Sabbatical (after 4 years tenure)
- Initial laptop workstation setup
- Teleworking Allowance
Iterable is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Iterable does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender-identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Iterable also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. Pursuant to the San Francisco Fair Chance Ordinance and other similar state laws and local ordinances, and its internal policy, Iterable will also consider for employment qualified applicants with arrest and conviction records.
-
Site Reliability Engineer
3 semanas atrás
Lisbon, Portugal Tata Consultancy Services Tempo inteiroAre you a Site Reliability Engineer seeking a new interesting challenge ? If your answer is yes, it’s your lucky day so keep reading, it can be just what you're looking for !
-
Site Reliability Engineer
3 semanas atrás
Lisbon, Portugal Tata Consultancy Services Tempo inteiroAre you a Site Reliability Engineer seeking a new interesting challenge ? If your answer is yes, it’s your lucky day so keep reading, it can be just what you're looking for !
-
Senior Site Reliability Engineer
Há 4 dias
Lisbon, Portugal INSCALE Tempo inteiroWhy Join Us?JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create comfortable...
-
Site Reliability Engineer
3 semanas atrás
Lisbon, Portugal Sperton Global AS Tempo inteiroJob Title: Site Reliability Engineer (SRE) Location: Lisbon, Portugal (Hybrid)Job Type: Contract (6 months)Role Overview:We are looking for an experienced Site Reliability Engineer (SRE) to support business-critical systems in the banking and financial services domain. The role has a strong focus on production support, monitoring, automation, CI/CD...
-
Senior Site Reliability Engineer
Há 6 dias
Lisbon, Portugal Hiire Tempo inteiroWe are based in Estonia and Cyprus, and we operate globally. We love to work remotely and travel the world. Having freedom, ownership and passion for recruitment are key for us.We hire amazing tech talent for great companiesWe coach highly motivated Professionals to get better careers We empower recruiters to hire more and better Talent. Do you want to...
-
Senior Site Reliability Engineer
Há 5 horas
Lisbon metropolitan area, Portugal INSCALE Tempo inteiroWhy Join Us?JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create comfortable...
-
Senior Site Reliability Engineer
Há 5 dias
Lisbon, Portugal GrabJobs Tempo inteiroA Site Reliability Engineer at Kevel will define reliability targets, solve security issues, automate tasks, and operate production infrastructure.
-
Azure Site Reliability Engineer
Há 5 dias
Lisbon, Portugal act digital Tempo inteiroWe are looking for an Azure Site Reliability Engineer to join a Cloud Operations team focused on digital transformation and cloud optimization. The team works closely with development and infrastructure teams to deliver secure, scalable and highly available cloud platforms.Role OverviewAs an Azure SRE, you will be responsible for ensuring the operational...
-
Senior Site Reliability Engineer
2 semanas atrás
Lisbon, Portugal GrabJobs Tempo inteiroSite Reliability Engineer needed to maintain and improve reliability of services, solve security issues, and automate tasks for a remote engineering team.
-
Senior Site Reliability Engineer
3 semanas atrás
Lisbon, Portugal Iterable Tempo inteiroIterable is the leading AI-powered customer engagement platform that helps leading brands like Redfin, SeatGeek, Priceline, Calm, and Box create dynamic, individualized experiences at scale. Our platform empowers organizations to activate customer data, design seamless cross-channel interactions, and optimize engagement—all with enterprise-grade security...