Senior site reliability engineer

Há 14 horas


Lisbon, Portugal Iterable Tempo inteiro

Iterable is the leading AI-powered customer engagement platform that helps leading brands like Redfin, Seat Geek, Priceline, Calm, and Box create dynamic, individualized experiences at scale. Our platform empowers organizations to activate customer data, design seamless cross-channel interactions, and optimize engagement—all with enterprise-grade security and compliance. Today, nearly 1,200 brands across 50+ countries rely on Iterable to drive growth, deepen customer relationships, and deliver joyful customer experiences.With a global presence—including offices in San Francisco, New York, Denver, London, and Lisbon, plus remote employees worldwide—we are committed to building a diverse and inclusive workplace. We welcome candidates from all backgrounds and encourage you to apply. Learn more about our story and mission on ourCultureandAbout Uspages. Let's shape the future of customer engagement togetherHow you will make an impact:
As a Senior Engineer on the Observability Team, your impact is measured by the clarity and reliability with which our engineers can see into their systems. You don't just provide a suite of tools; you serve as astrategic observability partnerfor the entire engineering organization.
Strategic Observability Partnership:You will collaborate deeply with product teams to ensure the frameworks we provide actually solve their problems. Your success is measured by how well teams can diagnose their own services, not just by the uptime of our clusters. You will act as a consultant to help teams define meaningful Observability that reflect the true customer experience.
Set the observability vision– Own the long-term roadmap for Datadog, Grafana, Prometheus, Elasticsearch, Quickwit, and emerging Open Telemetry tooling. Define SLIs/SLOs that align platform health with customer experience.
Lead large-scale implementations- Design and automate scalable pipelines (metrics, traces, logs, events) so every engineer has sub-second, queryable visibility into production.
Harden our platform- Drive upgrades, capacity modeling, and policy enforcement for our dedicated observability-focused clusters; introduce best-in-class patterns for multi-tenant isolation and cost optimization.
Ship platform enhancements– Contribute production-quality Go or Python services, operators, and Terraform modules that elevate reliability, performance, and developer velocity.
Partner with service ownersto embed observability into their SDLC, guide best practices, perform instrumentation reviews, and elevate on-call readiness across the org.
Reduce MTTR, noise, and wasteby designing cost-efficient telemetry architectures, high-signal alerting, and automated recovery patterns.
Lead and model operational excellencethrough on-call participation, post-incident reviews, and continuous improvement initiatives.What we're looking for
We prioritize demonstrated proficiency and the ability to solve complex problems over years of experience.
Kubernetes & Cloud Mastery
Cluster Operations:Proven ability to architect and manage production-grade Kubernetes (EKS) clusters, specifically for stateful workloads.
Infrastructure as Code:Proficiency of Infrastructure-as-code (Ia C), including Terraform.
Observability & Engineering Depth
Telemetry Expertise:Deep production experience with Elasticsearch, Prometheus, or Open Telemetry. You know how to tune these systems for multi-terabyte daily workloads.
Software Engineering:Proficiency in Go or Python to build custom operators, internal tools, and automation.
Data Pipeline Design:Ability to optimize ingestion and storage for logs, metrics, and traces while balancing query performance with cost-efficiency.
Leadership & Collaboration
Consultative Approach:Ability to influence engineering culture by mentoring peers and partnering with service owners to improve their observability posture.
Growth Mindset:A humble, collaborative approach to problem-solving and a bias toward systemic, automated solutions.
Bonus points
Hands-on success migrating to Open Telemetry or similar vendor-neutral standards.
Experience tuning Datadog APM/Logs, Grafana/ Thanos/Mimir, or Click House-based log stores for multi-TB/day workloads.Perks & Benefits:
Competitive salaries & meaningful equity
Private Medical Insurance
Life/Risk Assurance
Meal Allowance: 8.55€ per day
Community Days (additional paid holidays)
Paid Annual Leave (22 days)
Paid Sabbatical (after 4 years tenure)
Initial laptop workstation setup
Teleworking Allowance


  • Site Reliability Engineer

    2 semanas atrás


    Lisbon, Portugal Tata Consultancy Services Tempo inteiro

    Are you a Site Reliability Engineer seeking a new interesting challenge ? If your answer is yes, it’s your lucky day so keep reading, it can be just what you're looking for !

  • Site Reliability Engineer

    2 semanas atrás


    Lisbon, Portugal Tata Consultancy Services Tempo inteiro

    Are you a Site Reliability Engineer seeking a new interesting challenge ? If your answer is yes, it’s your lucky day so keep reading, it can be just what you're looking for !


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us?JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create comfortable...


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us?JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create comfortable...


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us? JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create...


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us? JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create...


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us? JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create...


  • Lisbon, Portugal INSCALE Tempo inteiro

    Why Join Us? JYSK is a global retail chain that brings Scandinavian design and quality to the world through an extensive range of quality products for sleeping and living. JYSK is known for its commitment to simplicity, functionality, and affordability. With over 3,200 stores in 48 countries, JYSK is a trusted brand for customers seeking to create...

  • Site Reliability Engineer

    3 semanas atrás


    Lisbon, Portugal Sperton Global AS Tempo inteiro

    Job Title: Site Reliability Engineer (SRE) Location:  Lisbon, Portugal (Hybrid)Job Type: Contract (6 months)Role Overview:We are looking for an experienced Site Reliability Engineer (SRE) to support business-critical systems in the banking and financial services domain. The role has a strong focus on production support, monitoring, automation, CI/CD...


  • Lisbon, Portugal GrabJobs Tempo inteiro

    A Site Reliability Engineer at Kevel will define reliability targets, solve security issues, automate tasks, and operate production infrastructure.