Senior Site Reliability Engineer
2 semanas atrás
Iterable is the leading AI-powered customer engagement platform that helps leading brands like Redfin, SeatGeek, Priceline, Calm, and Box create dynamic, individualized experiences at scale. Our platform empowers organizations to activate customer data, design seamless cross-channel interactions, and optimize engagement—all with enterprise-grade security and compliance. Today, nearly 1,200 brands across 50+ countries rely on Iterable to drive growth, deepen customer relationships, and deliver joyful customer experiences.
Our success is powered by extraordinary people who bring our core values—Trust, Growth Mindset, Balance, and Humility—to life. We foster a culture of innovation, collaboration, and inclusion, where ideas are valued and individuals are empowered to do their best work. That's why we've been recognized as one of Inc's Best Workplaces and Fastest Growing Companies, and were recognized on Forbes' list of America's Best Startup Employers in 2022. Notably, Iterable has also been listed on Wealthfront's Career Launching Companies List and has held a top 10 ranking on the Top 25 Companies Where Women Want to Work.
With a global presence—including offices in San Francisco, New York, Denver, London, and Lisbon, plus remote employees worldwide—we are committed to building a diverse and inclusive workplace. We welcome candidates from all backgrounds and encourage you to apply. Learn more about our story and mission on our Culture and About Us pages. Let's shape the future of customer engagement together
How you will make an impact:
As a Senior Engineer on the Observability Team, your impact is measured by the clarity and reliability with which our engineers can see into their systems. You don't just provide a suite of tools; you serve as a
strategic observability partner
for the entire engineering organization.
- Strategic Observability Partnership: You will collaborate deeply with product teams to ensure the frameworks we provide actually solve their problems. Your success is measured by how well teams can diagnose their own services, not just by the uptime of our clusters. You will act as a consultant to help teams define meaningful Observability that reflect the true customer experience.
- Set the observability vision – Own the long-term roadmap for Datadog, Grafana, Prometheus, Elasticsearch, Quickwit, and emerging OpenTelemetry tooling. Define SLIs/SLOs that align platform health with customer experience.
- Lead large-scale implementations - Design and automate scalable pipelines (metrics, traces, logs, events) so every engineer has sub-second, queryable visibility into production.
- Harden our platform - Drive upgrades, capacity modeling, and policy enforcement for our dedicated observability-focused clusters; introduce best-in-class patterns for multi-tenant isolation and cost optimization.
- Ship platform enhancements – Contribute production-quality Go or Python services, operators, and Terraform modules that elevate reliability, performance, and developer velocity.
- Partner with service owners to embed observability into their SDLC, guide best practices, perform instrumentation reviews, and elevate on-call readiness across the org.
- Reduce MTTR, noise, and waste by designing cost-efficient telemetry architectures, high-signal alerting, and automated recovery patterns.
- Lead and model operational excellence through on-call participation, post-incident reviews, and continuous improvement initiatives.
What We're Looking For
We prioritize demonstrated proficiency and the ability to solve complex problems over years of experience.
Kubernetes & Cloud Mastery
- Cluster Operations: Proven ability to architect and manage production-grade Kubernetes (EKS) clusters, specifically for stateful workloads.
- Infrastructure as Code: Proficiency of Infrastructure-as-code (IaC), including Terraform.
Observability & Engineering Depth
- Telemetry Expertise: Deep production experience with Elasticsearch, Prometheus, or OpenTelemetry. You know how to tune these systems for multi-terabyte daily workloads.
- Software Engineering: Proficiency in Go or Python to build custom operators, internal tools, and automation.
- Data Pipeline Design: Ability to optimize ingestion and storage for logs, metrics, and traces while balancing query performance with cost-efficiency.
Leadership & Collaboration
- Consultative Approach: Ability to influence engineering culture by mentoring peers and partnering with service owners to improve their observability posture.
- Growth Mindset: A humble, collaborative approach to problem-solving and a bias toward systemic, automated solutions.
Bonus points
- Hands-on success migrating to OpenTelemetry or similar vendor-neutral standards.
- Experience tuning Datadog APM/Logs, Grafana/ Thanos/Mimir, or ClickHouse-based log stores for multi-TB/day workloads.
Perks & Benefits:
- Competitive salaries & meaningful equity
- Private Medical Insurance
- Life/Risk Assurance
- Meal Allowance: 8.55€ per day
- Community Days (additional paid holidays)
- Paid Annual Leave (22 days)
- Paid Sabbatical (after 4 years tenure)
- Initial laptop workstation setup
- Teleworking Allowance
Recruitment Disclaimer:
Please be aware that Iterable, Inc. ("Iterable") and our official professional recruiting agencies and platforms do not:
- Send job offers from free email services like Gmail, Yahoo mail, Hotmail, etc.
- Request money, fees, or payment of any kind from prospective candidates to apply to Iterable, for employment, or for the recruitment process (e.g. for home office supplies, or training, etc.).
- Request or require personal documents like bank account details, tax forms, or credit card information as part of the recruitment process prior to the candidate signing an engagement letter or an employment contract with Iterable.
You may see all job vacancies on our official Iterable channels:
- Official Iterable website, Careers page:
- Official LinkedIn Jobs page:
Iterable is not affiliated in any way to these impostors and we hereby confirm that such individuals/entities are not authorized, encouraged, or sponsored to act on behalf of Iterable. Such job opportunities are entirely fake and not valid. Therefore, please disregard any written or oral request for a job offer or an interview that you believe is or might be fraudulent or suspicious and immediately reach out to us via email at
talent-
upon receiving a suspicious job offer.
Criminal and/or civil liabilities may arise from such actions, and Iterable expressly reserves the right to take legal action, including criminal action, against such individuals/entities whenever such phenomena occur. In any case, please note that under no circumstances shall Iterable and any of its affiliates be held liable or responsible for any claims, losses, damages, expenses or other inconvenience resulting from or in any way connected to the actions of these impostors.
Iterable is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Iterable does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender-identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Iterable also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. Pursuant to the San Francisco Fair Chance Ordinance and other similar state laws and local ordinances, and its internal policy, Iterable will also consider for employment qualified applicants with arrest and conviction records.
-
Senior Site Reliability Engineer
1 semana atrás
Lisboa, Lisboa, Portugal QiBit Tempo inteiroWe are looking for aSenior Site Reliability Engineer (SRE)to join the IT team of our client – a company specialized in the financial technology sector.What will be your main tasks and responsibilities?Act as the primary contact and leader for platform incidents, ensuring swift resolution through collaboration with engineering teams and effective...
-
Senior Site Reliability Engineer
Há 3 dias
Lisboa, Lisboa, Portugal Arcesium Tempo inteiroArcesium is a global financial technology firm that solves complex data-driven challenges faced by some of the world's most sophisticated financial institutions. We constantly innovate our platform and capabilities to meet tomorrow's challenges, anticipate the risks our clients encounter, and design advanced solutions to help our clients achieve...
-
Senior Site Reliability Engineer
Há 6 dias
Lisboa, Lisboa, Portugal EPAM Systems Tempo inteiroWe are seeking aSenior Site Reliability Engineerto support a global execution platform and deliver high-quality solutions to trading desks and clients.You will work closely with top specialists, developing your skills in system management, monitoring, and low-latency technology. Apply now to be part of a team driving innovation in financial technology.Please...
-
Site Reliability Engineer
Há 7 dias
Lisboa, Lisboa, Portugal ISPROX Tempo inteiroISPROX is a talent recruiting organization. Our goal is to find and select the best human capital and talent for our clients in order to help them to grow or sustain as a company. ISPROX has presence in several locations in Europe in order to be as much close as possible from our clients.ISPROX is looking for:We are selecting for our client, a multinational...
-
SRE - Site Reliability Engineer
1 semana atrás
Lisboa, Lisboa, Portugal MCC Consulting Tempo inteiroSRE – Site Reliability EngineerEstamos à procura de umSREexperiente para integrar uma equipa dinâmica e orientada para a excelência operacional. Se gostas de automação, estabilidade, observabilidade e boas práticas DevOps, esta oportunidade pode ser para tiRequisitos obrigatóriosExperiência ProfissionalMínimo de 5 anos de experiência comprovada...
-
Senior Site Reliability
1 semana atrás
Lisboa, Lisboa, Portugal Canonical - Jobs Tempo inteiroCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the world's leading public cloud and silicon providers,...
-
Site Reliability Engineer
1 semana atrás
Lisboa, Lisboa, Portugal ComplyAdvantage Tempo inteiroWhat you will be doing:Join our dynamic and collaborative technology team as a Site Reliability Engineer You'll be at the heart of our operations, playing a pivotal role in ensuring the reliability, scalability, and performance of the critical services our customers depend on.As part of the CloudOps team within our Platform tribe, you'll collaborate with...
-
Site Reliability Engineer
2 semanas atrás
Lisboa, Lisboa, Portugal ComplyAdvantage Tempo inteiroWhat you will be doing:Join our dynamic and collaborative technology team as a Site Reliability Engineer You'll be at the heart of our operations, playing a pivotal role in ensuring the reliability, scalability, and performance of the critical services our customers depend on. As part of the CloudOps team within our Platform tribe, you'll collaborate with...
-
Lisboa, Lisboa, Portugal DataSmart Lda Tempo inteiro# Think Data Be Smart #About Us:DataSmartis a Portuguese company, positioning itself as a consulting company of excellence, with over 20 years of existence. We are specialized in Technologies and Information Systems services, for the Portuguese and International markets. We pride ourselves on fostering a culture of involvement, experience, and...
-
Site Reliability Engineer
1 semana atrás
Lisboa, Lisboa, Portugal MOZAYDO Tempo inteiroJob Title: Site Reliability Engineer (SRE)Location: Lisbon, PortugalWork model: Full-time, Hybrid (3x office per week)About MozaydoMozaydo was built by people who believe work should feel human - even when powered by technology.We're a remote-first company that connects talent, technology, and purpose to help companies grow sustainably.Here, ownership...