Manager, DevOps
AppZen
Software Engineering
San Jose, CA, USA
USD 240k-280k / year + Equity
As Manager, DevOps you will lead a devops team responsible for the AWS-based infrastructure, Kubernetes platform, CI/CD systems, production datastores (PostgreSQL, Elasticsearch, Redis, and more), and observability stack that power AppZen. You'll set technical direction, coach engineers, partner closely with Product Engineering and Security, and stay close enough to the work to tune a slow Postgres query, debug an Elasticsearch cluster under load, write Terraform, or review a Helm chart yourself.
This is a builder-manager role. We expect roughly 60% leadership and delivery management, and 40% hands-on technical contribution.
Responsibilities:
Manage, coach, and grow a team of 3-6 DevOps and platform engineers; own hiring, performance, growth plans, and 1:1s.
Set quarterly priorities aligned to engineering and business goals; communicate progress and risk clearly to leadership.
Build a healthy on-call culture: balanced rotations, blameless postmortems, and continuous reduction of toil.
Own the architecture, cost, and reliability of AppZen's AWS footprint across multiple regions and accounts.
Drive infrastructure-as-code standards using Terraform; champion modular, reviewable, version-controlled infrastructure.
Partner with Security and Compliance on SOC 2, ISO 27001, GDPR, and customer audit requirements; harden IAM, network, and secrets management.
Manage cloud spend: visibility, forecasting, and ongoing optimization (Savings Plans, rightsizing, multi-tenant efficiency).
Hands on ownership of PostgreSQL in production: schema reviews, index and query tuning, vacuum/bloat management, replication, failover, point-in-time recovery, and major-version upgrades (RDS / Aurora).
Run and scale Elasticsearch / OpenSearch clusters: shard and index design, JVM and heap tuning, snapshot strategy, hot-warm tiers, and incident response under heavy ingest or query load.
Operate supporting datastores such as Redis (caching, queues), Kafka or SQS/SNS (streaming and async), and S3-backed data lakes; define patterns for high availability, durability, and disaster recovery.
Partner with engineering on capacity planning, performance benchmarking, data tier cost optimization, backup/restore drills, and customer data isolation for multi-tenant workloads.
Operate and improve our EKS-based Kubernetes platform: cluster lifecycle, autoscaling, multi tenancy, and workload isolation.
Define golden paths for service teams using Helm, Kustomize, and GitOps tooling such as ArgoCD or Flux.
Set patterns for service mesh, ingress, and zero-downtime deployments.
Lead the design of internal developer platform capabilities so product teams can ship safely and quickly without infra friction.
Maintain and improve build, test, and deploy pipelines (e.g., GitHub Actions, Jenkins, ArgoCD); enforce supply-chain security and artifact provenance.
Drive measurable improvements in DORA metrics: lead time, deploy frequency, change failure rate, and MTTR.
Own the observability stack (e.g., Datadog, Prometheus, Grafana, OpenTelemetry); ensure consistent metrics, logs, and traces across services.
Define and operationalize SLOs and error budgets in partnership with service owners.
Lead incident command for high-severity events and convert learnings into durable systemic fixes.
What You Bring:
8+ years of experience in DevOps, SRE, infrastructure, or platform engineering, with at least 2 years leading or managing engineers (formal or tech-lead capacity).
Deep, hands-on AWS experience across compute, networking, IAM, data, and observability services; comfortable designing for multi-account, multi-region SaaS.
Strong production experience with Kubernetes (preferably EKS), including upgrades, autoscaling, and securing multi-tenant clusters.
Demonstrated hands on operations experience with PostgreSQL at scale — query and index tuning, replication, HA/failover, backups, and version upgrades — and with Elasticsearch / OpenSearch (cluster sizing, shard strategy, ingest tuning, and incident response).
Working knowledge of additional datastores commonly used in SaaS: Redis, Kafka or other message brokers, and object storage; comfortable evaluating tradeoffs between managed services (RDS, Aurora, ElastiCache, MSK, OpenSearch Service) and self-managed options.
Proficient with Terraform and modern IaC patterns; clear opinions on module design, state management, and PR-driven workflows.
Solid scripting and automation skills in at least one of Python, Go, or Bash.
Track record of designing and operating CI/CD pipelines at scale (GitHub Actions, Jenkins, ArgoCD, or similar).
Experience running production workloads under SOC 2 or comparable compliance frameworks; comfortable partnering with Security on audits and remediation.
Excellent communication and stakeholder skills; able to translate infrastructure tradeoffs into language product, finance, and customer teams understand.
Nice-to-Have:
Experience supporting AI/ML or data heavy SaaS workloads (GPU fleets, vector stores, large async pipelines).
Familiarity with service mesh (Istio, Linkerd) and progressive delivery (Argo Rollouts, feature flags).
Background scaling FinOps practices and managing cloud spend at $5M+ annual run-rate.
Experience operating multitenant SaaS with strict data isolation requirements for enterprise finance customers.
Exposure to multi-cloud or hybrid-cloud environments (Azure, GCP).
240000 - 280000 USD a year
AppZen is committed to fair and equitable compensation practices.
The base pay range for this role is posted above. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to skill set, depth of experience, certifications, and specific work location. This may be different in other locations due to differences in the cost of labor.
The total compensation package for this position may also include annual performance bonus, stock, benefits and/or other applicable incentive compensation plans.