Site Reliability Engineer - Bangalore, India

Cockroach Labs

Cockroach Labs

Software Engineering
Bengaluru, Karnataka, India
Posted on Thursday, May 11, 2023

Databases are the beating heart of every business in the world.

Cockroach Labs is the team behind CockroachDB, an open source, distributed SQL database. In addition to the open source version of the DB, we offer CockroachCloud, a self-service, fully managed cloud offering of CockroachDB. We aim to build infrastructure that keeps pace with the world, so developers can focus on what matters most: building the best products. Join us on our mission to Make Data Easy.

About the Role

Cockroach Labs is expanding its international engineering presence into Bangalore, India, and you must be based in Bangalore to be considered for this position.

CockroachDB is the backbone of storing global service. As a Site Reliability Engineer, you’ll help manage and scale our CockroachDB Cloud services, which span multiple cloud providers. You will oversee our production systems, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers. Roughly half of your time will be spent on greenfield development work, with an emphasis on developing tooling and driving automation. In the role you will work across multiple teams within CockroachDB Cloud as well as development and product teams working on CockroachDB.

You Will

  • Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
  • Design, write and deliver software and systems to increase product reliability and operational efficiency.
  • Develop custom tools as necessary.
  • Keep a complex system running and solve problems relating to mission-critical services.
  • Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
  • Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
  • Participate in an on-call rotation for our production systems and hosted services.

The Expectations

In your first 30 days, you will onboard and be exposed to our current internal and customer-facing production systems. Working with our existing SRE and engineering teams, you will pair on production operations and build out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology and our company.

After 3 months, you'll be fully integrated into the team. You will develop and own tooling for reliability, automation, and other issues related to CockroachDB Cloud’s stability and scalability. You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, and building great tools. You will help make CockroachDB Cloud the best platform to host CockroachDB on by bringing your expertise to our database.

You Have

  • Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
  • Experience in software development using one or more of the following: Go, C, C++, Python, Java.
  • Proficiency working with algorithms, data structures, and production troubleshooting.
  • Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
  • Debugged and optimized code and to automate routine tasks.
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
  • Previous on-call experience, with a sense of urgency.
  • Experience building collaborative relationships with your colleagues. You enjoy being part of the code review process and partnering with your teammates on challenging problems.
  • 5+ years of experience.

The Team

BabuSrithar - Site Lead, India

BabuSrithar is the Site Leader for India. He is responsible for our growth strategy and cultural champion in the region. He is passionate about building high-quality software products and lean teams by leveraging everyone's potential. He enjoys working with people and learning along the way. Before joining Cockroach Labs, BabuSrithar held senior leadership positions at companies like Nutanix, Clumio and recently he was VP of engineering at Apty where he led the engineering globally. When not at work, he enjoys his time with his 3 year old and family.

Yandu Oppacher - Director of Engineering

Yandu works across multiple parts of CockroachCloud to ensure that our infrastructure and teams are robust and scalable. Yandu joined Cockroach Labs after nearly 8 years at Shopify where he started on the data platform team and helped it grow from 4 DB nodes to several hundred Hadoop nodes running over petabytes of data in Google Cloud. In his last 2 years at Shopify, he led the Production Engineering teams responsible for all of the compute runtime resources that power Shopify’s mission critical services. Joining CockroachCloud and Cockroach Labs allows him to get back to his first love, Databases, while applying his Production Engineering skills to help build our DBaaS platform. Outside of Cockroach Labs Yandu will be found reading or, more likely, chasing after his 3 young kids and exploring the outdoors with them.

Tom Schmidt - Site Reliability Engineering Manager

Tom recently joined Cockroach Labs as manager of Site Reliability Engineering and has taken responsibility for Cockroach Cloud’s production operations. Tom joined Cockroach Labs after 15 years at IBM where he initially contributed in a wide variety of technical leadership roles, generally focussing on quality and automation across compiler development, test frameworks, CICD, and more. Over the past 5 years, Tom has become an enthusiastic advocate of the Site Reliability Engineering discipline, presenting on the topic at conferences, developing certification curriculum, and securing multiple patents. Tom was also a primary contributor towards the establishment of IBMs formal SRE profession and was recognized as one of the first three SRE Thought Leaders within the company. Most recently, Tom transitioned into a management role where he introduced Site Reliability Engineering to the IBM Business Analytics organization, building an SRE team from the ground up, eventually managing over 20 individuals across 3 unique project areas while establishing practices that now guide over 80 engineers internationally. Cockroach labs presented a new and unique opportunity to gain experience in a high paced startup environment, laying the foundation for scalable reliability as we prepare for the rapid growth of our Cockroach Cloud offering. Beyond the business, Tom is blessed to call himself a proud father of a 2 year old boy, and otherwise enjoys finding a balance between spending time in nature (hiking, camping, exploring) and testing his mettle in competitive gaming.

Our Benefits

  • Paid parental leave (with baby bucks)
  • Flex Fridays
  • Flexible time off & flexible hours

Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at