Cloud Reliability Engineer II
Mode Analytics
What's the role:
We are seeking a dedicated Cloud Reliability Engineer to champion the reliability, availability, and security of our production SaaS platform. In this role, you will act as the first line of defense for cloud infrastructure, balancing your time between core production day to day operations —such as incident management, change management, monitoring, and triage—and automation to reduce operational toil. You will play a pivotal role in maintaining customer trust by strictly adhering to SLAs and compliance processes while driving continuous improvement through code.
What You'll Do:
Operational Excellence & Incident Management
- Monitoring & Triage: Proactively monitor cloud infrastructure health to ensure high availability and performance. Act as the primary owner for production alert monitoring, triage, and swift resolution.
- Incident Response: Manage critical incidents and escalations from identification to resolution. Lead root cause analysis (RCA) and post-incident reviews to minimize Mean Time To Recovery (MTTR) and prevent recurrence.
- Change & Release Management: Execute and track production upgrades, multi-tenant deployments, and change requests within defined SLAs, ensuring zero-downtime maintenance where possible.
- Escalation Support: Handle escalated Support cases and provide infrastructure support for field teams and other environments.
- 24/7 Availability: Participate in a shift-based schedule and on-call rotation to provide round-the-clock support for critical production systems.
Automation & Continuous Improvement
- Task Automation: Utilize Python and Jenkins to script and automate repetitive operational tasks, reducing manual intervention and increasing efficiency.
- Tooling Optimization: Assist in maintaining and optimizing monitoring, alerting, and CI/CD tools to streamline workflows.
- Process Evolution: Identify opportunities to shift left on operations, transforming manual runbooks into automated self-healing mechanisms over time.
What You Bring:
- 2–5 years of professional experience in Cloud Operations, Site Reliability Engineering (SRE), or K8s administration.
- Hands-on experience with public cloud platforms (AWS, GCP, or Azure) in a production environment.
- Operational knowledge of Kubernetes (EKS, GKE, or AKS) including troubleshooting and cluster management.
- Moderate proficiency in scripting and automation, specifically using Python and Jenkins.
- Strong understanding of ITIL processes (Incident, Change, Problem Management).
- Demonstrated ability to prioritize tasks under pressure while maintaining strict SLAs.
- Excellent collaboration skills to work effectively with Engineering, Product, and Support teams.
- Bachelor’s degree in Computer Science, Information Technology, or equivalent work experience.
Preferred Skills:
- Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation.
- Familiarity with cloud-native observability tools (e.g., CloudWatch, Stackdriver, Prometheus, Grafana).
- Strong Linux system administration and networking troubleshooting skills.
- Background in supporting enterprise-grade SaaS platforms with strict compliance and security requirements.
Working Conditions:
- Shift-Based Role: This position requires working in defined shifts to ensure global coverage.
- On-Call: Regular participation in an on-call rotation is required.
- Environment: Fast-paced, collaborative, and process-oriented environment with a strong focus on production stability.
What makes ThoughtSpot a great place to work?
ThoughtSpot is the experience layer of the modern data stack, leading the industry with our AI-powered analytics and natural language search. We hire people with unique identities, backgrounds, and perspectives—this balance-for-the-better philosophy is key to our success. When paired with our culture of Selfless Excellence and our drive for continuous improvement (2% done), ThoughtSpot cultivates a respectful culture that pushes norms to create world-class products. If you’re excited by the opportunity to work with some of the brightest minds in the business and make your mark on a truly innovative company, we invite you to read more about our mission, and apply to the role that’s right for you.
ThoughtSpot for All
Building a diverse and inclusive team isn't just the right thing to do for our people, it's the right thing to do for our business. We know we can’t solve complex data problems with a single perspective. It takes many voices, experiences, and areas of expertise to deliver the innovative solutions our customers need. At ThoughtSpot, we continually celebrate the diverse communities that individuals cultivate to empower every Spotter to bring their whole authentic self to work. We’re committed to being real and continuously learning when it comes to equality, equity, and creating space for underrepresented groups to thrive. Research shows that in order to apply for a job, women feel they need to meet 100% of the criteria while men usually apply after meeting 60%. Regardless of how you identify, if you believe you can do the job and are a good match, we encourage you to apply.