As a Site Reliability Engineer (SRE) with 4 to 7 years of experience, you will play a critical role in ensuring the availability, reliability, and performance of our systems and services. You will work closely with various teams to respond to incidents, proactively monitor our infrastructure, and use a variety of tools to maintain system stability. Below are the key responsibilities and qualifications for this role:
- Incident Response: Respond promptly to incidents and coordinate with relevant teams to resolve issues at the earliest, minimizing downtime and ensuring a seamless user experience.Determine the launch of new features by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).
- Monitoring and Ticketing: Monitor the alerts dashboard using tools like Squadcast, Evergent, and Runscope. Create internal tickets where necessary and track their progress.
- Deep Dive Analysis: Utilize available tools and services to conduct in-depth investigations into system issues and provide detailed analysis to assist in problem resolution.
- Impact Analysis: Perform impact analysis to understand the potential consequences of incidents and advise on appropriate actions.
- War Room Support: Collaborate with Incident Managers to set up and participate in War Rooms when necessary to expedite issue resolution and identify root causes.
- Ticket Management: Create and manage tickets, working closely with other vendors and stakeholders to drive issues to resolution and completion.
- Critical Bridge Participation: Actively participate in critical bridges to provide technical expertise and support during major incidents.
- Service Understanding: Develop a deep understanding of the various services, especially RESTful web services, hosted on the system, including their order of execution, successful outcomes, and error scenarios.
- To achieve this the SRE team uses the following set of tools
- Confluence – knowledge management tool.
- Squadcast – to monitor alters.
- Evergent – Check user registration status and retrieve CPID.
- CMS – repository for SPN Content and Metadata.
- Firebase – to analysie crashfress users and check crashlogs.grafana, jira, runscope, Google analytics
- Runscope – API monitoring tool which monitors URL and Critical API.
- Bachelor’s degree in Computer Science, Information Technology, or related field.
- 4 to 6 years of experience in a Site Reliability Engineer or with prior experience in AWS.
- Proficiency in tools such as Confluence, Squadcast, Evergent, CMS, Firebase, Grafana, Jira, Runscope, and Google Analytics.
- Experience with API monitoring tools like Runscope.
- Familiarity with incident management and resolution processes.
- Knowledge of cloud computing platforms (e.g., AWS, Azure, GCP) AWS is a plus.
- Certification in relevant areas such as is plus
- AWS Certified DevOps Engineer,
- Certified Kubernetes Administrator
- Certified Docker Associate (DCA)
- ITIL Foundation or similar is preferred.