At IBM, work is more than a job – it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so, lets talk.
Your Role and Responsibilities
The developer and Site Reliability Engineer (SRE) teams both care about reliability, availability, performance, scalability, efficiency, and feature and launch velocity. However, SRE’s operate under different incentives, mainly favoring service long-term viability over new feature launches. SRE’s are responsible for ensuring services are resilient, responsive and have an up time appropriate to customer’s needs whilst controlling capacity and performance. Additionally, improving these services in a highly dynamic environment.
In summary, SRE is an engineering discipline that combines software, infrastructure and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. Day-to-day, SRE’s use automation to limit time spent on operational work and proactively identify potential risk factors and convert them into actionable improvements.
Required Technical and Professional Expertise
Experience automating problems or tasks to reduce toil (Powershell, shell, python etc.)
Knowledge of building and using observability, defining metrics or measures and dashboards, use of observability tools (Sysdig, Kibana, Prometheus, Grafana, Zabbix)
Experience with a logging and analytics framework (Splunk, LogDNA, or ELK stack)
System design knowledge (cloud-native architectures, best practices for availability and resiliency, practices and methods for problem isolation)
Experience with pipeline tools for deploying and managing applications (Travis, Jenkins)
Confident with infrastructure-as-code tools (Ansible, Terraform, Blueprints)
Confident with source control (Github, perforce)
Experience with cloud services and platforms (IBM Cloud, AWS, GCP, MS Azure)
General Linux knowledge
Network and security knowledge
Happy working using Agile practices, and JIRA