Site Reliability Engineer

Johannesburg, SOUTH AFRICA Engineering Full-Time

Linux VMWare Cloud NGINX CI/CD Scripting Go Python Virtualisation Reliability

At ING, Site Reliability Engineers (SRE) seek to constantly improve the availability, reliability and performance of our production systems and services.

By design our SREs are focused on engineering and automation. An SRE is expected to be extremely proficient in Coding. In order to achieve highly available, highly reliable and performance efficient production systems an SRE is given a very broad mandate and authority to operate.

If you are good at:

Monitoring & TroubleshootingMonitoring the performance of our production systems using a host of monitoring toolsProactively identifying and troubleshooting issues such as software bugs, misconfigurations, performance bottlenecks and coordinating the fix of those issues
Monitoring the performance of our production systems using a host of monitoring tools
Proactively identifying and troubleshooting issues such as software bugs, misconfigurations, performance bottlenecks and coordinating the fix of those issues

Availability & ReliabilityIncreasing availability and reliability of our production systemsCoordinating Chaos TestingCapacity Planning a. Coordinating capacity assessment and capacity planning with IT Engineering and IT Architecture
Increasing availability and reliability of our production systems
Coordinating Chaos Testing
Capacity Planning a. Coordinating capacity assessment and capacity planning with IT Engineering and IT Architecture

Technical Risk and Health Assessment a. Constantly running technical state health assessments on production infrastructure and systems to identify CIs deviating from baseline
Service Level ManagementActively monitoring SLAs and ensuring that services perform within promised SLAsb. Holding IT Engineering, Security and Architecture accountable for the remediation of any SLA degradation
Actively monitoring SLAs and ensuring that services perform within promised SLAs
b. Holding IT Engineering, Security and Architecture accountable for the remediation of any SLA degradation

IT Key Controlsa. Ensuring that IT is ‘in CONTROL’ by holding IT groups accountable for adherenceb. Collating and providing necessary evidence to Auditors for these controls
a. Ensuring that IT is ‘in CONTROL’ by holding IT groups accountable for adherence
b. Collating and providing necessary evidence to Auditors for these controls

Runners, Automation & Toolinga. Architecting, creating and automatically managing an army of ‘runners or bots’ that fully automate tasks across infrastructure and applications – e.g. extracting production data, generating production reports, trigger event responses etc.b. Identifying and automating manual operational tasksc. Building and integrating tools that will assist in improving system availability, reliability and performance
a. Architecting, creating and automatically managing an army of ‘runners or bots’ that fully automate tasks across infrastructure and applications – e.g. extracting production data, generating production reports, trigger event responses etc.
b. Identifying and automating manual operational tasks
c. Building and integrating tools that will assist in improving system availability, reliability and performance

Incident & Problem Managementa. Coordinating incident management and service restorationb. SREs are part of the on-call team of engineers that support production systems.c. Work with BizDevOps squads on post mortems & assist in identifying and fixing reliability issues
a. Coordinating incident management and service restoration
b. SREs are part of the on-call team of engineers that support production systems.
c. Work with BizDevOps squads on post mortems & assist in identifying and fixing reliability issues

Disaster Recovery (DR) & Business Continuity Planning (BCP)a. Plan and Manage Disaster Recovery (DR) Runbook and DR testing
a. Plan and Manage Disaster Recovery (DR) Runbook and DR testing

Production Reporting a. Gather relevant data and provide accurate production reporting for availability, reliability, performance and capacity.

Service Request ManagementA small part of the job requires coordinating response to the occasional service request from our business partners. For e.g. if a business unit requests restore of a particular backup
A small part of the job requires coordinating response to the occasional service request from our business partners. For e.g. if a business unit requests restore of a particular backup

If you can:

Exhibit sound knowledge of and experience in multiple areas:

Systems administration o RHEL Linux
Virtualisation
VMWare
Cloud Concepts, Platforms, Technologies & Tools
OpenShift
Containers & Orchestration – Docker & Kubernetes
Web/App Servers
NGINX, JBoss etc.
CICD
Git, Jenkins, Ansible, etc.
Scripting
Bash, Go, Python etc.
Programming Languages
Java
Databases
SQL server administration
Ability to write complex queries
Logging & Analytics o ELK
Networking
Load Balancing – F5, HA Proxy o Firewalls
Monitoring
Prometheus, Grafana, DataDog, StatusPage etc.

If you are:

A Professional whith the following expertise:

Enthusiastic, Curious and Self-driven about improving system reliability, availability and performance
Analyse trends to pro-actively prevent incidents, understand and capture key data from log
Understand traffic flows and key dependencies between services.
Strong sense of ownership of problems
Solve complex problems while remaining cool under pressure
Effective collaboration and communication skills

If you’d like to work with:

Customers
Developers & Technology Team
Experts and Stakeholders

If you want to:

Work at Bonifacio Global City, Taguig – premier central business district
Enjoy competitive compensation and allowances
Be part of a diverse, creative, and energized team

ING

ING is a pioneer in digital banking and on the forefront being one of the most innovative banks in the world. As ING we have a clear purpose that represents our conviction of people’s potential. We don’t judge, coach, or to tell people how to live their lives. However big or small, modest or grand, we empower people and businesses to realize their vision for a better future. We made the promise to make banking frictionless, removing barriers to progress, and make people confident in their financial decisions. As a global bank we have a huge opportunity – and responsibility – to make an impact for the better. We can play a role by financing change, sharing knowledge, and innovating. Being sustainable is in all the choices we make—as a lender, as a partner and through the services we offer our customers.

© LaunchAfrica 2024

Top Locations