Site Reliability Engineer

Zappi is a SaaS company that is aimed at completely transforming the market research industry. Our platform integrates world class research methodologies and engineering to allow brands to run consumer testing through all stages of advertising and innovation development. We are constantly innovating and tackling diverse and complex problems using a multitude of technologies in order to scale expertise and create the world’s most powerful enterprise research platform thereby making the world of insights even better.

We have created an environment that fosters constant learning and innovating and we believe in having ambitious goals. We are data scientists, developers, researchers, analysts, designers, engineers, and marketers all driven by the notion of trying to make the impossible possible. To realise our vision we are constantly in search of people who will bring a different perspective, who will challenge our thinking, create value for our customers and apply themselves passionately to our vision and culture.

You:

  • Prioritise your own learning.
  • Passionate about solving complex problems with an array of engineering techniques.
  • Are versatile and think outside the box.
  • Display leadership qualities and are enthusiastic about taking on new problems.
  • Take time to listen deeply before acting.
  • Aren't afraid to address challenging issues directly, with compassion.
  • Lead through inspiration not coercion and create space for others to lead.
  • Are comfortable with radical transparency.
  • Are comfortable being uncomfortable; change is constant at Zappi.
  • Are humble and honest.

We:

  • Listen carefully to each other and to our customers.
  • Believe anyone can achieve great things; we don’t put people in boxes.
  • Promote experimentation and freedom of expression whether it is with new engineering practices, new technologies or new cultural and working practices.
  • Fundamentally trust each other.
  • Leave our egos at the door.
  • Aren't afraid to fail.
  • Want to have a positive impact on the planet and our communities.

We are an equal opportunities employer; our diversity is a major strength. We maintain a constant dialogue with our teams and wider communities about how we can become a more inclusive place to work.

Role

We are looking for one Site Reliability Engineer, to help us better manage the infrastructure that runs the Zappi platform and support the workflows of 60+ developers. We pride ourselves in making major infrastructure changes a non-event, providing tools to increase developer productivity and giving developers the confidence to ship features to our remote environments often, quickly and easily, even on a Friday! You would be joining a team of 3 that consists of King, Zac and Hadrian.

On a day-to-day basis you would be involved with everything listed in the ‘What You'll Work On’ section of this job description. You will also need to write code from time to time so programming proficiency is important. Web development helps.

Our expectation is that you are at an intermediate level however if you consider yourself of a more senior level we would still love to hear from you!

Our Stack

It includes but is not limited to:

  • CI/CD – Jenkins (task runner), CircleCI (application tests) and Port Control (internal application that powers a Heroku-like deployment experience for developers)
  • Cloud – AWS (for compute) & GCP (for select application APIs)
  • Containerisation – Docker
  • Databases – MySQL (Aurora), Postgres (RDS), Redis (ElasticCache), RedisGraph, RedisRoaring and Elasticsearch
  • Infrastructure as Code – Terraform, Jenkins-Job-Builder
  • Logging Stack – Elasticsearch, Logstash, Kibana and Filebeat
  • Metrics Stack – Prometheus (including Alert Manager), Grafana and InfluxDB
  • Operating System – Linux
  • Orchestration – Kubernetes
  • Programming Languages – Ruby (Ruby on Rails), Python, JavaScript (NodeJS), Go, Elixir (Phoenix) and PHP (WordPress)
  • Tracing - Honeycomb
  • Version Control – Git and GitHub

We run all our applications on self-managed Kubernetes clusters which we bootstrap and manage using Kops. We’ve complemented our Kubernetes setup quite a bit using add-ons such as: AWS Load Balancer Controller, Calico, Cluster Autoscaler, Custom Metrics Adapter, External DNS, Falco, Kube State Metrics, Metrics Server, Nginx Ingress and Node Problem Detector.

And we run all the above on AWS. Some of the primary services we use and maintain are: CloudFront, CloudWatch, Cost Explorer, ECR, EC2, ELB, Elasticache, Redshift, RDS, Route53, S3, SES, SNS and SQS. We also use and help maintain the following services alongside the security team: CloudTrail, Cognito, GuardDuty, KMS, Macie and WAF.

If you find our stack interesting then you’ll probably love working with us! And we have some talks up where we share a little about our journey and experience:

What You'll Work On

To give you an inkling of what a typical day looks like, here’s some tasks that you would find yourself working on a day-to-day:

  1. Designing, building, and maintaining the core infrastructure used by all of development teams.
  2. Building and maintaining internal tooling to manage continuous integration and deployment.
  3. Automation of arduous developer processes with the goal of making their lives easier.
  4. Debugging issues across services and different levels of the stack.
  5. Monitoring and managing the cost of our infrastructure.
  6. Planning for the growth of our infrastructure.
  7. Working closely with the security team to configure for secure infrastructure.
  8. Improving the experience of internal and external clients.
  9. Writing high quality application code in a programming language – Go & Ruby.
  10. Writing scripts to automate small tasks i.e. bash scripts.
  11. Support developers to roll out high-risk application changes e.g. large migrations.
  12. Perform upgrades to keep everything up to date.
  13. Maintain documentation of our infrastructure and tooling.
  14. Educate developers on our infrastructure and tooling.

We also do have a keen interest in blogging a lot more about what we do and open-sourcing anything that would benefit the larger community.

How We Work

We start from a position of trust. We believe that given the right information, people will make good decisions. Therefore we lean toward principles and guidelines rather than hard and fast rules. Here are a few things that we would like to highlight on how it is to work with us:

  • Advice & Feedback – You should both count on, and be prepared for, completely honest advice & feedback from your team-mates. We may offer encouragement or criticism, indifference or unease; in any case, you can count on it being honest and candid, and from the heart. And in return, we expect and encourage you to also be courageously honest.
  • Decision Making – Once you have sought advice, you are empowered to make a decision. Not everyone has to agree with your chosen course of action; we value disruptive innovation and it might not always please everyone. Constantly seeking consensus can be tiresome and so we place emphasis on obtaining consent, not consensus.
  • Meetings – We have only three meetings per week that run about 30 minutes to an hour depending on what needs to be discussed. One on Monday morning to plan the week, another on Wednesday where we break work down into chunks and make sure there’s a ticket to track each chunk and the other on Friday to recap on the week and provide peer feedback. The rest of the time you’re free to manage your time as you wish as long as you’re getting work done. In some cases we scrap the meeting if we feel we’re already aligned.
  • Communication – Most of our communication is on Slack and should be asynchronous. However, if you’re blocked you’re free to nudge anyone on the team to unblock you.
  • Conventions – Internally we have a lot of conventions that we’ve come to follow over the years. Understandably you won’t know about all these but we value your perspective and want to make the most of your unique point of view. We’ll be ready to listen and discuss anything that you would like to challenge.
  • Onboarding & Support – You can expect to to have the support you need to have a delightful onboarding experience with enough room to learn at a reasonable pace.
  • Working Hours – While 8-5 are the official hours, you have the freedom to slide this earlier or later depending on what works best for you (with agreement with the team). It's certainly true that every now and then, crunch time hits hard, and we might have to work some extra hours. But for the most part, this is more the exception than the norm.
  • On-Call – We have one person from the team rotate weekly. Their responsibility is to handle issues that arise after hours so as to give the rest of the team room to not think about work after hours. Our on-call is considered paid overtime.

Requirements

For the intermediate level, experience with logging, monitoring, containerisation, container orchestration, continuous integration/deployment, database management and cloud infrastructure is required.

We have a flat structure (no top down decisions) so the person in this role would have a high degree of autonomy and be expected to make good choices as it’s a team with high responsibility and very low margin for error. Beyond that every bit of relevant experience helps.

For the senior level, you would be expected to make an impact sooner. Since a more senior applicant will be expected to have several years of experience in the space (potentially in areas that we ourselves may not be as knowledgeable in). They will bring more to the team from a design, team direction, and mentorship perspective. On the more technical side of things we would expect:

  • Opinions on alerting, application configuration, autoscaling, automation, centralised logging, cloud infrastructure, containerisation, deployment strategies, distributed systems, failure management/modes, high-availability, immutable infrastructure, monitoring, latency reduction, load-testing, performance measuring and security; since you will have a high degree of influence on design and implementation details of our infrastructure.
  • Knowledge of security tooling (e.g. SIEMs, IDSs & IPSs) would be a plus but it's absolutely not required as we have an internal dedicated security team that we work closely with to tackle any security work. However, you should have a general sense of what’s required to configure our applications and infrastructure securely.

Application Process

Once selected, our typical interview process will run you through the following steps:

  • Technical Interview – A role based technical assessment that would evaluate your grasp of operations & infrastructure related tasks as well as your application programming skills. The former would be a take-home exercise that you have about a week to work through at your own time and the latter would be a chat on Zoom where we just walk through your thinking (no hard-core algorithms, just a basic programming exercise that you would encounter on a typical day).
  • Team Chat – Casual 30 minute to one hour chat with your team-mates possibly on Zoom. We don't aim for this to be long but we're open to giving you as much of an opportunity to get to know us and vice versa.
  • Company Chat – One hour coffee chat with different people from different teams across the company so that you get acquainted with other people in the company. If you prefer it can be on Zoom.
  • CTO Chat – Thirty minute chat with our CTO on Zoom which includes (but is not limited to) discussing salary expectations.

Benefits

  • Competitive pay scales benchmarked annually.
  • Unlimited holidays – and this is not a trap! We expect and encourage people to take plenty of leave.
  • Flexibility to work in a way that suits your lifestyle with flexible working and travel arrangements to and from work.
  • Nice working setup i.e. MacBook Pro, high-res screen or 2 monitors, keyboard, mouse, stand etc. Basically, you’ll get what you ask for to make you productive.
  • Support setting up your home office, if appropriate e.g chair, desk etc.
  • Open plan office with stocked snacks, fruit, beers & cool-drinks.
  • Paid 24 hour secure parking.
  • Free yoga.
Zappi is a SaaS company that is aimed at completely transforming the market research industry. Zappi is a platform that integrates world class research methodologies and engineering to allow brands to run consumer testing through all stages of advertising and innovation development. We are constantly innovating and tackling diverse and complex problems using a multitude of technologies in order to scale expertise and create the world’s most powerful enterprise research platform thereby making the world of insights even better. We have created an environment that fosters constant learning and innovating and we believe in having ambitious goals. We are data scientists, developers, researchers, analysts, designers, engineers, and marketers all driven by the notion of trying to make the impossible possible. To realise our vision we are constantly in search of people who will bring a different perspective, who will challenge our thinking, create value for our customers and apply themselves passionately to our vision and culture.