Measuring software engineering teams performance with RDD metrics
In today’s, competitive world companies who strive and those who flop are determined by the execution of its people and teams. Working in a software development field for a decent time I’ve been a part of various teams with notably different performances. And since I’ve been put in a Team Lead role the question of how to establish a high-performance team and how to measure the performance has been one of my daily focus. This article is a summary of our team’s findings and current approach which I hope might be helpful to others too.
Why do we want to measure the performance of the software engineering team?
Monitoring the performance just for the monitoring sake is a waste of effort and time. Also monitoring some vanity metrics to determine the team’s and individual contributors’ gratuities is a bad idea too which might lead to low-quality value for the customer.
The way we look at monitoring is that it’s a tool that indicates whether we as a team are moving in the right direction. And the right direction for us means to deliver a high-quality value to our customers in the shortest possible time. This is why we regularly review the team’s metrics to detect whether there’re actions we have to take to improve the way we work.
Which metrics to monitor?
What to monitor is as important as the why. As I said paying attention to vanity metrics might make the team look good to the outside even though the performance is low.
Focusing only on the team’s metrics might decrease the fulfillment of an individual engineer. But paying attention only to individual contributors’ performance can create unnecessary competition within a team.
Given that the software engineering team’s goal is to develop value to the customer and value to the customer can be present only after it’s released and is used by a customer. And taking into account that most engineers get the feeling of accomplishment doing work that is used by the customer or other developers. We have concluded that the best way to determine whether the team is performant or not is by looking at how many production releases it does in a certain period. In our case in a day.
Introducing the right metric
Releases per Day is a straightforward metric of a team, but as I mentioned we shouldn’t lose the individual aspect in monitoring the team’s performance. Even though we could reach an outstanding number of Releases per Day this number might be misleading without taking into account how many developers are in the team. Therefore we added the third dimension which gave us the final metric — the Releases per Day per Developer.
Defining the target RDD for the team
To define how improving the RDD would help the team we had to set the goals for the team as well for the individual contributors. The goal of the team is to release value to our customers in the shortest possible time to market.
Individually as software engineers, our goal was to see that our work matters which means we work on features that our customer truly needs and we see the code being released and used in production as soon as possible.
Combining these two intentions we came up with an exact number, our north star. In the foreseeable future, our goal is to reach 1 Release per Day per Developer. For comparison at the beginning of defining this metric, our RDD was about 0.05 to 0.12. This means that with the effort of 5 developers, we were able to release only once or twice a week. Here one must take int account that the release size was significant with many changes which had a noticeable effect on slowing down the release cycle.
Our experience of reaching a higher RDD
To move the needle towards the team’s goal we had to take a look at how we do our work of delivering value. The following are our main areas of focus.
Trunk Based Development
First of all the team decided to use the Trunk Based Development approach for its Git workflow. We saw that long-living feature branches and also the staging (dev, test, pre-production, etc.) branches would prevent us from getting code into the production promptly. A production-ready master branch that is deployable at any time allows us to move the code changes through stages more rapidly. As soon as changes are approved and merged into the master branch we can start our QA and delivery process.
Automated tests and Continuous integration
Second manual testing must be kept at a minimum, ideally, only acceptance tests and exploratory tests should be performed manually. And they have to bee done at the final stage before production. This required the team to apply automated tests early at the development phase where the developers are responsible for covering their code with:
- unit tests for the business logic in Typescript code;
- DOM snapshot tests for the HTML markup rendered by Vue.js components;
- integration tests for REST API endpoints;
- system tests with Cypress for the happy path of the user flow;
- and screenshot snapshot tests which assert that the screenshot of a particular view has not been affected.
The amount of tests sounds like a lot of work to do for a developer and a lot of time to execute. But this is where Continuous Integration tools come into play.
We’ve set up the CI infrastructure in a way that different test groups can be run in parallel on the CI and takes less than 10 minutes in total. Also, the heaviest tests like system tests and screenshot snapshot tests which are run by Cypress are executed inside the docker image. This allows the developer to replay it locally with the same environment setup as it’s on the CI.
Continuous delivery of production-ready image
An important aspect of repeatable and environment agnostic release is to deliver it from pre-built artifacts. Our CI/CD process builds a production-ready docker image as soon as the code is merged into the master branch. And the same image is promoted through all stages. This means that in test and staging environments the same code is run. All that is different is the environment’s values which are entered from Hashicorp Vault at the deploy time.
Releasing with feature toggles
Because we practice short-lived feature branches and maintain always deployable master branch we had to adopt feature toggles because it is not always possible to develop and merge fully functional features within a day. Therefore we split up the user stories in smaller subtasks and merge them into master one by one. To prevent the not ready features to be used by the customers we disable the code responsible for it by introducing a feature toggle. This toggle allows us to turn on and off the feature on-demand, for example enabling it on the staging environment to perform manual testing.
The last but not the least important thing about increasing the RDD is the planning. The feature toggles allow us to release code that is not finished yet. But from the business and customer point of view, there is no value in such a code. And the longer the feature is in progress or turned off the less value it might have in the future.
This is why at the planning stage of a new feature it is important to explore whether the development of it can be split into multiple incremental deliveries. For example, applying some UI and UX changes to existing features or system parts before launching the full feature. Thus gradually gathering feedback of the proposed changes and altering the intended feature in case the feedback exhibits a need for that.
We have been monitoring the RDD for about two months now and honestly, we haven’t seen the numbers increased manyfold. On average we have doubled our releases count to 4 or 5 releases per week, which is 0.2 RDD (up by around 70%).
There are many factors that affect this number, for example, the nature of the tasks. Currently, we are focusing more on migrating code from one system to another rather than developing new features. But at this stage of our team, it is acceptable as we observe huge progress in many areas of software development.
- First of all the Time to Market has decreased dramatically and now we are able to prepare a release and deploy it to production in less than 30 minutes.
- Also, the confidence about the deployments in production has increased thanks to the minimal need for human interaction. All that developers have to do is to press the “Accept” button to promote an image to the next stage.
- In case of a failure in the production environment, the feature toggles allow us to turn it off until we deploy a patch. And in a worst-case scenario, a prebuilt image of the previous version can be redeployed.
- But of course, all of this is supported by various kinds of automated tests with a proper level of code coverage.
Even though we’re far away from reaching our north star of 1 Release per Day per Developer. We still have observed the effect on the team’s work and development process we can have by defining concrete and measurable metrics to follow.
If you have a similar or contrasting experience of measuring a team’s performance I would more than welcome you to share it with us.