Pratik Bhattacharya
- Dec 15, 2021
- 8 min read

Control Rollout; or how I stopped worrying and started testing in Production - Part II

Updated: Aug 4, 2022

This is the second part of the Controlled Feature rollout blog. This blog will discuss building and maintaining a centralized feature flighting management system. I will also introduce an open-sourced GitHub library, which is heavily used to maintain feature flights. See the first part here.

Where are my feature toggles? 🙋‍♂️

The theory of controlled feature rollout can be implemented by individual teams in any way they see fit. However, investing in a centralized system has advantages, especially at an organizational level.

The most apparent advantage is the reduction of development effort. You can't expect every application team to develop its feature flighting system for faster time-to-market. A centralized service with standard contracts helps all teams utilize the expected utility. Plus, you can have a dedicated engineering team responsible for maintaining and improving the feature flighting system.

Some of the aspects of the centralized system would be

Configuration Experience

You need to provide a user experience where admins from other applications and services can setup up the feature flags and create the ring conditions

REST APIs

You can use any standard protocol (like REST) to expose APIs. The most used API would be the evaluation API which will accept the feature names and the user context and return if the features are enabled.

Performance

Feature flighting introduces an extra condition to determine if a user is part of the ring. This adds overhead with some performance concerns. So the evaluation should have latency as low as possible. Typical performance boosters like caching and pre-evaluation of the conditions are commonly used.

Reliability

We already discussed that once a user is assigned a feature, the user must be exposed to the same feature in the subsequent cycles. If the centralized feature flighting system is disrupted, the clients wouldn't know what feature to assign. In most cases, the client default all users to the old feature.

One popular way to tackle this problem is by downloading a snapshot of the ring conditions and features to the client system and performing the evaluation from the downloaded data. This will ensure that even if the Flighting Service is down, the client system will assign features based on the last copy of the snapshot. Snapshotting also boosts the performance because it saves an API call.

Data latency

Once a feature is promoted or a ring condition is updated, the information must reflect immediately in the client system. This becomes more critical during feature rollback (especially if a critical bug has been detected). Caching and downloading snapshots hinders data latency. Hence there must be a mechanism to upload the snapshot or cache at regular and frequent intervals. You can build a polling mechanism to poll the central servers and keep the snapshot updated continuously. You can also take advantage of an event-based mechanism to update the client cache

Client SDK

Providing a client SDK is essential when you are building cache and snapshots. Asking the clients to maintain their cache is dangerous because clients won't be aware of refresh intervals which will cause data latency. Be familiar with your organisation's tech stack to choose what programming languages must be supported.

Let's Build It! 🏗

The below diagram shows a High-Level design of a Centralized Feature Flighting System leveraging Azure App Configuration's Feature Management APIs.

1. Azure App Configuration

Heart of the system is responsible for storing all the feature flags and filter parameters

2. App Service

Code for evaluating the feature flags. The logic for implementing Rings, custom filters and complex expressions is executed here.

3. Blobs

Stores additional rules for maintaining complex feature flags

4. Front Door

Load balancer for balancing traffic between multiple regions of the App Service

5. Common Infrastructure

Key Vault - Stores secrets (like App Configuration and Blob connection strings)
Application Insights - Telemetry and monitoring of system health
Active Directory - Authenticating client requests

6. Admin UI

Web application used by partners to create/edit/delete feature flags

Jack of Many Trades 🤹‍♂️

We can use feature toggles for multiple purposes:

A/B Tests

A trendy experimentation technique where teams release one or more variants of an existing feature to a subset of users to gather insights about the variants. Feature toggles are a great way to assign variants to different users. You can then monitor the KPIs for each variant and take a scientific decision as to which variant is performing the best. Using this mechanism, teams can make decisions based on product usage and avoid Hippos.

Canary Channels (Preview version)

Allow a small percentage of enthusiastic users to preview new features before they are released globally. Feature toggles can be used to identify the Canary users and expose the new features to these users.

Trunk-based deployments

A mode of deployment where engineers continuously check in code to the main branch even if the full feature is not complete. Generally, incomplete features are kept in long-lived feature branches, making the final commit a little messy. Feature toggles are kept on the incomplete code to ensure they do not impact the whole product.

Circuit Breakers

A general pattern in Cloud-based microservices. During unanticipated faults in service, other dependent services can keep invoking the failed service, which may have cascading effects (like socket exhaustion and memory leaks). With this pattern, you can stop an application from repeatedly calling a service that is likely to fail. Feature toggles can be used as a centralized mechanism to kill circuits without deployments. Additional logic might be needed to open/close a circuit.

But before that ... Instrumentation 👨‍🏫

Before investing in creating a Feature Flighting System, you must ensure that proper instrumentation is present in the applications. The keyword here is "measurable" metrics. They must all be measurable, regardless of the metrics you consider (success metrics, guardrail metrics or other KPIs). Even if your success criteria are based on user feedback, there must be a way to quantify the feedback. For example, you can employ AI-based techniques to calculate a sentiment score of your user feedback and then verify if your new features are improving this sentiment score.

Before developing any new feature, consider what metrics you are trying to improve. First, have a conversation with your team, involve other stakeholders, and discuss the feature's objective, especially what you intend to improve in the user experience. Next, verify that the new feature's key results and intended improvements are measurable. Next, determine the exact metrics you intend to improve and ensure they are measurable. If they are not measurable, invest in adding additional instrumentation in your code to measure those metrics. Once those metrics are in place and verified on the existing feature, then build the new feature. Experiment with a small set of users on this new feature and verify the improvement.

Ignored but Critical 👩‍🏫

In this section, I will discuss a few understated topics which are critical for successfully rolling out features

Bias - Random is good 👩‍🌾

We had earlier talked a little about bias during Ring-based deployments. If the subset of users chosen for a ring is not adequately diversified, your results may have biases, leading to incorrect conclusions about your success metrics.

Continuing with the Search Algorithm, let's assume your Ring 1 mainly consisted of users from countries with huge network bandwidth. Unfortunately, your power users might not notice lags and latencies in your website due to powerful network speed. When the same app is used in low network conditions, the latencies are prominent and might lure users away from your app. Since your Ring 1 users did not experience this lag, the observed KPIs will be primarily positive. When the same app is released to the global audience, you will see a sharp decline in the same KPIs.

This is a prime example of biased data because you did not diversify your sample users. Your sample users must have a representation of all types of users. All user attributes like Location, Device Type, OS, and demographic data must be considered to diversify your Rings. The best way to eliminate bias is by randomly assigning users to your rings. Randomization ensures no preconceived notions are in place while deciding the users for experimentation.

Clean it up 🧹

Highly understated and heavily ignored. Once a feature has been landed, it's imperative that the feature flag is deleted from the central system and the condition is removed from the application code. You can keep the legacy code in your system for a few months as a backup if the new code has any long-term effect. The legacy code in the application will be ignored, hence accumulating technical debts. An additional feature toggle will remain in the system, and an extra unnecessary API call to evaluate this feature flag will be made. None of these is healthy for your system. An intelligent centralized feature flighting system should send warning notifications to the feature owners about long-term feature toggles in the system.

As an application developer, cleaning up code for the legacy feature should be part of your team's regular cadence.

Can't be that good (or bad) ⁉

If the data looks too good, then probably it's wrong. This is known as Tywan's law. When you are experimenting with new features, the lift in your target KPIs won't be havoc in most cases. There are exceptions when small features cause huge improvement, but the gain will be small in most cases. So when you see that the success metrics of your Rings have improved too much, it's better to have a pessimistic outlook and verify all aspects of the feature toggle. In many scenarios, the instrumentation methodology gets changed in the new feature. For example, you might have changed the dimensions or frequency of telemetry in the new feature, which might affect the comparison with the existing data.

Other times the success metrics themselves are ambiguous. For, e.g. take a metric like Average Revenue per user. Now, if your user retention rate is dropping, then there is a chance that the average might increase (since less a number of users). This is where guardrail metrics play an important role in identifying potential false positives.

You didn't forget to test, right? 🧪

Until the existing feature is deprecated, your automated tests (unit and integration) must test both feature variants. The feature toggle in your code should be mockable. This will ensure that you can unit test the variant assignment.

In a nutshell, you should be able to test all the variants in the system, and by mocking the feature toggle, you should be able to direct the flow of control from your test cases.

Domain logic ... Hmmm 😔

This is a debatable topic, and there are both pros and cons. However, in my opinion, you cannot have permanent feature flags. They can be long-term but not permanent. This scenario happens when we try to derive business logic using feature toggles. Let's take an example of showing some additional discounts to your exclusive members (users who pay extra for being a prime member of your app). You may be tempted to create a feature toggle with only exclusive members and drive the behaviour of the discount logic based on it. Ideally, this logic should be driven by your user attributes. You should maintain a list of benefits available only to exclusive users and light up the benefits experience based on the logged-in user's subscription type. In other words, domain logic should be encapsulated in the domain classes only (Domain-Driven design) and not based on an external configurable source.

It's all in the mind 🧠

Invest all you want to build a feature flighting system, but it won't matter unless your organization is willing to utilize feature toggle techniques. Your organization needs to imbibe a culture of flighting features to Production. You can start by only flighting critical features, but it needs to spread to the point where you don't release any code (no matter how insignificant it might seem) unless its feature is flighted. Multiple applications in the industry have adopted this culture to a great extent. One of the best examples is Bing at Microsoft, which runs around 250 experiments per day.

Once you have a centralized system, start a conversation with leadership and other teams to educate them about the benefits of feature flighting. Initially, there might be some resistance and doubts, but flighting would become a standard norm once people see the benefits.

Behind the scenes ‍🎭

Multiple articles, books, and conversations inspired me to write this article. I will name some of the most influential ones

Trustworthy Online Controlled Experiments: A Practical Guide to A/B testing by Ron Kohavi, Diang Tang and Ya Xu

IMHO this is the best book if you want to start experimentation

https://martinfowler.com/articles/feature-toggles.html#CategoriesOfToggles

A simple yet elegant blob that explains some key concepts like Dynamic toggles and different types of feature toggles

Optimizely

A well-known service for running experiments in your application. The glossary section is filled with knowledge

Statistical Methods in Online A/B Testing

Another good book if you want to understand the mathematical model behind A/B tests.

If you liked this article, please check out the open-source library we have built for a centralized feature flighting management system. The service takes care of the various techniques we discussed here. Watch and star our repository for more updates.

Share some of your exciting feature-flighting stories in the comments.