top of page
  • Writer's picturePratik Bhattacharya

Control Feature Rollout; or how I stopped worrying and started testing in Production - Part I

Updated: Aug 4, 2022



A few years back, I bought a shirt with a tag - "I generally don't test my code, but when I do, it's in Production". I was a naïve developer and took this as a joke. As I started understanding the world of software development, I realized testing in Production is not a joke; instead, it's pretty good advice.

In this set of blogs, I will talk about my experience building a centralized Feature Flighting system. Then, I will discuss how we can use Feature Toggles to roll out a feature, some of the good practices we picked up along the way and standard industry norms.


Develop. Deploy. Repeat. 🔁


The modern software industry demands a faster release of Products and new features for Users. As engineers, we are encouraged to adopt an agile development model with shorter sprint cycles, multiple deployments, incorporation of early user feedback and when you are failing - fail fast. When releasing a new feature, one major problem is deducing how your user base will react to it. For example, there might be gaps between the requirement and the implementation, you might have done insufficient sampling while gathering requirements, or it might simply be that most users just do not like the end product.

Another problem is that features can have unintended side effects in complicated ecosystems, which might cause disruptions and bugs. Such bugs may not be functional or related to performance or reliability issues. For example, you might release a code with potential deadlock in them. The code might perform well in non-production environments but starts exhausting your infrastructure when released to a vast user base.

We may have techniques like Load testing, automated functional tests, and UAT testing in place, but bugs and a bad experience can always slip through.

Thus it's hard to predict how new code will function in Production.

Now the best place to test your code is in Production. QA, Analysts, Product Managers, internal testers - no stakeholder can give you feedback better than the application user. The real challenge here is to successfully test in Production while minimizing the risk of testing with real users.


Controlled Rollout - A better way to Release stuff 🤔

Controlled feature rollout is a technique where a feature is made available to a subset of users and then incrementally released to the entire user base. For example, let's say you have developed a new searching algorithm in an e-commerce website that claims to provide more relevant results to users. When developing, you can either modify the existing code or write the new code in a different place without changing the current code. You know that the existing code works, and it has probably been fine-tuned over the years to suit your users, whereas the new code is experimental, and its impact is unpredictable.

You can maintain this separation in any way you want.

/// Search using traditional methods

public class SearchManager: ISearchManager {

public async Task<Result> SearchDatabase(string keyword) { ... }

}

/// Search using new algorithm

public class ModernSearchManager: ISearchManager {

public async Task<Result> SearchDatabase(string keyword) { ... }

}

/// User facing API

public class ProductsController {

public async Task<IActionResult> Search(string keyword) {

if (__should_expose_new_feature_to_user) { // More in this later

ISearchManager manager = new ModernSearchManager();

return manager.SearchDatabase(keyword)

} else {

ISearchManager manager = new SearchManager();

return manager.SearchDatabase(keyword)

}

}

}

The controller's 'if' condition decides whether the new or old code is executed. Initially, when the feature is ready, a subset of users would be exposed to the new code in the `ModernSearchManager` class. We would then closely observe the system and behaviour of this subset of users. Once we are satisfied that all metrics are functioning as expected, you would allow a bigger subset of users to access the new code. You would continue increasing your user subset until all users are part of the new experience. Your new code becomes the default code, and the existing code will be labelled as legacy code.

On the contrary, if you find any bugs when the feature is exposed to a smaller subset of users, you will need to remove the users from the condition and ensure they can access the old working code. This is known as rollback. Rollback can occur at any stage of the control rollout process. For example, you may not detect bugs when the feature is exposed to a small set of users, but you might start encountering issues with the new code as you expose more users to the feature. Rollbacks don't necessarily happen due to bugs; they can also occur based on user feedback or if the expected KPI of the new code doesn't match the expectations. So, for example, you can roll back because you don't see the expected improvement in revenue for the users exposed to the new code.


But how? Have you heard of Feature Toggles 💡

Feature toggles is a technique to turn "on" or "off" a feature without a complete system deployment. They are used for implementing the Controlled Feature rollout. In the above code same the `__should_expose_new_feature_to_user__` condition is evaluated by the feature toggle system. The feature toggle would take the feature's name, details of the current user, and some other additional information and decide if the user should be exposed to the new feature. For, e.g. you can create a method that will evaluate your condition.

/// User facing API

public class ProductsController {

public async Task<IActionResult> Search(string keyword) {

bool isModernAlgoExposed = await IsFeatureEnabled("ModernSearchAlgorithm", Logged_In_User_Id);

if (isModernAlgoExposed) {

ISearchManager manager = new ModernSearchManager(); // For Production code I would recommend using DI and Factory pattern to get the ISearchManager

return manager.SearchDatabase(keyword)

} else {

ISearchManager manager = new SearchManager();

return manager.SearchDatabase(keyword)

}

}

public async Task<bool> IsFeatureEnabled(string featureName, string userId) {

List<string> powerUsers = GetPowerUsers(featureName); // Gets a list of users exposed to the given feature

return powerUsers != null && powerUsers.Contains(userId);

}

}

The `GetPowerUsers` method returns a list of users who can access the new feature. You can keep this information in an external database to define a list of preview features and a list of user IDs associated with each.

The beauty of this technique is that once your code is deployed to Production, you can expose or roll back the new feature without deploying your code again. When you add more users to the database, the `IsFeatureEnabled` method would return true for those users, and they would be exposed to the new feature without any code deployment.


Let's hear more about it - Rings 💍

The new feature is gradually exposed to users. Initially, the feature is exposed to a small set of users (let's call them Power Users). Once satisfied with the results, you add more users to your Power Users and observe their behaviour. Hence you will have layers of user subsets, with each set being the superset of the preview set.

Rings are a predefined set of conditions used to identify the different layers of the subset of users for a feature toggle.

There is 2 popular mechanism for defining Rings - percentage-based or customer attribute-based.

You define what percentage of users should be part of a ring in percentage-based rings. A common pattern is 1% -> 10% -> 50% -> Global release. Initially, the feature is exposed to 1% of the total user base, 10%, followed by 50%, followed by a global release. Users should be chosen at random to be part of the rings. This mechanism is popular in A/B tests as it removes any preconceived biases from the system. Percentage-based rings must also enable "sticky" evaluation, which means that if the `IsFeatureEnabled` method returns true for a user for 1 call, it must return true for that user in all the subsequent calls. So assignments cannot be completely random all the time; once you have decided to expose the new feature to the user, that user must always experience the new feature.

In attribute-based rings, the users are chosen based on attributes like Location, Subscriptions, etc. We can choose other attributes in an enterprise application such as Division, Job Title, Specialization, etc. If we take Location as an example a sample ring distribution would be India -> Asia -> Asia + USA -> Asia + USA + Europe -> Global Rollout. So you expose the feature to all users in India, followed by users in Asia and so on.

This mechanism can introduce biases in the system but is helpful for Enterprise scenarios, especially where user training is a factor. For example, let's say you will need to train your users to use the new feature; your rings will depend on the speed and efficiency at which you can prepare your users and then expose them to the new feature.



Ring 0 - In the ring-based controlled rollout mechanism, we have a convention of Ring 0, which consists mainly of the Engineering and PM team. This is the first ring, whose sole purpose is to test the new feature's functionality before being released to any real user. Ideally, only the infrastructure health and P1/P2 scenarios are verified in Ring 0. Therefore, no calculation of the KPIs is done.


Was it a success? Success Criteria ✔

Promotion refers to exposing the new feature to the users in the next ring. Rollback is the reverse process where we revert the users in the current ring to the old code. The decision to promote or Rollback a feature happens based on whether or not the success criteria in the present ring are met.

Success criteria are a predefined set of metrics that indicate the system's improvement. Developing new features or improving existing ones is to enhance your user experience or increase adoption. Success criteria refer to these measurable metrics which calculate the user experience.

Going back to the e-commerce example, let's say you are developing a new search algorithm that considers users' purchase history. Therefore you are trying to show more relevant products to the users. Success criteria here would be 'Revenue-per-user'. Since you show better search results, that should prompt users to buy more from your site. Of course, your product owners can also decide the improvement you are trying to achieve with this new algorithm. For, e.g. you might decide that this new algorithm must increase the revenue per user by at least 10%.



Once your new search algorithm is exposed to the 1st ring, you will calculate the revenue-per-user for the users in the ring. If you can see that the revenue from these users has increased to the required value, you would promote your new algorithm to the next ring; and repeat the same process. On the contrary, you will roll back the feature if you do not see the expected improvement or a decrease in the metrics. After the rollback, you need to look closely at the user behaviour when the feature was exposed to detect the reasons for your failure. Often, we depend on direct user feedback to determine why our success criteria weren't met.

Mathematical models like Negative Hypothesis Statistical Testing, Statistical significance, and Statistical power, are widely used to determine whether the improvement we see in the metrics should be considered for promotion.


Everything has side effects - Guardrail metrics 📈

Continuing with the search algorithm example, let's say your new algorithm is eating more CPU cycles and takes longer to respond. Since a small percentage of users see a new feature for the first time, they might ignore the longer wait period. When more users experience the lag, they may stop coming to the site, thus causing a drop in user retention. Also, since the new algorithm requires more CPU power, your organization might spend more money procuring better infrastructure. This is a widespread scenario where the core success metrics might improve initially, but other metrics negatively affect the organization long term. These are known as Guardrail Metrics.

You can think of guardrail metrics as an uncompromising set that cannot depreciate, regardless of the developed feature. For example, we can keep API Latency, User retention rate, and Infrastructure cost as the guardrail metrics for our application.



We generally keep only the core success metric in mind when developing a feature. Still, the new feature can have multiple side effects, which are extremely difficult to predict till the actual user base is exposed. This makes guardrail metrics so important. Violation of guardrail metrics should result in immediate rollback to control the spread of negative user experience.




Ring Periods: How much time is the right time? ⌚

A common question is how long we should keep each of our rings active. For example, let's say you are experimenting on 10% of our users then when should we decide to promote or roll back the ring? 1 week? or 2 weeks? 1 month?

There are many answers to this question, but the general rule of thumb is to keep the rings active until most of your users in the ring have the opportunity to experience the new feature. So instead of keeping a strict period like 1 week or 2 weeks, your ring period should be based on some metrics (like adoption percentage). In our e-commerce example, we can define our success metric as a 10% improvement in Revenue-per-user and at least 80% adoption of the new algorithm.

If you are conducting A/B tests, we use Statistical Power to determine the sample size.

The ring period shouldn't be long, especially when performing A/B tests. Suppose a feature is exposed too long to the users. In that case, KPIs generally tend to normalize, especially for Enterprise applications, since users do not have any other option but to get acclimated to the new experience, even if they don't link it.


In a nutshell 🥜


Recent Posts

See All
bottom of page