Jun 11, 2023

How I Contributed as a Tester to a Machine Learning System: Opportunities, Challenges and Learnings

InfoQ Homepage Articles How I Contributed as a Tester to a Machine Learning System: Opportunities, Challenges and Learnings

Feb 16, 2023 10 min read


Shivani Gaba

reviewed by

Ben Linders

Testing is considered as a vital aspect in software development life cycle (SDLC), so testers are involved to assure the application’s quality. This holds true for conventional software systems including web applications, mobile applications, web services etc.

But have you ever wondered about systems based on machine learning? In those cases, testing takes a backseat. And even if some testing is done, it’s done mostly by developers itself. A tester’s role is not clearly portrayed in such areas.

Testers usually struggle to understand ML-based systems and explore what contributions they could make in such projects. So in this article, I’ll share my journey of assuring quality of ML-based systems as a tester. I’ll highlight my challenges, learnings and my success story.

Like most of the testers, I’ve been a part of testing conventional systems involving web apps, native apps, backend, microservice,s etc. In such systems, input is taken and logic is written by humans (mostly developers) to produce a deterministic output. As testers, our focus is to verify the expected output against the specified/implied requirements.

Interestingly, some years ago I got an opportunity to test an ML-based recommendations system. It’s quite different from conventional systems, so I was excited and anxious at the same time.

In ML systems, a numerous amount of data with patterns is given as input. It is fed to a model, which learns logic to discover these patterns and predict future events.

To ensure quality, it’s important to test learned logic. So I asked myself, how do I test this learning process and the logic learnt by the model? Is it even possible to test this logic? Is a model completely a black box?

Having a lot of such questions in my mind, I was curious to explore and learn. I was all set for a roller-coaster ride :)

In my excitement and curiosity to contribute fast, I did what most of us would do - Google the testing ideas! Most resources I referred to were pointing to model evaluation metrics like precision, recall, confusion matrix, etc. They felt like jargons to me. Honestly, I didn’t understand them. But I took back my half-baked learnings to my developers, and they told me that they were already taking these metrics into account.

I realized that the developers were well-versed with their domain. To create an impact as a tester, I needed to bring in the perspectives that they missed.

We, as testers, are blessed with the great skill set of asking the right questions, understanding the big picture, thinking out-of-the-box, applying deeper product knowledge, challenging the status quo, etc. If these skills are applied to test ML systems, a lot of the issues could be prevented.

I also realized I was trying to test something without even understanding how the system works and what the building blocks are. But, to test any system better, this should be the first step.

I discussed this approach with my developers in order to collaborate to understand the system deeply and to apply my testing skills at every phase.

The developers, who were initially skeptical of having a tester onboard, got excited after hearing this plan and were looking forward to the value I could provide.

Together with my team, I understood that in machine learning, a huge set of data with certain patterns is collected, filtered and fed to a model. The model identifies these patterns and predicts the probability of future events.

For example, our project was to provide article recommendations for our users. So, a huge amount of data like article interactions, users characteristics and user behavior on the platform was collected and fed to the model. The model then learned patterns in the data and formed rules to predict future possibilities of users interacting with the article.

The process is split in two phases:

Phase A - Learning process of machine learning systems

The learning phase is where the model learns to identify the patterns from data and comes up with logic to make future predictions.

There are two types of learning processes: supervised learning and unsupervised learning. Our project had supervised learning, where the desired output samples are available in the training data itself.

The detailed learning process was as follows:

Step 1- Data collection: The training of ML systems depends heavily on input data. As a rule of thumb: “The more the data, the better the training”. But generally there’s a tendency to not pay attention to the quality of data.

I would indeed say, “The higher the quality of data, the better the training”.

So, as tester we can help developers check/fix all quality dimensions like:

Let’s take a simple example. In our data, a column named ‘article_age’ had negative values in certain cases. This column reflects how many hours ago an article was created, and surely this shouldn’t be negative. We found that these negative values were due to a bug in timezone conversion, so we fixed it to correct the data.

All such checks should be added and integrated in the pipeline to validate data quality.

Step 2- Building of model features: After the data is collected and refined, the next step is to design model features. A table with all data is formed to feed to the model. Each column is called a model feature except one column, which is the target value. All columns have numeric values.

For example, for our system of article recommender, there were features related to user specifics like age, gender, duration of user subscription; article-specific features like time since it was published, total views/impression on the article, total clicks on the article; features about a user's interaction with articles like how many times a user viewed or clicked the article, etc, and many more.

The target value reflected if the user interacted with the article or not.

To train a model properly, it’s important to select correct features.

As a tester, I helped in reviewing all features, standardizing rules applied to them, and applying product knowledge to find bugs/enhance features. Some examples:

An important role that I played was also documenting features with their description, rules, etc.

Step 3 - Training and validation of model: After the data is collected and features are decided, it’s time to feed this data into the model and train it.

This is where the model comes up with rules to predict future target values when new data would be provided to it.

Not all the data should be used to train the model. But the data should be split into training data and testing data. The model should be trained with the training data and the logic that the model learned must be verified with the testing data using different model evaluation metrics like precision, recall, rpAUC, etc.

Some evaluation could also be done by analyzing the importance of each model feature (for e.g. using SHAP values). Testers could help evaluate the feature importance graph and ask the right questions.

For example, in our model which recommends articles, I realized that the feature about age of the article was considered as least important by the model prediction. I was surprised and raised this issue because if the model would learn that age is the least important thing in data, then it could recommend old and new articles with the same probabilty. This could be problematic for time-relevant articles where newly published articles are much more important that old published articles. Developers found that some error happened during training which resulted in this bug, so the model was retrained.

PHASE B - Deployment of (new version of) model

Now that the model is trained with the past data, the next step is to provide new unseen data to predict the probability of future events, which basically means deploying the new model to production.

To test the deployment of a new model, it’s important to understand its setup in production.

For example, in our case with article recommenders, when a user requested for recommendations, an API request was made to get predictions of the model. On top of the result of the model, some filters and re-rankers were also applied. Filters referred to business rules which remove some items predicted by the model. Re-rankers were business rules which shuffle some of the items predicted by the model.

In the above setup, whenever we planned to deploy the newly trained model to replace the existing one, certain tests were performed before replacing to confirm that it’s worth replacing.

We performed a comparison analysis between the final result users would see with the old and the new version. We collected top X items from the output of both versions and compared them for different metrics like how many items are changed/same, what’s the effect of the average article age, what’s the effect on item diversity, etc.

Carefully analyzing these results helped us to decide whether to deploy the new model or not. For example, once we noticed that in the new model, the average age of top 10 articles shown to users decreased by a huge amount & this was a blocker for us to deploy because we didn’t want to show older content to users when new articles were available.

Performing such tests always gave us confidence in the new model’s quality.

In addition, whenever there was a change in rule of filter or re-rankers, I tested the relevant change and sometimes even found crucial bugs. For instance, we wanted to add a filter which should keep only two articles having the same author and remove the rest of the articles from that author. But I noticed that the filter was allowing three articles from the same, instead of two articles. Developers realized this bug got introduced due to unclear requests and fixed it. This small fix indeed changed the results for our users to great extent.

So, it’s definitely crucial to test before deploying whenever any changes are made in the setup.

Having said that, ultimately our users are our final testers. So, instead of just releasing the new model, we always did an A/B test where the user group is split into 2 groups. Users in group A received the old version and users in group B received the new version. After letting the test run for a certain duration (a few weeks), we collected data from both groups and analyzed if the new model/reranker/filter actually performed better based on metrics like click through ratio, scroll distance, engagement with items etc. Based on results, we decided if the new version is worth rolling out to everyone. This data is then collected & used to retrain the new model as well..

I was very skeptical about my role in an ML system project in the beginning.

But reflecting back on my journey, I’ve learnt how big of an impact testers can make. I’ve learnt what blunders could happen if testing is not done. I’ve learnt how testers can closely collaborate with developers and find problems even with ML systems.

In short, I’ve learnt that it’s high time to change the perception that “testers cannot contribute towards ML systems”

Thanks to Rahul Verma, Prabhu & Ben Linders for motivating me to write this article and thoroughly reviewing it to help bring it to its current shape.

Writing for InfoQ has opened many doors and increased career opportunities for me. I was able to deeply engage with experts and thought leaders to learn more about the topics I covered. And I can also disseminate my learnings to the wider tech community and understand how the technologies are used in the real world.

I discovered InfoQ’s contributor program earlier this year and have enjoyed it since then! In addition to providing me with a platform to share learning with a global community of software developers, InfoQ’s peer-to-peer review system has significantly improved my writing. If you’re searching for a place to share your software expertise, start contributing to InfoQ.

I started writing news for the InfoQ .NET queue as a way of keeping up to date with technology, but I got so much more out of it. I met knowledgeable people, got global visibility, and improved my writing skills.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

InfoQ seeks a full-time Editor-in-Chief to join C4Media's international, always remote team. Join us to cover the most innovative technologies of our time, collaborate with the world's brightest software practitioners, and help more than 1.6 million dev teams adopt new technologies and practices that push the boundaries of what software and teams can deliver!

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

by Charles Chan,

by Charles Chan,

Your message is awaiting moderation. Thank you for participating in the discussion.

Thanks for writing the article and share it with the community. It outlines the challenges a tester face when dealing with a seemingly “black box”. It’s great to see that we can reason with an ML based system by understanding each step of the training process.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Join a community of experts. learned logiclogic learntPhase A Phase BPhase A - Learning process of machine learning systemsStep 1- Data collectionquality of dataStep 2- Building of model featuresStep 3 - Training and validation of modelPHASE B - Deployment of (new version of) modelShivani Gabahas opened many doors and increased career opportunitiesVivian HuInfoQ’s peer-to-peer review system has significantly improved my writingOghenevwede Emeni got global visibility, and improved my writing skillsEdin Kapićbest decisions of my careerhelped me grow in so many waysjoin our teamThomas Bettsfull-time Editor-in-ChiefThe InfoQGet the most out of the InfoQ Charles Chan