When a Virtual Assistant goes public, companies face multiple questions that ultimately have to do with quality. How do I measure the quality of my conversational solution?
One way to measure the quality of our Virtual Assistant training is to apply an assertiveness measurement test.
Although the meaning of this last term expresses a social skill, it is currently used within the community to describe the ability of virtual assistants to give a correct or appropriate response to a specific question from a user who has expressed himself in a way that was not directly trained in the chatbot or virtual assistant.
There are several ways to correctly measure assertiveness, but they can be grouped into three main ways to measure that increase in complexity and cost.
1. Indirect rate of assertiveness:
When we talk about a fallback, we are talking about a response where the assistant was not trained and responded with a message like “I didn’t understand”.
In this way, you can create the easiest indicator of assertiveness, which would be to take the total number of fallbacks and divide it by the number of interactions that came into the bot during a period.
This would give a fallback rate, and its complement would be the assertiveness, so we are talking about an indirect assertiveness rate. It serves to roughly know how much volume of questions are coming in that the bot has not been trained for, answering that it doesn’t understand.
2. Strict assertiveness rate:
At the other extreme, the most complex way to measure assertiveness requires the common agreement of two or more parties that select a representative sample of inputs or real examples of users with which the system will be measured and then manually annotate each of the inputs with their outputs, i.e., the response that the system actually gave, and identify whether the sentence belongs to the bot’s knowledge domain and whether the classification or response it delivered was adequate or not.
Once the group of annotators has made the relevant evaluation of the same set of data, the degree of agreement among them is evaluated, because it is possible that some of them may have considered that everything was relevant and adequate in a random way.
A simple statistical test allows to solve that, creating an annotated collection of great value for further training improvement. The work is cumbersome and time-consuming and even requires some training for the annotators. This way of measuring the Strict Assertiveness Rate is recommended only in cases where the indicator is linked to some obligation that requires formal demonstration.
3. Semi-Automated Assertiveness Rate:
An intermediate approach is the Semi-Automated Assertiveness Rate calculation procedure, which saves time and is often an ideal formula in agile contexts where the quality of our Virtual Assistant needs to be measured and updated by demonstrating its value.
Depending on the type of conversational solution, the calculation will be made by first identifying all the training, linking it with the answers that will be measured. With this input, a table is generated where the actual sentences and the response that “should” have been received.
This task is usually abbreviated by simply using the intent that should have classified that sentence. Because in practice manual effort is usually required in this part, the “semi” part of the indicator’s name comes up. In some cases, it is possible to automate the entire flow from start to finish, but there are often conditions that make this task difficult.
Then, a second external bot will “send” the sentences to the virtual assistant. The wizard will respond with its answer and that answer will be saved, giving rise to a data collection containing each of the actual user inputs, the classification that should have been delivered and the classification that was delivered.
Finally, a matrix is created with the frequency of correct and incorrect classifications, thus creating the assertiveness rate indicator par excellence, which allows us to identify with a good level of detail and relatively quickly which are the knowledge domains that the bot does not handle and in which the training fails more in a familiar indicator expressed as a percentage.
The first insight we have seen generated in these measurement experiences is the need to merge some answers together, to avoid confusing the dialog engine that runs the wizard.
There are an infinite number of ways to combine these measurements and the three levels are rather didactic to describe their complexity. Usually, more steps are added to the measurement as each virtual assistant’s own requirements emerge.
Having a proper measurement of the assertiveness of our bot will ensure its quality with the support of an indicator that impacts the user experience and the final evaluation of the virtual assistant. With the measurement comes a subsequent re-training process that must be carried out carefully to avoid diminishing the generalization capacity of the model on new cases for which it was not trained.