Overview:
The review score in Label Studio Enterprise Edition is calculated based on the agreement between annotations. This agreement can be assessed in several ways, depending on the type of labeling performed and the specific agreement metric chosen.
Description:
Here are some key points about how the review score, often referred to as the agreement score, is calculated:
-
Task Agreement: This shows the consensus between multiple annotators when labeling the same task. It includes a per-task agreement score, which displays how well the annotations on a particular task match across annotators.
-
Agreement Method: Label Studio uses the mean average of all inter-annotation agreement scores for each annotation pair as the final task agreement score.
-
Agreement Metrics: There are various agreement metrics available in Label Studio Enterprise, each suitable for different types of labeling tasks. These metrics include exact matching, intersection over union (IoU), precision, recall, F1 score, and others.
-
Exact Matching: This metric evaluates whether annotation results exactly match, ignoring any label weights.
-
Intersection over Union (IoU): This metric evaluates the overlap compared to the union of two regions, such as bounding boxes or polygons.
-
Precision, Recall, and F1 Score: These metrics are used to evaluate the accuracy of annotations, taking into account true positives, false positives, and false negatives. The F1 score is the harmonic mean of precision and recall.
-
Custom Agreement Metrics: Users can define custom agreement metrics if the built-in ones do not suit their needs.
-
Thresholds: Some metrics use thresholds to only consider annotations that meet a certain level of similarity.
Example:
The specific formula used to calculate the agreement score will depend on the chosen metric. For example, for exact matching, the score is 1 if the annotations are the same and 0 if they are different. For IoU, the score is the area of intersection divided by the area of union. For precision and recall, the scores are calculated based on the number of true positives relative to the number of false positives and false negatives, respectively.
Summary:
In summary, the review score is a measure of how similar annotations are to each other, and it is calculated using a variety of metrics that can be chosen based on the specific requirements of the labeling task.