Multi-institutional Validation of Improved Vesicoureteral Reflux Assessment With Simple and Machine Learning Approaches.
Objective: Vesicoureteral reflux grading from voiding cystourethrograms is highly subjective with low reliability. We aimed to demonstrate improved reliability for vesicoureteral reflux grading with simple and machine learning approaches using ureteral tortuosity and dilatation on voiding cystourethrograms.
Methods: Voiding cystourethrograms were collected from our institution for training and 5 external data sets for validation. Each voiding cystourethrogram was graded by 5-7 raters to determine a consensus vesicoureteral reflux grade label and inter- and intra-rater reliability was assessed. Each voiding cystourethrogram was assessed for 4 features: ureteral tortuosity, proximal, distal, and maximum ureteral dilatation. The labels were then assigned to the combination of the 4 features. A machine learning-based model, qVUR, was trained to predict vesicoureteral reflux grade from these features and model performance was assessed by AUROC (area under the receiver-operator-characteristic).
Results: A total of 1,492 kidneys and ureters were collected from voiding cystourethrograms resulting in a total of 8,230 independent gradings. The internal inter-rater reliability for vesicoureteral reflux grading was 0.44 with a median percent agreement of 0.71 and low intra-rater reliability. Higher values for each feature were associated with higher vesicoureteral reflux grade. qVUR performed with an accuracy of 0.62 (AUROC=0.84) with stable performance across all external data sets. The model improved vesicoureteral reflux grade reliability by 3.6-fold compared to traditional grading (P < .001).
Conclusions: In a large pediatric population from multiple institutions, we show that machine learning-based assessment for vesicoureteral reflux improves reliability compared to current grading methods. qVUR is generalizable and robust with similar accuracy to clinicians but the added prognostic value of quantitative measures warrants further study.