Introduction: Conducting clinical research on treatments for emerging infectious diseases is often complicated by methodological challenges, such as the identification of appropriate outcome measures to assess treatment response and the lack of validated instruments available to measure patient outcomes. In bubonic plague, some studies have assessed bubo size as an indicator of treatment success, a measure widely assumed to be indicative of recovery. Evaluating this outcome however is challenging as there is no validated method for measuring bubo size. The aim of this study is to assess the accuracy and inter- and intra-rater agreement of artificial bubo measurements using a digital calliper to understand whether a calliper is an appropriate measurement instrument to assess this outcome. Methods: Study technicians measured 14 artificial buboes made from silicone overlaid with artificial silicone skin sheets over the course of two training sessions. Each artificial bubo was measured by each study technician once per training session, following a Standard Operating Procedure. The objectives of this study are to (i) evaluate the accuracy of individual measurements against the true size of the artificial bubo when using a digital calliper, (ii) understand whether the characteristics of the artificial bubo influence measurement accuracy and (iii) evaluate inter- and intra-rater measurement agreement. Results: In total, 14 artificial buboes ranging from 52.7 to 121.6 mm in size were measured by 57 raters, generating 698 measurements recorded across two training sessions. Raters generally over-estimated the size of the artificial bubo. The median percentage difference between the measured and actual bubo size was 13%. Measurement accuracy and intra-rater agreement decreased as the size of the bubo decreased. Three quarters of all measurements had a maximum of 25% difference from another measurement of the same artificial bubo. Inter-rater agreement did not vary with density, size or presence of oedema of the artificial bubo. Conclusions: The results of this study demonstrate the challenges for both individual and multiple raters to repeatedly generate consistent and accurate measurements of the same artificial buboes with a digital calliper.