br Study design br Two
Two ROs (JV and SD), well trained in delineation of OARs for HNC RT and familiar with the delineation guidelines [28,29], each delineated the 16 OARs on all 15 CT scans twice, using Eclipse (Var-ian Medical Systems, Palo Alto, CA), in 2 separate, uninterrupted sessions for each patient: once manually (‘‘manual delineations”) and once by modifying and correcting the presented automated delineations generated by the CNN (‘‘corrected delineations”) (Fig. 1). The 2 delineation sessions by the same RO for the same patient were performed with an average interval of 15.5 days, with manual delineations being performed in the 1st session for about half of the cases and in the 2nd session for the other half, and blinded for any other delineation result to avoid observer bias.
Patient Site Age Gender T N p16 1 oropharynx 67 female 4a 3b NS 2 parotid left (postoperative) 50 male 4a 2b
3 supraglottic 68 male 2 3b
7 oropharynx 55 male 2 1 - 8 oropharynx 66 female 3 2 + 9 oral cavity (postoperative) 77 male 2 N2c
10 oropharynx (postoperative) 56 male 2 1 + 11 hypopharynx 78 male 2 1
Abbreviations: TNM: tumour staging according to the TNM-8 staging system (2017); T: clinical tumour stage; N: clinical nodal stage; p16: p16 protein expression, correlated
with human papilloma virus status; NS: not specified.
70 Benefits of deep learning for delineation of VH-298 at risk in head and neck cancer
Fig. 1. Overview of study design. Automated delineations (A) of 16 OARs in conventional planning CT images of 15 HNC patients were corrected by 2 different ROs (RO1: C1,
RO2: C2) and were also manually delineated by the same ROs (RO1: M1, RO2: M2) in different delineation sessions. Accuracy of automated delineation was assessed by
comparing automated and corrected contours for each RO (Acc1: C1 vs A, Acc2: C2 vs A). Intra-observer variability was assessed by comparing corrected and manual
delineations by the same RO (IOV1: C1 vs M1, IOV2: C2 vs M2). Inter-observer variability was assessed by comparing corrected and manual delineations by different ROs
All delineations were verified and approved without modification by a third expert in HNC RT (SN) to ensure their clinical validity.
The benefits of the use of a CNN based automated delineation tool in clinical practice were assessed in terms of its accuracy, impact on IOV and time efficiency.
The accuracy of the automated delineation tool was assessed for each 3D OAR separately by comparing it to the corrected delin-eations using the Dice Similarity Coefficient (DSC) and the Average Symmetric Surface Distance (ASSD). DSC is a measure for the over-lap between two delineations A and B, yielding a value of 1 in case of perfect overlap and a value of 0 if no overlap: jA \ Bj DSC ¼ 2
with jAj and jBj the volumes of each delineation andjA \ Bj the vol-ume of their intersection. ASSD represents the mean distance between two delineations A and B in mm: ASSD ð A; BÞ ¼ h ðA; BÞ þ hðB; AÞ
h ðA; BÞ ¼ mean fminfdða; bÞgg
with dða; bÞ the 3D distance between point a on delineation A and point b on delineation B. Both DSC and ASSD provide an indi-cation for the amount of corrections necessary for clinical approval.
The impact of the use of the automated delineation tool on IOV between different observers was assessed for each 3D OAR sepa-rately by computing DSC and ASSD between the manual delin-eations and between the corrected delineations made by both ROs (with larger DSC and smaller ASSD indicating less IOV). In addition, IOV for the same observer was assessed by comparing manual delineations and corrected delineations made by the same RO.
The efficiency of the automated delineation tool was quantified by comparing the time needed for manual delineation to the time needed for correction of the automated delineations. Both ROs recorded the total delineation time per patient for each of the 2
delineation sessions. This included the time for adjusting window settings, navigating between slices and creating or correcting all delineations for all 16 OARs.
Statistically significant differences for DSC and ASSD were assessed with a two-sided, paired Wilcoxon signed rank test, using significance level a = 0.05 and a power of 90%. To asses reduction in delineation time, a one-sided, Wilcoxon signed rank test was used, using significance level a = 0.05 and a power of 90%.
To investigate accuracy, mean DSC and ASSD for automated ver-sus corrected delineations for each OAR were calculated and are summarised in Table 2. Based on DSC, the network performed best for brainstem, left cochlea, mandible, parotid glands, submandibu-lar glands and spinal cord (DSC >90%). The average corrections nec-essary for clinical acceptance were below 1 mm for cochleae, glottic larynx, mandible, right parotid gland, middle pharyngeal constrictor muscle (PCM) and right submandibular gland (ASSD <1 mm). Average corrections for all other structures were below 2 mm. Examples of manual and automated delineations are shown in Fig. 2. This figure also shows that scatter due to dental fillings did not impact the accuracy.