    We assessed the included studies using the QUADAS-2 Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies [16]. This was done in duplicate by two reviewers (BDN and TRF), with the final assessment being either the consensus judgment or that agreed following additional discussion. As for the purposes of this review we do not regard each of the QUADAS-2 domains or individual items to have equal importance, rather than creating a summary score we present the assessment separately for each included study, using Review Manager 5.3. We had intended to carry out meta-analysis of the diagnostic accuracy results to estimate pooled sensitivity and specificity, and present the results in receiver operating characteristic space. After conducting the search it Seladelpar became clear that this would be inappropriate because of the high level of clinical heterogeneity of the included studies. We therefore present the results from each study, including the estimated sensitivity and specificity and 95% confidence intervals [17], stratified by primary tumour type, together with a narrative summary, but do not pool the diagnostic accuracy measures.
    Results After removing duplicates and reports that could be excluded by automated title screening, the search identified 2406 studies for assessment (Fig. 1). Of these, 113 required scrutiny of the full text, and nine met the inclusion criteria [[18], [19], [20], [21], [22], [23], [24], [25], [26]]. Of the 104 studies excluded at the final stage of screening, by far the most common reason for exclusion was that the study did not report sufficient information in its results for the full 2 × 2 table of biomarker elevation versus cancer recurrence to be extracted or reconstructed. Often, the reason for this was that biomarker levels were reported only for individuals who had disease recurrence, and not for those who were under surveillance but who had not had a recurrence at the end of the study. Nine studies were excluded at the final stage because they reported baseline or immediately post-surgery marker levels, rather than levels during surveillance, and eight were excluded because they did not consider cancer recurrence. Other reasons for exclusion are shown in Fig. 1. Among the included studies, the clinical characteristics of patients varied widely. Studies considered patients with a primary diagnosis of seminoma or NSGCT, or used a mixed group (Table 1). Some studies reported a single stage of primary disease; others reported a mixed group. Some studies indicated using an independent reference standard for recurrence, but in several the reference standard was unclear. Most studies reported recurrence on a ‘per patient’ basis, but some reported ‘per sample’ – allowing for the possibility of multiple recurrences per patient and treating each negative biomarker result as a ‘true negative’. Three studies reported results for LDH; the remainder reported AFP or HCG. Cut-points for defining test positivity varied and in some studies were not reported. Methodological quality is summarised in accordance with QUADAS-2 guidelines in Fig. 2. There were few applicability concerns, with studies generally using an appropriate patient group, index test and reference standard to match the research question. Risk of bias was high or unclear in at least one domain for each of the included studies. Two major concerns relate to the consistency and timing with which the reference standard (typically, an appropriate imaging modality) was implemented to confirm or refute an elevation in biomarker levels as an indicator of tumour recurrence. Although an inclusion criterion for incus review is that the reference standard be performed on both tumour marker positive and tumour marker negative subjects, some studies imply either the possibility of incorporation bias if tumour markers were used directly as part of diagnostic criteria, or that the timing when the reference standard was administered may have differed between marker-positive and marker-negative patients. No studies reported whether the reference standard was interpreted without knowledge of biomarker results. Two studies reported patient drop-out, either for unwillingness to comply with the follow-up protocol [25] or for reasons that were not stated [18]. Two studies used elevation of at least one of the biomarkers as an exclusion criterion without independent verification [25,26].