Each task is scored on two equally weighted components. The final leaderboard score is a 50/50 combination of predictive performance and automated reasoning quality.
Fold 1 -- Predictive performance (50%)¶
- Tasks 1 and 2: Area under the ROC curve (AUROC)
- Task 3: Concordance index (C-index) for time-to-event BCR prediction
Performance estimates are reported with non-parametric percentile bootstrap confidence intervals. Ranking stability is assessed using bootstrap-based re-ranking across test set resamples.
Fold 2 -- Reasoning quality (50%)¶
Generated explanations are evaluated using a claim-level faithfulness metric computed by an LLM on the platform:
- The explanation is decomposed into independently verifiable atomic claims.
- Each claim is checked against the structured input fields for that specific case.
- A faithfulness score is computed as:
supported claims / total detected claims - Unsupported, contradictory, or hallucinated findings incur an explicit penalty.
Fallback: If automated faithfulness evaluation proves unstable during pilot validation, a composite similarity metric is used instead: cosine similarity between PubMedBERT embeddings + SciSpacy entity Jaccard similarity + BLEU-4 + ROUGE-L, combined as
S = w1*EmbeddingCosine + w2*EntityJaccard + w3*(BLEU-4 + ROUGE-L).
Expert review of top submissions¶
For the top-performing submissions on each leaderboard, domain expert clinicians conduct a qualitative reasoning review scored on:
- Clinical plausibility
- Coherence and appropriate use of evidence
- Absence of spurious or unsupported claims
Expert scores are reported separately and do not affect leaderboard rankings, but do factor into final winner selection alongside the quantitative scores.
| Task | Expert reviewers | Submissions reviewed |
|---|---|---|
| Task 1 | Radiologists | Top 3 |
| Task 2 | Radiologists and pathologists | Top 3 |
| Task 3 | Radiologists, pathologists, and oncologists | Top 5 |