Each task is scored on two equally weighted components. The final leaderboard score is a 50/50 combination of predictive performance and automated reasoning quality.

Fold 1 -- Predictive performance (50%)

  • Tasks 1 and 2: Area under the ROC curve (AUROC)
  • Task 3: Concordance index (C-index) for time-to-event BCR prediction

Performance estimates are reported with non-parametric percentile bootstrap confidence intervals. Ranking stability is assessed using bootstrap-based re-ranking across test set resamples.

Fold 2 -- Reasoning quality (50%)

Generated explanations are evaluated using a claim-level faithfulness metric computed by an LLM on the platform:

  1. The explanation is decomposed into independently verifiable atomic claims.
  2. Each claim is checked against the structured input fields for that specific case.
  3. A faithfulness score is computed as: supported claims / total detected claims
  4. Unsupported, contradictory, or hallucinated findings incur an explicit penalty.

Fallback: If automated faithfulness evaluation proves unstable during pilot validation, a composite similarity metric is used instead: cosine similarity between PubMedBERT embeddings + SciSpacy entity Jaccard similarity + BLEU-4 + ROUGE-L, combined as S = w1*EmbeddingCosine + w2*EntityJaccard + w3*(BLEU-4 + ROUGE-L).

Expert review of top submissions

For the top-performing submissions on each leaderboard, domain expert clinicians conduct a qualitative reasoning review scored on:

  • Clinical plausibility
  • Coherence and appropriate use of evidence
  • Absence of spurious or unsupported claims

Expert scores are reported separately and do not affect leaderboard rankings, but do factor into final winner selection alongside the quantitative scores.

Task Expert reviewers Submissions reviewed
Task 1 Radiologists Top 3
Task 2 Radiologists and pathologists Top 3
Task 3 Radiologists, pathologists, and oncologists Top 5