Scoring And Evaluation - CHIMERA-agent

Each task is scored on two equally weighted components. The final leaderboard score is a 50/50 combination of predictive performance and automated reasoning quality.

Fold 1 -- Predictive performance (50%)¶

Tasks 1 and 2: Area under the ROC curve (AUROC)
Task 3: Concordance index (C-index) for time-to-event BCR prediction

Performance estimates are reported with non-parametric percentile bootstrap confidence intervals. Ranking stability is assessed using bootstrap-based re-ranking across test set resamples.

Fold 2 -- Reasoning quality (50%)¶

Generated explanations are evaluated using a claim-level faithfulness metric computed by an LLM on the platform:

The explanation is decomposed into independently verifiable atomic claims.
Each claim is checked against the structured input fields for that specific case.
A faithfulness score is computed as: supported claims / total detected claims
Unsupported, contradictory, or hallucinated findings incur an explicit penalty.

Fallback: If automated faithfulness evaluation proves unstable during pilot validation, a composite similarity metric is used instead: cosine similarity between PubMedBERT embeddings + SciSpacy entity Jaccard similarity + BLEU-4 + ROUGE-L, combined as S = w1*EmbeddingCosine + w2*EntityJaccard + w3*(BLEU-4 + ROUGE-L).

Expert review of top submissions¶

For the top-performing submissions on each leaderboard, domain expert clinicians conduct a qualitative reasoning review scored on:

Clinical plausibility
Coherence and appropriate use of evidence
Absence of spurious or unsupported claims

Expert scores are reported separately and do not affect leaderboard rankings, but do factor into final winner selection alongside the quantitative scores.

Task	Expert reviewers	Submissions reviewed
Task 1	Radiologists	Top 3
Task 2	Radiologists and pathologists	Top 3
Task 3	Radiologists, pathologists, and oncologists	Top 5