Guide

Participation Guide

A participant-oriented guide to the MedReason task, released data, pre-evaluation and final evaluation workflow, official metrics, and submission requirements.

Go to FAQ →

Overview

MedReason evaluates whether medical MLLM systems can perform clinically grounded visual question answering under domain shift. The challenge uses one medical VQA task, but it distinguishes between closed-ended accuracy and open-ended grounded reasoning.

Participation Guidelines
  • One challenge task: medical VQA with both multiple-choice and open-ended questions.
  • Registration: official Synapse participation page here. Optional organizer-side Google Form registration: here.
  • Data and tools: released participant-facing data are described on the Data Access page. Example code: coming soon. Example Docker: coming soon.
  • Submission: submit a Docker package and a model description document. See Submission Resources.
MedReason challenge overview figure
Figure 1. MedReason builds on the fine-grained Med-CMR benchmark and evaluates a single medical VQA task across two clinical reasoning scenarios, with additional out-of-distribution testing on low-dose X-ray and 7T brain MRI.
What you need to solve

Task

MedReason defines one medical visual question answering task. Each case contains a medical image, optional high-level context, and one or more questions. Depending on the question type, your system must return either one option or a reasoning trace plus a final answer.

Question format A

Closed-ended questions

Your system returns a single multiple-choice option. This track evaluates whether the model can select the correct answer under the official option set.

Question format B

Open-ended questions

Your system must return both reasoning_trace and answer. This is required because a plausible final answer may still be unsupported by image evidence, while the reasoning trace reveals whether the model remained visually grounded.

Clinical scenario 1

Image-grounded anatomical reasoning

Questions may involve anatomical identification, laterality, localization, spatial relationships, or structural interpretation directly grounded in the image.

Clinical scenario 2

Pathology-oriented clinical reasoning

Questions may involve abnormality interpretation, subtle finding differentiation, disease-oriented judgment, and reasoning under ambiguity.

What you can access

Data

The public benchmark backbone is provided with public reference data available on Synapse. MedReason organizes released participant-facing resources for training and local development, while keeping leaderboard test sets hidden for organizer-side evaluation.

Open the MedReason public benchmark resources on Synapse →
MedReason challenge task data figure
Split Released to participants Answers included Main purpose
Train Yes Yes Model development
Validation Yes No Local development and pipeline checking
Public leaderboard test No No Organizer-side pre-evaluation and public leaderboard ranking
Private leaderboard test No No Organizer-side final official ranking
Released packages

How participants use the released data

Use the train split for model development. Use the participant-facing validation split for local validation, output-format checking, and debugging your inference pipeline before Docker packaging.

Hidden splits

What remains organizer-side

The public leaderboard split is used during pre-evaluation. The private leaderboard split is used only for the final official ranking after submissions are frozen.

Important. Participants cannot directly evaluate on leaderboard test sets locally. Official leaderboard results are obtained only by organizer-side execution of submitted Docker packages on hidden evaluation data.
How the challenge runs

Workflow

The challenge process has three connected phases: local development on released data, pre-evaluation on the hidden public leaderboard split, and final evaluation on the hidden private leaderboard split.

Combined challenge workflow diagram showing released data usage, public pre-evaluation, and final private evaluation
Figure 2. Released train and validation data are used locally. Pre-evaluation runs on the hidden public leaderboard split. Final official ranking runs on the hidden private leaderboard split.
1

Local development on released data

Participants access the released train and participant-facing validation packages, train the method locally, and verify that outputs are complete and correctly formatted.

2

Pre-evaluation (Many times) on the hidden public split

Teams submit a Docker package for organizer-side execution. The hidden public leaderboard split supports technical verification and public challenge-period ranking.

3

Final evaluation (Once) on the hidden private split

After final submissions are frozen, organizers run the final official package on the hidden private leaderboard split to determine the official ranking.

How your submission is judged

Evaluation & Assessment

Read this section from top to bottom. First understand the MCQ track. Then read the open-ended track. Finally, read how the open-ended case-level rubric scores become the official leaderboard metrics.

Key points before reading the details
  • MCQ track: submit one option; official metric = MCQ Accuracy.
  • Open-ended track: submit reasoning_trace and answer; official metrics = Open-ended GT and Open-ended VA.
  • Open-ended scoring is organizer-side: a rubric-guided LLM-as-a-judge workflow assigns case-level scores and then averages them across the dataset.
Track A

MCQ evaluation

What you submit

For each closed-ended case, submit one multiple-choice option matching the official option format.

Official metric

MCQ Accuracy

How the metric is computed

  • The submitted option is compared with the reference option.
  • An exact match is counted as correct.
  • Otherwise, the case is counted as incorrect.

Leaderboard value: mean accuracy over all closed-ended evaluation cases.

Track B

Open-ended evaluation

What you submit

  • reasoning_trace
  • answer

Both fields are required. A final answer alone is not sufficient for official open-ended evaluation.

Official metrics

Open-ended GT Ground-Truth Correctness

Open-ended VA Visual Accuracy

How the metrics are computed

Each open-ended case is scored organizer-side with a rubric-guided LLM-as-a-judge workflow.

  • Ground-Truth Correctness (GT): evaluated using Llama-3.1-70B-Instruct in a text-only setting.
  • Visual Accuracy (VA): evaluated using Qwen-2.5-VL, which jointly processes the image, question, and generated answer.

Both evaluators are used under controlled scoring protocols with fixed model versions, deterministic decoding parameters, and standardized prompts.

Leaderboard values: mean case-level GT_final and mean case-level VA_final over all open-ended evaluation cases.

Open-ended scoring logic

How one open-ended case is scored

1

Case-level rubric scoring

For each open-ended case, the organizer-side judge assigns three 0–4 rubric scores.

2

Three internal rubric components

The three case-level components are Ground-Truth Correctness internal field: GT_final, Answer-level Visual Accuracy internal field: VA_answer, and Reasoning Visual Faithfulness internal field: RVF_trace.

3

Threshold-gated visual aggregation

Final Visual Accuracy internal field: VA_final is computed from VA_answer and RVF_trace. This means the reasoning trace can directly affect the final visual score.

4

Dataset-level leaderboard values

Open-ended GT is the mean of GT_final across open-ended cases. Open-ended VA is the mean of VA_final across open-ended cases.

What the 0–4 scale means

Case-level rubric components

These 0–4 scores are assigned organizer-side evaluation judgding models for each open-ended case rating 0,1,2,3,4. They are not participant-submitted values.

Ground-Truth Correctness

internal field: GT_final

Judge input: question, reference answer, and submitted final answer.

What it measures: whether the final answer is medically and semantically aligned with the reference answer.

0 = clearly contradictory or medically wrong
4 = well aligned with the reference answer

Answer-level Visual Accuracy

internal field: VA_answer

Judge input: image, question, and submitted final answer.

What it measures: whether the key visual claims in the final answer are actually supported by the image.

0 = key claims are unsupported or contradicted by the image
4 = key claims are clearly supported without critical hallucination

Reasoning Visual Faithfulness

internal field: RVF_trace

Judge input: image, question, and submitted reasoning trace.

What it measures: whether the reasoning trace remains faithful to image evidence through the key reasoning chain.

0 = core reasoning depends on unsupported or hallucinatory visual assertions
4 = key visual assertions in the reasoning trace remain grounded throughout
Threshold-gated VA aggregation

Final Visual Accuracy VA_final is derived from VA_answer and RVF_trace. This prevents an unfaithful reasoning trace from receiving a high final visual score.

if RVF_trace <= 1:
    VA_final = min(VA_answer, 1)
elif RVF_trace == 2:
    VA_final = min(VA_answer, 3)
else:
    VA_final = VA_answer
Auxiliary context only

Role of image caption

Image caption or visual description may be used as organizer-side auxiliary context or for difficult tie-break cases, but the primary basis for open-ended VA and RVF assessment is the raw image together with the question, reasoning trace, and final answer.

Special case

How “I don’t know” is handled

Suppose the image is low quality and the model answers: “I cannot confidently confirm a pleural effusion from this image alone.”

  • This will usually not receive a high GT_final score if the reference answer contains a definite finding, because the final conclusion was not aligned.
  • However, if the answer avoids hallucinated findings and the uncertainty is justified by the image, the case may still receive partial credit for VA_answer or RVF_trace.

Practical takeaway: strong open-ended performance requires both medical correctness and visual grounding. Overconfident answers can hurt VA, while overly frequent abstention can hurt GT.

Official leaderboard

What the official leaderboard shows

  • Main displayed metrics: MCQ Accuracy, Open-ended GT, and Open-ended VA.
  • Overall ranking: computed from the average of the per-metric ranks for those three official metrics.
  • Pre-evaluation: based on the hidden public leaderboard split.
  • Final official ranking: determined after final submission freeze on the hidden private leaderboard split.
What you must submit

Submission Resources

To submit your model to the MedReason challenge, you need to provide a Docker image package and a short model description. The hidden leaderboard test sets will not be released, so your method must run through the official container-based evaluation workflow.

1

Prepare the Docker package

Use the official Docker template from the GitHub repository. Follow the README carefully and verify that the container runs end-to-end locally under the official I/O contract.

2

Upload the archive to cloud storage

Export the final Docker container as a zipped archive and upload it to a cloud platform such as Google Drive, Mega, or WeTransfer.

3

Email the download link

Send the cloud link to medreason26@googlegroups.com with the subject line MedReason 2026 Submission [TEAMNAME] Docker Container Submission.

4

Submit the model description

In addition to the Docker submission, participants must submit a model description documents by July 15th, 2026.

Use the following form: MedReason Model Description Submission Form

Submission email checklist

What organizers need together with the Docker link

  • Team name
  • Synapse username
  • Method name
  • Downloadable Docker package link
  • Contact email
  • Model description document (PDF format)
Important. The model description is part of the official submission package. A Docker link alone is not sufficient.
Open the GitHub repository →