#KeyLIMEPodcast 220: Crowdsourcing Diagnosis?

Diagnosis is fundamental to the practice and identity of clinicians. However, in Jason’s opinion, none of the current publications on the topic have yet advanced the practice of #meded. of  This is what led him to this week’s paper, whose authors set out to compare the “diagnostic accuracy” of individuals vs groups of individuals using what Jason calls “unusual methods”. Curious? Listen here.


KeyLIME Session 220

Listen to the podcast.


New KeyLIME Podcast Episode Image

Barnett et al., Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open. 2019 Mar 1;2(3)


Jason R. Frank (@drjfrank)


Diagnosis…few things can be more fundamental to our practice and identity as clinicians. It is such a critical element, you would think that it has been extensively researched, clarified, and translated into practice…Except it has not. Oh, there are many, many papers on diagnosis, diagnostic errors, diagnostic biases. In my opinion, none yet advance our practice of meded. Some are concerning for their lack of evidence. (The exception to this diatribe is all the great research by some guy named Sherbino and another named Norman…That work teaches us all about what experts really do to make diagnoses…Don’t tell Sherbino I said he did good—he gets enough fan mail as it is.) What we need is a new line of research on how we make diagnoses and how we can make them better. So, I picked up a paper from JAMA Open by Barnett et al…


The authors, from Harvard, set out to compare the “diagnostic accuracy” of individuals vs groups of individuals in an online database of digital cases.

Key Points on the Methods

The methods are unusual: the authors accessed The Human Dx online digital case platform between May 2014 and October 2016. The Human Dx platform has 14,000 users from 80 countries and has had more than 230,000 inputs by users (cases or Dx). Excluding cases that were flagged as poor quality, they analyzed all cases with 10 or more Dx ratings by users (n=1572 cases). Users were categorized based on self-reported profiles and specialties. They randomly selected 10 solve attempts from all that tried a given case. They then scored as “correct” if any of the top 3 Dx possibilities of the user corresponded to the intended Dx of the author of the case. 2 authors rated each case, and interrater agreement was a Cohen’s kappa of 0.70 (good).  They tried several ways to weight a correct Dx based on the number of Dxs submitted and ended up using 1/n Dx per case as a weighting for the number correct.

Their primary outcome was the individual diagnostic accuracy of a randomly chose user vs a collective score by 10 other random attempts by other users (proportion of correct Dx among x users). Medical students and attending physicians were excluded from this index. They also compared the group Dx accuracy to individuals whose profile suggested that they should be the most expert in a given case. Whew!

Key Outcomes

By self-report, 60% of participants were postgraduate trainees/residents or fellows, and 20% each were med students or practicing physicians, collectively from ~46 countries. Most (70%) participants were involved in Internal Medicine. Diagnostic accuracy of all users was 62.5% for individuals overall (55.8% for med students, 65.5% for residents, and 63.9% for attendings. Subspecialists in a relevant area only scored 66.3% in accuracy.

Their primary outcome comparing individual accuracy to random groups demonstrated greater chance of getting the “right Dx” in groups ~82.5% (with some variation). Subgroup analyses and using different weighting made no significant difference.

Key Conclusions

The authors conclude…that diagnostic accuracy of groups of clinicians was ~17.3% to 29.8% better than individuals from around the world using these digital cases. They suggest collective intelligence should be deployed in medicine.

Spare Keys – other take home points for clinician educators

  1. There are many, many threats to validity in this paper: the case quality is uncontrolled and unreported, the participants are self-described and presumably extremely variable, the participants are 80% still trainees, the measure of “diagnostic accuracy” and “group accuracy” are potentially flawed, etc.
  2. Beware meded headlines that sound too good to be true: they probably are.
  3. This is still a clever way to do meded research from existing databases.
  4. Hear us all clinician-educators: there is still a major need for more research on diagnosis.

Access KeyLIME podcast archives here