Sample Size for Bandit MaxDiff Studies

Last Updated: 11 Sep 2018Hits: 4377
The following information was extracted from Appendix C of the following Sawtooth Software technical paper: Bandit MaxDiff: When to Use It and Why It Can Be a Better Choice than Standard MaxDiff, written by Bryan Orme.

This section provides guidance regarding sample size needed to achieve 90% correct classification of the top few items out of many under Bandit MaxDiff. The first results below are based on a simulation study conducted by Fairchild, Orme, & Schwartz (2015 Sawtooth Software Conference) that generated simulated respondents based on actual HB utilities from a real MaxDiff study conducted by P&G.

In our 2015 paper, Fairchild et al. conducted simulations with bootstrap sampling to compute the sample size needed for 90% correct classification via pooled multinomial logit estimation of top-3 and top-10 items for studies with 120 and 300 items, where each robotic respondent completed 18 MaxDiff sets showing 5 items per set:

    Sample Size to Achieve
Top-3 Hit Rate of 90%
Sample Size to Achieve
Top-10 Hit Rate of 90%
Row Average
120 items 250 160 205
300 items 1,000 1,050 1,025

The 300-item study was generated based on the patterns of preferences and variances found in the original 120-item study, so the results should be fairly comparable.

Later, with the help of Zachary Anderson, we conducted simulations (13 separate replications, using different random seeds) for a 1,000-item study, based on the same variances and patterns of preference correlation seen in the original 120-item study collected by P&G. On average, it took 8,500 respondents to achieve a 90% hit rate (correct classification of the top-30 items) and 2,500 respondents to achieve an 80% hit rate. The true top item had a 3% higher likelihood of choice than the true second-place item, and we were very pleased at how well Bandit MaxDiff with robotic respondents was able to classify that top-ranked item correctly out of 1,000 items (almost like finding a needle in a haystack). With just 1,000 robotic respondents and again using pooled logit estimation, it identified the top true item out of 1,000 items in 7 out 13 of our simulations. With 2,000 robotic respondents Bandit MaxDiff identified the top true item in 10 of 13 simulations (and the three simulations that didn’t correctly identify the top item ranked it in second place). With 5,000 robotic respondents, it found the top item in 12 out of 13 simulations (and the top item came in second place for that one miss).

Is 90% correct classification rate for the top few items really needed for studies involving hundreds of items? Perhaps for a 500-item study, obtaining an 80% classification rate of the top true few items would be sufficient. Even if an item was incorrectly classified among the top 10 out of 500, the Bandit MaxDiff methodology is robust enough that items misclassified into the top 10 should still be very high quality and very near the top 10.