06 February 2018

The latest Cornell Food and Brand Lab correction: Some inconsistencies and strange data patterns

The Cornell Food and Brand Lab has a new correction. Tim van der Zee already tweeted a bit about it.

"Extremely odd that it isn't a retraction"? Let's take a closer look.

Here is the article that was corrected:
Wansink, B., Just, D. R., Payne, C. R., & Klinger, M. Z. (2012). Attractive names sustain increased vegetable intake in schools. Preventive Medicine55, 330–332. http://dx.doi.org/10.1016/j.ypmed.2012.07.012

This is the second article from this lab in which data were reported as having been collected from elementary school children aged 811, but it turned out that they were in fact collected from children aged 3–5 in daycares.  You can read the lab's explanation for this error at the link to the correction above (there's no paywall at present), and decide how convincing you find it.

Just as a reminder, the first article, published in JAMA Pediatrics, was initially corrected (via JAMA's "Retract and replace" mechanism) in September 2017. Then, after it emerged that the children were in fact in daycare, and that there were a number of other problems in the dataset that I blogged about, the article was definitively retracted in October 2017.

I'm going to concentrate on Study 1 of the recently-corrected article here, because the corrected errors in this study are more egregious than those in Study 2, and also because there are still some very substantial problems remaining.  If you have access to SPSS, I also encourage you to download the dataset for Study 1, along with the replication syntax and annotated output file, from here.

By the way, in what follows, you will see a lot of discussion about the amount of "carrots" eaten.  There has been some discussion about this, because the original article just discussed "carrots" with no qualification. The corrected article tells us that the carrots were "matchstick carrots", which are about 1/4 the size of a baby carrot. Presumably there is a U.S. Standard Baby Carrot kept in a science museum somewhere for calibration purposes.

So, what are the differences between the original article and the correction? Well, there are quite a few. For one thing, the numbers in Table 1 now finally make sense, in that the number of carrots considered to have been "eaten" is now equal to the number of carrots "taken" (i.e., served to the children) minus the number of carrots "uneaten" (i.e., counted when their plates came back after lunch).  In the original article, these numbers did not add up; that is, "taken" minus "uneaten" did not equal "eaten".  This is important because, when asked by Alison McCook of Retraction Watch why this was the case, Dr. Brian Wansink (the head of the Cornell Food and Brand Lab) implied that it must have been due to some carrots being lost (e.g., dropped on the floor, or thrown in food fights). But this makes no sense for two reasons. First, in the original article, the difference between the number of carrots "eaten" was larger than the difference between "taken" and "uneaten", which would imply that, rather than being dropped on the floor or thrown, some extra carrots had appeared from somewhere.  Second, and more fundamentally, the definition of the number of carrots eaten is (the number taken) minus (the number left uneaten).  Whether the kids ate, threw, dropped, or made sculptures out of the carrots doesn't matter; any that didn't come back were classed as "eaten". There was no monitoring of each child's oesophagus to count the carrots slipping down.

When we look in the dataset, we can see that there are separate variables for "taken" (e.g., "@1CarTaken" for Monday, "@2CarTaken" for Tuesday, etc), "uneaten" (e.g., "@1CarEnd", where "End" presumably corresponds to "left at the end"), and "eaten" (e.g., "@1CarEaten").  In almost all cases, the formula ("eaten" equals "taken" minus "uneaten") holds, except for a few missing values and two participants (#42 and #152) whose numbers for Monday seem to have been entered in the wrong order; for both of these participants, "eaten" equals "taken" plus "uneaten". That's slightly concerning because it suggests that, instead of just entering "taken" and "uneaten" (the quantities that were capable of being measured) and letting their computer calculate "eaten", the researchers calculated "eaten" by hand and typed in all three numbers, doing so in the wrong order for these two participants in the process.

Another major change is that whereas in the original article the study was run on three days, in the correction there are reports of data from four days.  In the original, Monday was a control day, the between-subject manipulation of the carrot labels was done on Tuesday, and Thursday was a second control day, to see if the effect persisted. In the correction, Thursday is now a second experimental day, with a different experiment that carries over to Friday. Instead of measuring how many carrots were eaten on Thursday, between two labelling conditions ("X-ray Vision Carrots" and "Food of the Day"; there was no "no-label" condition), the dependent variable was the number of carrots eaten on the next day (Friday).

OK, so those are the differences between the two articles. But arguably the most interesting discoveries are in the dataset, so let's look at that next.

Randomisation #fail


As Tim van der Zee noted in the Twitter thread that I linked to at the top of this post, the number of participants in Study 1 in the corrected article has mysteriously increased since the original publication. Specifically, the number of children in the "Food of the Day" condition has gone from 38 to 48, an increase of 10, and the number of children in the "no label" condition has gone from 45 to 64, an increase of 19.  You might already be thinking that a randomisation process that leads to only 22.2% (32 of 144) participants being in the experimental condition might not be an especially felicitous one, but as we will see shortly, that is by no means the largest problem here.  (The original article does not actually discuss randomisation, and the corrected version only mentions it in the context of the choice of two labels in the part of the experiment that was conducted on the Thursday, but I think it's reasonable to assume that children were meant to be randomised to one of the carrot labelling conditions on the Tuesday.)

The participants were split across seven daycare centres and/or school facilities (I'll just go with the authors' term "schools" from now on).  Here is the split of children per condition and per school:


Oh dear. It looks like the randomisation didn't so much fail here, as not take place at all, in almost all of the schools.

Only two schools (#1 and #4) had a non-zero number of children in each of the three conditions. Three schools had zero children in the experimental condition. Schools #3, #5, #6, and #7 only had children in one of the three conditions. The justification for the authors' model in the corrected version of the article ("a Generalized Estimated Equation model using a negative binominal distribution and log link method with the location variable as a repeated factor"), versus the simple ANOVA that they performed in the original, was to be able to take into account the possible effect of the school. But I'm not sure that any amount of correction for the effect of the school is going to help you when the data are as unbalanced as this.  It seems quite likely that the teachers or researchers in most of the schools were not following the protocol very carefully.

At school #1, thou shalt eat carrots


Something very strange must have been happening in school #1.  Here is the table of the numbers of children taking each number of carrots in schools #2-#7 combined:

I think that's pretty much what one might expect.  About a quarter of the kids took no carrots at all, most of the rest took a few, and there were a couple of major carrot fans.  Now let's look at the distribution from school #1:


Whoa, that's very different. No child in school #1 had a lunch plate with zero carrots. In fact, all of the children took a minimum of 10 carrots, which is more than 44 (41.1%) of the 107 children in the other schools took.  Even more curiously, almost all of the children in school #1 apparently took an exact multiple of 10 carrots - either 10 or 20. And if we break these numbers down by condition, it gets even stranger:

So 17 out of 21 children in the control condition ("no label", which in the case of daycare children who are not expected to be able to read labels anyway presumably means "no teacher describing the carrots") in school #1 chose exactly 10 carrots. Meanwhile, every single child12 out of 12in the "Food of the Day" condition selected exactly 20 carrots.

I don't think it's necessary to run any statistical tests here to see that there is no way that this happened by chance. Maybe the teachers were trying extra hard to help the researchers get the numbers they wanted by encouraging the children to take more carrots than they otherwise would (remember, from schools #2-#7, we could expect a quarter of the kids to take zero carrots). But then, did they count out these matchstick carrots individually, 1, 2, 3, up to 10 or 20? Or did they serve one or two spoonfuls and think, screw it, I can't be bothered to count them, let's call it 10 per spoon?  Participants #59 (10 carrots), #64 (10), #70 (22), and #71 (10) have the comment "pre-served" recorded in their data for this day; does this mean that for these children (and perhaps others with no comment recorded), the teachers chose how many carrots to give them, thus making a mockery of the idea that the experiment was trying to determine how the labelling would affect the kids' choices?  (I presume it's just a coincidence that the number of kids with 20 carrots in the "Food of the Day" condition, and the number with 10 carrots in the "no label" condition, are very similar to the number of extra kids in these respective conditions between the original and corrected versions of the article.)

The tomatoes... and the USDA project report


Another interesting thing to emerge from an examination of the dataset is that not one but two foods, with and without "cool names", were tested during the study.  As well as "X-ray Vision Carrots", children were also offered tomatoes. On at least one day, these were described as "Tomato Blasts". The dataset contains variables for each day recording what appears to be the order in which each child was served with the tomatoes or carrots.  Yet, there are no variables recording how many tomatoes each child took, ate, or left uneaten on each day. This is interesting, because we know that these quantities were measured. How? Because it's described in this project report by the Cornell Food and Brand Lab on the USDA website:

"... once exposed to the x-ray vision carrots kids ate more of the carrots even when labeled food of the day. No such strong relationship was observed for tomatoes, which could mean that the label used (tomato blasts) might not be particularly meaningful for children in this age group."

This appears to mean that the authors tested two dependent variables, but only reported the one that gave a statistically significant result. Does that sound like readers of the Preventive Medicine article (either the original or the corrected version) are being provided with an accurate representation of the research record? What other variables might have been removed from the dataset?

It's also worth noting that the USDA project report that I linked to above states explicitly that both the carrots-and-tomatoes study and the "Elmo"/stickers-on-apples study (later retracted by JAMA Pediatrics) were conducted in daycare facilities, with children aged 35.  It appears that the Food and Brand Lab probably sent that report to the USDA in 2009. So how was it that by March 2012the date on this draft version of the original "carrots" articleeverybody involved in writing "Attractive Names Sustain Increased Vegetable Intake in Schools" had apparently forgotten about it, and was happy to report that the participants were elementary school students?  And yet, when Dr. Wansink cited the JAMA Pediatrics article in 2013 and 2015, he referred to the participants as "daycare kids" and "daycare children", respectively; so his incorrect citation of his own work actually turns out to have been a correct statement of what had happened.  And in the original version of that same "Elmo" article, published in 2012, the authors referred to the childrenwho were meant to be aged 8–11as "preliterate". So even if everyone had forgotten about the ages of the participants at a conscious level, this knowledge seems to have been floating around subliminally. This sounds like a very interesting case study for psychologists.

Another interesting thing about the March 2012 draft that I mentioned in the previous paragraph is that it describes data being collected on four days (i.e., the same number of days as the corrected article), rather than the three days that were mentioned in the original published version of the article, which was published just four months after the date of the draft:


Extract from the March 2012 draft manuscript, showing the description of the data collection period, with the PDF header information (from File/Properties) superposed.

So apparently at some point between drafting the original article and submitting it, one of the days was dropped, with the second control day being moved up from Friday to Thursday. Again, some people might feel that at least one version of this article might not be an accurate representation of the research record.

Miscellaneous stuff


Some other minor peculiarites in the dataset, for completeness:

- On Tuesdaythe day of the experiment, after a "control" dayparticipants 194, 198, and 206 was recorded as commenting about "cool carrots"; it is unclear whether this was a reference to the name that was given to the carrots on Monday or Tuesday.  But on Monday, a "control" day, the carrots should presumably have had no name, and on Tuesday they should have been described as "X-ray Vision Carrots".

- On Monday and Friday, all of the carrots should have been served with no label. But the dataset records that five participants (#199, #200, #203, #205, and #208) were in the "X-ray Vision Carrots" condition on Monday, and one participant (#12) was in the "Food of the Day" condition on Friday. Similarly, on Thursday, according to the correction, all of the carrots were labelled as "Food of the Day" or "X-ray Vision Carrots". But two of the cases (participants #6 and #70) have the value that corresponds to "no label" here.

These are, again, minor issues, but they shouldn't be happening. In fact there shouldn't even be a variable in the dataset for the labelling condition on Monday and Friday, because those were control-only days.

Conclusion


What can we take away from this story?  Well, the correction at least makes one thing clear: absolutely nothing about the report of Study 1 in the original published article makes any sense. If the correction is indeed correct, the original article got almost everything wrong: the ages and school status of the participants, the number of days on which the study was run, the number of participants, and the number of outcome measures. We have an explanation of sorts for the first of these problems, but not the others.  I find it very hard to imagine how the authors managed to get so much about Study 1 wrong the first time they wrote it up. The data for the four days and the different conditions are all clearly present in the dataset.  Getting the number of days wrong, and incorrectly describing the nature of the experiment that was run on Thursday, is not something that can be explained by a simple typo when copying the numbers from SPSS into a Word document (especially since, as I noted above, the draft version of the original article mentions four days of data collection).

In summary: I don't know what happened here, and I guess we may never know. What I am certain of is that the data in Study 1 of this article, corrected or not, cannot be the basis of any sort of scientific conclusion about whether changing the labels on vegetables makes children want to eat more of them.

I haven't addressed the corrections to Study 2 in the same article, although these would be fairly substantial on their own if they weren't overshadowed by the ongoing dumpster fire of Study 1.  It does seem, however, that the spin that is now being put on the story is that Study 1 was a nice but perhaps "slightly flawed" proof-of-concept, but that there is really nothing to see there and we should all look at Study 2 instead.  I'm afraid that I find this very unconvincing.  If the authors have real confidence in their results, I think they should retract the article and resubmit Study 2 for review on its own. It would be sad for Matthew Z. Klinger, the then high-school student who apparently did a lot of the grunt work for Study 2, to lose a publication like this, but if he is interested in pursuing an academic career, I think it would be a lot better for him to not to have his name on the corrected article in its present form.

08 January 2018

Some quick and dirty R code for checking between-subjects F (and t) statistics

A while back, someone asked me how we (Tim van der Zee, Jordan Anaya, and I) checked the validity of the F statistics when we analyzed the "pizza papers" from the Cornell Food and Brand Lab.  I had an idea to write this up here, which has now floated to the top of the pile because I need to cite it in a forthcoming presentation. :-)

A quick note before we start: this technique applies to one- or two-way between-subjects ANOVAs. A one-way, two-condition ANOVA is equivalent to an independent samples t test; the F statistic is the square of the t statistic. I will sometimes mention only F statistics, but everything here applies to independent samples t tests too.  On the other hand, you can't use this technique to check mixed (between/within) ANOVAs, or paired-sample t tests, as those require knowledge of every value in the dataset.

It turns out that, subject to certain limitations later), you can derive the F statistic for a between-subjects ANOVA from (only) the per-cell means, SDs, and sample sizes.  You don't need the full dataset.  There are some online calculators that perform these tests; however, they typically assume that the input means and SDs are exact, which is unrealistic.  I can illustrate this point with the t test (!) calculator from GraphPad.  Open that up and put these numbers in:
(Note that we are not using Welch's t test here, although Daniël Lakens will tell you --- and he is very probably right --- that we should usually do so; indeed, we should use Welch's ANOVA too. Our main reason for not doing that here is that the statistics you are checking will usually not have been made with the Welch versions of these tests; indeed, the code that I present below depends on an R package assumes that the non-Welch tests are used.  You can usually detect if Welch's tests have been used, as the [denominator] degrees of freedom will not be integers.)

Having entered those numbers, click on "Calculate now" (I'm not sure why the "Clear the form" button is so large or prominent!) and you should get these results: t = 4.3816, df = 98.  Now, suppose the article you are reading states that "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=4.33". Should you reach for your green ballpoint pen and start writing a letter to the editor about the difference in the t value?  Probably not.  The thing is, the reported means (4.05, 5.06) and SDs (1.12, 1.18) will have been rounded after being used to calculate the test statistic.  The mean of 4.05 could have been anywhere from 4.045 to 4.055 (rounding rules are complex, but whether this is 4.0499999 or 4.0500000 doesn't matter too much), the SD of 1.12 could have been in the range 1.115 to 1.125, etc.  This can make quite a difference.  How much?  Well, we can generate the maximum t statistic by making the difference between the means as large as possible, and both of the SDs as small as possible:

That gives a t statistic of 4.4443.  To get the smallest possible t value, we make the difference between the means as small as possible, and both of the SDs as large as possible, in an analogous way.  I'll leave the filling in of the form as an exercise for you, but the result is 4.3195.

So we now know that when we see a statement such as "People in condition B (M=5.06, SD=1.18, N=52) outperformed those in condition A (M=4.05, SD=1.12, N=48), t(98)=<value>", any t value from 4.32 through 4.44 is plausible; values outside that range are, in principle, not possible. If you see multiple such values in an article, or even just one or two with a big discrepancy, it can be worth investigating further.

The online calculators I have seen that claim to do these tests have a few other limitations as well as the problem of rounded input values.  First, the interface is a bit clunky (typically involving typing numbers into a web form, which you have to do again tomorrow if you want to re-run the analyses). Second, some of them use Java, and that may not work with your browser. What we needed, at least for the "Statistical Heartburn" analyses, was some code.  I wrote mine in R and Jordan independently wrote a version in Python; we compared our results at each step of the way, so we were fairly confident that we had the right answers (or, of course, we could have both made the same mistakes).

My solution uses an existing R library called rpsychi, which does the basic calculations of the test statistics. I wrote a wrapper function called f_range(), which does the work of calculating the upper and lower bounds of the means and SDs, and outputs the minimum and maximum F (or t, if you set the parameter show.t to TRUE) statistics.

Usage is intended to be relatively straightforward.  The main parameters of f_range() are vectors or matrices of the per-cell means (m), SDs (s), and sample sizes (n).  You can add show.t=TRUE to get t (rather than F) statistics, if appropriate; setting dp.p forces the number of decimal places used to that value (although the default almost always works); and title and labels are cosmetic. Here are a couple of examples from the "pizza papers":

1. Check Table 1 from "Lower Buffet Prices Lead to Less Taste Satisfaction"
n.lbp.t1 <- c(62, 60) 

m.lbp.t1.l1 <- c(44.16, 46.08)
sd.lbp.t1.l1 <- c(18.99, 14.46)
f_range(m=m.lbp.t1.l1, s=sd.lbp.t1.l1, n=n.lbp.t1, title="Age") 

m.lbp.t1.l3 <- c(68.52, 67.91)
sd.lbp.t1.l3 <- c(3.95, 3.93)
f_range(m=m.lbp.t1.l3, s=sd.lbp.t1.l3, n=n.lbp.t1, title="Height") 

m.lbp.t1.l4 <- c(180.84, 182.31)
sd.lbp.t1.l4 <- c(48.37, 48.41)
f_range(m=m.lbp.t1.l4, s=sd.lbp.t1.l4, n=n.lbp.t1, title="Weight") 

m.lbp.t1.l5 <- c(3.00, 3.28)
sd.lbp.t1.l5 <- c(1.55, 1.29)
f_range(m=m.lbp.t1.l5, s=sd.lbp.t1.l5, n=n.lbp.t1, title="Group size") 

m.lbp.t1.l6 <- c(6.62, 6.64)
sd.lbp.t1.l6 <- c(1.85, 2.06)
# Next line gives an F too small for rpsychi to calculate
# f_range(m=m.lbp.t1.l6, s=sd.lbp.t1.l6, n=n.lbp.t1, title="Hungry then") 

m.lbp.t1.l7 <- c(1.88, 1.85)
sd.lbp.t1.l7 <- c(1.34, 1.75)
f_range(m=m.lbp.t1.l7, s=sd.lbp.t1.l7, n=n.lbp.t1, title="Hungry now") 

2. Check Table 2 from "Eating Heavily: Men Eat More in the Company of Women"
lab.eh.t2 <- c("gender", "group", "gender x group")
n.eh.t2 <- matrix(c(40, 20, 35, 10), ncol=2)

m.eh.t2.l1 <- matrix(c(5.00, 2.69, 4.83, 5.54), ncol=2)
sd.eh.t2.l1 <- matrix(c(2.99, 2.57, 2.71, 1.84), ncol=2)
f_range(m=m.eh.t2.l1, s=sd.eh.t2.l1, n=n.eh.t2, title="Line 1", labels=lab.eh.t2)

m.eh.t2.l2 <- matrix(c(2.99, 1.55, 1.33, 1.05), ncol=2)
sd.eh.t2.l2 <- matrix(c(1.75, 1.07, 0.83, 1.38), ncol=2)
f_range(m=m.eh.t2.l2, s=sd.eh.t2.l2, n=n.eh.t2, title="Line 2", labels=lab.eh.t2)

m.eh.t2.l3 <- matrix(c(2.67, 2.76, 2.73, 1.00), ncol=2)
sd.eh.t2.l3 <- matrix(c(2.04, 2.18, 2.16, 0.00), ncol=2)
f_range(m=m.eh.t2.l3, s=sd.eh.t2.l3, n=n.eh.t2, title="Line 3", labels=lab.eh.t2)

m.eh.t2.l4 <- matrix(c(1.46, 1.90, 2.29, 1.18), ncol=2)
sd.eh.t2.l4 <- matrix(c(1.07, 1.48, 2.28, 0.40), ncol=2)
f_range(m=m.eh.t2.l4, s=sd.eh.t2.l4, n=n.eh.t2, title="Line 4", labels=lab.eh.t2)

m.eh.t2.l5 <- matrix(c(478.75, 397.5, 463.61, 111.71), ncol=2)
sd.eh.t2.l5 <- matrix(c(290.67, 191.37, 264.25, 109.57), ncol=2)
f_range(m=m.eh.t2.l5, s=sd.eh.t2.l5, n=n.eh.t2, title="Line 5", labels=lab.eh.t2)

m.eh.t2.l6 <- matrix(c(2.11, 2.27, 2.20, 1.91), ncol=2)
sd.eh.t2.l6 <- matrix(c(1.54, 1.75, 1.71, 2.12), ncol=2)
f_range(m=m.eh.t2.l6, s=sd.eh.t2.l6, n=n.eh.t2, title="Line 6", labels=lab.eh.t2)

I mentioned earlier that there were some limitations on what this software can do. Basically, once you get beyond a 2x2 design (e.g., 3x2), there can be some (usually minor) discrepancies between the F statistics calculated by rpsychi and the numbers that might have been returned by the ANOVA software used by the authors of the article that you are reading, if the sample sizes are unbalanced across three or more conditions; the magnitude of such discrepancies will depend on the degree of imbalance.  This issue is discussed in a section starting at the bottom of page 3 of our follow-up preprint.

A further limitation is that rpsychi has trouble with very small F statistics (such as 0.02). If you have a script that makes multiple calls to f_range(), it may stop when this occurs. The only workaround I know of for this is to comment out that call (as shown in the first example above).

Here is the R code for f_range(). It is released under a CC BY license, so you can do pretty much what you like with it.  I decided not to turn it into an R package because I want it to remain "quick and dirty", and packaging it would require an amount of polishing that I don't want to put in at this point.  This software comes with no technical support (but I will answer polite questions if you ask them via the comments on this post) and I accept no responsibility for anything you might do with it. Proceed with caution, and make sure you understand what you are doing (for example, by having a colleague check your reasoning) before you do... well, anything in life, really.

library(rpsychi)

# Function to display the possible ranges of the F or t statistic from a one- or two-way ANOVA.
f_range <- function (m, s, n, title=FALSE, show.t=FALSE, dp.p=-1, labels=c()) {
  m.ok <- m
  if (class(m.ok) == "matrix") {
    func <- ind.twoway.second
    useF <- c(3, 2, 4)
    default_labels <- c("col F", "row F", "inter F")
  }
  else {
    m.ok <- matrix(m)
    func <- ind.oneway.second
    useF <- 1
    default_labels <- c("F")
    if (show.t) {
      default_labels <- c("t")
    }
  }

# Determine how many DPs to use from input numbers, if not specified
  dp <- dp.p
  if (dp.p == -1) {
    dp <- 0
    numbers <- c(m, s)
    for (i in numbers) {
      if (i != round(i, 0)) {
        dp <- max(dp, 1)
        j <- i * 10
        if (j != round(j, 0)) {
          dp <- max(dp, 2)
        }
      }
    }
  }

  if (length(labels) == 0) {
    labels <- default_labels
  }

# Calculate the nominal test statistic(s) (i.e., assuming no rounding error)
  f.nom <- func(m=m.ok, sd=s, n=n)$anova.table$F

# We correct for rounding in reported numbers by allowing for the maximum possible rounding error.
# For the maximum F estimate, we subtract .005 from all SDs; for minimum F estimate, we add .005.
# We then add or subtract .005 to every mean, in all possible permutations.
# (".005" is an example, based on 2 decimal places of precision.)
  delta <- (0.1 ^ dp) / 2    #typically 0.005
  s.hi <- s - delta
  s.lo <- s + delta

# Initialise maximum and minimum F statistics to unlikely values.
  f.hi <- rep(-1, length(useF))
  f.lo <- rep(999999, length(useF))
  f.hi <- f.nom
  f.lo <- f.nom

# Generate every possible combination of +/- maximum rounding error to add to each mean.
  l <- length(m.ok)
  rawcomb <- combn(rep(c(-delta, delta), l), l)
  comb <- rawcomb[,!duplicated(t(rawcomb))]

# Generate every possible set of test statistics within the bounds of rounding error,
#  and retain the largest and smallest.
  for (i in 1:ncol(comb)) {
    m.adj <- m.ok + comb[,i]
    f.hi <- pmax(f.hi, func(m=m.adj, sd=s.hi, n=n)$anova.table$F)
    f.lo <- pmin(f.lo, func(m=m.adj, sd=s.lo, n=n)$anova.table$F)
  }

  if (show.t) {
    f.nom <- sqrt(f.nom)
    f.hi <-  sqrt(f.hi)
    f.lo <-  sqrt(f.lo)
  }

  if (title != FALSE) {
    cat(title)
  }

  sp <- " "
  fdp <- 2     # best to report Fs to 2 DP always, I think
  dpf <- paste("%.", fdp, "f", sep="")
  for (i in 1:length(useF)) {
    j <- useF[i]
    cat(sp, labels[i], ": ", sprintf(dpf, f.nom[j]),
                   " (min=", sprintf(dpf, f.lo[j]),
                   ", max=", sprintf(dpf, f.hi[j]), ")",
        sep="")
    sp <- "  "
  }

  if ((dp.p == -1)  && (dp < 2)) {
    cat(" <<< dp set to", dp, "automatically")
  }
  
  cat("\n", sep="")
}






18 December 2017

A review of the research of Dr. Nicolas Guéguen

Over the last couple of weeks, James Heathers and I have been blogging about some rather strange articles by Dr. Nicolas Guéguen, of the Université Bretagne-Sud in France.  In this joint post, we want to summarise the apparent issues that we have identified in this researcher’s output.  Some of the points we make here have already been touched on (or more) in an excellent article by Cathleen OGrady at Ars Technica.

There is a lot to read

The first thing one notices in looking at his Google Scholar record is that Dr. Guéguen is a remarkably prolific researcher.  He regularly publishes 10 or more sole-authored empirical articles per year (this total reached 20 in 2015), many of which include extensive fieldwork and the collection of data from large numbers (sometimes even thousands) of participants. Yet, none of the many research assistants and other junior collaborators who must have been involved in these projects ever seem to be included as co-authors, or even have their names mentioned; indeed, we have yet to see an Acknowledgments section in any sole-authored article by Dr. Guéguen.  This seems unusual, especially given that in some cases the data collection process must have required the investment of heroic amounts of the confederates’ time.

Much of Dr. Guéguen's research focuses on human relationships, in what one might euphemistically term a rather "old-fashioned" way.  You can get a flavour of his research interests by browsing through his Google Scholar profile.  As well as the articles we have blogged about, he has published research on such vital topics as whether women with larger breasts get more invitations to dance in nightclubs (they do), whether women are more likely to give their phone number to a man if asked while walking near a flower shop (they are), and whether a male bus driver is more likely to let a woman (but not a man) ride the bus for free if she touches him (we’ll let you guess the answer to this one).  One might call it “Benny Hill research”, although Dr. Guéguen has also published plenty of articles on other lightweight pop-social-psychology topics such as consumer behaviour in restaurants (does that sound familiar?) that do not immediately conjure up images of sexual stereotypes.

Neither Dr. Guéguen’s theories, nor his experimental designs, generally present any great intellectual challenges.  However, despite their simplicity, and the almost trivial nature of the manipulations, his studies often produce effect sizes of the kind more normally associated with the use of domestic cleaning products against germs. Many of the studies also seem to be run on a production line system, with almost every combination of independent and dependent variables being tested (something that was noted by Hans van Maanen in the Dutch newspaper De Volkskrant as far back as 2012).  For example, waitresses get more tips if they have blonde hair or use make-up or wear a red t-shirt, but women wearing red also find it easier to hitch a ride, as do women with blonde hair or larger breasts.  Those same women with blonde hair or larger breasts also get asked to dance more in nightclubs. As well as earning her more tips if she is a waitress, using make-up also makes a woman more likely to be approached by a man in a bar, although her choice to wear make-up might reflect the fact that she is near ovulation, at which point she is also more likely to accept that invitation to dance; and so it goes, round and round.

It seems that some of this research is actually taken quite seriously by some psychologists.  For example, it is cited in recent work by Andrew Elliot and colleagues at the University of Rochester that claims to show that women wear red clothes as a sexual signal (thus also providing a piece of Dr. Guéguen’s IV/DV combination bingo card that would otherwise have been missing). The skeptical psychologist Dr. Richard Wiseman also seems to be something of a fan of Guéguen's work; for example, in his 2009 book 59 Seconds: Think a Little, Change a Lot, Wiseman noted that "Nicolas Guéguen has spent his career investigating some of the more unusual aspects of everyday life, and perhaps none is more unusual than his groundbreaking work on breasts", and he also cited Guéguen several times in his 2013 book The As If Principle.

And of course, as is common for research with themes of sexual attraction and other aspects of everyday human behaviour, these results readily find their way into the popular media, such as the Daily Mail (women are more likely to give their phone number to a man who is carrying a guitar), the Guardian (people drink more in bars with louder music), The Atlantic (men think that women wearing red are more likely to be interested in sex) and the New York Times (customers spend more money in a restaurant if it smells of lavender).

Beyond a joke

But our concerns go well beyond the apparent borderline teenage sexism that seems to characterise much of this research.  A far bigger scientific problem is the combination of extraordinary effect sizes, remarkably high (in some cases, 100%) response rates among participants recruited in the street (cf. this study, where every single one of the 500 young female participants who were intercepted in the street agreed to reveal their age to the researchers, and every single one of them turned out to be aged between 18 and 25), other obvious logistical obstacles, and the large number of statistical errors or mathematically impossible results reported in many of the analyses.

We also have some concerns about the ethicality of some of Dr. Guéguen’s field experiments.  For example, in these two studies, his male confederates asked other men how likely it was that a female confederate would have sex with them on a first date, which might be a suitable topic for bar-room banter among friends but appears to us to be somewhat intrusive.  In another study, women participants were secretly filmed from behind with the resulting footage being shown to male observers who rated the “sexiness” of the women’s gait (in order to test the theory that women might walk “more sexily” in front of men when they are ovulating; again, readers may not be totally surprised to learn that this is what was found). In this study, the debriefing procedure for the young female participants involved handing them a card with the principal investigator’s personal phone number; this procedure was “refined” in another study, where participants who had agreed to give their phone number to an attractive male confederate were called back, although it is not entirely clear by whom. (John Sakaluk has pointed out that there may also be issues around how these women’s telephone numbers were recorded and stored.)

It is unclear from the studies presented that any of these protocols received individual ethical approval, as study-specific details from an IRB are not offered. Steps to mitigate potential harms/dangers are not mentioned, even though in several cases data collection could have been problematic, with confederates dressing deliberately provocatively in bars and so on. Ethical approval is mentioned only occasionally, usually accompanied by the reference number “CRPCC-LESTIC EA 1285”.  This might look like an IRB approval code of some kind, but in fact it is just the French national science administration’s identification code for Dr. Guéguen’s own laboratory.

It is also noteworthy that none of the articles we have read mention any form of funding. Sometimes, however, the expenses must have been substantial.  In this study (hat tip to Harry Manley for spotting it), 99 confederates stood outside bars and administered breathalyser tests to 1,965 customers as they left.  Even though the breathalyser device that was used is a basic model that sells for €29.95, it seems that at least 21 of them were required; plus, as the “Accessories” tab of that page shows, the standard retail price of the sterile mouthpieces (one of which was used per participant) before they were discontinued was €4.45 per 10, meaning that the total cash outlay for this study would have been in the region of €1500.  One would have thought that a laboratory that could afford to pay for that out of petty cash for a single study could also pick up the tab in a nightclub from time to time.

This has been quite the saga

It is almost exactly two years to the day since we started to put together an extensive analysis (over 15,000 words) focused on 10 sole-authored articles by Dr. Guéguen, which we then sent to the French Psychological Society (SFP). The SFP’s research department agreed that we had identified a number of issues that required an answer and asked Dr. Guéguen for his comments. Neither they nor we have received any coherent response in the interim, even though it would take just a few minutes to produce any of the following: (a) the names and contact details of any of the confederates, (b) the field notes that were made during data collection, (c) the e-mails that were presumably sent to coordinate the field work, (d) administrative details such as insurance for the confederates and reimbursement of expenses, (e) minutes of ethics committee meetings, etc.

At one point Dr. Guéguen claimed that he was too busy looking after a sick relative to provide a response, circumstances which did not prevent him from publishing a steady stream of further articles in the meantime.  In the autumn of 2016, he sent the SFP a physical file (about 500 sheets of A4 paper) containing 25 reports of field experiments that had been conducted by his undergraduates, none of which had any relevance to the questions that we had asked.  In the summer of 2017, Dr. Guéguen finally provided the SFP with a series of half-hearted responses to our questions, but these systematically failed to address any of the specific issues that we had raised.  For example, in answer to our questions about funding, Dr. Guéguen seemed to suggest that his student confederates either pay all of their out-of-pocket expenses themselves, or otherwise regularly improvise solutions to avoid incurring those expenses, such as by having a friend who works at each of the nightclubs that they visit and who can get them in for free.

We want to offer our thanks here to the officials at the SFP who spent 18 months attempting to get Dr. Guéguen to accept his responsibilities as a scientist and respond to our requests for information. They have indicated to us that there is nothing more that they can do in their role as intermediary, so we have decided to bring these issues to the attention of the broader scientific community.

Hence, this post should be regarded as a reiteration of our request for Dr. Guéguen to provide concrete answers to the questions that we have raised. It should be very easy to provide at least some evidence to back up his remarkable claims, and to explain how he was able to conduct such a huge volume of research with no apparent funding, using confederates who worked for hours or days on end with no reward, and obtain remarkable effect sizes from generally minor social priming or related interventions, while committing so many statistical errors and reporting so many improbable results.

Further reading

We have made a copy of the current state of our analysis of 10 articles by Dr. Guéguen available here, along with his replies (which are written in French).  For completeness, that folder also includes the original version of our analysis that we sent to the SFP in late 2015, since that is the version to which Dr. Guéguen eventually replied.  The differences between the versions are minor, but they include the removal of one or two points where we no longer believe that our original analysis made a particularly strong case.

Despite its length (around 50 pages), we hope that interested readers will our analysis to be a reasonably digestible introduction to the problems with this research.  There are one or two points that we made back in December 2015 which we might not make today (either because they are rather minor, or because we now have a better understanding of how to report these issues now that we have more experience with the application of tools such as GRIM and SPRITE).  Most of the original journal articles are behind paywalls, but none are so obscure that they cannot be obtained from standard University subscriptions.

Nick Brown
James Heathers