08 December 2017

If you're trying to contact me: My e-mail address got hacked

My e-mail address, nick.brown@free.fr, got hacked earlier this week.  The hosting company "suspended" the account, which doesn't just mean I can't access it or send mails from it; it also means that if you send me a mail you will get a message that the user doesn't exist.

Their procedure for dealing with this is fairly... amazing.  If you read French, it's described here.  I had to send them an e-mail explaining what might have been the reason why I got hacked (virus or trojan on my PC, a password that wasn't long enough, reusing the e-mail/password combination as the login details for a site that itself got hacked, etc).  Despite their disclaimer ("Il ne s'agit pas ici de distribuer les bons points et les réprimandes"), it seems pretty clear that the point of the exercise is to cause people sufficient annoyance that they take more care in future, a bit like a mildly sadistic schoolteacher forcing a student to write 500 words on "why I will not forget to bring my gym clothes in future").  I posted about this in a French online forum and discovered that several other people have been victims of this too. My hosting company is screwing me over 1000 times worse than the hackers.

The account was suspended on Wednesday evening (6 December 2017 around 17:00 UTC, which is now more than 48 hours ago) and I sent the required e-mail straight away, but I haven't heard anything since. The technical support line is always busy, and in any case I don't know if they provide support for e-mail. The address to which I sent my "explanation" was abuse@theirdomain, so it is presumably in the hands of the e-mail server managers.

The problem, of course, is that for many people, losing access to their e-mail has the potential to be economically disastrous.  Yes, we can all do things more securely, but the hackers only sent out a few pieces of spam; the real damage is being done by the company trying to teach me a lesson.  And I have a certain amount of computer knowledge.  How is J. Random Customer, who just uses, meant to respond to that list of points is beyond me.

I don't know how long this will take to sort out. For all I know it could be forever, since the suspension is presumably triggered by an algorithm and I don't know if anyone is there to read mails sent to abuse@theirdomain.  This will be rather boring since I have about 30,000 e-mails in there - pretty much everything I've done for the last five or more years.

As a result of this, I'm starting to move everything over to a new Gmail address, "nicholasjlbrown".  This will take a while; I estimate that I have over 150 accounts with various sites out there that use my e-mail address either as the username or the contact address or both.  So if I ever lose the password to those I will be stuck; plus, if those sites send me a mail and it bounces, they might have a policy of deactivating the account.  So I'm going to have a very boring weekend updating logins (and discovering which sites didn't [yet] bother to implement a mechanism to change your e-mail address; to my surprise, PubPeer is in this category).

If you have been expecting to hear back from me, I might no longer have your address.  This applies in particular to people who have written to me in the last few weeks about things that have come up on this blog, so this post is to apologise in advance and invite you to recontact me at my new Gmail address.

04 December 2017

More problematic sexual attraction research, this time with high heels

Another post about some strange issues in the work of Dr. Nicolas Guéguen. Today's article is:

Guéguen, N. (2015).  High heels increase women's attractiveness. Archives of Sexual Behavior, 44, 2227–2235. http://dx.doi.org/10.1007/s10508-014-0422-z

There are four studies reported in this article; I want to concentrate on Study 4, although as you will see if you read the whole thing, there are plenty of questions one could ask about the other studies as well.

Brief summary of the study
Participants were male customers in bars. The author's hypothesis was that men would be quicker to approach a woman drinking on her own in her bar if she was wearing shoes with high (versus medium or flat) heels.  A female confederate was instructed to sit on her own "at a free table near the bar where single men usually stand" (p. 2231).  She was identically dressed in all conditions apart from the size of her heels, and she was told to "cross her legs on one side so that people around could clearly view her shoes" (p. 2231). Meanwhile, two male observers seated nearby timed how long it took before a man approached the female confederate.  When this happened, she told the man that her friend was expected to arrive shortly, and one of the observers then "arrived" to meet her, thus ending the interaction with the participant.  If no contact was made within 30 minutes, the confederates were instructed to leave the bar.

The results showed that the mean time before a male customer of the bar approached the woman was lower when her heels were higher.  This difference was statistically significant for high heels (versus medium or flat heels), but not for medium heels (versus flat heels).  Although it was not reported whether contact was made in every case, the degrees of freedom of the reported ANOVA imply that it was, even when the woman was wearing flat shoes.

There are a few readily apparent problems with this study.

1. The research design is inefficient and implausible
This study seems to be a very inefficient way of gathering data. You need three young volunteers (it's not exactly clear why two male confederates were necessary rather than just one) to give up their Wednesday and Saturday evenings for six straight weeks. They have to visit three bars each time, and no mention is made of funding to pay for the drinks that they would presumably need to buy in order to maintain their credibility as ordinary customers.  As soon as contact is made between a participant and the female confederate, data collection ends.  The three confederates leave the bar and walk to the next, taking care to spend half an hour on the walk so they don't arrive too early for the next session.  (Or maybe they drive and spend 26 minutes chatting in the car.  Sounds like fun.)  And after all this, you get a maximum of three data points in an entire evening.

Even the choice of "time taken before someone approaches the female confederate" as the dependent variable seems strange.  Let's imagine for a moment that you are the kind of man who goes to bars in the hope of meeting attractive single women. Today is your lucky day; one such individual has just come into the bar and sat down on her own, close to where you hang out with your fellow bachelor drinkers. She is wearing "a skirt and an off the shoulder tight fitting top" (p. 2231). You have to decide whether or not to approach her (presumably before anyone else does, if I may be allowed to show off my limited knowledge of what one might call "folk evolutionary psychology" for a moment).  The apparent claim of the study is that the degree of sexual availability conveyed only by the height of the woman's her heels will affect, not whether you ultimately decide to approach her or not, but how long you will hesitate before doing so.  I don't find this very convincing.  What else are all of the single males in the bar (the number of whom, incidentally, is not reported anywhere in the article) thinking about during that time?  Whether they can get a "better deal" if her identical twin appears at the next table wearing slightly higher heels?  See also point 4, below.

2. Repeated use of the same bars
The study took place in each of three different bars on twelve different nights (Wednesdays and Saturdays).  The same female confederate thus made twelve visits to each bar, in each case sitting on her own at a table "near the bar where single men usually stand" (p. 2231).  You might imagine that the staff or the regular customers of the bar might notice what was going on, as a different man each evening attempted to make contact with the same female confederate who was always identically dressed (apart from her heels) and sitting in an area of the bar where one might not expect a woman who was waiting for her boyfriend to feel comfortable, only to be told that her friend would be arriving shortly (which, indeed, transpired every time).  But the article describes no precautions that might have been taken to deal with this issue, which has obvious implications for the validity of the study.  After a few visits, the regular customers might have started taking bets among themselves as to who was going to try his luck this evening (perhaps trying not to giggle as he introduced himself with "Hello, I’ve never seen you here before"), only for the woman's boyfriend to show up immediately afterwards.  Even if the staff were aware of the experiment, it would seem to be hard to take into account the possible range of behaviours of young single men in a bar, especially just before midnight on a Saturday evening.

3. The effect size is huge
Remember that the only difference between the conditions was the height of the woman's heels, which, even with her legs crossed as described in the article, were probably not going to be something that many people --- even single men on the lookout for some action --- would necessarily even notice.  Yet, Cohen's (1988, pp. 274–277) formula gives an effect size (f) of 0.67 for the numbers in Table 4 of Guéguen's article, which corresponds (for k=3 groups) to 1.64 in the more familiar terms of Cohen's d.  Such effect sizes are very rare in psychological studies, and indeed in real life (James will be covering this in his next post).  It seems highly implausible that a manipulation of  this kind could have such an effect.

4. The pattern of behaviour by the men is very strange
Despite my advanced age, my personal lifetime experience of hanging around in bars waiting to hit on single women is exactly zero.  However, it seems to me that for individuals who list that particular activity as one of their hobbies, time is probably of the essence.  If you're going to start talking to a girl who has just sat down and crossed her legs so you can see how high her heels are, you probably want to do it fairly quickly, if only to stake your claim before any of your buddies does.

So what would we expect the distribution of the waiting-time-until-contact to look like?  I don't think we can apply something like queuing theory here since the behaviour of the men probably can't be assumed to be random, but I'm guessing it's likely to look like some kind of Poisson or negative binomial distribution, with a lot of guys trying their luck in the first few minutes, resulting in a big right skew.

So I decided to simulate some data.  For each condition, I generated 12-item samples from a uniform distribution, with a minimum of 0 minutes and a maximum that I determined with some preliminary testing to be the largest possible time that could give the mean and SD reported in the article, plus or minus 0.05 in each case.  I ran this simulation until I had 400 samples for each condition, which required about 250 million iterations per condition.  Then I plotted the simulated amounts of time to make contact, to the nearest minute, from those samples:

Given that the high heels were meant to be especially irresistible, you might expect a certain number of contacts to have been made within the first minute in that condition.  But you can see from the plot that in the high heels condition (blue bars) that no values below 2 minutes were returned by my simulation.  In fact when I forced one of the 12 values in the sample to be 30 seconds, I didn't find a single valid sample in 100 million iterations in the high heels condition.  When I set the minimum to 1 minute, I found three valid samples, but they all had looked weird: the value of 1 meant that the other values were all very close to 8 minutes (i.e., when the woman was wearing high heels, if one man approached her after a minute, the other 11 would all have had to approach her after 8 minutes, plus or minus a few seconds).

You can also see in the above plot that the aggregate of the simulated values in each condition is nicely normally distributed.  The most highly skewed 12-item samples were not in fact very badly skewed at all; for example, here is the most right-skewed sample out of 400 in the high heels condition:

So even here, we can see that these single men are taking a certain amount of time before talking to the woman, even though their tongues are apparently all hanging out at the height of her heels.  The limiting factor here is that the standard deviations (4.87, 3.67, and 2.18 minutes, for the flat, medium, and high heels conditions, respectively) are too small, relative to the range of values allowed (0 to 30), to allow any of the 12 responses to be very far from the others (or, if one value is a little bit further away, this requires all of the others to bunch up).  As we saw in James's post about dead plants and global warming, the subjects in this study all appear to be intensely moderate in their behaviour; the manipulation (increasing the size of the woman's heels) simply reduces the diversity of that moderation somewhat.

5. The reported statistics are incorrect
Readers who are familiar with some recent corrections of work from the Cornell Food and Brand Lab may have been anticipating this problem: the reported F statistic (7.18 with 2 and 33 degrees of freedom) is incorrect.  With the given means and SDs, the correct F statistic should be between 8.06 and 8.16, depending on rounding.  This does not change the statistical significance of the reported result, but it makes one wonder what numbers were run in order to produce the incorrect F statistic, and where those numbers came from. (Just as an aside, the standard deviations appear to be substantially different between the groups, but no indication is given in the article about whether the standard ANOVA checks for homogeneity of variance were made; however, given the context, perhaps asking for this is like criticising Donald Trump for not having his tie straight.)

The report of this study sounds like it is describing a thought experiment for an undergraduate methods class (in a world where nobody is too concerned about crass sexist stereotypes), rather than the results of a field experiment carried out under real-world conditions. The premise is based on a pastiche of evolutionary psychology (skeptics of this subfield can fill in their own joke here), the scenario is a minefield of strange decisions, the effect size is absurdly large, the implied behaviour patterns of the participants are weird, and the statistics haven't been reported correctly. Yet, this article was the subject of uncritical pieces in Huffington Post (under the headline "High Heels Increase A Woman's Attractiveness, And For Once It's Not A Bogus Survey"), the Boston Globe, and Psychology Today (twice). It seems that there is quite a market for sexist junk science out there.

28 November 2017

Some problems in a field study of sexual attraction and hitchhiking

This is the first in a series of blog posts by me and James Heathers on the research of Dr. Nicolas Guéguen, of the Université Bretagne-Sud in France. We will be examining one of Dr. Guéguen's studies in each post. Cathleen O'Grady at Ars Technica has written an excellent overview of the situation.

We start with this article:
Guéguen, N. (2012). Color and women hitchhikers’ attractiveness: Gentlemen drivers prefer red. Color Research & Application37, 76–78. http://dx.doi.org/10.1002/col.20651

Brief summary of the article
The author's hypothesis was that male (but not female) drivers would be were more likely to offer a lift to a female hitchhiker if she was wearing a red (versus any other colour) t-shirt. The independent variables were the colour of the woman's t-shirt and the sex of the driver, and the dependent variable was the driver's decision to stop or not.

Participants were drivers on a road in Brittany (western France). A female confederate wearing one of six colours of t-shirt (black, white, red, blue, green, or yellow) stood by the side of the road posing as a hitchhiker (with a male confederate stationed out of sight nearby for security purposes). She noted whether each driver who stopped to offer her a lift was male or female. In order to establish the number of drivers of each sex who drove along the road (whether or not they stopped), two other confederates were stationed 500 metres away in a car by the side of the road, facing the approaching traffic. As each car passed, they noted whether the driver was a man or a woman. Using the count of male vs female drivers who stopped, and the count of male vs female drivers who passed, it was found that male drivers were considerably more likely to stop when the hitchhiker was wearing a red t-shirt compared to any other colour.

There are some puzzling aspects to this article.

1. Number of volunteers
The article states (p. 77) that the five female confederates were chosen by a group of men who
rated the facial attractiveness of “18 female volunteers with the same height, with the same breast size (95 cm of bust measurement and bra with a ‘B’ size cup), and same hair color”.  It is interesting to think about how many there women must be in the volunteer pool at the Université Bretagne-Sud in order for 18 with the same height, bra size, and hair colour to put themselves forward to stand for hours on end (see point 4, below) to stop passing drivers.  Once the attractiveness of the participants had been established, the five who were rated closest to the middle of the scale were chosen, and "precautions were taken to verify that the rates of attractiveness were not statistically different between the confederates", whatever that means.  Oh, and "All of the women stated that they were heterosexuals"—presumably to ensure that they gave off the right vibes through the windscreens of approaching cars.

2. Two different sample sizes
There is a curious inconsistency between Table 1 and the main text.  In Table 1, the numbers of male and female drivers are listed as 3,474 and 1,776, respectively.  However, these two numbers sum to 5,250, rather than 4,800 (which was the sample size reported elsewhere in the article, with 3,024 male and 1,776 female drivers).  It is not clear how such an error might creep in by accident, since it requires two very different digits (4 instead of 0 and 7 instead of 2) to be mistyped.

3. The colours of the t-shirts
The article reports the colours of the T-shirts worn by the hitchhikers in very precise terms, even going so far as to give their HSL (Hue, Saturation, Luminance) values.  However, in several cases, those values do not correspond to the reported colours.  As the table here shows, the colour described by the HSL values corresponding to “red” is probably best described as a salmon-pink colour, while “yellow” is a very pale pink, and “blue” is pure white.

Stated colour
Hex colour

It would be interesting to learn how these HSL numbers were obtained, since several of them are so badly wrong. Indeed, it is not clear why it was considered necessary to report the colours with such precision; it would surely have been enough to state that bright, unambiguous examples of each color had been selected. For that matter, given how long it must have taken to test so many drivers (see the next point) and that the author had a clear hypothesis about the effects of the colour red, it is not clear why so many different colours of t shirt were tested.

4. How long did all this take?
The article states (p. 77) that “Each hitchhiker was instructed to test 960 drivers. After the passage of 240 drivers, the confederate stopped and was replaced by another confederate.” No indication is given of how long it took for 240 drivers to pass. However, the article also tells us (p. 77) that the research was conducted "at the entry of a famous peninsula of Brittany in France". So perhaps we can get a clue from another Guéguen article:

Guéguen, N. (2007b). Bust size and hitchhiking: A field study. Perceptual and Motor Skills, 105, 1294–1298. http://dx.doi.org/10.2466/pms.105.4.1294-1298

Yes, you read that right. Dr. Guéguen did indeed conduct a study to see whether women with larger breasts get more offers of lifts from men. (I bet you can't guess what the result was.) Anyway, that studywhich had a very similar procedure to the one we're discussing here, except that the, er, manipulated (!) independent variable was the apparent size of the female hitchhiker's breasts, rather than the colour of her t-shirtwas conducted "at the entry of a famous peninsula ('Presqu'Île de Rhuys') of Brittany in France". Assuming that the “famous peninsula” mentioned in both studies is the same place (which would make sense, if that location is indeed particularly propitious for lone female hitchhikers), and assuming similar traffic flows to those reported in the "bust size" study, in which the passage of 100 cars took “about 40 to 50 minutes” (p. 1296), we assume that it took between 1.5 and 2 hours for 240 cars to pass. In order to test 4,800 drivers, then, a total of 30 to 40 hours of testing would be required. The "t-shirt colour" article also states (p. 77) that the experiment “took place during summer weekends on clear sunny afternoons between 2 and 5 PM.” With three hours being available on Saturday and three more on Sunday, the experiment would thus have taken between five and seven complete weekends, assuming that every hour of testing time was sunny (a contingency that is far from guaranteed in Brittany). Yet none of the confederates who gave up multiple weekends to accomplish this Herculean task on behalf of psychological science are listed as co-authors, or even acknowledged in any way, in the resulting article.

Additionally, the design of the experiment appears to require that only drivers who were alone in their cars should be counted, since the purported effect of the red t-shirt was to increase the sexual attractiveness of the wearer. You might expect that if a male driver's wife is in the car, it could affect any sexually-motivated enthusiasm he might have for offering a lift to the hitchhiker, whatever the colour of her t-shirt; alternatively, a female driver might have been willing to help the hitchhiker had all of the seats in her car not been full of children. We have seen a statement by Dr. Guéguen in which he confirmed that "L’expérience n’incluait effectivement que des personnes seules. Les automobiles avec plusieurs personnes ne sont pas prises en compte dans l’étude" ("The experiment only included people [driving] on their own. Cars containing multiple people were not counted in the study"). So the figure calculated above for the number of hours and days taken to test the required number of drivers needs to be multiplied by some factor to take into account the percentage of cars with multiple occupants. Given that the study was carried out on sunny weekend afternoons in summer in an area with a substantial number of tourists, it seems reasonable that perhaps half of the cars driving along the road on a summer afternoon might have had more than one occupant, which would either double the number of weekends required for collection of the data to between 10 and 14, or in any case more than compensate in our calculations for any growth in local traffic since the "variable bust size" study was conducted.

Another way to think about the time involved is to consider the interactions of the hitchhiker with the drivers who stopped. Even if it took an average of only two minutes to catch up to the car where it stopped (probably some distance along the road from her), introduce herself, explain that there was an experiment taking place, "warmly thank" the driver, and return to her starting point, that would require nearly 10 hours (i.e., four afternoons) just for the 579 drivers (450 male, 129 female) who were reported as having stopped, even assuming that in every case a new driver then stopped immediately afterwards. If drivers were only stopping at the rate of one every five minutes overall (12 per hour), it would take 48 solid hours to test 579 drivers.

5. Problems with recording the sex of the drivers
As mentioned in the introduction, there were two observers whose job was to observe every passing car and record its driver's sex. (Per the previous point, it is worth thinking about the challenge of determining whether or not a driver is on their own in the vehicle, which requires, for example, determining whether a car driving past at around 20 metres per second does or does not have a small child in the back seat.) The article states (p. 77) that “[t]he convergence between the two observers’ evaluation was high (r = 0.97)”.

There is a major problem here. In order for a correlation coefficient to be calculated, we need more information than the simple total numbers of male and female drivers. Specifically, the two observers would need to independently record both the sex of each driver and the sequence in which those drivers were observed; for example, with ten drivers and disagreement about the sex of the third, the correlation between MFFFMMMFMM and MFMFMMMFMM would be .80. However, the article reports (p. 77) that each of the observers “used two hand-held counters, one to count the female motorists and the other to count the male motorists”. The term “hand-held counters” suggests simple mechanical devices, such as those used to count attendees at sporting events (such as this). But without synchronized timestamps across all four of these counters, or some other form of sequential tracking, it is not possible to establish the order in which the drivers passed each of the observers. More sophisticated methods of collecting and correlating these data can be imagined (for example, using laptop computers), but of course both observers had their hands full with the counters. With just a count of male and female drivers from each observer, stating a correlation coefficient makes no sense. It is therefore entirely unclear how the author could have established the correlation coefficient that he reported.

In view of the above points, it is not clear that how the study can actually have taken place as described in the article. As noted above, we have seen a statement by Dr. Guéguen (with whom we have been indirect contact for almost two years now, via the good offices of the French Psychological Society, about a number problems in several of his published articles; more on this to come in a subsequent post) concerning the question of whether only drivers who were on their own were tested. That statement did not, however, provide any specific or relevant answers to any of the other issues about this article that I have discussed here.

[[Update 2017-11-29 22:08 UTC: Added link to Cathleen O'Grady's article. ]]

11 November 2017

Don't stiff people who live from tips

I don't normally blog just because a tweet annoyed me (otherwise I'd be writing several dozen blog posts per day), but this tweet and its ensuing thread touched a nerve.
From his profile and recent tweets, the author seems like the kind of person with whom I probably share a very sizeable percentage of my political and social attitudes.  He follows me on Twitter; I would follow him back, except that I ration my follows simply to try and slow down the firehose of information.  In short, I'm sure he's a nice guy with progressive values.  But it seems that he and some of his followers have a rather different attitude to tipping to mine.

Consider the situation.  You're(*) standing outside a restaurant at a US airport, looking at the menu.  The airline has messed up your connection, so maybe a nice meal will help you feel a little better.  You fancy half a dozen oysters (maybe $15) and then perhaps the steak (maybe $28) and a beer ($7), so that'll be $50 in total.  Plus you'll need to add $5 for tax and $10 for a 20% tip.  So that will cost a total of $65.  Can you afford that?  Yes?  OK, let's go.  Those of us who are lucky enough to be able to afford to eat in restaurants might make similar decisions many times per year (give or take the tax and tip calculations, depending on where we spend most of our time.)

Then the meal happens.  I encourage you to read the Twitter thread (it's quite short) to see what happened.  The situation is not entirely black and white, and the details of the author's experience are not especially relevant to my point here, but to sum up, he was not very impressed with the overall level of the service he received. That's OK; disappointment is a normal part of the human condition and experience, especially in consumer and travel situations.

After a while, the check arrives.  It's for $50 plus $5 tax, and it may or may not have "Thank you" written on it by hand, perhaps even with a little smiley face, because scientific research that was totally not underpowered or p-hacked in any way has shown that when female (but not male, as hypothesised in advance by the remarkably prescient authors of that study) servers do this, they get bigger tips.  Remember, you have already budgeted $10 for the tip, but because you were unhappy with the service, you are thinking twice about whether to give it.  For support in that decision, you go on Twitter to ask people how much of that amount they think you should give or withhold.  And, because Twitter solidarity with your friends while they're travelling is a genuinely rather nice thing about the 21st century, within just a few minutes you have several replies:
So, the consensus was that the author should tip 10%, instead of the now-conventional (for the US) 20%.  A couple of people even suggested that he tip 0%, but he settled on 10%.  Under the reasonable (I think) assumptions about his meal that I made earlier, that means he left the waitress about $5 instead of $10.

Had I been watching the proceedings during the 17 minutes between the first tweet and the verdict, my answer would have been: you should withhold nothing.  Tip the 20%.  Give the waitress the full $10 that you presumably budgeted for from the start.  After all, once you decided to walk into the restaurant it was basically a sunk cost anyway.  You can't know all of the reasons why the service was slow, and even if you could somehow establish that it was entirely her personal fault (rather than that of other staff, or the restaurant, none of whom will be affected by your tipping decision), it doesn't matter anyway. Within five minutes of leaving the restaurant the bad service you got will be forgotten forever (not least because you will shortly be waiting in line at the boarding gate, or back on the phone to American Airlines about your connection, dealing with people who do not have the incentive of possibly losing a tip to encourage them to give you better service, ha ha).

There seems to be a pervasive idea in certain parts of the world (mostly North America and the UK) that serving in a restaurant is like being one of those street entertainers who juggle things in front of the cars at red lights.  Indeed, something to reassure you that it's OK to think that way is usually written on the menu in some form: "We do not impose a service charge, as we believe that our customers have the right to reward good service personally".  Well, I've got news for all you people who like to imagine that you are normally skeptical of capitalism: that is pure marketing bullshit.  What it means is, "If our menu prices were 10/15/20% higher, people would be less likely to come inside.  So we make the posted prices lower in order to entice you in, at no risk to us, and we let the staff play a kind of roulette with their income based, essentially, on your mood".  (However, for parties of 6 or more, the restaurant has to add a service charge because waitstaff know that groups are terrible tippers and so would otherwise try to avoid having them seated in their area of the restaurant.)

Think about this: if you were eating at a restaurant in a country where the pricing structure of restaurants is such that tipping is not expected (in some cases, it might even be regarded as slightly offensive), you would probably not go to the manager and request, say, an 8% reduction in the bill because the service was a bit sloppy.  And a big part of the reason why you wouldn't do that is because it would require you to actively do something to justify your claim, versus the far simpler act of not placing the second $5 bill on the plate.  As a result of this (entirely natural) behaviour by customers, waitstaff in countries with a tip-based wage model are essentially incentivised to be both happy-looking and efficient, every minute of their working day.  That is, frankly, an inhuman requirement (try it in your office job for, say, fifteen minutes).

Actually, I can think of a certain kind of person who I would expect to stiff people in these circumstances.  The current archetype of this kind of person has strange blond hair and a fake tan and plays a lot of golf and mouths off a lot about how bad almost everyone else in the world is.  If we were to learn that he tweeted his buddies and told them how much he was going to stiff a waitress, we wouldn't bother spending the energy on rolling our eyes.  There are endless stories about how this individual didn't pay bills sent to himself or his companies, because he decided he didn't like the service he received.

Don't be that person.  Don't, in effect, put working people on piecework rates ("$0.30 per smile") by deciding how much you will tip them based on how perfectly they do their not-particularly-desirable job, simply because the formal rules say that you legally can because the tip is optional.  Be a mensch, as I believe the expression goes.  Eat your oysters, add the going rate for the tip, pay the bill, get on your plane, and don't punish the waitress for working in a messed-up system that pits her against both you and her employer.  If you are the kind of person who can afford to dine on oysters at an airport restaurant prior to getting on a plane, then pretty much by definition $5 means more to the waitress than it does to you.

Here's my personal benchmark (your mileage may vary): I wouldn't withhold a tip unless the situation was sufficiently serious that I would be prepared to complain to the restaurant manager about it.  (For what it's worth, I have never been in a restaurant situation that was so bad that I felt the need to complain to the manager.)  If the server were to, say, cough violently into my food and then carry on in the hope that I didn't notice, then that's not a tipping matter.  But I don't like the idea of micro-managing the ups and downs of other people's workdays through small (to me) sums of money.  It just doesn't feel like something we ought to be doing on the way to building a nicer society.

If you still have a few minutes, please watch this video, where someone a lot more erudite than me makes a far better job of explaining the point I wanted to make here. If you're in a hurry, skip to 09:30.

(*) All references to "you" are intended to be to a generic restaurant customer, although obviously the example from the quoted tweets will be salient. I hope Malcolm von Schantz will forgive me for choosing the occasion of his Twitter thread as a reminder that this issue has bothered me for some time and was in a far corner of my "blog ideas back burner".

23 September 2017

Problems in Cornell Food and Brand Lab's replacement "Can Branding Improve School Lunches?" article

Back in February 2017, I wrote this post about an article in JAMA Pediatrics from the Cornell Food and Brand Lab entitled "Can Branding Improve School Lunches?". That article has now been "retracted and replaced", which is a thing that JAMA does when it considers that an article is too badly flawed to simply issue a correction, but that these flaws were not the result of malpractice.  This operation was covered by Retraction Watch here.

Here is the retraction notice (including the replacement article), and here is the supplementary information, which consists of a data file in Excel, some SPSS syntax to analyze it, and a PDF file with the SPSS output.  It might help to get the last of these if you want to follow along; you will need to unzip it.

Minor stuff to warm up with

There are already some inconsistencies in the replacement article, notably in the table.

First, note (d) says that the condition code for Elmo-branded cookies was 2, and note (e) says that the condition code for Elmo-branded cookies was 4.  Additionally, the caption on the rightmost column of the table says that "Condition 4" was a "branded cookie". (I'll assume that means an "Elmo-branded cookie"; in my previous blog post I noted that "branded" could also mean "branded with the 'unknown character' sticker" that was mentioned in the first article, but since that sticker is now simply called the "unknown sticker", "unfamiliar sticker", or "control sticker", I'll assume that the only branding is Elmo.)  As far as I can tell (e.g., from the dataset and syntax), note (d) is correct, and note (e) and the column caption are incorrect.

Second, although the final sample size (complete cases) is reported as 615, the number of exclusions in the table is 420, and the overall sample was reported as 1,040.  So it looks like five cases are missing (1040 - 420 = 620).  On closer inspection, though, we can see that 10 cases were eliminated because they could not be tied to a school, so the calculation is (1030 - 420 = 610) and we seem to have five cases too many.  After trawling through the dataset for an hour or so, I identified five cases that had been excluded for more than one reason, but counted as having been excluded for both reasons.  (Yes, this was the authors' job, not mine.  I occasionally fantasise about one day billing Cornell for all the time I've spent trying to clean up their mess, and I suspect my colleagues feel the same way.)

I'll skip over the fact that, in reporting their only statistically significant results, the authors reported percentages of children choosing an apple that were not the actual percentages observed, but the estimated marginal means from their generalized estimating equations model (which had a more impressive gap between the Elmo-branded apple and the control condition: 13.1 percentage points versus 9.0).  Were those actual percentages to be cited in a claim about how many more children actually took an apple in the experiment then there would be a problem, but for the moment those numbers are just sitting there.

Now, so far I've maybe just been nitpicking about the kind of elementary mistakes that are easy to make (and fail to spot during proofreading) entirely inadvertently when submitting an article to a journal with an impact factor of over 10, and which can be fixed with a simple correction.  Isn't there something better to write about?

Well, it's funny you should ask, because yes, there is.

The method doesn't match the data

Recall that the design of the study is such that each school (numbered 1 to 7) runs each condition of the experiment (numbered 1 to 6, with 4 not used) on one day of the week (numbered 1 to 5).  All schools are supposed to run condition 5 on day 1 (Monday) and condition 6 on day 5 (Friday), but apart from that, not every school runs, say, condition 2 on Tuesday.  However, for each school X, condition Y (and only condition Y) is run on day Z (and only on day Z).  If we have 34 choice records from school X for condition Y, these records should all have the same number for the day.

To check this, we can code up the school/condition/day combination as a single number, by multiplying the school number by 100, the condition by 10, and the day by 1, and adding them together.  For school 1, this might give us numbers such as 112, 123, 134, 151, and 165.  (The last two numbers definitely ought to be 151 and 165, because condition 5, the pre-test, was run on day 1 and condition 6, the post-test, was run on day 5.)  For school 2, we might find 214, 222, 233, 251, and 265.  There are five possible numbers per school, one per day (or, if you prefer, one per condition).  There were seven schools with data included in the analysis, so we would expect to see a maximum of 35 different three-digit school/condition/day numbers in a summary of the data, although there may well be fewer because in some cases a school didn't contribute a valid case on one or more days.  (I have made an Excel sheet with these data available here, along with SPSS and CSV versions of the same data, but I encourage you to check my working starting from the data supplied by the authors; detailed instructions can be found in the Appendix below.)

Let's have a look at the resulting pivot table showing the frequencies of each school/condition/day number.  The column on the left is our three-digit number that combines school, condition, and day, while the column on the right is the number of times it appears.  Remember, in the left-hand column, we are expecting up to five numbers starting with 1, five starting with 2, etc.


You can see from the Grand Total line that we have accounted for all 615 cases in here.  And for schools 5 and 7, we see the pattern we expected.  For example, school 7 apparently ran condition 1 on day 3, condition 2 on day 4, condition 3 on day 2, and (as expected) condition 5 on day 1 and condition 6 on day 5.  School 6 only had a few cases.  (I'm trying to work out if it's meaningful to add them to the model; since we only have cases from one condition, we can have no idea what the effect of the Elmo sticker was in school 6.  But that's a minor point for the statisticians.)

School 3 isn't too bad: the numbers opposite 324 and 325 suggest that condition 2 was run on day 4 with 6 cases, and on day 5 for 1 case.  Maybe that 325 is just a typo.

However, schools 1, 2, and 4 --- which, between them, account for 73.3% of the cases in the dataset --- are a disaster.  Let's start at the top with school 1: 19 cases for condition 1 on day 2, 12 cases for condition 1 on day 3.  Conditions 2, 3, and 5 appear to have been run on three different days in that school.  Cases from condition 1 appear on four different days in school 4.  Here is a simple illustration from the authors' own Excel data file:

Neither the retracted article, nor the replacement, makes any mention of multiple conditions being run on any given day.  Had this been an intended feature of the design, presumably the day might have been included as a predictor in the model (although it would probably been quite strongly correlated with condition).  You can check the description of the design in the replacement article, or in the retracted article in both its published and draft forms, which I have made available here.  So apparently the authors have either (again) inadvertently failed to describe their method correctly, or they have inadvertently used data that is highly inconsistent with their method as it was reported in both the retracted article and the replacement.

Perhaps we can salvage something here.  By examining the counts of each combination of school/condition/day, I was able to eliminate what appeared to be the spurious multiple-days-per-condition cases while keeping the maximum possible number of cases for each school.  In most cases, I was able to make out the "dominant" condition for any given day; for example, for school 1 on day 2, there are cases for 19 participants in condition 1, five in condition 2, nine in condition 3, and two in condition 5, so we can work (speculatively, of course) on the basis that condition 1 is the correct one, and delete the others.  After doing this for all the affected schools, 478 cases were left, which represents a loss of 22.3% of the data.  Then I re-ran the authors' analyses on these cases (using SPSS, since the authors provided SPSS syntax and I didn't want to risk translating a GEE specification to R):

Following the instructions in the supplement, we can calculate the percentages of children who chose an apple in the control condition (5) and the "Elmo-branded apples" condition (1) by subtracting the estimated marginal means from 100%, giving us an increase of (1.00 - .757) = 24.3% to (1.00 - .694) = 30.6%, or 6.3 percentage points instead of the 13.1 reported in the replacement article.  The test statistic shows that this is not statistically significant, Wald χ2 = 2.086, p = .149.  Of course, standard disclaimers about the relevance or otherwise of p values apply here; however, since the authors themselves seemed to believe that a p = .10 result is evidence of no association when they reported that "apple selection was not significantly associated ... in the posttest condition (Wald χ2 = 2.661, P = .10)", I think we can assume that they would have not considered the result here to be evidence that "Elmo-branded apples were associated with an increase in a child’s selection of an apple over a cookie".

The dataset contains duplicates


There is one more strange thing in this dataset, which appears to provide more evidence that the authors ought to go back to the drawing board because something about their design, data collection methods, or coding practices is seriously messed up.  I coded up participants and conditions into a single number by multiplying condition by 1000 and adding the participant ID, giving, for example, 4987 for participant 987 in condition 4.  Since each participant should have been exposed to each condition exactly once, we would expect to see each of these numbers only once.  Here's the pivot table.  The column on the left is our four-digit number that combines condition and participant number, while the column on the right is the number of cases.  Remember, we are expecting 1s, and only 1s, for the number of cases in every row:

Oh. Again. (The table is quite long, but below what's shown here, it's all ones in the Total column.)

Eleven participants were apparently exposed to the same condition twice. One participant (number 26) was assigned to condition 1 on three days (3, 4, and 5).  Here are those duplicate cases in the authors' original Excel data file:

(Incidentally, it appears that this child took the cookie on the first two days in the "Elmo-branded apple" condition, and then the apple on the last day.  Perhaps that's a demonstration of the longitudinal persuasive power of Elmo stickers to get kids to eat fruit.)

Conclusion: Action required

Not to put too fine a point on it, this dataset appears to be a complete mess (other adjectives are available, but some may not be suitable for a family-friendly blog such as this one aspires to be.)

What should happen now?  This seems like too big a problem to be fixable by a correction.  Maybe the authors should retract their "retract and replace" replacement, and replace it with a re-replacement.  (I wonder if an article has ever been retracted twice?)  This could, of course, go on for ever, with "retract and replace" becoming "rinse and repeat".  But perhaps at some point the journal will step in and put this piece of research out of its misery.

Appendix: Step-by-step instructions for reproducing the dataset

As mentioned above, the dataset file provided by the authors has two tabs ("worksheets", in Excel parlance).  One contains all of the records that were coded; the other contains only the complete and valid records that were used in the analyses.  Of these, only the first contains the day on which the data were collected.  So the problem is to generate a dataset with records from the first tab (including "Day") that correspond to the records in the second tab.

To do this:
1. Open the file ApamJamaBrandingElmo_Dataset_09-21-17.xlsx (from the supplementary information here).  Select the "Masterfile" tab (worksheet).
2. Sort by "Condition" in ascending order.
3. Go to row 411 and observe that Condition is equal to 4. Delete rows 411 through 415 (the records with Condition equal to 4).
4. Go to line 671 and observe that Condition is equal to "2,3".  Delete all rows from here to the end of the file (including those where Condition is blank).
5. Sort by "Choice" in ascending order.
6. Go to row 2. Delete rows 2 and 3 (records with "Choice" codes as 0 and 0.5)
7. Go to rows 617 and observe that the following "Choice" values are 3, 5, 5, "didn't eat a snack", and then a series of blanks.  Delete all rows from 617 through the end of the file.
8. You now have the same 615 records as in the second tab of the file, "SPSS-615 cases", with the addition of the "Day" field.  You can verify this by sorting both on the same fields and pasting the columns of one alongside the other.

13 September 2017

Now even second-hand books are fake

First off, my excuses for the shameless plug in this post.  I co-edited a book and it's due to appear this week:

Actually, I should say it "was" due to appear this week, because as you can see, Amazon thought it would be available yesterday (Tuesday 12 September).  My colleagues and I have submitted the final proof corrections and been told that it's in production, but I guess these things can always slip a bit.  So Amazon's robots have detected that the book is officially available, but they don't have any copies in the warehouse.  Hence the (presumably automatically-generated) message saying "Temporarily out of stock".

As of today the book costs £140 at Amazon UK, $187.42 at Amazon.com, €202.16 at Amazon.de, €158.04 at Amazon.es, and Amazon.fr don't have a price for it.  This is a lot of money and I can't really recommend the average student to buy a personal copy, although I hope that anyone who is even peripherally interested in positive psychology will pester their librarian to acquire several!  (That said, I've recently reviewed some academic books that are about one-quarter the size and which cost £80-100.  So perhaps our book isn't too bad value in relative terms.  Plus at 590 pages thick you can probably use it to fend off attackers, although obviously this is not professional security advice and you should definitely call the police first.)

But all of that is a bit academic (ha ha) today, because the book is out of stock.  Or is it?  Look at that picture again: "1 Used from £255.22".

Now, I guess it makes sense that a used book would sometimes be more expensive than the new one, if the latter is out of stock.  Maybe the seller has a copy and hopes that someone really, really needs the book quickly and is prepared to pay a premium for that (assuming that a certain web site *cough* doesn't have the full PDF yet).  Wonderful though I obviously think our book is, though, I can't imagine that anyone has been assigned it as a core text for their course just yet (hint hint).  But I guess this stuff is all driven by algorithms, which presumably have to cope with books like this and the latest from J. K. Rowling, so maybe that's OK.

However, alert readers may already have spotted the bigger problem here.  There is no way that the seller of the used book can actually have a copy of it in stock, because the book does not exist yet.  I clicked on the link and got this:

So not only is the book allegedly used, but it's in the third-best possible condition ("Good", below "Like New" and "Very Good").

The seller, Tundra Books, operates out of what appears to be a residential address in Sevilla, Spain.  Of course, plenty of people work from home, but it doesn't look like you could fit a great deal of stock in one of those maisonettes.  I wonder what magical powers they possess that enable them to beam slightly dog-eared academic handbooks back from the future?  Or is it just possible that if I order this book for over £100 more than the list price, I will receive it in about four of weekssay, the time it takes to order a new one from Amazon.es, ruffle the pages a bit, and send it on to me?

- My attention was first drawn to the "out of stock" issue by my brother-in-law, Tony Douglas, who also painted the wonderful picture on the cover.  Seriously, even if you don't open the book it's worth the price for the cover alone.  (And remember, "it's not nepotism if the work is really good".)
- Matti Heino spotted the "1 Used from £255.22" in the screen shot.