[DISCLAIMER: The opinions expressed in my posts are personal opinions, and they do not reflect the editorial policy of Social Psychological and Personality Science or its sponsoring associations, which are responsible for setting editorial policy for the journal.]
The following is a guest post by Alexa Tullett.
Review Submission Background
I was invited to review “When Both The Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect.” A day before the review was due I went to complete it and realized that the link to the pre-registration wasn’t working. I emailed the journal and asked if I could get access to the pre-registration before I completed my review. Three days after that the handling editor, Dr. Kitayama, forwarded me the updated pre-registration link. Three days after that (I had not yet submitted my review) I got an email saying that the handling editor was moving forward to make a decision without my review. I emailed the journal and asked if it would be ok if I submitted my review by the end of the day, and the peer review coordinator, Charlie Retzlaff, said that would be fine. I did so, emailing my review to Charlie because the paper was no longer in my reviewer center. Three days after that I emailed Dr. Kitayama and asked if he had received my review. He replied and said he had already made his decision, but would forward my review to the authors. When forwarding my review to the authors, Dr. Kitayama asked that they attend to all issues raised.
Review of “When Both The Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect.”
In this manuscript the authors provide a theoretical explanation for a failed replication of the facial-feedback effect. They suggest that the reason the original study found the effect while the replications did not was that in the original study participants were not being observed, whereas in all of the replication studies participants were told that they were being monitored by video camera. They conduct a test of this hypothesis, and conclude that the presence of a camera moderated the facial-feedback effect such that it was observed when the camera was absent but not when the camera was present. I think that conducting an empirical test of a hidden moderator explanation for a widely publicized RRR is a terrific idea. Currently, however, I found the evidence provided in the present study to be fairly weak and largely inconclusive. The combination of low statistical power to detect the key interaction, a high p value (p = .051), and some apparent flexibility in data analysis options weakens the evidentiary value of the present study. I elaborate on these points below.
Major Points
- The authors calculate that they would need 485 participants to achieve power of 80% to detect an interaction, which is the key prediction (i.e., a moderation effect). However, they decide to use a sample of 200, which becomes 166 after exclusions (note, the authors’ pre-registered plan was to replace excluded participants, but they did not do so because they finished collecting data on the last day of the academic year). In justifying their sample size, the authors write “we opted to align the number of participants in our study with the replication study, namely, based on the power to detect the simple effects… [remainder of quote redacted*].” This seems to reflect a misunderstanding about statistical power; having higher statistical power in the present study does nothing to undermine comparisons to the replication studies, it simply allows a more precise estimate of effect size (whether simple or interactive) in the present study. This can only be an advantage. Moreover, it is the comparisons within the current sample that are critical to testing the authors’ research question, not comparisons between the current sample and the previous replications. The decision to go with lower statistical power in the present study comes at the cost of underpowering the authors’ main analysis (i.e., the interaction between condition and expression). Indeed, the interaction is not significant (p = .051), and thus the results do not technically support the authors’ hypothesis within an NHST framework.
- I very much appreciated that the authors pre-registered their methodology and data analysis plan and made this information publically available on OSF. I still have some concerns, however, about remaining flexibility regarding data analysis. 20 participants were excluded because they reported during the debriefing that they did not hold the pen as instructed. How was this decision made? Given that this represents 10% of the sample, it would be helpful to know whether there are multiple reasonable decisions about who should be excluded and how much the results vary depending on these decisions. Also, the authors made some exclusions that were not pre-registered: 2 participants who did not agree to use the pen as instructed, and another 4 who suspected the cover story of video recording. Although these exclusions seem reasonable, it would also seem reasonable to me to include these people. This potential flexibility regarding exclusion criteria has implications for interpreting the p value (.051) for their main analysis. Although this number may seem so close to the p = .05 cutoff that it should be counted as significant, if this “marginal” p-value is dependent on making one out of several reasonable decisions regarding exclusions then the results may be weaker than they appear.
*Because the remainder of this quote differed from the published paper I redacted it from my review to preserve the confidentiality of the review process.