I sent the following reply yesterday. It's not perfect, I should have taken some time before sending something off.
Thank you xxx. I appreciate your followup.
I found the analysis from the editorial board member useful and clear in its acknowledgement that there is an issue with the selection of the data for PCA. However, the board member's view seems to be that, since the authors have 1. documented their use of highly selected data for the PCA that maximises the differences between the cohorts and the similarities within the cohorts and 2. made the full dataset available for others to analyse, the use of the results to illustrate relationships between the three groups is acceptable. It seems to imply that readers of the paper will identify the problem and draw their own conclusions about the validity of the findings, and that therefore the journal does not need to take any action to correct the problem.
I am puzzled why the editorial board member finds the complaint only partially justified. If we are agreed that the methodology and resulting finding was wrong, I cannot see that the fact that it is possible for an observant reader to identify that the methodology was wrong from the information provided makes everything okay.
A point I believe the board member has not given sufficient weight to is that the paper refers multiple times to the results of the PCA as providing evidence of differences between the three cohorts and clustering within them, without noting that a random set of data would produce the same outcome.g.
- "A PCA plot of the fifteen participants’ data showed that the patients of three cohorts formed tightly clustered groups in each case, which were well separated. This implied all members of each cohort, including all the HCs, had similar methylation levels of the fragments to the other members of their cohorts. The separation of the ME/CFS and LC cohorts implied either that there was a difference in the degree of change in the methylation between the ME/CFS patients and LC patients at the same sites, or that there may be methylation changes at specific positions in one of the cohorts but not the other."
- "Principal Component Analysis (PCA) of differentially methylated fragments identified in all patients of the three cohorts showed distinct clustering of the HC, LC, and ME/CFS cohorts, demonstrating that despite LC and ME/CFS having many changes in methylation at genomic sites in common, the global DNA methylation patterns can separate the two disease groups from each other as well, and both are well separated from the HC group."
- In the abstract: "A principal component analysis (PCA) analysing significant methylation changes (p < 0.05) separated the ME/CFS, LC, and HC cohorts into three distinct clusters."
The caption on the figure does not make the nature of the data selection clear - the comparisons the p value criterion applies to are not defined.
I don't think it is safe to assume that all readers will see the problem and understand that randomly distributed data would produce a very similar PCA chart. It is my experience that incorrect findings are often referenced in other papers, and can take on a life of their own as peer-reviewed facts, with scarcely any readers ever referring back even to the abstract, much less the detail of the original paper or the dataset.
I am particularly concerned that even the authors do not seem to understand that the use of the highly selected data (less than 5% of the total data) for the PCA has led to conclusions about the differences between the groups that are not substantiated by the analysis. There is no indication from the authors, both in my discussions with them at the preprint stage, and in their subsequent replies to the journal, that they will not preselect data for a PCA again. In communications with me at the preprint stage, an author noted that one of the authors was very experienced with data analysis and had previously used this approach in cancer research. I think it is very possible that these authors may go on to apply the same methodology to draw incorrect conclusions from other datasets. Researchers reading this paper may also not see the problem, and may apply the same approach themselves.
With respect to this paper, I will be concerned if the conclusions based on the PCA are allowed to stand. If most readers of the paper do in fact identify the problem, they will draw conclusions not only about the capability of the authors, but also about your journal. There are findings of value in this paper and good work has been done, but the error with the PCA is likely to result in those findings and even the team being dismissed.
I find myself in a difficult situation, having started the conversation with an author at the preprint stage in order to try to help ensure that their work was well founded. I do not want to diminish the ability of this team to do further work on ME/CFS. However, I also do not want errors to stand and potentially be replicated. I think the journal's review process has been part of the problem here. Perhaps it is missing other errors in the papers it publishes? I have not seen anything from the journal acknowledging that it made an error in not requiring revision of the paper before it was published.
I am not sure of the best way forward; I don't know what would normally happen. I think the paper contains objective errors and does require revision. I don't believe it is reasonable to conclude that the existing transparency around how the analysis was done will lead to most people exposed to the findings of this paper understanding that some conclusions are unfounded.
Warm regards,