This is the draft of comments I've written up. It got a bit long, so I put some of the less important or more duplicated points in an attached file (generally there’s some overlap between these comments and those made by others). Some, or all, of these points could be added to the EMEC comment, added to another comment, or anyone can take any of these points and add them to their own submitted comment (I could also help with that). I can rewrite these comments as required in any way, and I can also find the matching german quotes in the quoted documents to make translation easier, as well as provide the relevant pdfs. The comments:
In 2012 a systematic review conducted by Haywood et al. on the quality and acceptability of PROMS (Patient Reported Outcome Measures) in ME/CFS found that “The poor quality of reviewed PROMs combined with the failure to measure genuinely important patient outcomes suggests that high quality and relevant information about treatment effect is lacking”. In particular it did not identify any evidence for content validity for two of the main outcomes used in ME/CFS research – the Chalder Fatigue Questionnaire and the SF-36. [1] Content validity refers to the extent which an instrument measures the concept of interest, and is assessed through qualitative studies (generally with patients) to determine if the instrument is an appropriate and comprehensive measure for the relevant concept, population and use. [2]
On page 59 the IQWiG General Methods handbook states “instruments are required that are suitable for use in clinical trials”, and it refers to guidance of the United States Food and Drug Administration on Patient-Reported Outcomes. [3] In this guidance the FDA states that the adequacy of a PROM depends on its content validity, and that “Evidence of other types of validity (e.g., construct validity) or reliability (e.g., consistent scores) will not overcome problems with content validity because we evaluate instrument adequacy to measure the concept represented by the labeling claim.” [2] This guidance, endorsed by the General Methods handbook, indicates that if a PROM has no evidence for its content validity, it would not be adequate.
In addition to not identifying any evidence for the content validity of the SF-36 and the CFQ, Haywood et al. did not find any evidence for validity of other outcomes used in the IQWiG report, such as the PEM survey (feeling ill after exertion), the Work and Social Adjustment Scale (WSAS – used for social participation), and the Clinical Global Impression Scale (CGI - used for general complaints). [1] For the WSAS I can identify one study that assessed construct validity in ME/CFS, [4] but no study assessing content validity. For the CGI and the PEM survey, I cannot find any study assessing validity in ME/CFS. For the SF-36 and CFQ, I cannot find any studies that assess content validity in ME/CFS. One study stated it assessed content validity for the SF-36, however did not assess it as it is defined by the FDA. [5]
If sufficient evidence exists for the content validity of these scales in ME/CFS, then this should be referred to in the IQWiG report, and the adequacy of these scales should be justified in accordance with the FDA guidance. The validity of the Sickness Impact profile 8 was considered (page 134) – this should also be done with the other PROMS. Conclusions should not be made on the basis of PROMS without any evidence for content validity.
In addition to the lack of evidence of validity, potential ceiling effects or floor effects have been detected for the CFQ and the SF-36. [1,5,6] This means that patients are scoring at or close to the maximum symptom severity value, which allows for the possibility that patients could experience decline without it being detected by the PROMS. The FDA guidance also notes that floor or ceiling effects should be avoided. [2]
Page 152 of the report states “If ME/CFS-specific adverse events, in particular PEM, had occurred more frequently due to the intervention, it can be assumed that this would also have had a negative effect on the observed results and that no advantages, for example in the endpoint fatigue, could have been identified.” This cannot be assumed, given the poor quality of the PROMS, possible ceiling/floor effects, lack of blinding, and relatively short duration of results. Furthermore, there was heterogeneity in the results, for instance, at 15 months GETSET found a statistically significant effect in favor of the control group over GET for fatigue (page 97).
Also on page 58 in the General Methods it states “The size of the effect observed is an important decision criterion for the question as to whether an indication of a benefit of an intervention with regard to PROs can be inferred from open studies.” Page 164 also notes that consideration of risk of bias issues, such as a lack of blinding, goes beyond the purely formal assessment. The size of the effects observed in the report have been modest. On page 174 of the general methods, for instance, a difference consisting of 15% of a scale is identified as a plausible threshold for a small but noticeable change, and the mean differences and the lower bounds of the confidence intervals of the effect sizes found fall below this. The modesty of these effect sizes should have been considered in whether a conclusion about benefit could be drawn, especially in light of the questionable quality of the PROMS.
Page 171 of the General Methods handbook states “The problem of multiplicity cannot be solved completely in systematic reviews, but should at least be considered in the interpretation of results”. There is no indication this was considered in this review. The number of outcomes considered in the report substantially raises the likelihood that some outcomes will appear to have clinically relevant effects. In the case of GET vs SMC (page 150), for instance, where over 10 different outcomes across three time periods were considered, two outcomes that show a clinically relevant effect may be a matter of chance.
1. Haywood KL, Staniszewska S, Chapman S. Quality and acceptability of patient-reported outcome measures used in chronic fatigue syndrome/myalgic encephalomyelitis (CFS/ME): a systematic review.
Qual Life Res. 2012;21(1):35-52. doi:10.1007/s11136-011-9921-8
2.
Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. U.S. Food and Drug Administration; 2009. Accessed November 8, 2022.
https://www.fda.gov/media/77832/download
3.
General Methods Version 6.1. IQWiG; 2022. Accessed November 8, 2022.
https://www.iqwig.de/methoden/general-methods_version-6-1.pdf
4. Cella M, Sharpe M, Chalder T. Measuring disability in patients with chronic fatigue syndrome: reliability and validity of the Work and Social Adjustment Scale.
J Psychosom Res. 2011;71(3):124-128. doi:10.1016/j.jpsychores.2011.02.009
5. Davenport TE, Stevens SR, Baroni K, Mark Van Ness J, Snell CR. Reliability and validity of Short Form 36 Version 2 to measure health perceptions in a sub-group of individuals with fatigue.
Disabil Rehabil. 2011;33(25-26):2596-2604. doi:10.3109/09638288.2011.582925
6. Morriss R, Wearden A, Mullis R. Exploring the validity of the chalder fatigue scale in chronic fatigue syndrome.
J Psychosom Res. 1998;45(5):411-417. doi:10.1016/S0022-3999(98)00022-1