American College of Rheumatology. Preliminary definition of improvement in rheumatoid arthritis, 1995, Felson et al.

MSEsperanza · Apr 8, 2022

American College of Rheumatology. Preliminary definition of improvement in rheumatoid arthritis.
Felson DT, Anderson JJ, Boers M, Bombardier C, Furst D, Goldsmith C, Katz LM, Lightfoot R Jr, Paulus H, Strand V, et al.Arthritis Rheum. 1995 Jun;38(6):727-35. doi: 10.1002/art.1780380602.

Free to read PDF: https://onlinelibrary.wiley.com/doi/epdf/10.1002/art.1780380602

Abstract

Objective:
Trials of rheumatoid arthritis (RA) treatments report the average response in multiple outcome measures for treated patients. It is more clinically relevant to test whether individual patients improve with treatment, and this identifies a single primary efficacy measure. Multiple definitions of improvement are currently in use in different trials. The goal of this study was to promulgate a single definition for use in RA trials.

Methods:
Using the American College of Rheumatology (ACR) core set of outcome measures for RA trials, we tested 40 different definitions of improvement, using a 3-step process. First, we performed a survey of rheumatologists, using actual patient cases from trials, to evaluate which definitions corresponded best to rheumatologists' impressions of improvement, eliminating most candidate definitions of improvement. Second, we tested 20 remaining definitions to determine which maximally discriminated effective treatment from placebo treatment and also minimized placebo response rates. With 8 candidate definitions of improvement remaining, we tested to see which were easiest to use and were best in accord with rheumatologists' impressions of improvement.

Results:
The following definition of improvement was selected: 20% improvement in tender and swollen joint counts and 20% improvement in 3 of the 5 remaining ACR core set measures: patient and physician global assessments, pain, disability, and an acute-phase reactant. Additional validation of this definition was carried out in a comparative trial, and the results suggest that the definition is statistically powerful and does not identify a large percentage of placebo-treated patients as being improved.

Conclusion:
We present a definition of improvement which we hope will be used widely in RA trials.

MSEsperanza · Apr 8, 2022

The issue about open label trials on treatments for ME using only subjective outcomes seems still difficult to understand for some proponents of evidence based medicine.

In his testimony to NICE, @Jonathan Edwards said:

"If results are unreliable, they cannot be considered reliable just because it is difficult to get more reliable ones. That this spurious issue is repeatedly raised just emphasises the existence of the difficulty.

"Moreover, methods for mitigating the difficulty exist. The American College of Rheumatology criteria of improvement for rheumatoid arthritis, rather than summing scores from disparate variables, uses a multiple threshold system so that a single improvement index indicates that key subjective outcomes are corroborated by relevant objective ones.

"I am not aware that tools such as this have been applied to ME/CFS trials."

Source: Edwards, J, The difficulties of conducting intervention trials for the treatment of myalgic encephalomyelitis/chronic fatigue syndrome: Expert testimony presented to the NICE guideline committee, 06.09.2019, https://www.nice.org.uk/guidance/GID-NG10091/documents/supporting-documentation-3

Forum thread here.

That’s an argument that I haven’t seen anyone of the defenders of the criticized trial design engaging with.

It is discussed on the forum on different threads how the concept of a multiple threshold system could be adapted for ME.[*]

Is this an argument we could use also more in discussions with defenders of the PACE, SMILE etc. trials?

Is the original ACR20 still used in the field of Rheumatology?

How controversial is it?

[*] e.g. https://www.s4me.info/threads/a-cor...ts-for-and-managing-symptoms-of-me-cfs.17767/

Edit: Removed a forest of links.

Edit 2: Links to related forum discussion and some first thoughts now posted here.

Apologies for the confusion.

Jonathan Edwards · Apr 9, 2022

Firstly, let me post this comment from elsewhere to give a bit more background:

A similar system for ME would be easy enough. We have discussed it in detail in years gone by.

You need one or more subjective endpoints that are the most direct practical indicators of ill health - fatigue is the obvious one, as long as it is appropriately defined.

You then want one or more significantly more objective endpoints that can be expected to show changes if the change in the subjective endpoint is meaningful. Actimetry or work status would be obvious examples.

You then devise a score that defines a level of improvement by requiring a given degree of improvement in both types of measure. The ACR scale proposed a 20% improvement in each of several measures be called 'ACR20'. An important point is that ACR20 is not a 20% improvement. It is more stringent because you need at least 20% on several measures. In reality for RA it equates to being very usefully better. ACR50 means being so much better you almost think you might be well. ACR70 pretty much means in remission. But this is a nicety. You have a score that includes important symptoms but is robust.

Even a multiple threshold score like this is of course influenced by subjective measures. So it is crucially important not to just look at statistical significance, especially in large trials where that is easy enough to get. The difference has to be both statistically significant and clinically relevant. A key feature of the PACE result is that it is unlikely to be clinically relevant even if (magically) it did not reflect bias.

Jonathan Edwards · Apr 9, 2022

Secondly, the merits of the ACR scoring system are complicated, as you suggest, and not necessarily perfectly suited to the ME situation.

The multiple threshold structure means that scores of different measures are combined in a way that reflects our normal (and rational) algorithmic decision making rather than the illogical pseudo arithmetic approach of adding things up. If you are buying a hoover you do not add up the features. You say - well that one has enough power to do what I need, but hang on, it is far too heavy so I need something else.

In the case of trying to assess health status when subjective symptoms are paramount, but potentially unreliably reported, you want a system that maximises your confidence that scores reflect a true improvement rather than the charisma of the physio.

ACR grading is mostly used in the context of blindable treatments. As such, the inclusion of objective elements like CRP are not as crucial as in ME. Moreover there are physical signs like joint swelling that are pretty objective if measured by a blinded observer. So the collection of measures in ACR scoring reflects a whole lot of different issues about heterogeneity between RA patients and consensus about what measures improve confidence in a change in RA. A change in tenderness might be a good sign of feeling better but a change in swelling would be a more reliable guide to the improvement being due to a change in the tissue thickening of the RA process and so on.

The ACR system will dramatically cut down bias due to systematic distortion of subjective reports. But it does not remove bias totally because the subjective elements are still in there.

For ME the systematic bias problem is much more acute because there are no objective signs or tests (which are totally objective). So an ACR-type multiple threshold scoring system for ME should reflect the problem of confidence in reliability in a rather different way.

In some ways it is easier because there are fewer things to measure in ME. RA can be extremely heterogeneous and there are lots of things to measure.

Jonathan Edwards · Apr 9, 2022

Just to add that yes, the ACR grading is still the gold standard 25 years later. This probably reflects the fact that it was put together by a small group of people who, knowingly or unknowingly, had a feel for how real decisions are made.

There is another system that exists alongside it though - the EULAR scoring system, which is much less good and adds things up. That was produced by a huge committee of committee-types who had little idea of real decision-making.

These days trials tend to collect all data needed for both scores so that either can be used if people want. EULAR may be used as a primary endpoint by some but ACR tends to be the gold standard.

Trish · Apr 9, 2022

This may not be the right place to start a discussion of which subjective measures would be most meaningful to patients, but I would make a plea for it not being fatigue. It means too many different things to different people. And in my experience of slipping down through the severity levels over the years, I would have filled in a fatigue questionnaire much the same when my ME was mild as now when it's severe.

I think something related to capacity to function would have more meaning to a lot of patients. Maybe SF-36 physical functioning combined with hours per day 'feet on floor' as Dr Bateman suggests, and something about cognitive function equivalent to the physical function questionnaire.

MSEsperanza · Apr 9, 2022

MSEsperanza said:
Edit: Removed a forest of links. Will re-add some links to related forum discussions later.

Here's what I removed from the second post but then was too slow to post in a following comment before people had already commented-- apologies for any confusion:

If I understood correctly, there are two main reasons why cleverly combined multiple outcome scores are a better instrument than just adding or checking the scores of different outcome measures.

1 = to make sure to measure a clinically significant difference

2 = Bias due to a lack of blinding: In open label trials you need objective outcomes to corroborate subjective outcomes

ad 1) The prerequisite for this I think is that, before you assess the way how to combine the scores, each measure actually is measuring something relevant. So each measure has to be validated first (does it actually measure the symptom or sign it is supposed to measure?).

This is discussed with respect to ME/CFS here:

Critique of the Chalder Fatigue Scale :

https://www.s4me.info/threads/s4me-...-with-the-chalder-fatigue-questionnaire.2065/

Critique of the DePaul Questionnaire (PEM measure) :

https://www.s4me.info/threads/s4me-...osed-measure-of-post-exertional-malaise.2220/

General discussion on patient reported outcomes widely used in research into ME:

https://www.s4me.info/threads/quest...e-in-me-cfs-research-discussion-thread.24353/

Our late Graham's video on the SF-36 questionnaire for physical function:

https://www.s4me.info/threads/video-the-pace-trial-a-short-explanation-graham-mcphee.4669/

Suggestions on developing better measures are discussed here (for example) :

https://www.s4me.info/threads/a-cor...ts-for-and-managing-symptoms-of-me-cfs.17767/

https://www.s4me.info/threads/measuring-fatigue-discussion-of-alternatives-to-questionnaires.7325/

https://www.s4me.info/threads/clini...hich-ones-are-useful-discussion-thread.20003/

Explicit references to the design of a multiple threshold score for measuring improvement in ME similar to the ACR20 are made here:

https://www.s4me.info/threads/a-cor...ts-for-and-managing-symptoms-of-me-cfs.17767/

https://www.s4me.info/threads/pace-trial-tsc-and-tmg-minutes-released.3150/page-7#post-57232

https://www.s4me.info/threads/measu...ernatives-to-questionnaires.7325/#post-130804

https://www.s4me.info/threads/measu...ernatives-to-questionnaires.7325/#post-130827

Ad 2) A general discussion on bias due to a lack of blinding is here:

https://www.s4me.info/threads/bias-due-to-a-lack-of-blinding-a-discussion.11429/

MSEsperanza · Apr 9, 2022

Trish said:
Maybe SF-36 physical functioning combined with hours per day 'feet on floor' as Dr Bateman suggests

MSEsperanza said:
Our late Graham's video on the SF-36 questionnaire for physical function:

https://www.s4me.info/threads/video-the-pace-trial-a-short-explanation-graham-mcphee.4669/

If I remember properly, Graham's critique of how the SF-36 was used in the PACE trial addressed not only the issues with the altered recovery criteria but was a general critique of the SF-36 as an instrument assessing improvement/ deterioriation?

It thought it would be interesting to know if there are treatment trials that used both the ACR20 and the SF-36, and if yes, how the scores correlate?

Jonathan Edwards said:
There is another system that exists alongside it though - the EULAR scoring system, which is much less good and adds things up. That was produced by a huge committee of committee-types who had little idea of real decision-making.

These days trials tend to collect all data needed for both scores so that either can be used if people want.

Interesting. Are you aware of studies looking at how both scores correlate?

Jonathan Edwards · Apr 9, 2022

MSEsperanza said:
Interesting. Are you aware of studies looking at how both scores correlate?

I don't know the literature but this has been looked at at length. Results with the two systems tend to be similar but not always. The problem is that if even if the two systems correlate 85% the 15% discrepancy might be exactly what makes a difference statistically significant in a study. Most studies look at mean values of populations and it is the discrepancy at the individual level that is crucial.

MSEsperanza · Apr 9, 2022

Jonathan Edwards said:
Results with the two systems tend to be similar but not always.

But why was there the need to develop a second system in the first place? What is the critique of the gold standard?

Jonathan Edwards said:
The problem is that if even if the two systems correlate 85% the 15% discrepancy might be exactly what makes a difference statistically significant in a study. Most studies look at mean values of populations and it is the discrepancy at the individual level that is crucial.

Again, I think it would be worthwhile to discuss the literature that deals with that problem. Perhaps we could get more people interested in the problem with badly done studies on treatments for ME and 'MUS' if we begin with showing how different standards are being applied by different people in research into the same illness, e.g. Rheumatoid Arthritis?

Presumably there are also non-drug trials for RA, e.g. on surgical interventions, physical therapy etc., and if I remember properly, BPS research on fatigue in RA is already happening too.

I get the impression that sadly it is very widely accepted among health care professionals and trial experts that it's completely OK to apply lower standards for non-pharmacological treatments, and also that there's nothing wrong with measures like the Chalder Fatigue Scale (and the SF-36 anyway) for assessing improvement in an illnesses that can't be objectively diagnosed yet as well as in subjective symptoms like pain and fatigue in all other kind of illnesses (including in the field of Rheumatology).

Perhaps it would help our critique to be taken more seriously (by people who should know better) if we could get more experts interested in a genuine discussion on those additional problems discussed here not only with regard to the need to include objective outcomes in open-label trials but also with regard to the validation of appropriate subjective outcome measures?

Jonathan Edwards · Apr 9, 2022

MSEsperanza said:
But why was there the need to develop a second system in the first place?

Politics.

Jonathan Edwards · Apr 9, 2022

MSEsperanza said:
Again, I think it would be worthwhile to discuss the literature that deals with that problem.

I am not sure that there is much literature on this. it is just common sense, and the committees that right about scoring systems tend not to get involved in common sense.

MSEsperanza said:
I get the impression that sadly it is very widely accepted among health care professionals and trial experts that it's completely OK to apply lower standards for non-pharmacological treatments,

I don't think it is widely accepted that it is OK to use lower standards - the double-think ensures that those involved think the standards are fine. The majority of rheumatologists do not think most trials of non-pharmacological treatments are worth anything for RA. But we have good treatments for RA. Rheumatologists who specialise in problems where we have no good treatments start double-thinking and rationalising. It is all quite complicated.

The problem with involving people in discussion is that the people who do not want to understand that their methods are no good will continue to ensure they do not understand. I think the message about poor quality non-pharmacological trials has been spread quite well to the extent it can be but when committees are made up of idiots, as they tend to be, rational argument fails to impact.

rvallee · Apr 9, 2022

Jonathan Edwards said:
I am not sure that there is much literature on this. it is just common sense, and the committees that right about scoring systems tend not to get involved in common sense.

Seems like the main common thread is who is involved, the rest comes along as long as the right people work in the right conditions. What those conditions are and what decides who to involve is the hard part, but it really seems like this is the only relevant factor in the end.

There is an interesting real-world experiment of this. I think it was (or is, don't know if it's still active) by the CIA. They basically tried to take some people mostly at random, present them with modified classified intelligence and asked them to assess and predict, give advice. Most people aren't better at it than a coin toss, but some individuals were found to do as well, if not better, than professional analysts. They mostly seem to just have better than average judgment, as long as the information they get is good enough to assess it. Of course if they're mislead, it probably doesn't work as well, although they may also do better at recognizing it.

I don't think the features of why some people are better at it were identified, but seems to fit in nicely with how it usually works in real life. Most people are no barely better than a stopped clock at dealing with new information. They can learn and memorize, but will perform only as well as the training they had, how accurate it is and how it prepares them to be mentally flexible about integrating new information. The difference between understanding an established scientific theory and coming up with it is huge, not much reason this wouldn't be any different with simpler concepts.

So I'm pretty certain that this is the way to go forward, that finding the right people to do this work will produce far better than average results, even when experts are included, especially because experts are sometimes too deeply anchored on past knowledge, and do not necessarily have the mental flexibility to do better than random people.

I can hardly think of people with less mental flexibility and ability to absorb new information, especially conflicting with what they expect, than our BPS overlords, providing a good example of the other side of this coin: that involving the wrong people almost guarantees worse than average outcomes and involving the worst, most biased people can produce negative outcomes, worse than doing nothing.

Obviously bias plays a huge role in this. Lots of lip service to it but it's absolutely not taken into account to the level that is needed. The idea that simply declaring bias is good enough says a lot about how misguided things currently are.

MSEsperanza · Apr 25, 2022

Jonathan Edwards said:
I don't think it is widely accepted that it is OK to use lower standards - the double-think ensures that those involved think the standards are fine. The majority of rheumatologists do not think most trials of non-pharmacological treatments are worth anything for RA.

Interesting.

So do you suggest most rheumatologists don't think exercise and physical therapy are effective additional treatments for RA? Or do they just think they don't need trials to prove the beneficial effects?

Jonathan Edwards said:
But we have good treatments for RA. Rheumatologists who specialise in problems where we have no good treatments start double-thinking and rationalising. It is all quite complicated.

I could imagine two main arguments here that maybe apply also to the field of other chronic illness for which good drug treatments have been developed in the last couple of decades. Perhaps rheumatologists think:

1) The drugs are effective, and if they don't completely cure (all) patients, we just have to improve the drugs or investigate what additional physiological processes could be addressed to develop additional or better drugs.

2) If patients receive effective drugs and all measurable signs of illness improve but patients still complain about symptoms like pain and fatigue, this has nothing to do with their RA. Not our job to address these issues. Most likely they have some psychological problem or don't move enough or both. Their GPs' job to deal with that.

Do you think these could be an accurate description?

Could such reasoning explain why clinicians and researchers in all the medical fields entered by psychologists don't care what happens in the psychological and other non-drug research that's done in their medical field, and also don't care which additional treatments 'their' patients receive?

If yes, what about research into surgery?

MSEsperanza · Feb 18, 2023

From IQWiG:

(From skimming, I didn't see any mention of the ACR or EULAR -- still seems relevant to this thread.)

"#CDAI is the first choice for measuring treatment effects of disease-modifying #antirheumatic drugs. Using a broad evidence base, a team of IQWiG authors compared measurement tools."

Press release: https://www.iqwig.de/en/presse/press-releases/press-releases-detailpage_84103.html

Article in BMC Rheumatology:

Janke, K., Kiefer, C., McGauran, N. et al. A systematic comparison of different composite measures (DAS 28, CDAI, SDAI, and Boolean approach) for determining treatment effects on low disease activity and remission in rheumatoid arthritis. BMC Rheumatol 6, 82 (2022). https://doi.org/10.1186/s41927-022-00314-7

Abstract

Background

Some composite measures for determining the treatment effects of disease-modifying antirheumatic drugs on remission and low disease activity (LDA) in rheumatoid arthritis (RA) may produce misleading results if they include an acute phase reactant (APR). To inform the choice of appropriate measure, we performed a systematic comparison of treatment effects using different composite measures.

Methods

We used data generated for a systematic review of biologics in RA conducted by the Institute for Quality and Efficiency in Health Care and data from systematic reviews of newer biologics and Janus kinase (JAK) inhibitors provided by sponsors. The studies included had been conducted up to 2020 and investigated comparisons of biologics with placebo and head-to-head comparisons of biologics. Treatment effects on LDA and remission in studies investigating biologics or JAK inhibitors in RA were compared among 4 composite measures: the disease activity score 28 (DAS 28), the simplified disease activity index (SDAI), the Boolean approach (remission only), and the clinical disease activity index (CDAI)—only the latter does not include an APR.

Results

49 placebo-controlled studies included 9 different biologics; 48 studies (16,233 patients) investigated LDA and 49 (16,338 patients) investigated remission. 11 active-controlled studies (5996 patients) investigated both LDA and remission and included 5 different head-to-head comparisons of biologics and 5 different comparisons (6 studies) of biologics with JAK inhibitors.

Statistically significantly larger treatment effects were found for biologics or JAK inhibitors versus placebo or active control in 16% of pairwise comparisons of composite measures (27 of 168). Most of these larger effects were observed for composite measures with an APR, i.e. the DAS 28 (19 comparisons) followed by the SDAI (n = 7). Larger effects were most frequently detected in favour of interleukin (IL)-6 inhibitors and to a lesser extent for JAK inhibitors versus treatments with different modes of action.

Conclusions

The use of the DAS 28 and SDAI in clinical studies may generate results favouring certain treatments based on their mode of action (e.g. IL-6 inhibitors versus other biologics). To enable unbiased comparative effectiveness research, a composite measure without an APR (i.e. the CDAI) should thus be the measure of choice.

MSEsperanza · Feb 19, 2023

From skimming I think the point of the review posted above (#15) is about what to do if improvements of a certain objective measure doesn't have clinical significance. The review seems to only consider drug trials that are blindable, but I think it could be still relevant for the discussion on outcome measures for ME/CFS.

Not able to read much at the moment, so no idea if they argue that the observed difference between changed measures in (one or more?) biomarkers and clinical improvement occurs because not all patients have the exact same pathomechanism. So if a treatment significantly changes a biomarker, that can mirror a clinical improvement in a group of patients, but not in other patients?

Or do they argue that the extent to which the biomarkers change just doesn't reliably relate to the extent of a clinical improvement?

@Jonathan Edwards

Had only a quick look at the proposed composite outcome score but it seems it doesn't include any objectively measurable biomarker at all. The most objective measure seems to be number of swollen joints, counted by an assessor? So depends heavily on the person examining the patient.

https://www.mdcalc.com/calc/2177/clinical-disease-activity-index-cdai-rheumatoid-arthritis

Link to a PDF that I think also says how to calculate the scores: https://www.rheumatology.org/Portals/0/Files/CDAI Form.pdf

Jonathan Edwards · Feb 19, 2023

I have never heard of BMC Rheumatology as a journal. Skimming, this looks like an article pushing a politically correct point about patient symptoms in a common sense vacuum. There is a specific problem with CRP for treatments that lower CRP irrespective of any effect on disease. I doubt that should be a reason for changing indices in other situations.

I had not actually heard of CDAI. I suspect this is Europeans digging themselves deeper and deeper into Emerson's 'Hobgoblin of small minds' - a certain foolish consistency. DAS28 is the old EULAR score.

American College of Rheumatology. Preliminary definition of improvement in rheumatoid arthritis, 1995, Felson et al.

MSEsperanza

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

Trish

Moderator

MSEsperanza

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

rvallee

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

MSEsperanza

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)