A Primer on
Alternative Study Designs for
Evidence-based Practice:
Harnessing Natural Variation for
Effectiveness Research
This Primer is based on presentations during a conference titled:
Alternative
Study Designs for Evidence-based Practice:
Harnessing Natural Variation for Effectiveness Research
Principal
Investigator: Peter I. Buerhaus, PhD, RN, FAAN
Team Members:
Susan Horn, PhD
Brenda Cornett
Jennifer Smith
Roberta James, MStat
Organization:
Vanderbilt University School of Nursing
Inclusive Dates of
Project: October 20 – 21, 2005
Federal Project
Officer: Milford Henderson
Agency Sponsors:
Dept of
Health & Human Services
Agency for Healthcare Research &
Quality (AHRQ)
National Center for Medical
Rehabilitation Research (NCMRR)
National
Institutes of Health
Interagency Committee on Disability
Research (ICDR)
National Institute of Child Health
& Human Development (NICHD)
Pharmaceutical
Research and Manufacturers of America (PhRMA)
Vanderbilt
University School of Nursing
Institute
for Clinical Outcomes Research, Salt Lake City, Utah
Award
#: 1
R13 HS015954-01
Abstract of Conference
Purpose: To
discuss and refine alternative study designs to randomized controlled trials
(RCTs) for effectiveness research and clinical decision-making; and to expand
infrastructure for conducting clinical research within healthcare delivery
system by increasing knowledge about rigorous alternative designs among
researchers and policymakers.
Scope:
Alternative study designs to determine comparative effectiveness of
treatments for clinical decision-making using analyses of existing
administrative databases, MDS data from long-term care settings, registries, etc.;
quasi-experimental designs, before-after or interrupted time series designs,
longitudinal designs and cross-sectional designs; and Clinical Practice
Improvement (CPI) study designs.
Methods: Conference
format over 1.5 days held in Washington, D.C., October 20-21, 2005. Combination of plenary presentations by
invited experts, workgroup discussions, and workgroup reports of key issues and
recommendations. Participants (n =96)
included health services researchers, behavioral researchers, study design
experts, clinicians, institutions, foundations, voluntary associations, health
plans, journal editors, and policymakers in Federal, State, and local
governments.
The following is a list of conference speakers and workgroup leaders with
their organizational affiliations:
|
NAME |
SPEAKER /
WORKGROUP LEADER |
TOPIC |
ORGANIZATIONAL AFFILIATION |
|
Peter
Buerhaus, PhD, RN |
Speaker |
Welcome and Introduction |
Vanderbilt University School of
Nursing |
|
Susan D.
Horn, PhD |
Speaker |
Clinical Practice
Improvement (CPI) study design |
Institute for
Clinical Outcomes Research |
|
Carolyn
Clancy, MD
|
Speaker |
Greeting and funder perspective |
Director, Agency for
Healthcare Research
and Quality |
|
Steven
Tingus, MS, C.Phil |
Speaker |
Greeting and funder perspective |
Director, National Institute on
Disability & Rehabilitation Research |
|
Michael
Weinrich, MD |
Speaker |
Greeting and funder perspective |
Director, National Center
for Medical
Rehabilitation Research |
|
Gerben
DeJong, PhD |
Speaker,
Moderator & Workgroup Leader |
Introduction; Moderator of CPI workgroup |
National Rehabilitation Hospital |
|
Kelly Cronin, MPH |
Speaker |
Greeting and payer perspective |
Centers for
Medicare & Medicaid Services |
|
Scott Gottlieb, MD |
Speaker |
Greeting and FDA perspective |
Food & Drug
Administration |
|
Andrew
Kramer, MD |
Speaker |
Administrative Databases |
University of Colorado |
|
Sharon-Lise Normand, PhD |
Speaker |
Quasi-experimental designs |
Harvard Medical School |
|
David
Helms, PhD |
Speaker |
Health services research perspective |
AcademyHealth |
|
Robert Rhodes, MD |
Speaker
& Workgroup Leader |
Surgery Board
perspective |
American Board
of Surgery |
|
Marcel Dijkers, PhD |
Workgroup Leader |
Moderator of Administrative
Databases workgroup |
Mount Sinai School of Medicine, New York |
|
Ruth Brannon, MSPH, MA |
Workgroup
Leader |
Moderator of Quasi-Experimental
Designs workgroup |
National Institute on Disability &
Rehabilitation Research |
|
Arthur Hartz, MD, PhD |
Workgroup
Leader |
Moderator of Quasi-Experimental
Designs workgroup |
University of
Iowa |
|
Nancy Bergstrom,
PhD, RN |
Workgroup
Leader |
Moderator of CPI workgroup |
University of
Texas at Houston |
|
|
|
|
|
A Primer on
Alternative Study Designs for
Evidence-based Practice:
Harnessing
Natural Variation for Effectiveness Research
In a recent John Eisenberg Lecture, Don Berwick called for a broader health services research agenda and the development and application of new research methods to support this agenda. “The challenge is to discover what we need to know that we do not now know in order to create much more effective systems of care.”1 He argues, “Health services research has not yet been sufficiently helpful in meeting the challenge of improving care in part because it has over-constrained both its methods and its favorite topics. The cost of insisting on formal, classical, summative, evaluative experimental designs [randomized controlled trials (RCTs)] in an uncertain, poorly understood, nonlinear, system is, unfortunately, to maintain the status quo….Health services research should become more effectively part of the solution. To do that will require that we enrich our portfolio of methods and broaden our agenda of inquiry. The scientific methods that we need to enhance and dignify in academic settings will combine formal classical methods with some pragmatic, immediate, and in many ways more informative forms of learning and investigation.” 1
We need alternative study designs that produce pragmatic, practice-based evidence that is useful and acceptable for practice and policy purposes. A confluence of factors is driving the need for better evidence to improve clinical practice, that is, for better knowledge regarding effective practice, concern for costs and cost-effectiveness (value), quality, patient safety, and equity. Better evidence will come from research that is clinically and practically applicable and generalizable. The traditional emphasis on internally valid research ignores the requisites of sound generalization, external validity, effectiveness, and utility in practice. Evidence-based medicine is “the integration of best research evidence with clinical expertise and patient values.” 1 In order to improve patient outcomes and population health we must move beyond generalizations based on belief and use new methodologies that are appropriate to clinical practice improvement.
Performing the “best research” requires identifying the best research methodology to answer a given question. Evidence does not always flow from the laboratory to clinical practice; it can also be discovered in the study of clinical practice. This Primer addresses the issue of “best research evidence” and how to integrate it with clinical practice and values.
As researchers, we have failed to embrace adequately the clinical experience that exists among front-line clinicians. There is a need to overcome gaps in dissemination that prevent translation of knowledge gained in research to practicing clinicians. Many research methodologies involve clinicians only in the periphery. Research designs that incorporate clinicians’ practical knowledge throughout every step of the process may increase our ability to transform research findings into practice.
Perfect evidence is an illusion―a useful motivating belief, but not the criterion for practice decisions. What we need is good, relevant, reliable information on what most probably is effective, safe, and worthwhile. An insistence on perfect evidence has led to an absence of good evidence that can be used to guide actual practice and policy.
There have been many methodological advances in experimentation, research design, and statistical modeling in the last 30 years. It is time for the health services community to make better use of these improved designs. Sophisticated alternative research designs have been developed that are as rigorous and able to demonstrate causality as RCTs. Other study designs are weaker, but in certain situations can provide the best possible evidence given practical and ethical considerations. These powerful research designs often go unused due to a lack of understanding and agreement on what constitutes strong alternative research designs and the circumstances and problems for which they are best suited.
The purpose of this Primer is to provide an overview of new developments in alternative, pragmatic, practice-based evidence research designs, including non-experimental research methods, and to elucidate those methods that are relevant to various tasks as well as limitations of some conventional approaches, including many RCTs. Specifically, we will 1) describe continuing developments in strong, convincing, quasi-experimental research designs; 2) describe improvements in correlational research designs that allow stronger causal inference; and 3) distinguish these designs from poorly controlled observational comparisons (which can be the best research option in some circumstances). The information presented here is intended for researchers, hospital and physician practice groups, grant reviewers, policy makers, funding organizations, and journal editors.
This Primer discusses four types of research designs, which are not mutually exclusive. The first type is the randomized controlled trial, the gold standard of medical research. The second focuses on study designs that use administrative databases. At the conference, Dr. Andrew Kramer from the University of Colorado drew upon his years of research experience to address these issues. Administrative database study designs take advantage of the large, currently existing administrative databases such as the Medicare Provider Analysis and Review (MEDPAR), Healthcare Cost and Utilization Project (HCUP), Minimum Data Set (MDS) for nursing, and many others. These databases have been used to examine specific treatment methodologies, health services, and provider characteristics.
The third type of research that we address is quasi-experimental designs, a broad category of research designs that include before and after designs, longitudinal designs, interrupted time series, and systematic treatment designs. At the Conference Dr. Sharon-Lise Normand from Harvard University led this discussion.
Our fourth research design is practice-based evidence for clinical practice improvement (PBE-CPI, or PBE for short in the rest of this Primer), which has been championed by Dr. Susan D. Horn. PBE examines three sets of factors and the interactions among them. The first is patient factors. Patient factors, such as case mix classifications and severity of illness measures can be used to control for differences in populations. The second is process factors: treatments, interventions, and medications, as well as the entire process of care, including the management and payment strategies in place. PBE examines combinations of patient and process factors in order to identify their association with outcomes, which is the third factor. Outcomes include clinical outcomes, such as health status of the patient, as well as measures like cost, length of stay, and number of encounters. Utilization indicators can function as independent or dependent variables. PBE brings a new level of rigor to this bolus of research designs. At the conference, Dr. Horn described PBE study designs.
Some areas of health care research have adopted alternative research designs more quickly than others, even though RCTs cannot always accurately reflect real world situations. Service systems are complex and adaptive; they do not provide a solitary intervention, such as a pill or a new device. Examination of an intervention in the real world requires assessing the entire system within which the intervention is delivered.
The health services research field has made limited use of a comprehensive or trans-disciplinary approach, which brings the best of all disciplines together in an active, participatory way. Another aspect that is often overlooked in designing research studies is the clinical experience of front-line clinicians. Much can be gained from harvesting their valuable experience and making it an integral part of research, not only in the research design, but also in the research process itself. Involving front-line clinicians throughout the entire process facilitates clinical buy-in and knowledge transfer.
One limitation of many research designs is that the clinicians who are participants in the study to one degree are not involved at every stage. Thus, by the time the study is completed, the clinicians may not believe the final result, which slows the implementation of research findings. We need to explore how to foster clinical buy-in that transforms the treatment community into effective advocates for research findings. There is a great gap between science and actual practice. This is due partially to our failure to engage our clinician colleagues in the entire research process, to encourage them to become advocates for the findings, and then to implement those findings.
Education is necessary but not sufficient for practice change to be implemented and sustained. Education is not merely a matter of knowledge transfer or educating clinicians through continuing education programs, but a matter of engaging clinicians throughout the entire research process.
The field of health services has been slow to adopt standardized documentation. We continue to use different kinds of instruments to evaluate patients. For example, in post-acute care there are many patient assessment instruments; lack of crosswalks between them inhibits comparative work.
The healthcare system and the policymakers and decision makers who drive it are finally coming to terms with chronic illness versus acute illness. Methods such as RCTs work for assessing new drug products or for assessing new interventions in care. For example, they can determine whether new intensive care units reduce mortality in people who have had an acute MI. Yet RCTs are unable to identify which interventions help people with disabilities lead more productive, independent lives.
To be useful, research has to be timely, convincing to providers, valid in practice (not only in controlled settings), and practical to implement. Until now, interventions often have been under-defined. We need to define specifically all the steps of the intervention and document them adequately. Many interventions have been recommended on the basis of reviews of sets of studies, but these reviews have not adequately specified the intervention, making it difficult to recommend implementation for practice.
In response to these challenges, Don Berwick has said, "We now have embedded in healthcare an extraordinarily powerful belief system, and a set of behaviors around clinical evaluation of science. This has taken us a long way from clinical practice guided by anecdote. Among other consequences, this revolution in applied methods placed the randomized controlled trial at the top rung of design as the best way to learn. But this commitment to sound evaluative science has also created a problem, namely that the journey we need to take now in seeking better systems of care will not yield to those methods alone. To crack the problem of health systems improvement, we are going to have to be interested as colleagues in science in other methods for learning, as we were previously engaged in the new classical methods. The formal methods of summative evaluation simply are not relevant when the hypotheses are many and vague, when alternative needs have evolved over time, when local knowledge is relevant and contains perhaps more transferable wisdom than bias, and when the confounders are not defects that spoil our learning but are themselves interesting and comprise the seeds of further progress. And when the effects sought are large enough, we ought not to have a hard time detecting the signal within the noise."1
RCTs are considered to be the gold standard for establishing the efficacy of drugs and other well-defined treatments and interventions.3 RCTs offer the most broadly applicable simple research design. However, the simplicity that makes this research design so appealing can lead to oversimplification of interventions and their effectiveness in real world applications. Finding ways to circumvent the limitations and shortcomings of the RCT, while maintaining a high level of internal validity, has led to some recent developments in alternative study designs. For this reason, it is important to understand both the strengths and weaknesses of the RCT.
RCTs began in the field of agriculture, where a few easily measured and controlled interventions and resulting outcomes could be investigated in hothouses. This type of research design is the most effective way to determine the efficacy of medications and well-defined interventions, as it controls for natural variation and singles out a small number of interventions for careful examination. RCTs prove causality by eliminating other confounders and by allowing for close examination of dose-response relationships.
It is customary in designing RCTs to develop a data collection tool that must be completed for every study patient. Variables are defined precisely and providers and patients are paid to collect the data. Also, careful monitoring of data reliability is performed. RCTs are often very expensive.
RCTs are concerned with efficacy, i.e., with the question of whether a treatment works under ideal conditions. Efficacy is simplest to determine when using a homogenous research population, but the requirement for homogeneity in the population that allows the researcher to determine the impact of the intervention also limits the ability to generalize the outcomes to the general population. Thus, these studies have strong internal validity, but weak external validity.
In contrast, effectiveness research, such as PBE, is concerned with the question of whether a treatment works under usual conditions of care. Effectiveness studies seek to identify the natural variation in the population and determine how interventions affect different subgroups of patients. Heterogeneity of the population is seen as a strength in these studies, and a means of gaining a clearer understanding of the intervention. These studies attempt to examine interventions within the wider healthcare system, where care is actually delivered. Thus, internal validity is weakened, while external validity is strengthened. Since PBE studies are not randomized, outcomes may be influenced by treatment selection. However, statistical methods can be used to overcome selection bias: matching, propensity scoring, and covariate adjustments.
There are methods for adapting RCTs to maximize their clinical relevance. RCTs should be designed to study the treatment or intervention as it would be delivered in the clinical practice setting, using outcome measures that reflect the values of persons involved and society (such as cost-effectiveness). In addition, studies should be conducted on a representative sample of patients in order to improve ability to generalize results to the wider public. While some RCTs can be adapted according to these principles, in other cases this is impossible and contextual issues (how the intervention would work within the multilayered healthcare system) remain unresolved. Also when it takes a long time to conduct a clinical trial, one can be left in the end with a greater understanding of the efficacy of methods that are no longer in use.
There are many threats to the validity of research inference, and selection bias―the primary validity threat that the RCT controls for―is only one. Optimal research designs must consider all major threats to validity of inference, and clinically applicable research should be designed to address all issues of generalizability and utility in practice―not just selection bias. The emphasis on RCTs as the gold standard of research has led to an oversimplification of the definition of high-quality research. It is as if selection bias were the only important threat to validity of research. Past evidence-based literature syntheses have been oriented around the RCT, sometimes to the extent of ignoring weaknesses of the RCT, and without sensitivity to the fact that other designs can, in certain circumstances, provide better, or indeed, the only evidence.
Most reports
concur that RCTs are needed to establish the efficacy of treatments. While that recommendation is clearly
justifiable, the reality is that RCTs pose substantial ethical and design
challenges for many clinical practice questions and may produce results with
limited generalizability. In clinical
practice, due to the wide variability in patient types and severity, the
complex dynamic and interactive nature of treatments, and the increasing
difficulty controlling for confounding treatment factors as more treatments are
introduced, the care environment is not very conducive to establish the
controlled experimental conditions necessary to conduct randomized trials. In recent years, the need for new research
methodologies to supply necessary missing pieces of information to clinicians
and health policy decision makers has become increasingly apparent, as RCTs
alone have failed to fill existing knowledge gaps.1,3 In summary, RCTs are a very important
study design methodology, but we need to consider alternative designs depending
on the research questions asked.
The value of administrative databases has been well established for purposes of health planning, public health surveillance, examination of geographic variation, and the investigation of health disparities across socioeconomic status, racial, and ethnic disparity. However, the value of administrative databases is far less well established for looking at practice-based evidence. First we describe administrative data. Second, we discuss how they can be used to generate practice-based evidence. Third, we present their strengths and limitations. And finally, we talk about what we can do to improve these data to provide practice-based evidence.
Dr. Kramer defined administrative data as preexisting data collected for federal/state requirements, or any surveys or databases for a sample of patients or providers collected for general purposes. Registries fall into the category of administrative data, as does Medicare cost and utilization information, nursing home MDS used for payment and quality purposes, and OASIS for home health care payment and quality. Non-institutionalized population administrative data include the National Ambulatory Medical Care Survey (NAMCS) and the National Health Interview Survey (NHIS).
How can administrative data be used to generate practice-based evidence? Typically administrative data are used in observational studies. Sometimes they can be used in quasi-experimental studies to supplement primary data collection in order to reduce respondent burden. This is the case if one wants to use some kinds of information that can be found in secondary data sources for the same sample of patients that primary data are being collected but one does not want to collect the data from respondents directly. Some examples of outcomes measures or effectiveness measures that can be found in administrative data are mortality, hospitalization, discharge disposition, etc. Mortality is a reliable endpoint in most of administrative data sources, but it is not 100 percent reliable. Data sources don't always agree, but usually one can triangulate and cross-check mortality with Social Security and other files, and verify mortality endpoints if multiple sources are used.
Hospitalization, and particularly diagnosis-specific hospitalization, is another useful endpoint. For example, Ambulatory Care Sensitive Conditions are used to look at quality of ambulatory care for conditions such as diabetes, COPD, and CHF. People think hospitalization for these conditions might be completely avoidable and they look at rates of hospitalization as indicators of quality of ambulatory care for these conditions.
A problem in most administrative databases is that there are pre-specified times when data are collected, so one is limited to those pre-specified times during analyses. Surgical and medical complications based on ICD-9 codes during hospitalization are an example of data where pre-specified collection times can affect their usefulness, since these complications are collected after discharge but when they occurred during the hospitalization is not specified.
For practice-based evidence purposes our administrative data analysis should be hypothesis driven. This means that we must define an effect variable separately from other covariates that are being adjusted for. One should not put all the variables into a model and say “This is what we found.” One must be very clear about hypotheses up front. Examples of effect variables that we can examine using administrative data are surgical procedures, new surgical technologies, specific services, treatment settings, and specific types of treatments that are coded with various codes. We can also study frequency of different kinds of services, and examine availability of services in an area, which might be a good proxy for the extent to which services are used. We can look at payer issues, e.g., managed care versus fee for service, and study which setting is more effective. We can look at facility characteristics, such as the volume of services or teaching hospitals versus non-teaching hospitals. And we can look at individual provider characteristics such as training levels, staffing levels in facilities, and physician specialty.
For example, consider an open appendectomy versus laparoscopic appendectomy.4 There is controversy about the indications for each surgery type. A study used data about 20 percent of all U.S. hospital discharges. It contained 43,000 appendectomy patients. Of this group, 17 percent were laparoscopic and 82 percent were open. Length of stay, complications, and mortality for appendectomy was examined, and there were an array of covariates, including perforation and abscess. We found decreased length of stay, some decreased complications, and increased rate of direct discharge for laparoscopic appendectomy patients. With stratifications, some of the complication differences went away. Nevertheless, this is an example of what can be done with administrative data from acute care and has benefits over a single site randomized trial.
A second example deals with indwelling catheter use in hip fracture patients, and looks at the issue of expanded use of indwelling catheters.5 This study used Medicare claims data and nursing home MDS data to look at the presence of catheters at the time of hospital discharge and some hospital characteristics. There were 111,000 hip fractures sent to nursing homes in this study and 32 percent were discharged with catheters. We studied rehospitalization for UTI and/or sepsis, return to community, mortality, and a whole array of covariates. In particular there were variables like function and cognition from the MDS, justifying indications for catheter use, which is what one would be concerned about. One certainly has to eliminate obstruction and retention as indications for use. But following multivariable analyses, there was increased rehospitalization for UTI and sepsis, higher mortality, and decreased community residence for 30 days for patients with extended indwelling catheter use.
What are the strengths and limitations of using administrative databases for practice-based evidence studies? The major strengths of administrative data include large sample size of subjects and providers, lower cost than primary data studies, and less time required for these studies since the data already exist. One can address policy questions and questions about past practices. There is no respondent burden and no need for consent. However, one of the greatest limitations is the silo limitation. We typically collect administrative data in silos such as hospital inpatient data separate from nursing home data. The richer elements in these databases are often within silos. And although one may link them, there are still some incompatibilities in timing, scale, and frequency. There is also the issue of unmeasured confounders; administrative data often do not fit the specific needs of a research question. They don't have all the controls that we want. In summary, administrative databases can be useful to answer some questions, but they must be used wisely; we must realize their strengths and limitations for practice-based evidence.
Quasi-experiments, as defined in Campbell and Stanley, are experiments in which subjects are not assigned to condition or treatment variables.6 In order for an intervention to be a causal effect, the timing has to be right. The cause must precede the effect and the cause must co-vary with the effect. In addition one has to rule out alternative explanations for the causal relationship. This is very, very important in observational studies where we do not have randomization. Although one may have an observational database and want to use it, one must adjust for covariates because treated responses may differ from control responses in ways that are not caused by the treatment but by missing confounders.
Common quasi-experimental designs include before/after designs, longitudinal designs, regression discontinuity/quantified assignment designs, multiple interrupted time series with a stable baseline and follow-up series, etc. Quasi-experimental designs are empirical investigations in which the objective is to understand causal effects. These questions are: What are treatment effects? What are intervention effects? What are policy effects?
The simplest longitudinal designs are before/after designs―one has n subjects at two times, before and after an intervention and there is no control group. All the subjects receive the control at early times, and then receive the treatment at later times. This may be a study in which one implements a new treatment. Nobody has it at the beginning and everybody has it after the intervention. The question is “What is the effect of the treatment?” We are looking at pre and post the new treatment and there is no control group. The strength of a pre/post design is that one has some information about the counterfactual (counterfactual theories of causation are ones where the meaning of a singular causal claim of the form "Event c caused event e" can be explained), because one sees what patients look like prior to the introduction of the new treatment or policy initiative. However, there are many weaknesses including the fact that the treatment is completely confounded with the post time period and there could be selection maturation (selection-maturation threat results from differential rates of normal growth between pre-test and post-test for the groups).
What about a repeated interrupted series design? This is a before/after design just expanded. Here we have multiple “before” observations, and multiple “after” observations, but still no control group. We have n subjects and observe their outcomes. We still have the issue that all n subjects receive the control at earlier times. But rather than one point, there may be a panel of observations, for example 10 monthly observations. However, patients receive the treatment at all the later times. What is the strength of this design? Why not just stick to pre/post? The strength is in multiple patient measurements, which provide information about current trends, and this helps to reduce sample size. But one still has selection maturation effects, and again, there is no blinding and no control group.
Clearly, we need a control group. The control group should be in both the pre and post test time periods. That is, some of the subjects receive the control and some receive the treatment at the same time and we have the untreated response in both groups at the baseline measurement. What is the strength of this design? Now treatment is not confounded completely with time.
If possible one wants to
have multiple observations pre and multiple observations post. In some sense that is the ideal world. Regardless, one still has to estimate the
causal effect. This is not the same as
association. We want to say x causes
y.
Regression is the most
common method of analysis, but regressing a number of covariates results in
association; it does not tell causation.
Nevertheless one still could run regressions but be careful about the
timing of the treatment or policy. Even
if one has information about all the confounders, one should not just run
regression analyses. Regression is
familiar to most people, is simple to interpret, and is easy for someone who
has collected a long list of covariates to say that the law of survival equals
the treatment plus the confounders. But
regression requires (a) a parametric model, (b) extrapolating different regions
of the covariates, and (c) imposing certain functional relationships, such as
saying that the relationship is linear or log linear.
The major problem, even
when one has all the covariates, is not knowing how comparable the treatment
and control groups are. In the
observational world, after adjusting for as many confounders as possible, it is
very difficult to see whether the treated and control groups are similar. One can look at each covariate in the
treated arm and in the control arm and look at differences. However, since there are many covariates, it
is really hard to see whether or not the groups are the same.
Moreover, if the
variances of the confounders differ between the treatment and control groups,
then the bias is increased. The group
of people who receive treatment in an observational study may be more homogeneous
than the group of people who do not receive the treatment, because people in
the latter group may not get treatment for many reasons.
There are several
different strategies to estimate treatment effects. One can do some exact matching or stratification, which is simple
to interpret and there are standards of how to do it. However, there can be too few observations to adjust for many
possible confounders, and so the original database needs to be large. Even then the values of the observed confounders
for the treatment group may not overlap with those for the control group, i.e.,
the patients might not be comparable exactly.
Regression is good, but
it is not the only analysis method to use.
Matching for stratification is good, but it can be problematic when
using large observational databases because one has to use some of the
confounders to match. Another way is to
produce a propensity score or some metric to summarize the difference between
who receives the treatment and who does not.7 A propensity score is simple to interpret
and standard software can be used. One
gains from looking at the comparability of the treatment and control groups
based on a number that summarizes the information of all observed
confounders. An example of a propensity
score is the Comprehensive Severity Index that is discussed in Chapter 5.
An RCT is simple. One looks at the difference in the outcomes
between two groups: treatment and control.
Where people get nervous about observational data is that by definition
one cannot use simple-minded methods.
One has to do a lot of work to show that the treatment and control
groups are comparable based on the observables. What about the unobservables?
People worry about them also. If
there is no lack of who gets the treatment and who does not, and if the treated
and control groups look comparable, then one can be confident in the robustness
of the effects.
A well-designed
quasi-experiment can provide valid inferences.
However, the investigator must work harder in the analyses to
demonstrate that the treatment effects are causal and in particular must
describe other possible causal explanations and why they either support or do
not support the original findings.
CHAPTER 5:
PRACTICE-BASED EVIDENCE FOR CLINICAL PRACTICE IMPROVEMENT
PBE methodology is a novel and
complementary practice-based evidence (versus evidence-based practice) approach
to study effectiveness of clinical interventions.3,8,9 The PBE method involves statistical analyses
of large databases that incorporate extensive details on patient
characteristics including severity of illness and co-morbidities, standardized
documentation of treatment details, and periodic outcome assessments. This methodology has been used successfully
to uncover important clinical associations between care and outcomes in
multiple conditions, including a recent study that revealed several very
specific and clinically-relevant insights from inside the ‘black box’ of stroke
rehabilitation.10
PBE is useful to study a wide range
of treatment options and practices in diverse populations and to determine how
these factors interact to affect outcomes.
PBE is a rigorous observational method that is embedded within
‘real-world’ multidisciplinary clinical care and offers many advantages over a
tightly constrained clinical trial. It
does not alter or suspend the treatment regimen to evaluate the efficacy of a
particular intervention. Instead, it
collects detailed information on actual care practices and thereby captures the
breadth and depth of patients, treatment regimens, and their interactions
within the multidisciplinary setting.
The hypotheses and study design are developed specifically to answer
questions faced on a daily basis by clinicians such as: “Does this treatment
work as well as it is purported to work?
For whom does the treatment work best?”
The PBE methodology has the
advantage of compiling data on a large number of patients―numbers that
would not be available (or affordable) in an RCT with rigid inclusion and exclusion
criteria. The PBE approach controls
statistically for patient differences by taking into account important patient
covariates such as severity of illness and functional status, thus giving it an
advantage over traditional smaller scale observational studies. Accepting a priori that potential confounding
variables should be identified and measured, rather than eliminated, allows for
a richer study. This inclusiveness also
allows for greater external validity (generalizability) of findings.
Perhaps the best way to
understand PBE is through some illustrative examples of the types of
information that have been uncovered in applications using this approach. One such example is the recently published
report on stroke rehabilitation.10-12 A prospective observational cohort study was conducted on over
1,291 patients post-stroke treated at 7 inpatient rehabilitation facilities
(six in the US and one in New Zealand).
In stroke rehabilitation, treatments are customized to meet individual
patient needs with little guidance or adherence to established practice
parameters or standardized treatment protocols. Consequently, considerable variation in treatment approaches is
seen from one patient to another and from one rehabilitation center to another.
Three types of data were
collected for the stroke study: 1) patient characteristics that are used to
formulate a Comprehensive Severity Index (CSI®), which is a
validated unique component of the PBE approach. It is an age- and disease-specific measure of physiologic and
psychosocial complexity comprised of multiple clinical signs, symptoms, and
physical findings; 2) process variables that detail what is being done in
treatment; and 3) outcome variables such as severity of illness and functional
status (e.g., Functional Independence Measure (FIM) scores).
One of the unique features of PBE
is its attention to the details of the process
of care, looking inside the ‘black box’ of treatment. Relevant details for some interventions
(e.g., surgical procedures and medications) can be found in the medical
record. However, for interventions such
as physical, occupational, and speech therapy, details of the clinical
activities performed in any given session typically are not documented
sufficiently in a patient chart. One of
the most impressive parts of the stroke study was the development of
‘point-of-care’ documentation using a form designed by the participating
therapists that successfully captured what was being done in each therapy
session. Results indicated that, controlling
for patient differences, certain activities and interventions were associated
with better outcomes: more time spent in higher level rehabilitation activities
such as gait training, upper extremity control, and problem solving, use of new
psychiatric medications, and enteral feeding.
Initiating gait training very early in the rehabilitation process was
associated with better outcomes, even for low functioning patients. Equally important information was the fact
that many treatments or activities that were used commonly and frequently
failed to be associated with positive outcomes. These findings could have important and immediate implications
for stroke rehabilitation practices; however, while the inherent scientific
value of the data obtained is widely acknowledged, some urge caution in direct
application of these findings to clinical settings.13-15
The stroke study provides details
on the development of the rigorous methodology implemented there and shows that
it is both possible and feasible to obtain this type of detailed information in
a complex setting such as stroke rehabilitation.12 The stroke study engenders confidence that
use of PBE methodology will be similarly successful in capturing the nuances of
multidisciplinary clinical care for patients with other conditions. Pertinent findings from other PBE studies
are as follows:
The major impetus for pursuing
PBE methodology is the challenge many have faced in trying to design RCTs to
evaluate efficacy of certain treatments as well as a failure to know how to
proceed with data from a series of controlled studies indicating that several
treatments are available for a specific indication, each of which have some
degree of efficacy or effectiveness.
Unless studies comparing each viable treatment in each patient sub-group
are done, clinicians are still at a loss to decide what works best for whom. RCTs
are considered the gold standard for efficacy, and this reputation is warranted
because these studies are designed specifically to demonstrate that the
measured effect can be attributed directly to the treatment. However, RCTs are not without
limitations. The sterile and somewhat
artificial treatment environment of a RCT and rigid inclusion-exclusion
criteria greatly limit the generalizability of results.
Although research funders have long favored the RCT, they may be beginning to recognize that RCTs are not the most appropriate designs for many questions, particularly in the complex world of health care research. PBE methodology uses various types of regression analyses and large numbers of patients, which allow the examination of multiple factors. What is unique about the PBE method is the strong focus on patient severity, specifically the formulation of the Comprehensive Severity Index, which is based on many years of research. In summary, PBE offers an alternative to traditional multi-center randomized clinical trials and it may be appropriate particularly to evaluate multidisciplinary care. It imposes a structure and rigor on the establishment of a multi-center database that yields high quality and clinically pertinent data. Recent published literature finds that significant effects in RCTs and observational studies are very similar.22-24
This chapter provides an overview of the methods used in the practice-based evidence clinical practice improvement approach. A PBE study is an observational cohort study that collects both prospective and retrospective data without interrupting the natural treatment environment. PBE studies examine what actually happens in the care process and overcome shortcomings commonly attributed to observational studies by the ways they account for patient covariates and severity of illness.
Although PBE studies resemble other observational studies that take into account patient demographic and setting characteristics that may affect outcomes and determine generalizability, PBE moves beyond traditional observational approaches to create comprehensive, complex databases that include detailed patient-specific descriptions, severity-of-illness measures, and characterizations of treatments for large samples of patients.
Methods
Steps in a PBE Study
The purpose of a PBE study
is to determine the relative contribution of specific interventions and
therapies to patient outcomes taking into account patient differences and other
contributing factors. PBE methodology
captures in-depth, comprehensive information about patient characteristics
(including clinical signs, symptoms, and physical findings), processes of care,
and outcomes needed to ascertain the contribution of individual processes to
outcomes. There are seven phases or
steps in a full PBE study.
1.
Create a multi-site, multidisciplinary Project Clinical Team whose tasks are to (a) identify outcomes of
interest, (b) identify individual components of the care process, (c) create a
common intervention vocabulary and dictionary, (d) identify key patient
characteristics and risk factors, (e) propose hypotheses for testing, and (f)
participate in analyses. The
multidisciplinary Project Clinical Team (referred to as the Team henceforth)
builds on theoretical understanding, research evidence to date, existing
guidelines, and clinical experience about factors that may influence
outcomes. PBE studies entail extensive
front-line clinical staff participation in all phases of study design, data
collection, and analyses.
2.
Use the Comprehensive Severity Index to control for differences in
patient severity of illness, including comorbidities that might otherwise affect outcomes. CSI is an age- and disease-specific measure
of physiologic and psychosocial complexity comprised of over 2,200 signs,
symptoms, and physical findings.25-28
3.
Implement an intensive data collection protocol that captures data on patient
characteristics, care processes, and outcomes drawn from medical records and study-specific,
point-of-care data collection instruments.
Data collectors are tested for inter-rater reliability.
4.
Create a study database
suitable for statistical analyses.
5. Successively test hypotheses based on questions that motivated the study
originally, previous studies, existing guidelines, and, above all, hypotheses
proposed by the Team using bivariate and multivariable analyses including
multiple regression, analysis of variance and covariance, logistic regression,
hierarchical models, Cox proportional hazards regression, and other methods
consistent with measurement properties of key variables.
6.
Validate study findings through an
implementation phase that tests the predictive validity of the findings. In this phase, findings from the first 5
steps are implemented and evaluated to determine whether the new or modified
interventions are associated with better outcomes as predicted.
7. Incorporate
validated study findings into standard practice of care and practice
guidelines. After the validation of
specific PBE findings, the findings are ready to be incorporated into care
protocols.
The PBE approach uses
detailed data on interventions that allow researchers to penetrate to the most
meaningful level of resolution regarding the effects of the types of care
rendered. Thus, the PBE approach can
answer study questions and hypotheses initially at a basic level but also
allows researchers to drill down into the data with the help of additional
insights offered by Team participants.
Project Clinical Team
The Team provides expert
advice to ensure clinical meaningfulness to create clear and compelling
hypotheses, useful study variables, and appropriate analyses. It usually contains a core group including
the medical director or director of nursing (DON) from each participating
site. This core clinical Team develops
and implements patient selection criteria, provides expert advice for data
collection instrument development, obtains IRB approvals at their respective
affiliated organizations, oversees the data collection process, and
participates in analyses. Over time and
depending on project activities/needs, the Team expands to include
representatives of each discipline in the clinical area treating each patient. People from these disciplines from each
study site provide expert advice specific to their fields of expertise. Team members participate in weekly or
biweekly conference calls over much of the PBE project. Frequent team meetings via conference calls
contribute to overall collaboration and investment in the study’s processes and
findings.
Study Facilities
Study sites are selected based on their willingness
to participate and geographic location.
Usually there are no specific criteria for selection; thus, study sites
are not a probability sample of sites in the US. Facilities can be for-profit or not-for-profit, free-standing, or
part of an organization of facilities.
Facility level differences are controlled for using statistical
analyses.
Patient Selection Criteria
Each site contributes detailed data for a specified
number of consecutive patients or for a specified time period using general
criteria. Facility size and rate of
condition specific patient admissions determine the duration of the enrollment
period. Some sites enroll patients faster
than others. No eligible patients are
excluded. Patients from the study sites
constitute a convenience sample.
Each participating site
obtains IRB approval for the study and enrolls
consecutively admitted patients that meet a set of inclusion criteria specified
by the Team. Inclusion criteria usually
include:
3. Reason for admission. Reason for admission criteria may be established. For example, the Post-Stroke
Rehabilitation Outcomes Project (PSROP) used the first rehabilitation admission
following current stroke, with the principal reason for admission being
stroke. The patient may have had
previous strokes and previous rehabilitation admissions for previous stroke(s),
but this is the first admission for the current stroke. Current stroke must have occurred within one
year of the rehabilitation admission.
4. Transfer-out limitations. Some studies create study inclusion criteria for patients with
interrupted stays. For example the
PSROP Project Clinical Team decided that if a patient were transferred to
another setting of care, e.g., acute hospital, and returned to the inpatient
rehabilitation facility within 30 days, the patient remained a study
patient. If a patient transferred to
another setting of care and returned to the facility after 30 days,
participation in the study ended on the day of transfer.
There are no
exclusion criteria that might otherwise limit the generalizability of
findings. Because PBE studies usually
do not entail a new or experimental intervention for which patient consent is
needed, there are no refusals or study dropouts and therefore, no need to compare
study participants with study dropouts or need to account for patient selection
effects that might otherwise occur. Some PBE studies, however, do require patient
informed consent, particularly if they wish to conduct patient or family
interviews. In these cases, comparisons
between patients giving consent and refusing to give consent can be performed.
Sample size and power
calculations
Sample size can be determined using recommendations
such as those of Cohen for modeling the magnitude of effect size.29 In some
research (e.g., studies conducted in applied settings or new areas of inquiry),
effect sizes may be small because the phenomena under study are not under good
experimental or measurement control.
The smaller the effect size, the larger the sample required (other
parameters being equal) to detect significant differences. Cohen recommends that power calculations be
performed assuming small, moderate, and large effect sizes based on the
proportion of variance accounted for in the dependent variable.
Using these concepts and tables provided in Cohen, a
sample of 1,800 subjects will have at least 80% power (with Type 1 error of
p<.05 (2-tailed test)) to detect small effects (effect size of 0.15) of the
predictor variables on outcomes. The
sample allows detection of differences in mean values of continuous outcomes
that are 0.15 standard deviation units, and differences in discrete outcomes of
4% to 8%. For regression analyses,
independent variables that predict about 2% of the variance in outcomes can be
detected. When analyzing subgroups of
patients, if, for example, 300 subjects are expected, then detection of medium
sized effects (effect size of 0.35) with at least 80% power (with Type 1 error
of p<.05 (2-tailed test)) is possible.
Models for these sub-analyses are sensitive to differences in mean
outcomes that are 0.35 standard deviations, or between 10% to 17% differences
in rates of an outcome.
Data Collection
Usually three types of
study data are collected in PBE studies: (1) patient characteristics (e.g.,
admission severity of illness and functional status measures), (2) process
variables (e.g., treatments and interventions), and (3) outcome variables
(e.g., discharge functional status, discharge severity of illness, and
discharge destination) and are obtained from multiple sources either at the
point of care or from post-discharge chart review in the site.
Point-of-care data
An important component of PBE is its
attention to the details of the process of care that the patient
actually receives; it addresses interventions and patient management
strategies. PBE relies on information
contained in patient medical records, which trained data collectors abstract
following patient discharge. The
Team identifies those study variables that can be obtained from existing
documentation at their respective sites.
However, they often believe that existing patient records do not
adequately document specific activities and interventions provided by certain
clinician specialists, e.g., physical, occupational, and speech language
therapists, etc., in stroke rehabilitation, because much patient documentation
is oriented to the needs of payment or reimbursement systems. The Team recommends how to get all members
of the treatment team to describe accurately what they do. Thus, the concept of point-of-care
intervention documentation can be incorporated into the study design.
Point-of-care intervention documentation
development
Discipline-specific
specialty teams with representation from each participating study site
conceptualize and then create discipline-specific point-of-care intervention
documentation forms to record activities/interventions used with study
patients. This iterative process, which
can include face-to-face meetings and telephone conference calls, can take
several months depending on the level of detail desired and the extent that
practice differs by site. Clinicians
sometimes find that definitions of common terms differ from site to site and
practitioner to practitioner. Thus,
part of the effort requires agreement on definitions of terms by participating
therapists.
Clinicians from study
sites create an intervention documentation form that includes a taxonomy of
activities used in each clinical area.
This work incorporates practices and definitions in existing frameworks,
and the level of intervention intensity clinicians think is needed to capture a
complete and accurate picture of the contribution made by that discipline to
care (beyond what is already contained in traditional medical record
documentation). In addition to
developing the content of its documentation form, each discipline decides upon
the frequency with which its form should be completed. The taxonomy provides a format into which
clinicians document actual interventions performed with patients; the
documentation forms do not suggest treatment strategies or changes to routine
practice.
Intervention documentation
forms are standardized for all sites.
Because development efforts include representatives from each
participating site, the forms contain interventions that may be specific to one
or more sites but are not used by all.
These ‘unique’ interventions are included on each site’s form even
though most places do not use them.
Therapists are trained to record only what was done in the actual care
process at each site for each patient.
As an example, see Appendix A for point-of-care documentation form used
by physical therapists in the PSROP.
Point-of-care intervention documentation
training/reliability
During a pilot test period
following development of each documentation form, practicing clinicians who
worked on form development use their draft forms during patient treatment
sessions and solicit input from clinician colleagues. Discipline-specific weekly teleconferences provide the forum for
clinicians to discuss pilot findings and agree to add, edit, or delete items
from the form. Each discipline’s
documentation form is finalized following this pilot test period.
Site clinicians are trained to use intervention documentation
forms via discipline-specific train-the-trainer sessions attended by a lead
clinician in each specialty from each site.
The Team facilitates this training for each clinical specialty using a
training manual that includes paper and electronic copies of the intervention
documentation forms, instructions for completing the forms, and definitions for
all terms used on the forms. Written
case studies are included; several case studies are used to demonstrate how to
complete each form based on a patient scenario. Additional case studies are used to evaluate trainees’
understanding of instructions by providing examples of how to use the form for
different patient scenarios.
Following the training
session, each clinical leader conducts on-site training sessions for their
co-workers. It is possible to have the
same training team visit each study site to conduct training for point-of-care
documentation for all clinicians. With
sufficient funding, such standard training is preferable. Teleconferences for each
group are held throughout the few months following training to provide
clinicians the opportunity to discuss implementation issues and ask questions
of their peers in other participating institutions.
Each site incorporates
auditing of intervention documentation form use into routine site
practices. Typically, a second
therapist (usually the lead therapist) observes a patient session and completes
a separate intervention documentation form based on what is observed. The therapist providing the session
completes a form as per protocol and the two are compared. The lead therapist reviews and discusses
differences in documentation with the practicing therapist.
Point-of-care intervention documentation validity
Face validity is built into the intervention
documentation forms, since they are developed and used by site clinicians as
described above. Clinicians agree with the content of their
respective forms by discussing findings from the pilot test and then agreeing to add,
edit, or delete items from the form (content validity).
Showing significant effects of interventions on outcomes assesses predictive validity. For example, the amount of variation explained in discharge FIM scores controlling for patient characteristics (including admission FIM, severity of illness, and demographic factors) was 40% for moderate strokes and 45% for severe strokes. When total time per day spent on physical therapy (PT), occupational therapy (OT), and speech language pathology (SLP) was added, there was no increase in variation explained for discharge FIM, consistent with previous findings by Bode, Heinemann, et. al.30 However, when time per day spent in specific PT, OT, and SLP activities was added, the amount of variation explained increased to 52% for moderate strokes and 73% for severe strokes, adding 12% to 23% explanation of variation, respectively, in discharge FIM.10
Post-Discharge Chart Review
To create a study
database, a method is needed to enter data from post-discharge medical chart
review. One mechanism used in previous
PBE studies is the Comprehensive Severity of Illness (CSI®) Software
System that allows for both the input of severity of illness data and the
creation of auxiliary data modules (ADMs), which are sets of study-specific
data elements that are collected in addition to patient severity
information. The Team identifies and
defines all patient, process, and outcome variables to include in the study
ADM. Using laptop computers, data
collectors at each participating site enter chart review data into the CSI
Software System.
The signature component of the CSI Software System
is the disease-specific severity system, hereafter referred to as CSI®. CSI is an
objective method to define severity of illness based on individual signs and
symptoms of a patient’s diseases.
Between 1980 and 1992, Dr. Susan Horn, in conjunction with expert
clinician panels originally at The Johns Hopkins Hospital, developed explicit
severity criteria for each ICD-9-CM diagnosis code or group of similar
diagnosis codes. In order to keep
severity criteria up-to-date with medical practice, the criteria are reviewed
and updated via clinician panel discussions with each application of CSI. CSI defines severity of illness as the
physiologic and psychosocial complexity presented to medical personnel due to
the extent and interactions of a patient’s disease(s).9,25-28,31
Inputs to the CSI include over 2,200 disease-specific and age-specific severity criteria
including physical findings, historical factors, physiologic parameters, and
laboratory and radiology results at specified levels of abnormality found in a
resident's chart. Treatments
provided do not contribute to severity of illness. For example, intubation is not a severity
criterion; severity criteria include patient signs, symptoms, and physical
findings that led to a clinical decision to intubate (e.g., respiratory
acidosis, absent breath sounds, cyanosis, etc.).
As an example, the pneumonia criteria set involves
the neurological, cardiovascular, and respiratory systems, vital signs, and
laboratory and radiology values. The presence of a
pneumonia ICD-9 code (486, for example) prompts for questions from the
pneumonia criteria set, as listed in Appendix B. Each criterion is followed by response choices for the data
collector to select; possible responses are presented in decreasing order of
severity. Responses for the pneumonia
dyspnea question, for example, include dyspnea at rest, dyspnea on exertion,
and other breathing difficulties. The
data collector selects the appropriate response based on information found in
the patient chart; data collectors are trained to select the most severe
response (by order of presentation). A disease-specific criteria set exists for each
group of similar ICD-9-CM codes; CSI contains over 5,500 criteria sets for
specific diagnoses in five health care settings (acute care, rehabilitation,
ambulatory, long term care, and hospice) with details similar to the pneumonia
criteria set in Appendix B.
Chart review training/reliability
Reliable data collection
is essential in PBE studies. To
accomplish this each site medical records abstractor completes a 3 or 4-day
training session during which efficient and accurate collection of chart-review
data is explained and practiced.
Following the training session, each data collector undergoes a rigorous
manual reliability testing process to ensure complete and accurate data
collection that goes beyond internal data editing features of CSI (e.g.,
features that prohibit entry of non-sensible values). Reliability monitoring is conducted at several points throughout
a PBE study to ensure that data abstraction accuracy is maintained
throughout. An agreement rate of 95% at
the criteria level between each data collector and the Project training-team
reliability person is required for each reliability test.
The study investigators and the trans-disciplinary
Team members direct PBE analyses. These
researchers and clinicians have the fundamental knowledge and experience
treating patients in the study area to know when associations are clear or
whether additional explanatory variables are needed. Clinical strengths of the Team combined with analytic experience
result in clinically meaningful, statistically sound data analyses.
Management of Missing Data and Outliers
When data are
missing, adjustments are made depending on the variable and its intended use in
analyses. Sometimes values are
categorized simply as “unknown” (and included in analysis as a dummy variable
representing the missing category); sometimes patients with missing data are
deleted from analyses; and sometimes continuous variables with missing data are
collapsed into categorical data and placed with cases with missing information
into a category using corroborating data.
For example, if a patient’s Body Mass Index is missing, but other
weight- and height-related information exists (e.g., an order for a bariatric
wheelchair), the patient may be categorized broadly as overweight or
obese. When missing data are material,
one also can examine whether the patients with missing data in question are
substantially different from the rest of the study group, and adjust
accordingly. Ranges for some
variables are set to exclude unrealistic values and obvious outliers from the
analysis. Values beyond set ranges are
considered improbable and not used in analysis.
Preliminary Data Analyses
Typically, the first phase of analysis uses
descriptive statistics to examine frequencies of categorical patient,
treatment, and outcome measures, and average, median, quartiles, and amount of
variation (standard deviation and range) for continuous measures. Bivariate analyses are conducted to test the
relationship between each candidate predictor and other predictors and
outcomes. For discrete variables,
contingency tables are created and chi-square tests, Fisher’s Exact tests, or
Wilcoxon tests or Kendall’s tau (for ordered categories) are used to determine
significance of bivariate associations.
Also categorical analysis of variance can be used to determine the
proportion of variation in outcome explained by each predictor. For continuous variables with normal distributions,
Pearson correlation, 2-sample t-tests, or analysis of variance can be
used. For continuous variables with
non-normal distributions, non-parametric tests are used including Spearman
correlation, Wilcoxon rank sum tests, or Kruskal-Wallis tests. Usually a two-sided p value <0.05 is
considered statistically significant.
Analyses of Primary Outcomes
Analyses in PBE studies include application of correlational research methods. These are most valuable to improve clinical practice, to elucidate the circumstances/context affecting quality of care, or implementation of a known treatment process (e.g., treatment of pressure ulcers). Correlational research designs also provide invaluable hypotheses/probable findings that would never arise from RCTs/lab research. In general, in circumstances we define, they can provide evidence of highly probable effectiveness (level 2)–which is the usual threshold for clinical decisions.
The most common multivariable analysis methods used
in PBE studies are hierarchical and least squares regression for continuous
outcomes and logistic or Cox proportional hazard regression for dichotomous
outcomes. These types of regression
analyses are used to identify patient and treatment variables that are
associated with better outcomes. Hence,
these regressions include patient characteristics, such as severity of illness,
age, gender, race, education level, and location and severity of injury, and
individual treatments and combinations of treatments. In all multivariable analyses, a p-value of <0.05/m
(Bonferroni correction: m is the number of independent variables in the model)
is considered significant.
Using suggestions from the trans-disciplinary Team,
potential predictors are allowed to enter the models. Those that are not statistically significant are deleted
sequentially from the full model.
Excluded variables can be reintroduced at various stages of model
development as decided by the Team but final models usually include only
statistically significant variables.
Two-way and higher order interactions can be included and tested along
with non-linear transformations of variables suggested by the Team. Regression analyses allow examination of the
extent to which various process/treatment steps and facility variables are
associated with outcomes, controlling for severity of illness and other patient
factors.
Analyses within subgroups of patients can clarify
associations found in regression analyses using larger samples of
patients. For example, one might
perform analyses within case mix groups (CMGs) to control for differences in
initial injury severity or within diagnosis related groups (DRGs) to control
for type of surgery and comorbidities.
A sample of 300 or more patients in a subgroup would allow up to 30 predictors
in models without being over specified.
Using a 10:1 cases:variables ratio helps to avoid spurious
correlations. Because there can be many
variables to use as possible predictors, variables can be grouped (e.g.,
patient variables as a group) and significant variables from each group can be
included in final models.
When performing patient-level regression analyses,
patient characteristics are allowed to enter in order to determine the amount
of variation in outcomes due to differences in patients. Next, treatment variables are added to
determine the amount of variation in outcomes due to differences in treatments
delivered while controlling for patient differences. In the next step, interaction and non-linear variables are
added. Only later are facility
variables included, because if facility variables are significant, they do not
tell us what to do to improve care. We
cannot send all patients to one facility.
PBE analyses first examine variation due to patient and treatment
factors and their interactions, which give information about which treatments
are better and for whom. After
including patient and treatment variables, including facility variables
determines if there is any additional variation explained by facility variables
that has not been captured already with the significant patient and treatment
variables included in the models. Often
regression analyses are repeated using hierarchical models to determine if
there are significant “among-site” components of variance and if any significant
patient or treatment variables are lost in hierarchical models.
Hierarchical analyses address the fact that patients
are treated within facilities, which may affect the independence of
observations. Alternatively facility
descriptive variables or facility dummy variables may be included in
regressions. Site effects, which could
be influential as determined by hierarchical models, may already be accounted
for in the detailed patient and treatment predictors. Researchers rightfully worry that patient observations may be
correlated within a setting or that treatments may be correlated within a
setting; independence of observations is the basic issue. Hierarchical analyses are conducted to be
sure that significant variables remain significant and in the same direction
for both ordinary and hierarchical regression.34
A PBE study collects comprehensive detailed data on
all factors that may influence outcomes for a specific group of patients. The goal is to capture variables at the
patient level that may differ across sites.
As a result, very detailed patient-level data about severity of illness,
levels of impairment, and many other patient factors are collected, as are
details about all interventions, including date and time, defined by the
Team. Hence, any differences in
patients and treatments among the participating sites are likely to be captured
in the detailed patient and intervention data used in PBE study analyses. Using this level of detail helps to make
observations within facilities less correlated in regression analyses.
Past analyses of PBE databases routinely have
included both hierarchical and non-hierarchical multivariable analyses to
predict outcomes, but we have not found differences in the significant factors
identified by the two approaches. The
absence of a difference may be due to the detailed manner in which PBE data
account for patient differences, including physiologic severity of illness
information, and treatment differences at the level of detail of each treatment
performed, with both time and date recorded.
Regression coefficients and odds ratios on the
independent variables are used to quantify the magnitudes and directions of
effects of each predictor variable on outcomes. Before analysis is started, pairwise correlations explore
associations between independent variables (colinearity), and one of each pair
of highly correlated independent variables (r > 0.75) is deleted.
For logistic regressions, discrimination can be
assessed using the area under the receiver operator characteristic curve (c) to
evaluate how well the model distinguishes patients who did not achieve a
specified outcome from patients who did achieve the specified outcome. Values of c that are closer to 1 indicate
better discrimination.35 In
addition, the Hosmer-Lemeshow goodness-of-fit test can be used to evaluate the
degree of correspondence between patients’ estimated probabilities of
developing the specified outcome and the actual development of the specified
outcome over groups spanning the entire range of probabilities
(calibration). Hosmer-Lemeshow p values
that are closer to 1 indicate better fit.
R2 can be used to evaluate proportion of variation in
continuous outcomes that is explained by the model. R2 values closer to 1 indicate better models.
Artifactual relationships are always possible in
regression analyses. However, in PBE
methodology, analyses are not performed by including all possible variables and
seeing what is significant. Instead,
the trans-disciplinary Team leads clinical analyses using theory, research
evidence to date, existing guidelines, and real-world clinical experience. Although many findings are not surprising,
some significant findings may be surprising and these findings lead to more
detailed analyses. Various types of
sensitivity analyses can be performed by including additional possible
confounders, examining subsets of variables and patients with specific
characteristics, and looking at multiple different slices of the data in order
to determine if the surprising findings persist. After exhausting all suggestions from clinicians as to what might
explain the surprising associations, and if the findings persist, then
providers and researchers feel more confident that the relationship is not an
artifact. Of course, significant
patient characteristics or interventions are found only if some patients have
them. And clinicians who use the
surprising significant interventions can speak to their effectiveness from
personal experience.
In the PBE methodology not all possible associations
can be articulated at the onset. The
PBE process depends on the ability to define identified outcome measures and
control for possible covariates in order to identify best treatments. While the investigation is governed by the
proposed study’s broad hypotheses, PBE is also a discovery process based on
post hoc analyses suggested by clinical professionals with fundamental
knowledge of patient and treatment issues.
Data collection questions and analyses are processed regularly with the
Team. All analyses are discussed until
the Team is satisfied that study questions have been addressed fully and
findings are based on the most valid interpretation of the data.
Certainty of conclusions (causality) from PBE
analyses may be less rigorous than that of good RCTs, but much better than that
often available for guiding clinical decisions. Or conclusions can be very strong, if one takes into account the
fact that the inference is based on both the joint probability of pre-existing
knowledge and the correlational results.
PBE is an innovative approach to understand the
impact of specific interventions on outcomes in routine clinical care. PBE uses both existing research findings and
practicing clinicians’ expertise to define the elements and analyze the data to
capture the complexity of the care process.
Preliminary findings from previous PBE studies show quite clearly that
PBE methodology can succeed in opening routine practice to scientific inquiry.
Due to the central
role played by the Project Team in all aspects of PBE, this approach can be
characterized as a form of “participatory action research”―a bottom-up
approach that values the participation of those actually engaged in the
care-providing process and garners their participation in implementing study
findings. PBE encourages new findings,
even those that challenge conventional wisdom and long-standing practice.
Using a severity system, such as the
Comprehensive Severity Index, enables going beyond controlling only for study
disease severity: it allows control for many complex comorbidities common to
patients (particularly elderly patients), reflecting more accurately the
realities of clinical practice. The
strength of CSI’s mechanism for compensating or adjusting for differences among
patients allows for a more powerful assessment of the effectiveness of
therapeutic interventions. CSI uses
specific, disease-oriented questions to produce a highly sensitive measure of
severity that cannot be produced by using diagnosis and/or procedure codes
alone or a limited, fixed set of physiologic criteria no matter what the
underlying diagnoses may be. Diagnosis
codes indicate existence of disease; they do not indicate extent or severity of
disease.
Limitations
PBE methodology relies on
the expertise of participating facility clinicians to guide the development of
high-level study hypotheses and identify critical data elements to study. As such, these clinicians are aware of study
data elements as they provide care and complete point-of-care intervention
documentation forms or perform routine documentation practices. This could be construed as introducing
treatment or observational bias.
However, the number of clinicians who participate in the development of
study instruments is a very small subset of all clinicians who care for
patients in study facilities.
Intervention documentation forms and project hypotheses are designed to
capture descriptions of actual practice, not alter practice patterns. In addition, the novelty of attention to
specific study questions would wane over the course of an extended patient
enrollment period.
As much as supplemental point-of-care intervention documentation forms provide an unprecedented level of detail about interventions, they also have limitations. Add-on documentation to traditional site practices increases the documentation burden of front-line staff and allotted documentation time may not be sufficient to ensure complete documentation of both. Intervention documentation form training usually is conducted via a train-the-trainer approach using a lead clinician in each discipline in each study site. Thus, the training of the majority of clinicians is dependent on the expertise and time availability of the site trainers. It is possible to have the same training team visit each study site to conduct training for point-of-care documentation for all clinicians. With sufficient funding, such standard training is preferable. Usually monitoring of documentation accuracy is an obligation of each study site. If it is not done well, inaccurate data are likely to be noisy and would bias against finding significant treatment effects.
A physiologic severity indexing system, such as CSI, is limited by data availability. Credentialed coding personnel at each facility assign ICD-9-CM codes as part of standard operating procedures; it is these codes that usually determine reimbursement. A smaller number of ICD-9-CM codes may result in lower severity of illness scores when using a system that is built upon ICD-9-CM coding. If laboratory tests are not ordered, findings are not clearly reported, or complications are not documented, the severity or incidence rate for the related conditions will be less. The incidence and type of test ordering and availability of information may not be uniform across sites and could account for a portion of the site variability in CSI scores.
One
great concern about observational studies is that the relation between an
intervention and an outcome may be confounded by other variables. Usually confounders are controlled for
through study design or statistical analyses.
Regression is a powerful tool to control for confounders, but many
independent variables may be required.
With many independent variables another concern is over-specification
(i.e., when the regression model has too many independent variables relative to
the size of the study group). Having a
Team that raises many possible confounding variables and being careful with
statistical methods, helps to overcome these limitations. Despite
these limitations, having micro-level data provide the ability to focus on the
individual patient level to explore reasons for findings and discover many
important associations between treatments and outcomes.
In summary, PBE
methodology creates a comprehensive database to assess the importance of such
patient variables as gender, race, severity of illness, baseline level of
functioning, and various therapy interventions on patient-centered
outcomes. The data describe the
duration, intensity, and components of treatment regimens. PBE studies allow discovery of treatment
practices that are associated with better outcomes for patients with various
levels of illness or impairment. These
include findings about surgical approaches, medications, physical therapy,
occupational therapy, speech and language therapy, timing of treatments,
nutritional support, etc., that are implementable in routine practice and have
been found to be associated with better outcomes as predicted by PBE models.
CONCLUSIONS
We need more evidence―evidence that is reliable, strong, and generalizable to real world scenarios. Simplistic beliefs in perfect rigor provided by large RCTs stand in the way of improving evidence for most clinical problems. In addition to RCTs, strong alternative designs should be employed more often, and there are circumstances where correlational designs provide the best evidence given practical constraints or the nature of the questions at issue. Alternative research designs can provide reliable evidence (level 1 or near level 1 where there are no plausible alternatives) for major and new treatments and complex system-level interventions. There is also a need for research methodologies that provide reliable, good (level 2) evidence using correlational modeling with very good covariance matching analyses. In certain circumstances, these studies provide the best information. Weaker (level 3) studies also may provide needed useful information in other circumstances. It is time to incorporate sophisticated research design considerations into evidence grading methods by distinguishing circumstances in which alternative study designs are strong or optimal from circumstances where RCTs are designed to provide the best evidence.
It is incumbent upon investigators, if using something other than an RCT, to demonstrate why it is a better approach and that it yields good statistically robust answers. The goal of the Conference and this Primer was to expand the study design toolbox and help investigators decide the most appropriate design to use. Clinicians and patients have to make decisions every day whether they have information or not. We need to figure out the most expeditious ways of providing the best available information even if it is not definitive, when it is needed at the point of care in a way that is understandable. We have to do this knowing that evidence is dynamic, and we should expect it to change and revisit it on a regular basis.
Increased use of sophisticated research designs and statistical methods can greatly increase the speed with which reliable information is obtained to improve knowledge of the effectiveness of interventions in clinical practice. Hundreds of millions of dollars are spent nationwide to incorporate better physical and biological tools into research, e.g., MRI, genomic, and proteomic technology, but research designs and statistical tools are also critical to making a difference in practice. The improved research designs and evidence evaluation sketched in this Primer can speed the progress of translational research and knowledge of what works best in actual practice, as well as discover new factors that are associated with improved outcomes in practice. The quality, effectiveness, and value of health care in practice can achieve stunning gains. The tools exist, they simply need to be used, because experimental designs such as RCTs are rarely feasible to evaluate complex interventions in the real world.
Acknowledgment. It is a pleasure to thank the people who made this Conference a success. Dr. Peter Buerhaus and his executive assistant, Brenda Cornett-Compton, created the proposal and skillfully handled many of the meeting details. Dr. Gerben DeJong was a skillful moderator as well as a Workgroup Leader and on the Conference Planning Committee. Mark Johnston, PhD, helped create the outline for the Primer. Linsey BenAmi, MPH, and Randy Smout, MS helped with editing the Primer. The planning committee designed the conference agenda to make it as understandable as possible, and the presenters and workgroup leaders made an exceptional effort to implement the agenda: Carolyn Clancy, MD [Speaker], Steven Tingus, MS, C.Phil [Speaker], Kelly Cronin, MPH [Speaker], Scott Gottlieb, MD [Speaker], Andrew Kramer MD [Speaker], Sharon-Lise Normand, PhD [Speaker], Arlene Ash, PhD [Conf Plan Com], Alan B. Cohen, ScD [Conf Plan Com], John Corrigan, PhD [Conf Plan Com], Marcel Dijkers, PhD [Workgroup Leader, Conf Plan Com], Alan M. Jette, PhD, PT [Conf Plan Com], Arthur Hartz, MD, PhD [Workgroup Leader, Conf Plan Com], David Helms [Speaker, Conf Plan Com], John Melvin, MD [Conf Plan Com], Robert Rhodes, MD, FACS [Speaker, Conf Plan Com], Mary Stuart, ScD [Conf Plan Com], Ruth Brannon [Workgroup Leader, Conf Plan Com], Michael Weinrich [Speaker, Conf Plan Com], Nancy Bergstrom, PhD, RN [Workgroup Leader].
1. Berwick DM, The John Eisenberg Lecture: Health Services Research as a Citizen in Improvement. Health Services Research 40:2 (April 2005):317-336.
2. Crossing the Quality Chasm: a new health system for the 21st century. March 2001.
3. Horn SD, Gassaway J. Practice-Based Evidence Study Design for Comparative Effectiveness Research. Medical Care 2007;45:10 (October Supplement 2).
4. Guller U, Hervey S, Purves H, Muhlbaier LH, Peterson ED, Eubanks S et al. Laparoscopic versus open appendectomy: outcomes comparison based on a large administrative database. Ann Surg 2004; 239(1):43-52
5. Wald H, Epstein A, Kramer A. Extended use of indwelling urinary catheters in postoperative hip fracture patients. Med Care 2005; 43(10):1009-10017
6. Campbell DT, Stanley JC. Experimental and Quasi-experimental Designs for Research. Chicago: Rand McNally, 1966.
7. D’Agostino RB Jr. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med 1998 Oct 15;17(19):2265-81.
8. Horn SD, DeJong G, Ryser D, Veazie P, Teraoka, J. Another Look at Observational Studies in Rehabilitation Research: Going Beyond the Holy Grail of the Randomized Controlled Trial. Arch Phys Med Rehabil 2005;86(12 Supplement 2):S8-S15.
9. Horn SD, Editor. Clinical Practice Improvement Methodology: Implementation and Evaluation. Faulkner & Gray, New York, New York, 1997.
10. Horn SD, DeJong G, Smout R, Gassaway J, James R, Conroy B. Stroke Rehabilitation Patients, Practice, and Outcomes: Is Earlier and More Aggressive Therapy Better? Arch Phys Med Rehabil 2005;86(12 Supplement 2):S101-S114.
11. DeJong G, Horn SD, Conroy B, Nichols D, Healton E. Opening the Black Box of Poststroke Rehabilitation: Stroke Rehabilitation Patients, Processes, and Outcomes. Arch Phys Med Rehabil 2005;86(12 Supplement 2):S1-S7.
12. Gassaway J, Horn SD, DeJong G, Smout R, Clark C, James R. Applying the Clinical Practice Improvement Approach to Stroke Rehabilitation: Methods Used and Baseline Results. Arch Phys Med Rehabil 2005;86(12 Supplement 2):S16-S33.
13. Jette AM. The Post-Stroke Rehabilitation Outcomes Project. Arch Phys Med Rehabil 2005;86(12 Suppl 2):S124-5.
14. Ottenbacher KJ. The Post-Stroke Rehabilitation Outcomes Project. Arch Phys Med Rehabil 2005;86(12 Suppl 2):S121-3.
15. DeJong G, Horn SD, Smout RJ, Gassaway J, James R. The Post-stroke Rehabilitation Outcomes Project revisited. Arch Phys Med Rehabil 2006;87(April,4):595-597.
16. Neumayer LA, Smout RJ, Horn HGS, Horn SD. Early and Sufficient Feeding Reduces Length of Stay and Charges in Surgical Patients. Journal of Surgical Research 2001;95(1):73-77.
17. Horn SD, Wright HL, Couperus JJ, Rhodes RS, Smout RJ, Roberts KA, Linares AP. Association Between Patient Controlled Analgesia Pump Use and Post-Operative Surgical Site Infection in Intestinal Surgery Patients. Surgical Infections 2002;3(2):109-118. Abstracted in Year Book of Surgery, 2003.
18. Horn SD, Smout RJ. Effect of prematurity on respiratory syncytial virus hospital resource use and outcomes. J Pediatrics 2003;143 (5 Suppl): S133-141.
19. Blonde L, Ginsberg BH, Horn SD, et al. Frequency of Blood Glucose Monitoring in Relation to Glycemic Control in Patients with Type 2 Diabetes, Diabetes Care 25:1 (January 2002) 245-246.
20. Horn SD, Bender SA, Ferguson ML, Smout RJ, Bergstrom N, Taler G, Cook AS, Sharkey SS, Voss AC. The National Pressure Ulcer Long-term Care Study (NPULS): Pressure ulcer development in long-term care residents. J. American Geriatrics Society 2004 March;52(3):359-367.
21. Horn SD, Sharkey PD, Tracy DM, Horn CE, James B, Goodwin F. Intended and Unintended Consequences of HMO Cost-Containment Strategies: Results from the Managed Care Outcomes Project. The American Journal of Managed Care 1996;2(3):253-264.
22. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Eng J Med 2000;342:1878-86.
23. Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med 2000;342:1887-92.
24. Ioannidis JP, Haidich AB, Pappa M, et al. Comparison of evidence of treatment effects in randomized and nonrandomized studies. JAMA 2001;286:821-30.
25. Averill RF, McGuire TE, Manning BE, Fowler DA, Horn SD, Dickson PS, et al. A study of the relationship between severity of illness and hospital cost in New Jersey hospitals. Health Services Research 27(5): 587-617, 1992.
26. Horn SD, Torres A Jr, Willson D, Dean JM, Gassaway J, Smout R. Development of a Pediatric Age- and Disease-Specific Severity Measure. J Pediatr 141:4 (2002): 496-503.
27. Horn SD, Sharkey PD, Buckle JM, Backofen JE, Averill RF, Horn RA. The relationship between severity of illness and hospital length of stay and mortality. Medical Care 29:305-317, 1991.
28. Willson DF, Horn SD, Smout RJ, Gassaway J, Torres A. Severity Assessment in Children Hospitalized with Bronchiolitis Using the Pediatric Component of the Comprehensive Severity Index (CSI®), Pediatric Critical Care Medicine 1(2): 127-132, 2000.
29. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, Second edition, Lawrence Erlbaum Associates, Inc, Publishers; Hillsdale, NJ
30. Bode R, Heinemann
A, Semik P, Mallinson T. Patterns of
Therapy Activities Across Length of Stay and Impairment Levels: Peering Inside
the “Black Box” of Inpatient Stroke Rehabilitation. Arch Phys Med Rehabil 2004;85:1901-1908.
31. Horn SD, Sharkey PD, Gassaway J. Managed Care Outcomes Project: Study Design, Baseline Patient Characteristics, and Outcome Measures. The American Journal of Managed Care 1996;2(3):237-247.
32. Horn SD, Bender SA, Bergstrom N, Cook AS, Ferguson ML, Rimmasch HL, Sharkey SS, Smout RJ, Taler G, Voss, AC. Description of the National Pressure Ulcer Long-Term Care Study (NPULS). J. American Geriatrics Society 2002;50:1816-1825.
33. Connor SR, Horn SD, Smout RJ, Gassaway JV. The National Hospice Outcomes Project (NHOP): Development and Implementation of a Multi-Site Hospice Outcomes Study. J Pain Symptom Manage 2005 March;29(3):286-296.
34. Raudenbush SW, Bryk AS. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods Second edition. Sage Publications. Thousand Oaks, CA.
35. Hosmer DW, Lemeshow S. 1989. Applied Logistic Regression. John Wiley and Sons, New York, NY.