ICH E9

ICH E9 Statistical Principles for Clinical Trials
ICH E9 临床试验的统计学原则
ICH E9(R1) Addendum: Statistical Principles for Clinical Trials
ICH E9(R1) 临床试验中的估计目标与敏感性分析（E9指导原则增补文件）

ICH E9 Statistical Principles for Clinical Trials

中文版

I. INTRODUCTION

1.1 Background and Purpose

The efficacy and safety of medicinal products should be demonstrated by clinical trials which follow the guidance in 'Good Clinical Practice: Consolidated Guideline' (ICH E6) adopted by the ICH, 1 May 1996. The role of statistics in clinical trial design and analysis is acknowledged as essential in that ICH guideline. The proliferation of statistical research in the area of clinical trials coupled with the critical role of clinical research in the drug approval process and health care in general necessitate a succinct document on statistical issues related to clinical trials. This guidance is written primarily to attempt to harmonise the principles of statistical methodology applied to clinical trials for marketing applications submitted in Europe, Japan and the United States.

As a starting point, this guideline utilised the CPMP (Committee for Proprietary Medicinal Products) Note for Guidance entitled 'Biostatistical Methodology in Clinical Trials in Applications for Marketing Authorisations for Medicinal Products' (December, 1994). It was also influenced by 'Guidelines on the Statistical Analysis of Clinical Studies' (March, 1992) from the Japanese Ministry of Health and Welfare and the U.S. Food and Drug Administration document entitled 'Guideline for the Format and Content of the Clinical and Statistical Sections of a New Drug Application' (July, 1988). Some topics related to statistical principles and methodology are also embedded within other ICH guidelines, particularly those listed below. The specific guidance that contains related text will be identified in various sections of this document.

E1A:	The Extent of Population Exposure to Assess Clinical Safety
E2A:	Clinical Safety Data Management: Definitions and Standards for Expedited Reporting
E2B:	Clinical Safety Data Management: Data Elements for Transmission of Individual Case Safety Reports
E2C:	Clinical Safety Data Management: Periodic Safety Update Reports for Marketed Drugs
E3:	Structure and Content of Clinical Study Reports
E4:	Dose-Response Information to Support Drug Registration
E5:	Ethnic Factors in the Acceptability of Foreign Clinical Data
E6:	Good Clinical Practice: Consolidated Guideline
E7:	Studies in Support of Special Populations: Geriatrics
E8:	General Considerations for Clinical Trials
E10:	Choice of Control Group in Clinical Trials
M1:	Standardisation of Medical Terminology for Regulatory Purposes
M3:	Non-Clinical Safety Studies for the Conduct of Human Clinical Trials for Pharmaceuticals.

This guidance is intended to give direction to sponsors in the design, conduct, analysis, and evaluation of clinical trials of an investigational product in the context of its overall clinical development. The document will also assist scientific experts charged with preparing application summaries or assessing evidence of efficacy and safety, principally from clinical trials in later phases of development.

1.2 Scope and Direction

The focus of this guidance is on statistical principles. It does not address the use of specific statistical procedures or methods. Specific procedural steps to ensure that principles are implemented properly are the responsibility of the sponsor. Integration of data across clinical trials is discussed, but is not a primary focus of this guidance. Selected principles and procedures related to data management or clinical trial monitoring activities are covered in other ICH guidelines and are not addressed here.

This guidance should be of interest to individuals from a broad range of scientific disciplines. However, it is assumed that the actual responsibility for all statistical work associated with clinical trials will lie with an appropriately qualified and experienced statistician, as indicated in ICH E6. The role and responsibility of the trial statistician (see Glossary), in collaboration with other clinical trial professionals, is to ensure that statistical principles are applied appropriately in clinical trials supporting drug development. Thus, the trial statistician should have a combination of education/training and experience sufficient to implement the principles articulated in this guidance.

For each clinical trial contributing to a marketing application, all important details of its design and conduct and the principal features of its proposed statistical analysis should be clearly specified in a protocol written before the trial begins. The extent to which the procedures in the protocol are followed and the primary analysis is planned a priori will contribute to the degree of confidence in the final results and conclusions of the trial. The protocol and subsequent amendments should be approved by the responsible personnel, including the trial statistician. The trial statistician should ensure that the protocol and any amendments cover all relevant statistical issues clearly and accurately, using technical terminology as appropriate.

The principles outlined in this guidance are primarily relevant to clinical trials conducted in the later phases of development, many of which are confirmatory trials of efficacy. In addition to efficacy, confirmatory trials may have as their primary variable a safety variable (e.g. an adverse event, a clinical laboratory variable or an electrocardiographic measure), a pharmacodynamic or a pharmacokinetic variable (as in a confirmatory bioequivalence trial). Furthermore, some confirmatory findings may be derived from data integrated across trials, and selected principles in this guidance are applicable in this situation. Finally, although the early phases of drug development consist mainly of clinical trials that are exploratory in nature, statistical principles are also relevant to these clinical trials. Hence, the substance of this document should be applied as far as possible to all phases of clinical development.

Many of the principles delineated in this guidance deal with minimising bias (see Glossary) and maximising precision. As used in this guidance, the term 'bias' describes the systematic tendency of any factors associated with the design, conduct, analysis and interpretation of the results of clinical trials to make the estimate of a treatment effect (see Glossary) deviate from its true value. It is important to identify potential sources of bias as completely as possible so that attempts to limit such bias may be made. The presence of bias may seriously compromise the ability to draw valid conclusions from clinical trials.

Some sources of bias arise from the design of the trial, for example an assignment of treatments such that subjects at lower risk are systematically assigned to one treatment. Other sources of bias arise during the conduct and analysis of a clinical trial. For example, protocol violations and exclusion of subjects from analysis based upon knowledge of subject outcomes are possible sources of bias that may affect the accurate assessment of the treatment effect. Because bias can occur in subtle or unknown ways and its effect is not measurable directly, it is important to evaluate the robustness of the results and primary conclusions of the trial. Robustness is a concept that refers to the sensitivity of the overall conclusions to various limitations of the data, assumptions, and analytic approaches to data analysis. Robustness implies that the treatment effect and primary conclusions of the trial are not substantially affected when analyses are carried out based on alternative assumptions or analytic approaches. The interpretation of statistical measures of uncertainty of the treatment effect and treatment comparisons should involve consideration of the potential contribution of bias to the p-value, confidence interval, or inference.

Because the predominant approaches to the design and analysis of clinical trials have been based on frequentist statistical methods, the guidance largely refers to the use of frequentist methods (see Glossary) when discussing hypothesis testing and/or confidence intervals. This should not be taken to imply that other approaches are not appropriate: the use of Bayesian (see Glossary) and other approaches may be considered when the reasons for their use are clear and when the resulting conclusions are sufficiently robust.

II. CONSIDERATIONS FOR OVERALL CLINICAL DEVELOPMENT

2.1 Trial Context

2.1.1 Development Plan

The broad aim of the process of clinical development of a new drug is to find out whether there is a dose range and schedule at which the drug can be shown to be simultaneously safe and effective, to the extent that the risk-benefit relationship is acceptable. The particular subjects who may benefit from the drug, and the specific indications for its use, also need to be defined.

Satisfying these broad aims usually requires an ordered programme of clinical trials, each with its own specific objectives (see ICH E8). This should be specified in a clinical plan, or a series of plans, with appropriate decision points and flexibility to allow modification as knowledge accumulates. A marketing application should clearly describe the main content of such plans, and the contribution made by each trial. Interpretation and assessment of the evidence from the total programme of trials involves synthesis of the evidence from the individual trials (see Section 7.2). This is facilitated by ensuring that common standards are adopted for a number of features of the trials such as dictionaries of medical terms, definition and timing of the main measurements, handling of protocol deviations and so on. A statistical summary, overview or meta-analysis (see Glossary) may be informative when medical questions are addressed in more than one trial. Where possible this should be envisaged in the plan so that the relevant trials are clearly identified and any necessary common features of their designs are specified in advance. Other major statistical issues (if any) that are expected to affect a number of trials in a common plan should be addressed in that plan.

2.1.2 Confirmatory Trial

A confirmatory trial is an adequately controlled trial in which the hypotheses are stated in advance and evaluated. As a rule, confirmatory trials are necessary to provide firm evidence of efficacy or safety. In such trials the key hypothesis of interest follows directly from the trial’s primary objective, is always pre-defined, and is the hypothesis that is subsequently tested when the trial is complete. In a confirmatory trial it is equally important to estimate with due precision the size of the effects attributable to the treatment of interest and to relate these effects to their clinical significance.

Confirmatory trials are intended to provide firm evidence in support of claims and hence adherence to protocols and standard operating procedures is particularly important; unavoidable changes should be explained and documented, and their effect examined. A justification of the design of each such trial, and of other important statistical aspects such as the principal features of the planned analysis, should be set out in the protocol. Each trial should address only a limited number of questions.

Firm evidence in support of claims requires that the results of the confirmatory trials demonstrate that the investigational product under test has clinical benefits. The confirmatory trials should therefore be sufficient to answer each key clinical question relevant to the efficacy or safety claim clearly and definitively. In addition, it is important that the basis for generalisation (see Glossary) to the intended patient population is understood and explained; this may also influence the number and type (e.g. specialist or general practitioner) of centres and/or trials needed. The results of the confirmatory trial(s) should be robust. In some circumstances the weight of evidence from a single confirmatory trial may be sufficient.

2.1.3 Exploratory Trial

The rationale and design of confirmatory trials nearly always rests on earlier clinical work carried out in a series of exploratory studies. Like all clinical trials, these exploratory studies should have clear and precise objectives. However, in contrast to confirmatory trials, their objectives may not always lead to simple tests of pre-defined hypotheses. In addition, exploratory trials may sometimes require a more flexible approach to design so that changes can be made in response to accumulating results. Their analysis may entail data exploration; tests of hypothesis may be carried out, but the choice of hypothesis may be data dependent. Such trials cannot be the basis of the formal proof of efficacy, although they may contribute to the total body of relevant evidence.

Any individual trial may have both confirmatory and exploratory aspects. For example, in most confirmatory trials the data are also subjected to exploratory analyses which serve as a basis for explaining or supporting their findings and for suggesting further hypotheses for later research. The protocol should make a clear distinction between the aspects of a trial which will be used for confirmatory proof and the aspects which will provide data for exploratory analysis.

2.2 Scope of Trials

2.2.1 Population

In the earlier phases of drug development the choice of subjects for a clinical trial may be heavily influenced by the wish to maximise the chance of observing specific clinical effects of interest, and hence they may come from a very narrow subgroup of the total patient population for which the drug may eventually be indicated. However by the time the confirmatory trials are undertaken, the subjects in the trials should more closely mirror the target population. Hence, in these trials it is generally helpful to relax the inclusion and exclusion criteria as much as possible within the target population, while maintaining sufficient homogeneity to permit precise estimation of treatment effects. No individual clinical trial can be expected to be totally representative of future users, because of the possible influences of geographical location, the time when it is conducted, the medical practices of the particular investigator(s) and clinics, and so on. However the influence of such factors should be reduced wherever possible, and subsequently discussed during the interpretation of the trial results.

2.2.2 Primary and Secondary Variables

The primary variable (‘target’ variable, primary endpoint) should be the variable capable of providing the most clinically relevant and convincing evidence directly related to the primary objective of the trial. There should generally be only one primary variable. This will usually be an efficacy variable, because the primary objective of most confirmatory trials is to provide strong scientific evidence regarding efficacy. Safety/tolerability may sometimes be the primary variable, and will always be an important consideration. Measurements relating to quality of life and health economics are further potential primary variables. The selection of the primary variable should reflect the accepted norms and standards in the relevant field of research. The use of a reliable and validated variable with which experience has been gained either in earlier studies or in published literature is recommended. There should be sufficient evidence that the primary variable can provide a valid and reliable measure of some clinically relevant and important treatment benefit in the patient population described by the inclusion and exclusion criteria. The primary variable should generally be the one used when estimating the sample size (see section 3.5).

In many cases, the approach to assessing subject outcome may not be straightforward and should be carefully defined. For example, it is inadequate to specify mortality as a primary variable without further clarification; mortality may be assessed by comparing proportions alive at fixed points in time, or by comparing overall distributions of survival times over a specified interval. Another common example is a recurring event; the measure of treatment effect may again be a simple dichotomous variable (any occurrence during a specified interval), time to first occurrence, rate of occurrence (events per time units of observation), etc. The assessment of functional status over time in studying treatment for chronic disease presents other challenges in selection of the primary variable. There are many possible approaches, such as comparisons of the assessments done at the beginning and end of the interval of observation, comparisons of slopes calculated from all assessments throughout the interval, comparisons of the proportions of subjects exceeding or declining beyond a specified threshold, or comparisons based on methods for repeated measures data. To avoid multiplicity concerns arising from post hoc definitions, it is critical to specify in the protocol the precise definition of the primary variable as it will be used in the statistical analysis. In addition, the clinical relevance of the specific primary variable selected and the validity of the associated measurement procedures will generally need to be addressed and justified in the protocol.

The primary variable should be specified in the protocol, along with the rationale for its selection. Redefinition of the primary variable after unblinding will almost always be unacceptable, since the biases this introduces are difficult to assess. When the clinical effect defined by the primary objective is to be measured in more than one way, the protocol should identify one of the measurements as the primary variable on the basis of clinical relevance, importance, objectivity, and/or other relevant characteristics, whenever such selection is feasible.

Secondary variables are either supportive measurements related to the primary objective or measurements of effects related to the secondary objectives. Their pre-definition in the protocol is also important, as well as an explanation of their relative importance and roles in interpretation of trial results. The number of secondary variables should be limited and should be related to the limited number of questions to be answered in the trial.

2.2.3 Composite Variables

If a single primary variable cannot be selected from multiple measurements associated with the primary objective, another useful strategy is to integrate or combine the multiple measurements into a single or 'composite' variable, using a pre-defined algorithm. Indeed, the primary variable sometimes arises as a combination of multiple clinical measurements (e.g. the rating scales used in arthritis, psychiatric disorders and elsewhere). This approach addresses the multiplicity problem without requiring adjustment to the type I error. The method of combining the multiple measurements should be specified in the protocol, and an interpretation of the resulting scale should be provided in terms of the size of a clinically relevant benefit. When a composite variable is used as a primary variable, the components of this variable may sometimes be analysed separately, where clinically meaningful and validated. When a rating scale is used as a primary variable, it is especially important to address such factors as content validity (see Glossary), inter- and intra-rater reliability (see Glossary) and responsiveness for detecting changes in the severity of disease.

2.2.4 Global Assessment Variables

In some cases, 'global assessment' variables (see Glossary) are developed to measure the overall safety, overall efficacy, and/or overall usefulness of a treatment. This type of variable integrates objective variables and the investigator’s overall impression about the state or change in the state of the subject, and is usually a scale of ordered categorical ratings. Global assessments of overall efficacy are well established in some therapeutic areas, such as neurology and psychiatry.

Global assessment variables generally have a subjective component. When a global assessment variable is used as a primary or secondary variable, fuller details of the scale should be included in the protocol with respect to:

1) the relevance of the scale to the primary objective of the trial;

2) the basis for the validity and reliability of the scale;

3) how to utilise the data collected on an individual subject to assign him/her to a unique category of the scale;

4) how to assign subjects with missing data to a unique category of the scale, or otherwise evaluate them.

If objective variables are considered by the investigator when making a global assessment, then those objective variables should be considered as additional primary, or at least important secondary, variables.

Global assessment of usefulness integrates components of both benefit and risk and reflects the decision making process of the treating physician, who must weigh benefit and risk in making product use decisions. A problem with global usefulness variables is that their use could in some cases lead to the result of two products being declared equivalent despite having very different profiles of beneficial and adverse effects. For example, judging the global usefulness of a treatment as equivalent or superior to an alternative may mask the fact that it has little or no efficacy but fewer adverse effects. Therefore it is not advisable to use a global usefulness variable as a primary variable. If global usefulness is specified as primary, it is important to consider specific efficacy and safety outcomes separately as additional primary variables.

2.2.5 Multiple Primary Variables

It may sometimes be desirable to use more than one primary variable, each of which (or a subset of which) could be sufficient to cover the range of effects of the therapies. The planned manner of interpretation of this type of evidence should be carefully spelled out. It should be clear whether an impact on any of the variables, some minimum number of them, or all of them, would be considered necessary to achieve the trial objectives. The primary hypothesis or hypotheses and parameters of interest (e.g. mean, percentage, distribution) should be clearly stated with respect to the primary variables identified, and the approach to statistical inference described. The effect on the type I error should be explained because of the potential for multiplicity problems (see Section 5.6); the method of controlling type I error should be given in the protocol. The extent of intercorrelation among the proposed primary variables may be considered in evaluating the impact on type I error. If the purpose of the trial is to demonstrate effects on all of the designated primary variables, then there is no need for adjustment of the type I error, but the impact on type II error and sample size should be carefully considered.

2.2.6 Surrogate Variables

When direct assessment of the clinical benefit to the subject through observing actual clinical efficacy is not practical, indirect criteria (surrogate variables - see Glossary) may be considered. Commonly accepted surrogate variables are used in a number of indications where they are believed to be reliable predictors of clinical benefit. There are two principal concerns with the introduction of any proposed surrogate variable. First, it may not be a true predictor of the clinical outcome of interest. For example it may measure treatment activity associated with one specific pharmacological mechanism, but may not provide full information on the range of actions and ultimate effects of the treatment, whether positive or negative. There have been many instances where treatments showing a highly positive effect on a proposed surrogate have ultimately been shown to be detrimental to the subjects' clinical outcome; conversely, there are cases of treatments conferring clinical benefit without measurable impact on proposed surrogates. Secondly, proposed surrogate variables may not yield a quantitative measure of clinical benefit that can be weighed directly against adverse effects. Statistical criteria for validating surrogate variables have been proposed but the experience with their use is relatively limited. In practice, the strength of the evidence for surrogacy depends upon (i) the biological plausibility of the relationship, (ii) the demonstration in epidemiological studies of the prognostic value of the surrogate for the clinical outcome and (iii) evidence from clinical trials that treatment effects on the surrogate correspond to effects on the clinical outcome. Relationships between clinical and surrogate variables for one product do not necessarily apply to a product with a different mode of action for treating the same disease.

2.2.7 Categorised Variables

Dichotomisation or other categorisation of continuous or ordinal variables may sometimes be desirable. Criteria of 'success' and 'response' are common examples of dichotomies which require precise specification in terms of, for example, a minimum percentage improvement (relative to baseline) in a continuous variable, or a ranking categorised as at or above some threshold level (e.g., 'good') on an ordinal rating scale.

The reduction of diastolic blood pressure below 90mmHg is a common dichotomisation. Categorisations are most useful when they have clear clinical relevance. The criteria for categorisation should be pre-defined and specified in the protocol, as knowledge of trial results could easily bias the choice of such criteria. Because categorisation normally implies a loss of information, a consequence will be a loss of power in the analysis; this should be accounted for in the sample size calculation.

2.3 Design Techniques to Avoid Bias

The most important design techniques for avoiding bias in clinical trials are blinding and randomisation, and these should be normal features of most controlled clinical trials intended to be included in a marketing application. Most such trials follow a double-blind approach in which treatments are pre-packed in accordance with a suitable randomisation schedule, and supplied to the trial centre(s) labelled only with the subject number and the treatment period so that no one involved in the conduct of the trial is aware of the specific treatment allocated to any particular subject, not even as a code letter. This approach will be assumed in Section 2.3.1 and most of Section 2.3.2, exceptions being considered at the end.

Bias can also be reduced at the design stage by specifying procedures in the protocol aimed at minimising any anticipated irregularities in trial conduct that might impair a satisfactory analysis, including various types of protocol violations, withdrawals and missing values. The protocol should consider ways both to reduce the frequency of such problems, and also to handle the problems that do occur in the analysis of data.

2.3.1 Blinding

Blinding or masking is intended to limit the occurrence of conscious and unconscious bias in the conduct and interpretation of a clinical trial arising from the influence which the knowledge of treatment may have on the recruitment and allocation of subjects, their subsequent care, the attitudes of subjects to the treatments, the assessment of end-points, the handling of withdrawals, the exclusion of data from analysis, and so on. The essential aim is to prevent identification of the treatments until all such opportunities for bias have passed.

Difficulties in achieving the double-blind ideal can arise: the treatments may be of a completely different nature, for example, surgery and drug therapy; two drugs may have different formulations and, although they could be made indistinguishable by the use of capsules, changing the formulation might also change the pharmacokinetic and/or pharmacodynamic properties and hence require that bioequivalence of the formulations be established; the daily pattern of administration of two treatments may differ. One way of achieving double-blind conditions under these circumstances is to use a 'double-dummy' (see Glossary) technique. This technique may sometimes force an administration scheme that is sufficiently unusual to influence adversely the motivation and compliance of the subjects. Ethical difficulties may also interfere with its use when, for example, it entails dummy operative procedures. Nevertheless, extensive efforts should be made to overcome these difficulties.

In this document, the blind review (see Glossary) of data refers to the checking of data during the period of time between trial completion (the last observation on the last subject) and the breaking of the blind.

2.3.2 Randomisation

Randomisation introduces a deliberate element of chance into the assignment of treatments to subjects in a clinical trial. During subsequent analysis of the trial data, it provides a sound statistical basis for the quantitative evaluation of the evidence relating to treatment effects. It also tends to produce treatment groups in which the distributions of prognostic factors, known and unknown, are similar. In combination with blinding, randomisation helps to avoid possible bias in the selection and allocation of subjects arising from the predictability of treatment assignments.

The randomisation schedule of a clinical trial documents the random allocation of treatments to subjects. In the simplest situation it is a sequential list of treatments (or treatment sequences in a crossover trial) or corresponding codes by subject number. The logistics of some trials, such as those with a screening phase, may make matters more complicated, but the unique pre-planned assignment of treatment, or treatment sequence, to subject should be clear. Different trial designs will require different procedures for generating randomisation schedules. The randomisation schedule should be reproducible (if the need arises).

Although unrestricted randomisation is an acceptable approach, some advantages can generally be gained by randomising subjects in blocks. This helps to increase the comparability of the treatment groups, particularly when subject characteristics may change over time, as a result, for example, of changes in recruitment policy. It also provides a better guarantee that the treatment groups will be of nearly equal size. In crossover trials it provides the means of obtaining balanced designs with their greater efficiency and easier interpretation. Care should be taken to choose block lengths that are sufficiently short to limit possible imbalance, but that are long enough to avoid predictability towards the end of the sequence in a block. Investigators and other relevant staff should generally be blind to the block length; the use of two or more block lengths, randomly selected for each block, can achieve the same purpose. (Theoretically, in a double-blind trial predictability does not matter, but the pharmacological effects of drugs may provide the opportunity for intelligent guesswork.)

In multicentre trials (see Glossary) the randomisation procedures should be organised centrally. It is advisable to have a separate random scheme for each centre, i.e. to stratify by centre or to allocate several whole blocks to each centre. More generally, stratification by important prognostic factors measured at baseline (e.g. severity of disease, age, sex, etc.) may sometimes be valuable in order to promote balanced allocation within strata; this has greater potential benefit in small trials. The use of more than two or three stratification factors is rarely necessary, is less successful at achieving balance and is logistically troublesome. The use of a dynamic allocation procedure (see below) may help to achieve balance across a number of stratification factors simultaneously provided the rest of the trial procedures can be adjusted to accommodate an approach of this type. Factors on which randomisation has been stratified should be accounted for later in the analysis.

The next subject to be randomised into a trial should always receive the treatment corresponding to the next free number in the appropriate randomisation schedule (in the respective stratum, if randomisation is stratified). The appropriate number and associated treatment for the next subject should only be allocated when entry of that subject to the randomised part of the trial has been confirmed. Details of the randomisation that facilitate predictability (e.g. block length) should not be contained in the trial protocol. The randomisation schedule itself should be filed securely by the sponsor or an independent party in a manner that ensures that blindness is properly maintained throughout the trial. Access to the randomisation schedule during the trial should take into account the possibility that, in an emergency, the blind may have to be broken for any subject. The procedure to be followed, the necessary documentation, and the subsequent treatment and assessment of the subject should all be described in the protocol.

Dynamic allocation is an alternative procedure in which the allocation of treatment to a subject is influenced by the current balance of allocated treatments and, in a stratified trial, by the stratum to which the subject belongs and the balance within that stratum. Deterministic dynamic allocation procedures should be avoided and an appropriate element of randomisation should be incorporated for each treatment allocation. Every effort should be made to retain the double-blind status of the trial. For example, knowledge of the treatment code may be restricted to a central trial office from where the dynamic allocation is controlled, generally through telephone contact. This in turn permits additional checks of eligibility criteria and establishes entry into the trial, features that can be valuable in certain types of multicentre trial. The usual system of pre-packing and labelling drug supplies for double-blind trials can then be followed, but the order of their use is no longer sequential. It is desirable to use appropriate computer algorithms to keep personnel at the central trial office blind to the treatment code. The complexity of the logistics and potential impact on the analysis should be carefully evaluated when considering dynamic allocation.

III. TRIAL DESIGN CONSIDERATIONS

3.1 Design Configuration

3.1.1 Parallel Group Design

The most common clinical trial design for confirmatory trials is the parallel group design in which subjects are randomised to one of two or more arms, each arm being allocated a different treatment. These treatments will include the investigational product at one or more doses, and one or more control treatments, such as placebo and/or an active comparator. The assumptions underlying this design are less complex than for most other designs. However, as with other designs, there may be additional features of the trial that complicate the analysis and interpretation (e.g. covariates, repeated measurements over time, interactions between design factors, protocol violations, dropouts (see Glossary) and withdrawals).

3.1.2 Crossover Design

In the crossover design, each subject is randomised to a sequence of two or more treatments, and hence acts as his own control for treatment comparisons. This simple manoeuvre is attractive primarily because it reduces the number of subjects and usually the number of assessments needed to achieve a specific power, sometimes to a marked extent. In the simplest 2×2 crossover design each subject receives each of two treatments in randomised order in two successive treatment periods, often separated by a washout period. The most common extension of this entails comparing n(>2) treatments in n periods, each subject receiving all n treatments. Numerous variations exist, such as designs in which each subject receives a subset of n(>2) treatments, or ones in which treatments are repeated within a subject.

Crossover designs have a number of problems that can invalidate their results. The chief difficulty concerns carryover, that is, the residual influence of treatments in subsequent treatment periods. In an additive model the effect of unequal carryover will be to bias direct treatment comparisons. In the 2×2 design the carryover effect cannot be statistically distinguished from the interaction between treatment and period and the test for either of these effects lacks power because the corresponding contrast is 'between subject'. This problem is less acute in higher order designs, but cannot be entirely dismissed.

When the crossover design is used it is therefore important to avoid carryover. This is best done by selective and careful use of the design on the basis of adequate knowledge of both the disease area and the new medication. The disease under study should be chronic and stable. The relevant effects of the medication should develop fully within the treatment period. The washout periods should be sufficiently long for complete reversibility of drug effect. The fact that these conditions are likely to be met should be established in advance of the trial by means of prior information and data.

There are additional problems that need careful attention in crossover trials. The most notable of these are the complications of analysis and interpretation arising from the loss of subjects. Also, the potential for carryover leads to difficulties in assigning adverse events which occur in later treatment periods to the appropriate treatment. These, and other issues, are described in ICH E4. The crossover design should generally be restricted to situations where losses of subjects from the trial are expected to be small.

A common, and generally satisfactory, use of the 2×2 crossover design is to demonstrate the bioequivalence of two formulations of the same medication. In this particular application in healthy volunteers, carryover effects on the relevant pharmacokinetic variable are most unlikely to occur if the wash-out time between the two periods is sufficiently long. However it is still important to check this assumption during analysis on the basis of the data obtained, for example by demonstrating that no drug is detectable at the start of each period.

3.1.3 Factorial Designs

In a factorial design two or more treatments are evaluated simultaneously through the use of varying combinations of the treatments. The simplest example is the 2×2 factorial design in which subjects are randomly allocated to one of the four possible combinations of two treatments, A and B say. These are: A alone; B alone; both A and B; neither A nor B. In many cases this design is used for the specific purpose of examining the interaction of A and B. The statistical test of interaction may lack power to detect an interaction if the sample size was calculated based on the test for main effects. This consideration is important when this design is used for examining the joint effects of A and B, in particular, if the treatments are likely to be used together.

Another important use of the factorial design is to establish the dose-response characteristics of the simultaneous use of treatments C and D, especially when the efficacy of each monotherapy has been established at some dose in prior trials. A number, m, of doses of C is selected, usually including a zero dose (placebo), and a similar number, n, of doses of D. The full design then consists of m×n treatment groups, each receiving a different combination of doses of C and D. The resulting estimate of the response surface may then be used to help to identify an appropriate combination of doses of C and D for clinical use (see ICH E4).

In some cases, the 2×2 design may be used to make efficient use of clinical trial subjects by evaluating the efficacy of the two treatments with the same number of subjects as would be required to evaluate the efficacy of either one alone. This strategy has proved to be particularly valuable for very large mortality trials. The efficiency and validity of this approach depends upon the absence of interaction between treatments A and B so that the effects of A and B on the primary efficacy variables follow an additive model, and hence the effect of A is virtually identical whether or not it is additional to the effect of B. As for the crossover trial, evidence that this condition is likely to be met should be established in advance of the trial by means of prior information and data.

3.2 Multicentre Trials

Multicentre trials are carried out for two main reasons. Firstly, a multicentre trial is an accepted way of evaluating a new medication more efficiently; under some circumstances, it may present the only practical means of accruing sufficient subjects to satisfy the trial objective within a reasonable time-frame. Multicentre trials of this nature may, in principle, be carried out at any stage of clinical development. They may have several centres with a large number of subjects per centre or, in the case of a rare disease, they may have a large number of centres with very few subjects per centre.

Secondly, a trial may be designed as a multicentre (and multi-investigator) trial primarily to provide a better basis for the subsequent generalisation of its findings.

This arises from the possibility of recruiting the subjects from a wider population and of administering the medication in a broader range of clinical settings, thus presenting an experimental situation that is more typical of future use. In this case the involvement of a number of investigators also gives the potential for a wider range of clinical judgement concerning the value of the medication. Such a trial would be a confirmatory trial in the later phases of drug development and would be likely to involve a large number of investigators and centres. It might sometimes be conducted in a number of different countries in order to facilitate generalisability (see Glossary) even further.

If a multicentre trial is to be meaningfully interpreted and extrapolated, then the manner in which the protocol is implemented should be clear and similar at all centres. Furthermore the usual sample size and power calculations depend upon the assumption that the differences between the compared treatments in the centres are unbiased estimates of the same quantity. It is important to design the common protocol and to conduct the trial with this background in mind. Procedures should be standardised as completely as possible. Variation of evaluation criteria and schemes can be reduced by investigator meetings, by the training of personnel in advance of the trial and by careful monitoring during the trial. Good design should generally aim to achieve the same distribution of subjects to treatments within each centre and good management should maintain this design objective. Trials that avoid excessive variation in the numbers of subjects per centre and trials that avoid a few very small centres have advantages if it is later found necessary to take into account the heterogeneity of the treatment effect from centre to centre, because they reduce the differences between different weighted estimates of the treatment effect. (This point does not apply to trials in which all centres are very small and in which centre does not feature in the analysis.) Failure to take these precautions, combined with doubts about the homogeneity of the results may, in severe cases, reduce the value of a multicentre trial to such a degree that it cannot be regarded as giving convincing evidence for the sponsor’s claims.

In the simplest multicentre trial, each investigator will be responsible for the subjects recruited at one hospital, so that ‘centre’ is identified uniquely by either investigator or hospital. In many trials, however, the situation is more complex. One investigator may recruit subjects from several hospitals; one investigator may represent a team of clinicians (subinvestigators) who all recruit subjects from their own clinics at one hospital or at several associated hospitals. Whenever there is room for doubt about the definition of centre in a statistical model, the statistical section of the protocol (see Section 5.1) should clearly define the term (e.g. by investigator, location or region) in the context of the particular trial. In most instances centres can be satisfactorily defined through the investigators and ICH E6 provides relevant guidance in this respect. In cases of doubt the aim should be to define centres so as to achieve homogeneity in the important factors affecting the measurements of the primary variables and the influence of the treatments. Any rules for combining centres in the analysis should be justified and specified prospectively in the protocol where possible, but in any case decisions concerning this approach should always be taken blind to treatment, for example at the time of the blind review.

The statistical model to be adopted for the estimation and testing of treatment effects should be described in the protocol. The main treatment effect may be investigated first using a model which allows for centre differences, but does not include a term for treatment-by-centre interaction. If the treatment effect is homogeneous across centres, the routine inclusion of interaction terms in the model reduces the efficiency of the test for the main effects. In the presence of true heterogeneity of treatment effects, the interpretation of the main treatment effect is controversial.

In some trials, for example some large mortality trials with very few subjects per centre, there may be no reason to expect the centres to have any influence on the primary or secondary variables because they are unlikely to represent influences of clinical importance. In other trials it may be recognised from the start that the limited numbers of subjects per centre will make it impracticable to include the centre effects in the statistical model. In these cases it is not appropriate to include a term for centre in the model, and it is not necessary to stratify the randomisation by centre in this situation.

If positive treatment effects are found in a trial with appreciable numbers of subjects per centre, there should generally be an exploration of the heterogeneity of treatment effects across centres, as this may affect the generalisability of the conclusions. Marked heterogeneity may be identified by graphical display of the results of individual centres or by analytical methods, such as a significance test of the treatment-by-centre interaction. When using such a statistical significance test, it is important to recognise that this generally has low power in a trial designed to detect the main effect of treatment.

If heterogeneity of treatment effects is found, this should be interpreted with care and vigorous attempts should be made to find an explanation in terms of other features of trial management or subject characteristics. Such an explanation will usually suggest appropriate further analysis and interpretation. In the absence of an explanation, heterogeneity of treatment effect as evidenced, for example, by marked quantitative interactions (see Glossary) implies that alternative estimates of the treatment effect may be required, giving different weights to the centres, in order to substantiate the robustness of the estimates of treatment effect. It is even more important to understand the basis of any heterogeneity characterised by marked qualitative interactions (see Glossary), and failure to find an explanation may necessitate further clinical trials before the treatment effect can be reliably predicted.

Up to this point the discussion of multicentre trials has been based on the use of fixed effect models. Mixed models may also be used to explore the heterogeneity of the treatment effect. These models consider centre and treatment-by-centre effects to be random, and are especially relevant when the number of sites is large.

3.3 Type of Comparison

3.3.1 Trials to Show Superiority

Scientifically, efficacy is most convincingly established by demonstrating superiority to placebo in a placebo-controlled trial, by showing superiority to an active control treatment or by demonstrating a dose-response relationship. This type of trial is referred to as a ‘superiority’ trial (see Glossary). Generally in this guidance superiority trials are assumed, unless it is explicitly stated otherwise.

For serious illnesses, when a therapeutic treatment which has been shown to be efficacious by superiority trial(s) exists, a placebo-controlled trial may be considered unethical. In that case the scientifically sound use of an active treatment as a control should be considered. The appropriateness of placebo control vs. active control should be considered on a trial by trial basis.

3.3.2 Trials to Show Equivalence or Non-inferiority

In some cases, an investigational product is compared to a reference treatment without the objective of showing superiority. This type of trial is divided into two major categories according to its objective; one is an 'equivalence' trial (see Glossary) and the other is a 'non-inferiority' trial (see Glossary).

Many active control trials are designed to show that the efficacy of an investigational product is no worse than that of the active comparator, and hence fall into the latter category. Another possibility is a trial in which multiple doses of the investigational drug are compared with the recommended dose or multiple doses of the standard drug. The purpose of this design is simultaneously to show a dose-response relationship for the investigational product and to compare the investigational product with the active control.

Active control equivalence or non-inferiority trials may also incorporate a placebo, thus pursuing multiple goals in one trial; for example, they may establish superiority to placebo and hence validate the trial design and simultaneously evaluate the degree of similarity of efficacy and safety to the active comparator. There are well known difficulties associated with the use of the active control equivalence (or non-inferiority) trials that do not incorporate a placebo or do not use multiple doses of the new drug. These relate to the implicit lack of any measure of internal validity (in contrast to superiority trials), thus making external validation necessary. The equivalence (or non-inferiority) trial is not conservative in nature, so that many flaws in the design or conduct of the trial will tend to bias the results towards a conclusion of equivalence. For these reasons, the design features of such trials should receive special attention and their conduct needs special care. For example, it is especially important to minimise the incidence of violations of the entry criteria, non-compliance, withdrawals, losses to follow-up, missing data and other deviations from the protocol, and also to minimise their impact on the subsequent analyses.

Active comparators should be chosen with care. An example of a suitable active comparator would be a widely used therapy whose efficacy in the relevant indication has been clearly established and quantified in well designed and well documented superiority trial(s) and which can be reliably expected to exhibit similar efficacy in the contemplated active control trial. To this end, the new trial should have the same important design features (primary variables, the dose of the active comparator, eligibility criteria, etc.) as the previously conducted superiority trials in which the active comparator clearly demonstrated clinically relevant efficacy, taking into account advances in medical or statistical practice relevant to the new trial.

It is vital that the protocol of a trial designed to demonstrate equivalence or non-inferiority contain a clear statement that this is its explicit intention. An equivalence margin should be specified in the protocol; this margin is the largest difference that can be judged as being clinically acceptable and should be smaller than differences observed in superiority trials of the active comparator. For the active control equivalence trial, both the upper and the lower equivalence margins are needed, while only the lower margin is needed for the active control non-inferiority trial. The choice of equivalence margins should be justified clinically.

Statistical analysis is generally based on the use of confidence intervals (see Section 5.5). For equivalence trials, two-sided confidence intervals should be used. Equivalence is inferred when the entire confidence interval falls within the equivalence margins. Operationally, this is equivalent to the method of using two simultaneous one-sided tests to test the (composite) null hypothesis that the treatment difference is outside the equivalence margins versus the (composite) alternative hypothesis that the treatment difference is within the margins. Because the two null hypotheses are disjoint, the type I error is appropriately controlled. For non-inferiority trials a one-sided interval should be used. The confidence interval approach has a one-sided hypothesis test counterpart for testing the null hypothesis that the treatment difference (investigational product minus control) is equal to the lower equivalence margin versus the alternative that the treatment difference is greater than the lower equivalence margin. The choice of type I error should be a consideration separate from the use of a one-sided or two-sided procedure. Sample size calculations should be based on these methods (see Section 3.5).

Concluding equivalence or non-inferiority based on observing a non-significant test result of the null hypothesis that there is no difference between the investigational product and the active comparator is inappropriate.

There are also special issues in the choice of analysis sets. Subjects who withdraw or dropout of the treatment group or the comparator group will tend to have a lack of response, and hence the results of using the full analysis set (see Glossary) may be biased toward demonstrating equivalence (see Section 5.2.3).

3.3.3 Trials to Show Dose-response Relationship

How response is related to the dose of a new investigational product is a question to which answers may be obtained in all phases of development, and by a variety of approaches (see ICH E4). Dose-response trials may serve a number of objectives, amongst which the following are of particular importance: the confirmation of efficacy; the investigation of the shape and location of the dose-response curve; the estimation of an appropriate starting dose; the identification of optimal strategies for individual dose adjustments; the determination of a maximal dose beyond which additional benefit would be unlikely to occur. These objectives should be addressed using the data collected at a number of doses under investigation, including a placebo (zero dose) wherever appropriate. For this purpose the application of procedures to estimate the relationship between dose and response, including the construction of confidence intervals and the use of graphical methods, is as important as the use of statistical tests. The hypothesis tests that are used may need to be tailored to the natural ordering of doses or to particular questions regarding the shape of the dose-response curve (e.g. monotonicity). The details of the planned statistical procedures should be given in the protocol.

3.4 Group Sequential Designs

Group sequential designs are used to facilitate the conduct of interim analysis (see section 4.5 and Glossary). While group sequential designs are not the only acceptable types of designs permitting interim analysis, they are the most commonly applied because it is more practicable to assess grouped subject outcomes at periodic intervals during the trial than on a continuous basis as data from each subject become available. The statistical methods should be fully specified in advance of the availability of information on treatment outcomes and subject treatment assignments (i.e. blind breaking, see Section 4.5). An Independent Data Monitoring Committee (see Glossary) may be used to review or to conduct the interim analysis of data arising from a group sequential design (see Section 4.6). While the design has been most widely and successfully used in large, long-term trials of mortality or major non-fatal endpoints, its use is growing in other circumstances. In particular, it is recognised that safety must be monitored in all trials and therefore the need for formal procedures to cover early stopping for safety reasons should always be considered.

3.5 Sample Size

The number of subjects in a clinical trial should always be large enough to provide a reliable answer to the questions addressed. This number is usually determined by the primary objective of the trial. If the sample size is determined on some other basis, then this should be made clear and justified. For example, a trial sized on the basis of safety questions or requirements or important secondary objectives may need larger numbers of subjects than a trial sized on the basis of the primary efficacy question (see, for example, ICH E1a).

Using the usual method for determining the appropriate sample size, the following items should be specified: a primary variable, the test statistic, the null hypothesis, the alternative ('working') hypothesis at the chosen dose(s) (embodying consideration of the treatment difference to be detected or rejected at the dose and in the subject population selected), the probability of erroneously rejecting the null hypothesis (the type I error), and the probability of erroneously failing to reject the null hypothesis (the type II error), as well as the approach to dealing with treatment withdrawals and protocol violations. In some instances, the event rate is of primary interest for evaluating power, and assumptions should be made to extrapolate from the required number of events to the eventual sample size for the trial.

The method by which the sample size is calculated should be given in the protocol, together with the estimates of any quantities used in the calculations (such as variances, mean values, response rates, event rates, difference to be detected). The basis of these estimates should also be given. It is important to investigate the sensitivity of the sample size estimate to a variety of deviations from these assumptions and this may be facilitated by providing a range of sample sizes appropriate for a reasonable range of deviations from assumptions. In confirmatory trials, assumptions should normally be based on published data or on the results of earlier trials. The treatment difference to be detected may be based on a judgement concerning the minimal effect which has clinical relevance in the management of patients or on a judgement concerning the anticipated effect of the new treatment, where this is larger. Conventionally the probability of type I error is set at 5% or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice may be influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. The probability of type II error is conventionally set at 10% to 20%; it is in the sponsor’s interest to keep this figure as low as feasible especially in the case of trials that are difficult or impossible to repeat. Alternative values to the conventional levels of type I and type II error may be acceptable or even preferable in some cases.

Sample size calculations should refer to the number of subjects required for the primary analysis. If this is the 'full analysis set', estimates of the effect size may need to be reduced compared to the per protocol set (see Glossary). This is to allow for the dilution of the treatment effect arising from the inclusion of data from patients who have withdrawn from treatment or whose compliance is poor. The assumptions about variability may also need to be revised.

The sample size of an equivalence trial or a non-inferiority trial (see Section 3.3.2) should normally be based on the objective of obtaining a confidence interval for the treatment difference that shows that the treatments differ at most by a clinically acceptable difference. When the power of an equivalence trial is assessed at a true difference of zero, then the sample size necessary to achieve this power is underestimated if the true difference is not zero. When the power of a non-inferiority trial is assessed at a zero difference, then the sample size needed to achieve that power will be underestimated if the effect of the investigational product is less than that of the active control. The choice of a 'clinically acceptable’ difference needs justification with respect to its meaning for future patients, and may be smaller than the 'clinically relevant' difference referred to above in the context of superiority trials designed to establish that a difference exists.

The exact sample size in a group sequential trial cannot be fixed in advance because it depends upon the play of chance in combination with the chosen stopping guideline and the true treatment difference. The design of the stopping guideline should take into account the consequent distribution of the sample size, usually embodied in the expected and maximum sample sizes.

When event rates are lower than anticipated or variability is larger than expected, methods for sample size re-estimation are available without unblinding data or making treatment comparisons (see Section 4.4).

3.6 Data Capture and Processing

The collection of data and transfer of data from the investigator to the sponsor can take place through a variety of media, including paper case record forms, remote site monitoring systems, medical computer systems and electronic transfer. Whatever data capture instrument is used, the form and content of the information collected should be in full accordance with the protocol and should be established in advance of the conduct of the clinical trial. It should focus on the data necessary to implement the planned analysis, including the context information (such as timing assessments relative to dosing) necessary to confirm protocol compliance or identify important protocol deviations. ‘Missing values’ should be distinguishable from the ‘value zero’ or ‘characteristic absent’.

The process of data capture through to database finalisation should be carried out in accordance with GCP (see ICH E6, Section 5). Specifically, timely and reliable processes for recording data and rectifying errors and omissions are necessary to ensure delivery of a quality database and the achievement of the trial objectives through the implementation of the planned analysis.

IV. TRIAL CONDUCT CONSIDERATIONS

4.1 Trial Monitoring and Interim Analysis

Careful conduct of a clinical trial according to the protocol has a major impact on the credibility of the results (see ICH E6). Careful monitoring can ensure that difficulties are noticed early and their occurrence or recurrence minimised.

There are two distinct types of monitoring that generally characterise confirmatory clinical trials sponsored by the pharmaceutical industry. One type of monitoring concerns the oversight of the quality of the trial, while the other type involves breaking the blind to make treatment comparisons (i.e. interim analysis). Both types of trial monitoring, in addition to entailing different staff responsibilities, involve access to different types of trial data and information, and thus different principles apply for the control of potential statistical and operational bias.

For the purpose of overseeing the quality of the trial the checks involved in trial monitoring may include whether the protocol is being followed, the acceptability of data being accrued, the success of planned accrual targets, the appropriateness of the design assumptions, success in keeping patients in the trials, etc. (see Sections 4.2 to 4.4). This type of monitoring does not require access to information on comparative treatment effects, nor unblinding of data and therefore has no impact on type I error. The monitoring of a trial for this purpose is the responsibility of the sponsor (see ICH E6) and can be carried out by the sponsor or an independent group selected by the sponsor. The period for this type of monitoring usually starts with the selection of the trial sites and ends with the collection and cleaning of the last subject’s data.

The other type of trial monitoring (interim analysis) involves the accruing of comparative treatment results. Interim analysis requires unblinded (i.e. key breaking) access to treatment group assignment (actual treatment assignment or identification of group assignment) and comparative treatment group summary information. This necessitates that the protocol (or appropriate amendments prior to a first analysis) contains statistical plans for the interim analysis to prevent certain types of bias. This is discussed in Sections 4.5 & 4.6.

4.2 Changes in Inclusion and Exclusion Criteria

Inclusion and exclusion criteria should remain constant, as specified in the protocol, throughout the period of subject recruitment. Changes may occasionally be appropriate, for example, in long term trials, where growing medical knowledge either from outside the trial or from interim analyses may suggest a change of entry criteria. Changes may also result from the discovery by monitoring staff that regular violations of the entry criteria are occurring, or that seriously low recruitment rates are due to over-restrictive criteria. Changes should be made without breaking the blind and should always be described by a protocol amendment which should cover any statistical consequences, such as sample size adjustments arising from different event rates, or modifications to the planned analysis, such as stratifying the analysis according to modified inclusion/exclusion criteria.

4.3 Accrual Rates

In trials with a long time-scale for the accrual of subjects, the rate of accrual should be monitored and, if it falls appreciably below the projected level, the reasons should be identified and remedial actions taken in order to protect the power of the trial and alleviate concerns about selective entry and other aspects of quality. In a multicentre trial these considerations apply to the individual centres.

4.4 Sample Size Adjustment

In long term trials there will usually be an opportunity to check the assumptions which underlay the original design and sample size calculations. This may be particularly important if the trial specifications have been made on preliminary and/or uncertain information. An interim check conducted on the blinded data may reveal that overall response variances, event rates or survival experience are not as anticipated. A revised sample size may then be calculated using suitably modified assumptions, and should be justified and documented in a protocol amendment and in the clinical study report. The steps taken to preserve blindness and the consequences, if any, for the type I error and the width of confidence intervals should be explained. The potential need for re-estimation of the sample size should be envisaged in the protocol whenever possible (see Section 3.5).

4.5 Interim Analysis and Early Stopping

An interim analysis is any analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to formal completion of a trial. Because the number, methods and consequences of these comparisons affect the interpretation of the trial, all interim analyses should be carefully planned in advance and described in the protocol. Special circumstances may dictate the need for an interim analysis that was not defined at the start of a trial. In these cases, a protocol amendment describing the interim analysis should be completed prior to unblinded access to treatment comparison data. When an interim analysis is planned with the intention of deciding whether or not to terminate a trial, this is usually accomplished by the use of a group sequential design which employs statistical monitoring schemes as guidelines (see Section 3.4). The goal of such an interim analysis is to stop the trial early if the superiority of the treatment under study is clearly established, if the demonstration of a relevant treatment difference has become unlikely or if unacceptable adverse effects are apparent. Generally, boundaries for monitoring efficacy require more evidence to terminate a trial early (i.e. they are more conservative) than boundaries for monitoring safety. When the trial design and monitoring objective involve multiple endpoints then this aspect of multiplicity may also need to be taken into account.

The protocol should describe the schedule of interim analyses, or at least the considerations which will govern its generation, for example if flexible alpha spending function approaches are to be employed; further details may be given in a protocol amendment before the time of the first interim analysis. The stopping guidelines and their properties should be clearly described in the protocol or amendments. The potential effects of early stopping on the analysis of other important variables should also be considered. This material should be written or approved by the Data Monitoring Committee (see Section 4.6), when the trial has one. Deviations from the planned procedure always bear the potential of invalidating the trial results. If it becomes necessary to make changes to the trial, any consequent changes to the statistical procedures should be specified in an amendment to the protocol at the earliest opportunity, especially discussing the impact on any analysis and inferences that such changes may cause. The procedures selected should always ensure that the overall probability of type I error is controlled.

The execution of an interim analysis should be a completely confidential process because unblinded data and results are potentially involved. All staff involved in the conduct of the trial should remain blind to the results of such analyses, because of the possibility that their attitudes to the trial will be modified and cause changes in the characteristics of patients to be recruited or biases in treatment comparisons. This principle may be applied to all investigator staff and to staff employed by the sponsor except for those who are directly involved in the execution of the interim analysis. Investigators should only be informed about the decision to continue or to discontinue the trial, or to implement modifications to trial procedures.

Most clinical trials intended to support the efficacy and safety of an investigational product should proceed to full completion of planned sample size accrual; trials should be stopped early only for ethical reasons or if the power is no longer acceptable. However, it is recognised that drug development plans involve the need for sponsor access to comparative treatment data for a variety of reasons, such as planning other trials. It is also recognised that only a subset of trials will involve the study of serious life-threatening outcomes or mortality which may need sequential monitoring of accruing comparative treatment effects for ethical reasons. In either of these situations, plans for interim statistical analysis should be in place in the protocol or in protocol amendments prior to the unblinded access to comparative treatment data in order to deal with the potential statistical and operational bias that may be introduced.

For many clinical trials of investigational products, especially those that have major public health significance, the responsibility for monitoring comparisons of efficacy and/or safety outcomes should be assigned to an external independent group, often called an Independent Data Monitoring Committee (IDMC), a Data and Safety Monitoring Board or a Data Monitoring Committee whose responsibilities should be clearly described.

Any interim analysis that is not planned appropriately (with or without the consequences of stopping the trial early) may flaw the results of a trial and possibly weaken confidence in the conclusions drawn. Therefore, such analyses should be avoided. If unplanned interim analysis is conducted, the clinical study report should explain why it was necessary, the degree to which blindness had to be broken, provide an assessment of the potential magnitude of bias introduced, and the impact on the interpretation of the results.

4.6 Role of Independent Data Monitoring Committee (IDMC) (see Sections 1.25 and 5.52 of ICH E6)

An IDMC may be established by the sponsor to assess at intervals the progress of a clinical trial, safety data, and critical efficacy variables and recommend to the sponsor whether to continue, modify or terminate a trial. The IDMC should have written operating procedures and maintain records of all its meetings, including interim results; these should be available for review when the trial is complete. The independence of the IDMC is intended to control the sharing of important comparative information and to protect the integrity of the clinical trial from adverse impact resulting from access to trial information. The IDMC is a separate entity from an Institutional Review Board (IRB) or an Independent Ethics Committee (IEC), and its composition should include clinical trial scientists knowledgeable in the appropriate disciplines including statistics.

When there are sponsor representatives on the IDMC, their role should be clearly defined in the operating procedures of the committee (for example, covering whether or not they can vote on key issues). Since these sponsor staff would have access to unblinded information, the procedures should also address the control of dissemination of interim trial results within the sponsor organisation.

V. DATA ANALYSIS CONSIDERATIONS

5.1 Prespecification of the Analysis

When designing a clinical trial the principal features of the eventual statistical analysis of the data should be described in the statistical section of the protocol. This section should include all the principal features of the proposed confirmatory analysis of the primary variable(s) and the way in which anticipated analysis problems will be handled. In case of exploratory trials this section could describe more general principles and directions.

The statistical analysis plan (see Glossary) may be written as a separate document to be completed after finalising the protocol. In this document, a more technical and detailed elaboration of the principal features stated in the protocol may be included (see section 7.1). The plan may include detailed procedures for executing the statistical analysis of the primary and secondary variables and other data. The plan should be reviewed and possibly updated as a result of the blind review of the data (see 7.1 for definition) and should be finalised before breaking the blind. Formal records should be kept of when the statistical analysis plan was finalised as well as when the blind was subsequently broken.

In the statistical section of the clinical study report the statistical methodology should be clearly described including when in the clinical trial process methodology decisions were made (see ICH E3).

5.2 Analysis Sets

The set of subjects whose data are to be included in the main analyses should be defined in the statistical section of the protocol. In addition, documentation for all subjects for whom trial procedures (e.g. run-in period) were initiated may be useful. The content of this subject documentation depends on detailed features of the particular trial, but at least demographic and baseline data on disease status should be collected whenever possible.

If all subjects randomised into a clinical trial satisfied all entry criteria, followed all trial procedures perfectly with no losses to follow-up, and provided complete data records, then the set of subjects to be included in the analysis would be self-evident. The design and conduct of a trial should aim to approach this ideal as closely as possible, but, in practice, it is doubtful if it can ever be fully achieved. Hence, the statistical section of the protocol should address anticipated problems prospectively in terms of how these affect the subjects and data to be analysed. The protocol should also specify procedures aimed at minimising any anticipated irregularities in study conduct that might impair a satisfactory analysis, including various types of protocol violations, withdrawals and missing values. The protocol should consider ways both to reduce the frequency of such problems, and also to handle the problems that do occur in the analysis of data. Possible amendments to the way in which the analysis will deal with protocol violations should be identified during the blind review. It is desirable to identify any important protocol violation with respect to the time when it occurred, its cause and influence on the trial result. The frequency and type of protocol violations, missing values, and other problems should be documented in the clinical study report and their potential influence on the trial results should be described (see ICH E3).

Decisions concerning the analysis set should be guided by the following principles : 1) to minimise bias, and 2) to avoid inflation of type I error.

5.2.1 Full Analysis Set

The intention-to-treat (see Glossary) principle implies that the primary analysis should include all randomised subjects. Compliance with this principle would necessitate complete follow-up of all randomised subjects for study outcomes. In practice this ideal may be difficult to achieve, for reasons to be described. In this document the term 'full analysis set' is used to describe the analysis set which is as complete as possible and as close as possible to the intention-to-treat ideal of including all randomised subjects. Preservation of the initial randomisation in analysis is important in preventing bias and in providing a secure foundation for statistical tests. In many clinical trials the use of the full analysis set provides a conservative strategy. Under many circumstances it may also provide estimates of treatment effects which are more likely to mirror those observed in subsequent practice.

There are a limited number of circumstances that might lead to excluding randomised subjects from the full analysis set including the failure to satisfy major entry criteria (eligibility violations), the failure to take at least one dose of trial medication and the lack of any data post randomisation. Such exclusions should always be justified. Subjects who fail to satisfy an entry criterion may be excluded from the analysis without the possibility of introducing bias only under the following circumstances:

(i) the entry criterion was measured prior to randomisation;

(ii) the detection of the relevant eligibility violations can be made completely objectively;

(iii) all subjects receive equal scrutiny for eligibility violations; (This may be difficult to ensure in an open-label study, or even in a double-blind study if the data are unblinded prior to this scrutiny, emphasising the importance of the blind review.)

(iv) all detected violations of the particular entry criterion are excluded.

In some situations, it may be reasonable to eliminate from the set of all randomised subjects any subject who took no trial medication. The intention-to-treat principle would be preserved despite the exclusion of these patients provided, for example, that the decision of whether or not to begin treatment could not be influenced by knowledge of the assigned treatment. In other situations it may be necessary to eliminate from the set of all randomised subjects any subject without data post randomisation. No analysis is complete unless the potential biases arising from these specific exclusions, or any others, are addressed.

When the full analysis set of subjects is used, violations of the protocol that occur after randomisation may have an impact on the data and conclusions, particularly if their occurrence is related to treatment assignment. In most respects it is appropriate to include the data from such subjects in the analysis, consistent with the intention-to-treat principle. Special problems arise in connection with subjects withdrawn from treatment after receiving one or more doses who provide no data after this point, and subjects otherwise lost to follow-up, because failure to include these subjects in the full analysis set may seriously undermine the approach. Measurements of primary variables made at the time of the loss to follow-up of a subject for any reason, or subsequently collected in accordance with the intended schedule of assessments in the protocol, are valuable in this context; subsequent collection is especially important in studies where the primary variable is mortality or serious morbidity. The intention to collect data in this way should be described in the protocol. Imputation techniques, ranging from the carrying forward of the last observation to the use of complex mathematical models, may also be used in an attempt to compensate for missing data. Other methods employed to ensure the availability of measurements of primary variables for every subject in the full analysis set may require some assumptions about the subjects' outcomes or a simpler choice of outcome (e.g. success / failure). The use of any of these strategies should be described and justified in the statistical section of the protocol and the assumptions underlying any mathematical models employed should be clearly explained. It is also important to demonstrate the robustness of the corresponding results of analysis especially when the strategy in question could itself lead to biased estimates of treatment effects.

Because of the unpredictability of some problems, it may sometimes be preferable to defer detailed consideration of the manner of dealing with irregularities until the blind review of the data at the end of the trial, and, if so, this should be stated in the protocol.

5.2.2 Per Protocol Set

sample or the 'evaluable subjects' sample, defines a subset of the subjects in the full analysis set who are more compliant with the protocol and is characterised by criteria such as the following:

(i) the completion of a certain pre-specified minimal exposure to the treatment regimen;

(ii) the availability of measurements of the primary variable(s);

(iii) the absence of any major protocol violations including the violation of entry criteria.

The precise reasons for excluding subjects from the per protocol set should be fully defined and documented before breaking the blind in a manner appropriate to the circumstances of the specific trial.

The use of the per protocol set may maximise the opportunity for a new treatment to show additional efficacy in the analysis, and most closely reflects the scientific model underlying the protocol. However, the corresponding test of the hypothesis and estimate of the treatment effect may or may not be conservative depending on the trial; the bias, which may be severe, arises from the fact that adherence to the study protocol may be related to treatment and outcome.

The problems that lead to the exclusion of subjects to create the per protocol set, and other protocol violations, should be fully identified and summarised. Relevant protocol violations may include errors in treatment assignment, the use of excluded medication, poor compliance, loss to follow-up and missing data. It is good practice to assess the pattern of such problems among the treatment groups with respect to frequency and time to occurrence.

5.2.3 Roles of the Different Analysis Sets

In general, it is advantageous to demonstrate a lack of sensitivity of the principal trial results to alternative choices of the set of subjects analysed. In confirmatory trials it is usually appropriate to plan to conduct both an analysis of the full analysis set and a per protocol analysis, so that any differences between them can be the subject of explicit discussion and interpretation. In some cases, it may be desirable to plan further exploration of the sensitivity of conclusions to the choice of the set of subjects analysed. When the full analysis set and the per protocol set lead to essentially the same conclusions, confidence in the trial results is increased, bearing in mind, however, that the need to exclude a substantial proportion of subjects from the per protocol analysis throws some doubt on the overall validity of the trial.

The full analysis set and the per protocol set play different roles in superiority trials (which seek to show the investigational product to be superior), and in equivalence or non-inferiority trials (which seek to show the investigational product to be comparable, see section 3.3.2). In superiority trials the full analysis set is used in the primary analysis (apart from exceptional circumstances) because it tends to avoid over-optimistic estimates of efficacy resulting from a per protocol analysis, since the non-compliers included in the full analysis set will generally diminish the estimated treatment effect. However, in an equivalence or non-inferiority trial use of the full analysis set is generally not conservative and its role should be considered very carefully.

5.3 Missing Values and Outliers

Missing values represent a potential source of bias in a clinical trial. Hence, every effort should be undertaken to fulfil all the requirements of the protocol concerning the collection and management of data. In reality, however, there will almost always be some missing data. A trial may be regarded as valid, nonetheless, provided the methods of dealing with missing values are sensible, and particularly if those methods are pre-defined in the protocol. Definition of methods may be refined by updating this aspect in the statistical analysis plan during the blind review. Unfortunately, no universally applicable methods of handling missing values can be recommended. An investigation should be made concerning the sensitivity of the results of analysis to the method of handling missing values, especially if the number of missing values is substantial.

5.4 Data Transformation

The decision to transform key variables prior to analysis is best made during the design of the trial on the basis of similar data from earlier clinical trials. Transformations (e.g. square root, logarithm) should be specified in the protocol and a rationale provided, especially for the primary variable(s). The general principles guiding the use of transformations to ensure that the assumptions underlying the statistical methods are met are to be found in standard texts; conventions for particular variables have been developed in a number of specific clinical areas. The decision on whether and how to transform a variable should be influenced by the preference for a scale which facilitates clinical interpretation.

Similar considerations apply to other derived variables, such as the use of change from baseline, percentage change from baseline, the 'area under the curve' of repeated measures, or the ratio of two different variables. Subsequent clinical interpretation should be carefully considered, and the derivation should be justified in the protocol. Closely related points are made in Section 2.2.2.

5.5 Estimation, Confidence Intervals and Hypothesis Testing

The statistical section of the protocol should specify the hypotheses that are to be tested and/or the treatment effects which are to be estimated in order to satisfy the primary objectives of the trial. The statistical methods to be used to accomplish these tasks should be described for the primary (and preferably the secondary) variables, and the underlying statistical model should be made clear. Estimates of treatment effects should be accompanied by confidence intervals, whenever possible, and the way in which these will be calculated should be identified. A description should be given of any intentions to use baseline data to improve precision or to adjust estimates for potential baseline differences, for example by means of analysis of covariance.

It is important to clarify whether one- or two-sided tests of statistical significance will be used, and in particular to justify prospectively the use of one-sided tests. If hypothesis tests are not considered appropriate, then the alternative process for arriving at statistical conclusions should be given. The issue of one-sided or two-sided approaches to inference is controversial and a diversity of views can be found in the statistical literature. The approach of setting type I errors for one-sided tests at half the conventional type I error used in two-sided tests is preferable in regulatory settings. This promotes consistency with the two-sided confidence intervals that are generally appropriate for estimating the possible size of the difference between two treatments.

The particular statistical model chosen should reflect the current state of medical and statistical knowledge about the variables to be analysed as well as the statistical design of the trial. All effects to be fitted in the analysis (for example in analysis of variance models) should be fully specified, and the manner, if any, in which this set of effects might be modified in response to preliminary results should be explained. The same considerations apply to the set of covariates fitted in an analysis of covariance. (See also Section 5.7.). In the choice of statistical methods due attention should be paid to the statistical distribution of both primary and secondary variables. When making this choice (for example between parametric and non-parametric methods) it is important to bear in mind the need to provide statistical estimates of the size of treatment effects together with confidence intervals (in addition to significance tests).

The primary analysis of the primary variable should be clearly distinguished from supporting analyses of the primary or secondary variables. Within the statistical section of the protocol or the statistical analysis plan there should also be an outline of the way in which data other than the primary and secondary variables will be summarised and reported. This should include a reference to any approaches adopted for the purpose of achieving consistency of analysis across a range of trials, for example for safety data.

Modelling approaches that incorporate information on known pharmacological parameters, the extent of protocol compliance for individual subjects or other biologically based data may provide valuable insights into actual or potential efficacy, especially with regard to estimation of treatment effects. The assumptions underlying such models should always be clearly identified, and the limitations of any conclusions should be carefully described.

5.6 Adjustment of Significance and Confidence Levels

When multiplicity is present, the usual frequentist approach to the analysis of clinical trial data may necessitate an adjustment to the type I error. Multiplicity may arise, for example, from multiple primary variables (see Section 2.2.2), multiple comparisons of treatments, repeated evaluation over time and/or interim analyses (see Section 4.5). Methods to avoid or reduce multiplicity are sometimes preferable when available, such as the identification of the key primary variable (multiple variables), the choice of a critical treatment contrast (multiple comparisons), the use of a summary measure such as ‘area under the curve’ (repeated measures). In confirmatory analyses, any aspects of multiplicity which remain after steps of this kind have been taken should be identified in the protocol; adjustment should always be considered and the details of any adjustment procedure or an explanation of why adjustment is not thought to be necessary should be set out in the analysis plan.

5.7 Subgroups, Interactions and Covariates

The primary variable(s) is often systematically related to other influences apart from treatment. For example, there may be relationships to covariates such as age and sex, or there may be differences between specific subgroups of subjects such as those treated at the different centres of a multicentre trial. In some instances an adjustment for the influence of covariates or for subgroup effects is an integral part of the planned analysis and hence should be set out in the protocol. Pre-trial deliberations should identify those covariates and factors expected to have an important influence on the primary variable(s), and should consider how to account for these in the analysis in order to improve precision and to compensate for any lack of balance between treatment groups. If one or more factors are used to stratify the design, it is appropriate to account for those factors in the analysis. When the potential value of an adjustment is in doubt, it is often advisable to nominate the unadjusted analysis as the one for primary attention, the adjusted analysis being supportive. Special attention should be paid to centre effects and to the role of baseline measurements of the primary variable. It is not advisable to adjust the main analyses for covariates measured after randomisation because they may be affected by the treatments.

The treatment effect itself may also vary with subgroup or covariate - for example, the effect may decrease with age or may be larger in a particular diagnostic category of subjects. In some cases such interactions are anticipated or are of particular prior interest (e.g. geriatrics), and hence a subgroup analysis, or a statistical model including interactions, is part of the planned confirmatory analysis. In most cases, however, subgroup or interaction analyses are exploratory and should be clearly identified as such; they should explore the uniformity of any treatment effects found overall. In general, such analyses should proceed first through the addition of interaction terms to the statistical model in question, complemented by additional exploratory analysis within relevant subgroups of subjects, or within strata defined by the covariates. When exploratory, these analyses should be interpreted cautiously; any conclusion of treatment efficacy (or lack thereof) or safety based solely on exploratory subgroup analyses are unlikely to be accepted.

5.8 Integrity of Data and Computer Software Validity

The credibility of the numerical results of the analysis depends on the quality and validity of the methods and software (both internally and externally written) used both for data management (data entry, storage, verification, correction and retrieval) and also for processing the data statistically. Data management activities should therefore be based on thorough and effective standard operating procedures. The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available.

VI. EVALUATION OF SAFETY AND TOLERABILITY

6.1 Scope of Evaluation

In all clinical trials evaluation of safety and tolerability (see Glossary) constitutes an important element. In early phases this evaluation is mostly of an exploratory nature, and is only sensitive to frank expressions of toxicity, whereas in later phases the establishment of the safety and tolerability profile of a drug can be characterised more fully in larger samples of subjects. Later phase controlled trials represent an important means of exploring in an unbiased manner any new potential adverse effects, even if such trials generally lack power in this respect.

Certain trials may be designed with the purpose of making specific claims about superiority or equivalence with regard to safety and tolerability compared to another drug or to another dose of the investigational drug. Such specific claims should be supported by relevant evidence from confirmatory trials, similar to that necessary for corresponding efficacy claims.

6.2 Choice of Variables and Data Collection

In any clinical trial the methods and measurements chosen to evaluate the safety and tolerability of a drug will depend on a number of factors, including knowledge of the adverse effects of closely related drugs, information from non-clinical and earlier clinical trials and possible consequences of the pharmacodynamic/pharmacokinetic properties of the particular drug, the mode of administration, the type of subjects to be studied, and the duration of the trial. Laboratory tests concerning clinical chemistry and haematology, vital signs, and clinical adverse events (diseases, signs and symptoms) usually form the main body of the safety and tolerability data. The occurrence of serious adverse events and treatment discontinuations due to adverse events are particularly important to register (see ICH E2A and ICH E3).

Furthermore, it is recommended that a consistent methodology be used for the data collection and evaluation throughout a clinical trial program in order to facilitate the combining of data from different trials. The use of a common adverse event dictionary is particularly important. This dictionary has a structure which gives the possibility to summarise the adverse event data on three different levels; system-organ class, preferred term or included term (see Glossary). The preferred term is the level on which adverse events usually are summarised, and preferred terms belonging to the same system-organ class could then be brought together in the descriptive presentation of data (see ICH M1).

6.3 Set of Subjects to be Evaluated and Presentation of Data

For the overall safety and tolerability assessment, the set of subjects to be summarised is usually defined as those subjects who received at least one dose of the investigational drug. Safety and tolerability variables should be collected as comprehensively as possible from these subjects, including type of adverse event, severity, onset and duration (see ICH E2B). Additional safety and tolerability evaluations may be needed in specific subpopulations, such as females, the elderly (see ICH E7), the severely ill, or those who have a common concomitant treatment. These evaluations may need to address more specific issues (see ICH E3).

All safety and tolerability variables will need attention during evaluation, and the broad approach should be indicated in the protocol. All adverse events should be reported, whether or not they are considered to be related to treatment. All available data in the study population should be accounted for in the evaluation. Definitions of measurement units and reference ranges of laboratory variables should be made with care; if different units or different reference ranges appear in the same trial (e.g. if more than one laboratory is involved), then measurements should be appropriately standardised to allow a unified evaluation. Use of a toxicity grading scale should be prespecified and justified.

The incidence of a certain adverse event is usually expressed in the form of a proportion relating number of subjects experiencing events to number of subjects at risk. However, it is not always self-evident how to assess incidence. For example, depending on the situation the number of exposed subjects or the extent of exposure (in person-years) could be considered for the denominator. Whether the purpose of the calculation is to estimate a risk or to make a comparison between treatment groups it is important that the definition is given in the protocol. This is especially important if long-term treatment is planned and a substantial proportion of treatment withdrawals or deaths are expected. For such situations survival analysis methods should be considered and cumulative adverse event rates calculated in order to avoid the risk of underestimation.

In situations when there is a substantial background noise of signs and symptoms (e.g. in psychiatric trials) one should consider ways of accounting for this in the estimation of risk for different adverse events. One such method is to make use of the 'treatment emergent' (see Glossary) concept in which adverse events are recorded only if they emerge or worsen relative to pretreatment baseline.

Other methods to reduce the effect of the background noise may also be appropriate such as ignoring adverse events of mild severity or requiring that an event should have been observed at repeated visits to qualify for inclusion in the numerator. Such methods should be explained and justified in the protocol.

6.4 Statistical Evaluation

The investigation of safety and tolerability is a multidimensional problem. Although some specific adverse effects can usually be anticipated and specifically monitored for any drug, the range of possible adverse effects is very large, and new and unforeseeable effects are always possible. Further, an adverse event experienced after a protocol violation, such as use of an excluded medication, may introduce a bias. This background underlies the statistical difficulties associated with the analytical evaluation of safety and tolerability of drugs, and means that conclusive information from confirmatory clinical trials is the exception rather than the rule.

In most trials the safety and tolerability implications are best addressed by applying descriptive statistical methods to the data, supplemented by calculation of confidence intervals wherever this aids interpretation. It is also valuable to make use of graphical presentations in which patterns of adverse events are displayed both within treatment groups and within subjects.

The calculation of p-values is sometimes useful either as an aid to evaluating a specific difference of interest, or as a 'flagging' device applied to a large number of safety and tolerability variables to highlight differences worth further attention. This is particularly useful for laboratory data, which otherwise can be difficult to summarise appropriately. It is recommended that laboratory data be subjected to both a quantitative analysis, e.g. evaluation of treatment means, and a qualitative analysis where counting of numbers above or below certain thresholds are calculated.

If hypothesis tests are used, statistical adjustments for multiplicity to quantify the type I error are appropriate, but the type II error is usually of more concern. Care should be taken when interpreting putative statistically significant findings when there is no multiplicity adjustment.

In the majority of trials investigators are seeking to establish that there are no clinically unacceptable differences in safety and tolerability compared with either a comparator drug or a placebo. As is the case for non-inferiority or equivalence evaluation of efficacy the use of confidence intervals is preferred to hypothesis testing in this situation. In this way, the considerable imprecision often arising from low frequencies of occurrence is clearly demonstrated.

6.5 Integrated Summary

The safety and tolerability properties of a drug are commonly summarised across trials continuously during an investigational product’s development and in particular at the time of a marketing application. The usefulness of this summary, however, is dependent on adequate and well-controlled individual trials with high data quality.

The overall usefulness of a drug is always a question of balance between risk and benefit and in a single trial such a perspective could also be considered, even if the assessment of risk/benefit usually is performed in the summary of the entire clinical trial program. (See section 7.2.2)

For more details on the reporting of safety and tolerability, see Chapter 12 of ICH E3.

VII. REPORTING

7.1 Evaluation and Reporting

As stated in the Introduction, the structure and content of clinical study reports is the subject of ICH E3. That ICH guidance fully covers the reporting of statistical work, appropriately integrated with clinical and other material. The current section is therefore relatively brief.

During the planning phase of a trial the principal features of the analysis should have been specified in the protocol as described in Section 5. When the conduct of the trial is over and the data are assembled and available for preliminary inspection, it is valuable to carry out the blind review of the planned analysis also described in Section 5. This pre-analysis review, blinded to treatment, should cover decisions concerning, for example, the exclusion of subjects or data from the analysis sets; possible transformations may also be checked, and outliers defined; important covariates identified in other recent research may be added to the model; the use of parametric or non-parametric methods may be reconsidered. Decisions made at this time should be described in the report, and should be distinguished from those made after the statistician has had access to the treatment codes, as blind decisions will generally introduce less potential for bias. Statisticians or other staff involved in unblinded interim analysis should not participate in the blind review or in making modifications to the statistical analysis plan. When the blinding is compromised by the possibility that treatment induced effects may be apparent in the data, special care will be needed for the blind review.

Many of the more detailed aspects of presentation and tabulation should be finalised at or about the time of the blind review so that by the time of the actual analysis full plans exist for all its aspects including subject selection, data selection and modification, data summary and tabulation, estimation and hypothesis testing. Once data validation is complete, the analysis should proceed according to the pre-defined plans; the more these plans are adhered to, the greater the credibility of the results. Particular attention should be paid to any differences between the planned analysis and the actual analysis as described in the protocol, protocol amendments or the updated statistical analysis plan based on a blind review of data. A careful explanation should be provided for deviations from the planned analysis.

All subjects who entered the trial should be accounted for in the report, whether or not they are included in the analysis. All reasons for exclusion from analysis should be documented; for any subject included in the full analysis set but not in the per protocol set, the reasons for exclusion from the latter should also be documented. Similarly, for all subjects included in an analysis set, the measurements of all important variables should be accounted for at all relevant time-points.

The effect of all losses of subjects or data, withdrawals from treatment and major protocol violations on the main analyses of the primary variable(s) should be considered carefully. Subjects lost to follow up, withdrawn from treatment, or with a severe protocol violation should be identified, and a descriptive analysis of them provided, including the reasons for their loss and its relationship to treatment and outcome.

Descriptive statistics form an indispensable part of reports. Suitable tables and/or graphical presentations should illustrate clearly the important features of the primary and secondary variables and of key prognostic and demographic variables. The results of the main analyses relating to the objectives of the trial should be the subject of particularly careful descriptive presentation. When reporting the results of significance tests, precise p-values (e.g.'p=0.034') should be reported rather than making exclusive reference to critical values.

Although the primary goal of the analysis of a clinical trial should be to answer the questions posed by its main objectives, new questions based on the observed data may well emerge during the unblinded analysis. Additional and perhaps complex statistical analysis may be the consequence. This additional work should be strictly distinguished in the report from work which was planned in the protocol.

The play of chance may lead to unforeseen imbalances between the treatment groups in terms of baseline measurements not pre-defined as covariates in the planned analysis but having some prognostic importance nevertheless. This is best dealt with by showing that an additional analysis which accounts for these imbalances reaches essentially the same conclusions as the planned analysis. If this is not the case, the effect of the imbalances on the conclusions should be discussed.

In general, sparing use should be made of unplanned analyses. Such analyses are often carried out when it is thought that the treatment effect may vary according to some other factor or factors. An attempt may then be made to identify subgroups of subjects for whom the effect is particularly beneficial. The potential dangers of over-interpretation of unplanned subgroup analyses are well known (see also Section 5.7), and should be carefully avoided. Although similar problems of interpretation arise if a treatment appears to have no benefit, or an adverse effect, in a subgroup of subjects, such possibilities should be properly assessed and should therefore be reported.

Finally statistical judgement should be brought to bear on the analysis, interpretation and presentation of the results of a clinical trial. To this end the trial statistician should be a member of the team responsible for the clinical study report, and should approve the clinical report.

7.2 Summarising the Clinical Database

An overall summary and synthesis of the evidence on safety and efficacy from all the reported clinical trials is required for a marketing application (Expert report in EU, integrated summary reports in USA, Gaiyo in Japan). This may be accompanied, when appropriate, by a statistical combination of results.

Within the summary a number of areas of specific statistical interest arise: describing the demography and clinical features of the population treated during the course of the clinical trial programme; addressing the key questions of efficacy by considering the results of the relevant (usually controlled) trials and highlighting the degree to which they reinforce or contradict each other; summarising the safety information available from the combined database of all the trials whose results contribute to the marketing application and identifying potential safety issues. During the design of a clinical programme careful attention should be paid to the uniform definition and collection of measurements which will facilitate subsequent interpretation of the series of trials, particularly if they are likely to be combined across trials. A common dictionary for recording the details of medication, medical history and adverse events should be selected and used. A common definition of the primary and secondary variables is nearly always worthwhile, and essential for meta-analysis. The manner of measuring key efficacy variables, the timing of assessments relative to randomisation/entry, the handling of protocol violators and deviators and perhaps the definition of prognostic factors, should all be kept compatible unless there are valid reasons not to do so.

Any statistical procedures used to combine data across trials should be described in detail. Attention should be paid to the possibility of bias associated with the selection of trials, to the homogeneity of their results, and to the proper modelling of the various sources of variation. The sensitivity of conclusions to the assumptions and selections made should be explored.

7.2.1 Efficacy Data

Individual clinical trials should always be large enough to satisfy their objectives. Additional valuable information may also be gained by summarising a series of clinical trials which address essentially identical key efficacy questions. The main results of such a set of trials should be presented in an identical form to permit comparison, usually in tables or graphs which focus on estimates plus confidence limits. The use of meta-analytic techniques to combine these estimates is often a useful addition, because it allows a more precise overall estimate of the size of the treatment effects to be generated, and provides a complete and concise summary of the results of the trials. Under exceptional circumstances a meta analytic approach may also be the most appropriate way, or the only way, of providing sufficient overall evidence of efficacy via an overall hypothesis test. When used for this purpose the meta-analysis should have its own prospectively written protocol.

7.2.2 Safety Data

In summarising safety data it is important to examine the safety database thoroughly for any indications of potential toxicity, and to follow up any indications by looking for an associated supportive pattern of observations. The combination of the safety data from all human exposure to the drug provides an important source of information, because its larger sample size provides the best chance of detecting the rarer adverse events and, perhaps, of estimating their approximate incidence. However, incidence data from this database are difficult to evaluate because of the lack of a comparator group, and data from comparative trials are especially valuable in overcoming this difficulty. The results from trials which use a common comparator (placebo or specific active comparator) should be combined and presented separately for each comparator providing sufficient data.

All indications of potential toxicity arising from exploration of the data should be reported. The evaluation of the reality of these potential adverse effects should take account of the issue of multiplicity arising from the numerous comparisons made. The evaluation should also make appropriate use of survival analysis methods to exploit the potential relationship of the incidence of adverse events to duration of exposure and/or follow-up. The risks associated with identified adverse effects should be appropriately quantified to allow a proper assessment of the risk/benefit relationship.

GLOSSARY

Glossary	Content
Bayesian Approaches	Approaches to data analysis that provide a posterior probability distribution for some parameter (e.g. treatment effect), derived from the observed data and a prior probability distribution for the parameter. The posterior distribution is then used as the basis for statistical inference.
Bias (Statistical & Operational)	The systematic tendency of any factors associated with the design, conduct, analysis and evaluation of the results of a clinical trial to make the estimate of a treatment effect deviate from its true value. Bias introduced through deviations in conduct is referred to as 'operational' bias. The other sources of bias listed above are referred to as 'statistical'.
Blind Review	The checking and assessment of data during the period of time between trial completion (the last observation on the last subject) and the breaking of the blind, for the purpose of finalising the planned analysis.
Content Validity	The extent to which a variable (e.g. a rating scale) measures what it is supposed to measure.
Double-Dummy	A technique for retaining the blind when administering supplies in a clinical trial, when the two treatments cannot be made identical. Supplies are prepared for Treatment A (active and indistinguishable placebo) and for Treatment B (active and indistinguishable placebo). Subjects then take two sets of treatment; either A (active) and B (placebo), or A (placebo) and B (active).
Dropout	A subject in a clinical trial who for any reason fails to continue in the trial until the last visit required of him/her by the study protocol.
Equivalence Trial	A trial with the primary objective of showing that the response to two or more treatments differs by an amount which is clinically unimportant. This is usually demonstrated by showing that the true treatment difference is likely to lie between a lower and an upper equivalence margin of clinically acceptable differences.
Frequentist Methods	Statistical methods, such as significance tests and confidence intervals, which can be interpreted in terms of the frequency of certain outcomes occurring in hypothetical repeated realisations of the same experimental situation.
Full Analysis Set	The set of subjects that is as close as possible to the ideal implied by the intention-to-treat principle. It is derived from the set of all randomised subjects by minimal and justified elimination of subjects.
Generalisability, Generalisation	The extent to which the findings of a clinical trial can be reliably extrapolated from the subjects who participated in the trial to a broader patient population and a broader range of clinical settings.
Global Assessment Variable	A single variable, usually a scale of ordered categorical ratings, which integrates objective variables and the investigator's overall impression about the state or change in state of a subject.
Independent Data Monitoring Committee (IDMC) (Data and Safety Monitoring Board, Monitoring Committee, Data Monitoring Committee)	An independent data-monitoring committee that may be established by the sponsor to assess at intervals the progress of a clinical trial, the safety data, and the critical efficacy endpoints, and to recommend to the sponsor whether to continue, modify, or stop a trial.
Intention-To-Treat Principle	The principle that asserts that the effect of a treatment policy can be best assessed by evaluating on the basis of the intention to treat a subject (i.e. the planned treatment regimen) rather than the actual treatment given. It has the consequence that subjects allocated to a treatment group should be followed up, assessed and analysed as members of that group irrespective of their compliance to the planned course of treatment.
Interaction (Qualitative & Quantitative)	The situation in which a treatment contrast (e.g. difference between investigational product and control) is dependent on another factor (e.g. centre). A quantitative interaction refers to the case where the magnitude of the contrast differs at the different levels of the factor, whereas for a qualitative interaction the direction of the contrast differs for at least one level of the factor.
Inter-Rater Reliability	The property of yielding equivalent results when used by different raters on different occasions.
Intra-Rater Reliability	The property of yielding equivalent results when used by the same rater on different occasions.
Interim Analysis	Any analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to the formal completion of a trial.
Meta-Analysis	The formal evaluation of the quantitative evidence from two or more trials bearing on the same question. This most commonly involves the statistical combination of summary statistics from the various trials, but the term is sometimes also used to refer to the combination of the raw data.
Multicentre Trial	A clinical trial conducted according to a single protocol but at more than one site, and therefore, carried out by more than one investigator.
Non-Inferiority Trial	A trial with the primary objective of showing that the response to the investigational product is not clinically inferior to a comparative agent (active or placebo control).
Preferred and Included Terms	In a hierarchical medical dictionary, for example MedDRA, the included term is the lowest level of dictionary term to which the investigator description is coded. The preferred term is the level of grouping of included terms typically used in reporting frequency of occurrence. For example, the investigator text “Pain in the left arm” might be coded to the included term “Joint pain”, which is reported at the preferred term level as “Arthralgia”.
Per Protocol Set (Valid Cases, Efficacy Sample, Evaluable Subjects Sample)	The set of data generated by the subset of subjects who complied with the protocol sufficiently to ensure that these data would be likely to exhibit the effects of treatment, according to the underlying scientific model. Compliance covers such considerations as exposure to treatment, availability of measurements and absence of major protocol violations.
Safety & Tolerability	The safety of a medical product concerns the medical risk to the subject, usually assessed in a clinical trial by laboratory tests (including clinical chemistry and haematology), vital signs, clinical adverse events (diseases, signs and symptoms), and other special safety tests (e.g. ECGs, ophthalmology). The tolerability of the medical product represents the degree to which overt adverse effects can be tolerated by the subject.
Statistical Analysis Plan	A statistical analysis plan is a document that contains a more technical and detailed elaboration of the principal features of the analysis described in the protocol, and includes detailed procedures for executing the statistical analysis of the primary and secondary variables and other data.
Superiority Trial	A trial with the primary objective of showing that the response to the investigational product is superior to a comparative agent (active or placebo control).
Surrogate Variable	A variable that provides an indirect measurement of effect in situations where direct measurement of clinical effect is not feasible or practical.
Treatment Effect	An effect attributed to a treatment in a clinical trial. In most clinical trials the treatment effect of interest is a comparison (or contrast) of two or more treatments.
Treatment Emergent	An event that emerges during treatment having been absent pre-treatment, or worsens relative to the pre-treatment state.
Trial Statistician	A statistician who has a combination of education/training and experience sufficient to implement the principles in this guidance and who is responsible for the statistical aspects of the trial.

ICH E9 临床试验的统计学原则

English Version

ICH E9 Statistical Principles for Clinical Trials

1. 引言

1.1 背景与目的

医药产品的有效性和安全性需由临床试验来论证。所采用的临床试验需遵循ICH在1996年5月1日通过的“良好临床实践（GCP）：综合指南”（ICH E6）。 ICH E6已阐明统计学在临床试验设计和分析中不可或缺的作用。由于统计学研究在临床试验领域的不断发展，加之临床研究在药物审批流程及一般医疗保健中的重要作用，因此，有必要制订一份关于临床试验统计学问题的简明文件。本指南旨在协调在欧洲、日本和美国提交上市申请的临床试验所应用的统计学方法的原则。

作为起点，本指南使用了欧盟专利医药产品委员会（CPMP）在题为《用于申请医药产品上市许可的临床试验生物统计学方法》（1994年12月）指南的意见，并参照了日本厚生省的《临床研究中的统计分析指南》（1992年3月）和美国食品药品监督管理局的《新药申请中临床与统计部分的格式与内容指南》（1998 年 7月）。其他 ICH指南也包含一些与统计学原则和方法有关的主题，特别是下面所列的指南。本指南的各个部分会对包含相关内容的特定指南进行标注。

E1A:	人群暴露程度对评价临床安全性的影响
E2A:	临床安全性数据管理：快速报告的定义与标准
E2B:	临床安全性数据管理：个例安全报告传输数据元素
E2C:	临床安全性数据管理：上市药品的定期安全性更新报告
E3:	临床研究报告的结构与内容
E4:	支持药品注册的剂量反应信息
E5:	国外临床数据可接受性的种族因素
E6:	良好临床实践：综合指南
E7:	特殊人群的支持性研究：老年医学
E8:	临床试验的一般考虑
E10:	临床试验中对照组的选择
M1:	用于监管目的的医学术语标准化
M3:	用于实施药物人体临床试验的非临床安全性研究

本指南旨在为申办方在整体临床研发背景下，对研究产品临床试验的设计、实施、分析和评价提供指导。本指南也将会帮助科学专家准备上市申请总结报告或者评价主要来自研发后期的临床试验的有效性和安全性证据。

1.2 范围与方向

本指南的重点是统计学原则，并不涉及具体统计步骤或方法的使用。确保这些原则得到正确实施的具体程序性步骤是申办方的职责。本指南对不同临床试验之间的数据整合亦作了讨论，但并不作为重点。其他ICH指南涵盖了与数据管理及临床试验监查活动有关的原则和程序，此处不再赘述。

本指南对很多科学学科的人士都是有意义的。然而，正如ICH E6 所述，我们假定所有与临床试验有关的统计工作的实际职责由训练有素且经验丰富的统计师承担。试验统计师（见词汇表）在与其他临床试验专家合作时，其作用和职责是确保在支持药物研发的临床试验中恰当地应用统计学原则。因此，试验统计师应同时具备足够的教育/训练和经验以贯彻本指南所阐明的原则。

对于每一个用于上市申请的临床试验，有关设计、实施和拟采用的统计分析的主要特征等重要细节需在研究方案中阐明。对方案中步骤的遵循程度和主要分析预先计划的程度，都将决定试验最终结果和结论的可信度。方案及后续修订应获得包括试验统计师在内的责任人员的批准。试验统计师应恰当使用技术术语，保证方案以及任何修订都能清楚准确地涵盖所有相关的统计问题。

本指南所述的原则主要与研发后期实施的临床试验有关，其中很多是有效性的确证性试验。除有效性外，确证性试验也可把安全性指标（如不良事件、临床实验室指标或心电图测量）、药效学或药代动力学指标（如确证性的生物等效性试验）作为主要指标。其次，有些确证性结果可能来源于不同试验的整合数据，本指南有些原则适用于这种情况。最后，虽然药物研发早期本质上以探索性临床试验为主，但统计学原则也与这些临床试验有关。因此，本指南应尽可能地应用于临床研发的各个阶段。

本指南所描述的很多原则致力于最小化偏倚（见词汇表）和最大化精度。这里的术语“偏倚”是指与临床试验设计、实施、分析和结果解释有关的任何因素所导致的处理效应（见词汇表）的估计值与真实值偏离的系统性趋势。应尽可能地识别偏倚的潜在来源，以便采取措施限制这些偏倚。偏倚的存在可能严重削弱从临床试验中得出正确结论的能力。

有些偏倚源于试验设计，例如，在处理分配过程中将风险较低的受试者系统地分配到其中一个处理组。其他偏倚源于临床试验的实施和分析。例如，违背方案且基于对受试者结局的认识从分析中排除受试者是偏倚的可能来源，这可能影响处理效应的准确估计。偏倚常在不知不觉中发生，且难以直接测量，因而评价试验结果和主要结论的稳健性是重要的。稳健性是一个概念，是指整体结论对数据的各种限制、假设和数据分析方法的敏感性。稳健性意味着，当基于另一假设或分析方法进行分析时，试验的处理效应和主要结论不会受到实质性的影响。在对处理效应和处理间比较的不确定性的统计测量进行解释时，应考虑偏倚对P值、置信区间或推断的潜在影响。

由于临床试验设计和分析的主要方法基于频率派统计方法，因此在讨论假设检验和/或置信区间时，本指南主要使用频率派方法（见词汇表）。这并不意味着其它方法不可取，如果理由充分且所得结论足够稳健，则贝叶斯方法（见词汇表）及其他方法亦可考虑。

2. 总体临床研发的考虑

2.1 试验背景

2.1.1 研发计划

新药临床研发过程的广义目标是发现药物是否在某一剂量范围和用法上能够显示出既安全又有效，且其风险获益关系能够被接受。可能从药物获益的特定对象以及特定的适应症也需要被定义。

满足这些目标通常需要一系列循序渐进的临床试验，每一个临床试验有其特定目的（见ICH E8），应该在一个或一系列临床计划中明确，这些计划应具有适当的决策点和随知识累积而进行修订的灵活性。上市申请应清晰地描述这些计划的主要内容和每个试验的作用。对整个试验项目证据的解释和评价需要综合单个试验的证据（见第7.2章节），为此应确保试验在一些特征上采用通用标准，如医学术语词典、主要测量的定义与时点、方案违背的处理，等等。当医学问题通过一个以上的试验来回答时，统计汇总、综述或meta分析（见词汇表）可能会有用。应尽量在计划中考虑到这一点，以便清晰地确定相关的试验，并且预先指定必要的设计方面的共同特征。应该在该计划中阐述可能会涉及整体计划中若干试验的其他主要统计学问题（如果有的话）。

2.1.2 确证性试验

确证性试验是一种预先提出假设并进行评价的具有充分对照的试验。原则上确证性试验需要提供有效性或安全性的确凿证据。此类试验中，感兴趣的关键假设通常需预先定义，应能直接反映试验的主要目的，且在试验完成后得到检验。在确证性试验中，以适当的精度估计处理效应的大小，与把这些效应和临床意义联系起来同等重要。

确证性试验旨在提供确凿证据以支持主张，因此，按照方案及标准操作规程进行试验尤为重要。应该解释和书面记录不可避免的变化，并考察它们的影响。此类试验设计的合理性以及其它重要的统计方面，如计划分析的主要特征，均应写入方案。每个试验应仅解决有限的问题。

支持所主张的确凿证据要求确证性试验的结果证实研究产品具有临床获益。因此确证性试验应清晰明确地回答每一个与有效性或安全性主张有关的关键临床问题。另外，推论（见词汇表）到目标患者人群的基础得以理解和解释很重要，这也会影响到所需研究中心和/或试验的数量和参与人员（如专家或全科医师）。确证性试验的结果应当是稳健的。某些情况下，单一确证性试验所提供证据强度可能就足够了。

2.1.3 探索性试验

确证性试验的理论基础和设计几乎总是依赖于一系列早期探索性临床研究工作。这些探索性研究和所有临床试验一样应有清晰和明确的目的，但与确证性试验相比，它们的目的并不总是对预先定义的假设进行简单检验。此外，探索性试验可能有时需要采用更灵活的方法进行设计，以便根据积累的结果更改设计。它们的分析可能仅限于数据探索，也可能进行假设检验，但假设的拟定可能依赖于数据。尽管这类试验可能对整体的相关证据有贡献，但不能作为证明有效性的正式依据。

任何试验可能同时具有确证性和探索性两个方面。例如，在大多数确证性试验中，也会对数据进行探索性分析，作为解释和支持研究发现、为后期研究提出进一步假设的基础。方案应明确区分进行确证试验和对数据做探索性分析的两种不同情况。

2.2 试验范围

2.2.1 人群

在药物研发的早期阶段，临床试验受试者的选择在很大程度上受到主观愿望的影响，即希望最大可能地观察到感兴趣的特定临床疗效，因此，研究对象往往是药物最终适用的患者总体中一个非常局限的亚组。但在开展确证性试验的时候，试验受试者应更能反映目标人群。因此，在保持足够的同质性以精确估计处理效应的同时，尽可能放宽目标人群的纳入和排除标准，这对确证性试验是有益的。由于地理位置、实施时间、特定研究者和诊所的医疗实践等因素的影响，任何一个临床试验都不可能完全代表将来的用药者。尽管如此，应尽可能减少这些因素的影响，并在解释试验结果时充分讨论。

2.2.2 主要和次要指标

主要指标（又称“目标”指标，主要终点）应能够提供与试验主要目的直接相关的最具临床相关性和说服力的证据。通常应只设置一个主要指标。因大部分确证性试验的主要目的是提供与有效性相关的强有力的科学证据，所以主要指标通常是有效性指标。安全性/耐受性有时也可能是主要指标，且会一直是一种重要的考量。有关生活质量和卫生经济的指标是进一步的潜在主要指标。主要指标的选择应反映相关研究领域公认的准则和标准。建议使用在早期研究或发表文献中获得的具有实践经验的可靠且已验证的指标。在纳入和排除标准所描述的患者人群中，应该有充分的证据说明主要指标能够有效和可靠地度量临床相关的和重要的治疗获益。主要指标通常用于样本量估计（见第3.5章节）。/p>

很多情况下，评价受试者结局的方法可能并不直接，应仔细定义。例如，将死亡率作为主要指标而无进一步说明是不够的，因为对死亡率的评价可以是比较某些固定时点的存活比例，也可以是比较在特定时域内生存时间的总体分布。另一个常见的例子是复发事件，处理效应的测量可以是简单的二分类指标（特定时期内的任何复发）、首次复发的时间、复发率（观察的单位时间的事件数），等等。在评价慢性病的处理效应时，随时间变化的功能状态对选择主要指标提出了其他挑战。相应的方法有多种，例如，观察期开始和结束时所做评价的比较、由观察期所有评价求得的斜率的比较、超过或低于规定阈值的受试者比例的比较、基于重复测量数据方法的比较。为避免因事后定义所产生的多重性担忧，在方案中规定主要指标的精确定义至关重要，因为该定义将用于统计分析。另外，所选择的具体主要指标的临床相关性和相关测量过程的合理性通常需要在方案中阐明。

主要指标及其选择理由应在方案中详细说明。揭盲后重新定义主要指标通常是不可接受的，因为由此引入的偏倚很难评价。当根据主要目的确定的临床效应存在多种测量方法时，应根据临床相关性、重要性、客观性、和/或其它相关特性，在方案中选择其中一种切实可行的测量方法作为主要指标。

次要指标是与主要目的相关的支持性指标，或与次要目的相关的效应指标。在方案中预先定义次要指标，并说明它们的相对重要性以及在解释试验结果时的作用也很重要。次要指标的数量应有限制，且与试验要回答的有限问题相关。

2.2.3 复合指标

当与主要目的相关的多种测量方法中难以确定单一的主要指标时，另一种有用的策略是按预先确定的计算方法将多个指标组合成一个单一或“复合”指标。主要指标有时以多种临床测量方法相组合的形式出现（如关节炎、精神疾病和其它疾病使用的量表），这虽涉及多重性问题，但无需调整I类错误。将多个指标组合的方法应在方案中详细说明，且应以临床获益的大小对结果进行解释。当复合指标被用作主要指标时，可以对复合指标中有临床意义的单个指标进行单独分析。当量表被用作主要指标时，阐明内容效度（见词汇表）、评价者内和评价者间信度（见词汇表）及检测疾病严重程度变化的反应度等尤其重要。

2.2.4 全局评价指标

在某些情况下，全局评价指标（见词汇表）用于评价某个处理的整体安全性、有效性和/或实用性。这种指标类型整合了客观指标和研究者对受试者的状态或状态变化的总体印象，它通常是一个有序分类量表。整体有效性的全局评价方法已经用于某些治疗领域，如神经病学和精神病学。

全局评价指标一般带有主观成分。使用全局评价指标作为主要或次要指标时，应该在方案中对量表的以下方面进行详细说明：

1) 量表与试验主要目的的相关性；

2) 量表的效度和信度基础；

3) 如何根据所收集的数据将个体受试者归类于量表中的特定类别；

4) 如何将有缺失数据的受试者归类于量表中的特定类别，或用其他方法评价。

若研究者选取的全局评价指标中包含客观指标，则这些客观指标应作为附加的主要指标，或至少作为重要的次要指标。

全局实用性评价综合了获益与风险两方面因素，反映了经治医生的决策过程，即医生在做出使用产品的决策时，必须权衡获益与风险。全局实用性指标会产生这样的问题，即某些情况下会将获益和不良反应方面差别很大的两种产品判断为等效。例如，将一种治疗的全局实用性指标判断为等效于或优效于另一种治疗时，可能掩盖了其疗效甚微或无效但不良反应较少的事实。因此不建议将全局实用性指标作为主要指标。如果全局实用性指标被用作主要指标，则将特定的有效性和安全性结局分别作为附加的主要指标考虑是非常重要的。

2.2.5 多个主要指标

有时需要使用一个以上的主要指标，且每一个指标（或其中一个子集）都足以涵盖其治疗效果的范围。解释这类证据的既定方式应当详细说明，即应该说明对任一指标，或最少几个指标，或全部指标的影响是否被认为是达到试验目的所必需的。应该针对已定义的主要指标清楚地说明主要假设或相关的假设与参数（如均数、百分数、分布），并清楚地叙述统计推断方法。因为存在潜在的多重性问题，所以应解释对I类错误的影响（见第5.6章节），也应在方案中给出控制I 类错误的方法。在评价对I类错误的影响时，所提出的主要指标之间的相关程度也需要考虑。如果试验目的是证实所有主要指标的效果，则无需调整I类错误，但必须仔细考虑对 II 类错误和样本量的影响。

2.2.6 替代指标

当通过观察实际临床有效性直接评价受试者的临床获益不可行时，可以考虑间接标准（替代指标—见词汇表）。一些被认为可以预测临床获益的指标通常可作为替代指标。确定替代指标有两个主要关注点：第一，它可能不是相关临床结局的真正预测因子，例如，它可以测量与一个特定药理学机制有关的治疗活性，但不能提供治疗的作用范围与最终效果的全部信息，无论是阳性还是阴性。许多例证表明，治疗在替代指标显示出高度阳性效应，而最终被证明对受试者的临床结局是有害的。与此相反，也有一些例证显示，治疗的临床获益明确却未能在替代指标体现。第二，替代指标可能不会定量测量可直接权衡不良反应的临床获益。验证替代指标的统计学标准已经具备，但是使用它们的经验相对有限。在实践中，替代证据的强度取决于（1）替代关系的生物学合理性；（2）流行病学研究证明替代指标对临床结局的预后价值；（3）临床试验证明替代指标的处理效应相当于临床结局的效应。一种产品的临床指标和替代指标之间的关系并不一定适用于治疗同一种疾病但具有不同作用方式的另一种产品。

2.2.7 分类指标

连续型或等级指标有时可能需要转化为二分类或其他分类指标。“成功”和“应答”的标准是二分类的常见例子。分类标准需明确规定，例如，连续型指标最小百分比的改善（相对于基线），或者有序等级量表中等于或高于某个阈值水平（如“良”）的按顺序分类。

舒张压降低于90mmHg是一个常见的二分类例子。当分类有明确的临床相关性时，它们是最有用的。众所周知，选择分类标准很容易使临床结果产生偏倚，因此在方案中应预先定义和特别说明分类标准。由于分类通常意味着信息丢失，因此在分析中会损失检验效能，样本量计算时需加以考虑。

2.3 避免偏倚的设计技术

临床试验中，避免偏倚的最重要的设计技术是盲法和随机化，它们为上市申请中大多数对照临床试验所常规采用。大多数此类试验采用双盲法，按照合适的随机化方案，对治疗药物进行预先包装并提供给试验中心，只标明受试者编号和疗程，从而使参与试验的任何人都不知道分配给任何特定受试者的具体治疗药物，甚至不知道编码字母。该方法会在第2.3.1 章节和第2.3.2章节中的大部分内容中进行介绍，例外情况会在最后考虑。

设计阶段应在方案中制定针对性措施，以使试验实施过程中可能损害分析的不规范操作最小化，从而减少偏倚。这里指的不规范操作包括各种类型的方案违背、退出和数据缺失。方案中应考虑一些方法，以减少出现这些问题的频率，以及解决在数据分析中出现的问题。

2.3.1 盲法

盲法或遮蔽是为了限制临床试验的实施和解释时所产生的有意或无意的偏倚，这些偏倚可能源于以下情况的影响：知晓受试者的招募和处理分组、受试者的后续治疗、受试者对治疗的态度、终点评价、退出的处理、从分析中剔除数据，等等。盲法的根本目标是防止知晓处理分组，直到所有产生偏倚的机会都消失。

在双盲试验中，所有受试者及参与受试者的治疗或临床评价的研究者和申办方人员，包括确定受试者资格、评价终点或评价方案依从性的任何人，均不知道受试者所接受的治疗。在整个试验实施过程中，这种盲态要始终保持，只有当数据被清理到可接受的质量水平时，才可对适当的人员揭盲。如果需要对不参与受试者的治疗或临床评价的申办方人员揭盲处理编码（如生物分析学家、稽查员、参与严重不良事件报告的人员），申办方应该制定严格的标准操作规程，以防止处理编码的不当传播。在单盲试验中，研究者和/或他的成员知道处理分组信息，但受试者不知道，反之亦然。在开放试验中，所有的人都可能知道处理分组信息。双盲试验是最优方法，它要求试验所采用的处理在使用前或使用期间均无法被识别出来（如外观、味道等），且在整个试验期间均适当地保持盲态。

达到理想的双盲会有很多困难：有些处理可能具有完全不同的性质，例如，手术和药物治疗；两种药物可能具有不同的剂型，虽然使用胶囊可以令它们无法被区分，但改变剂型可能会改变药代动力学和/或药效学的特性，因此需要建立制剂的生物等效性；两种处理的每日用法可能不同。这些情况下，使用“双模拟”（见词汇表）技术是实现双盲条件的一种方法，该技术有时会强制实施一种非同寻常的使用方案，使得受试者的积极性和依从性受到负面影响。伦理上的困难也可能会干扰该技术的应用，例如手术过程的模拟。无论如何，应当努力克服这些困难。

某些临床试验的双盲性质可能由于明显的处理诱导效应而遭到部分破坏。这种情况下，使研究者和有关申办方人员对某些检验结果（如所选择的临床实验室测量）保持盲态，可以使盲法得到改善。使偏倚最小化的类似方法（见下文）应当在开放试验中考虑，例如独特的处理效应无法对患者设盲的试验。

如果双盲试验不可行，则应考虑用单盲方案。有些情况下，只有开放试验在实践上或伦理上是可行的。单盲和开放试验更具灵活性，但特别重要的是，研究者知道了下一个受试者的处理不应影响入组受试者的决定，即该决定应在知道随机化处理之前做出。对于这些试验，应考虑使用中央随机化方法，如采用电话随机化管理处理的分配。此外，应该由不参与治疗受试者并对处理保持盲态的医务人员进行临床评价。在单盲或开放试验中，应尽一切努力使各种已知的偏倚来源降到最低，并且应采用尽可能客观的主要指标。应在方案中解释所采用的盲态程度的原因，以及所采取的使偏倚最小化的措施。例如，申办方应当有严格的标准操作规程，以保证在清理数据库以供分析之前，适当限制对处理编码的获取。

只有经治医师认为对某一受试者的治疗有必要知道其处理分配时，才应考虑对该受试者破盲。无论什么原因导致的任何有意或无意地破盲都应该在试验结束时给予报告和解释。处理分配的揭盲过程及时间都应该记录在案。

本文件中，数据的盲态审核（见词汇表）是指在试验完成（对最后一位受试者的最后一次观察）到揭盲之间的这段时间内对数据的检查。

2.3.2 随机化

在临床试验中，随机化将机会元素引入到受试者的处理分配中。在试验数据的后续分析期间，它为定量评价与处理效应有关的证据提供了坚实的统计基础。它倾向于使各处理组的已知和未知的预后因素分布相似。与盲法结合，在受试者的选择和分配时，随机化有助于避免因处理分配的可预测性而可能出现的偏倚。

临床试验的随机化列表记录了施与受试者处理的随机分配，其最简单的方式是处理的序列表（或交叉试验中的处理序列），或按受试者编号对应的编码。有些试验，如具有筛选阶段的试验，可能使问题复杂一些，但是预先计划的受试者的处理分配或处理序列应是唯一的。不同的试验设计需要不同的程序来生成随机化列表。随机化列表应当有重现性（如果需要）。

虽然无限制条件的随机化是一种可接受的方法，但区组随机一般具有某些优势，它有助于增加处理组间的可比性，特别是当受试者特征可能随时间变化时，例如由于招募策略改变引起的变化。它还能更好地保证各处理组的样本量几乎相等。在交叉试验中，它提供了获得具有更高效率和更易于解释的平衡设计的方法。选择区组长度时需注意，既要足够短以限制可能的不平衡，又要足够长以避免对区组序列末尾的可预测性。区组长度通常应对研究者及其他有关人员保持盲态；使用两种或多种区组长度与每个区组随机选择长度，可达到同样目的。（理论上，在双盲试验中，可预测性并不重要，但药物的药理作用可能提供猜测机会。）

对于多中心试验（见词汇表），应按中心进行随机化。提倡每个中心有一个单独的随机方案，即按中心分层或为每个中心分配若干完整的区组。更一般地，按照基线测量的重要预后因素（如疾病的严重程度、年龄、性别等）进行分层，可保障层内的平衡分配，这种方法在小型试验中潜在益处更大。分层因素一般不超过三个，否则实现平衡不仅困难，而且麻烦。应用动态分配程序（见下文）可能有助于同时在多个分层因素之间达到平衡，只要可以调整其余试验流程以适应这类方法。应当在后续的分析中对分层随机化的因素加以考虑。

进入试验的下一个随机化受试者，应该接受对应于随机化列表（如果随机化是分层的，则在相应的层中）中下一个号码的处理。只有当已经确认下一个受试者进入到试验的随机化阶段时，才能给受试者分配合适的号码和相关处理。具有增加可预测性的随机化细节，如区组长度，不应包含在试验方案中。随机化列表本身应该由申办方或独立方安全存档，以确保整个试验过程维持盲态。在试验期间获取随机化列表应该考虑在紧急情况下为任何受试者破盲的可能性。破盲应遵循的程序、必要的文件以及受试者后续的处理和评价均应在方案中写明。

动态分配也是一种选择，该方法根据当前已分配的处理的平衡情况进行处理分配，对于分层试验，处理分配视受试者所属层内的平衡情况而定。应当避免确定性的动态分配程序，应当为每个处理分配纳入适当的随机化要素。应尽一切努力保持试验的双盲状态。例如，仅限于中央试验办公室知道处理编码，并由办公室通过电话联系来控制动态分配。这种方法允许对入选标准进行额外检查，并会建立试验入组的记录，这些信息对某些类型的多中心试验具有价值。随后会启用双盲试验的预包装和贴标签的药品供应系统，但它们的使用顺序不再是依次的。最好使用适当的计算机算法使中央试验办公室的人员对处理编码保持盲态。当考虑动态分配时，应该仔细评价物流的复杂性以及对分析的潜在影响。

3. 试验设计的考虑

3.1 设计类型

3.1.1 平行组设计

对于确证性试验，最常见的临床试验设计是平行组设计，该设计将受试者随机分配到两组或多组中的一组，每组采用不同的处理。这些处理包括一个或多个剂量的研究产品，以及一个或多个对照处理，如安慰剂或/和阳性对照。该设计的假设比大多数其它设计简单，但与其它设计一样，可能会有使分析和解释复杂化的额外试验特征，如协变量、随时间的重复测量、设计因素之间的交互作用、方案违背、脱落（见词汇表）、退出等。

3.1.2 交叉设计

在交叉设计中，每个受试者被随机分到两个或多个处理序列，因此处理间的比较相当于自身对照。这种简单策略之所以有吸引力，主要因为它减少了满足检验效能所需的受试者，有时减少的程度相当可观。2×2 交叉设计是最简单的，该设计通常在先后两个处理周期中安排一个洗脱期，每个受试者以随机顺序在每个处理周期接受两个处理中的其中一个。最常见的扩展设计是n个周期和n（>2）个处理，每个受试者先后接受所有 n 个处理。此类设计形式多样，例如，每个受试者接受n（>2）个处理中的一个子集，或者对一个受试者重复给予处理。

交叉设计有很多问题可导致其结果无效，主要困难在于残留效应，即在后继处理周期内的前序处理的残余影响。使用相加模型时，不同的残留效应将使处理间的直接比较产生偏倚。对于2×2设计，统计上无法将残留效应从处理与周期的交互作用中区分开来，并且因为相应的对比是“受试者之间”，故检验这两个效应中任何一个都缺乏检验效能。这一问题在高阶设计中并不严重，但不能完全消除。

因此，使用交叉设计重要的是要避免残留效应，最好的办法是在充分了解疾病领域和新药的基础上有选择地和谨慎地使用该设计，诸如针对病情稳定的慢性病；治疗周期内可充分发挥药物的相关效应；洗脱期足够长以使药物效应完全消退等。应该在试验前利用已有信息及数据确定是否可满足这些条件。

交叉试验还有一些需要密切注意的问题，其中，受试者失访导致的分析和解释的复杂化最值得关注。另外，残留效应的潜在作用导致后续处理周期所发生的不良事件很难判断是哪种处理所致。这些问题以及其它问题在ICH E4中已有阐述。交叉设计一般应严格限于预期仅有少数失访的试验。

采用2×2交叉设计验证相同药物的两种制剂的生物等效性甚为常用，往往令人满意，尤其是以健康志愿者为对象的试验，如果两个周期间的洗脱时间足够长，极不可能发生相关药代动力学指标的残留效应。不过，在分析期间基于获得的数据核实这一假设仍然非常重要，例如，通过在每个周期开始时未检测到药物来证实无残留效应。

3.1.3 析因设计

在析因设计中，通过使用不同的处理组合可以同时评价两个或多个处理。最简单的例子是2×2析因设计，受试者被随机分配到两个处理 A和B的四种可能组合之一，即单独A、单独B、既有A又有B、既无A又无B。该设计多以检验A和B的交互作用为特定目的。如果基于检验主效应计算样本量，则交互作用统计检验的检验效能可能不足。当该设计被用于检验A和B的联合效应时，特别是如果两者可能被一起使用，这一考虑尤为重要。

析因设计的另一个重要用途是，建立同时使用处理C和D时的剂量-反应特征，特别是在先前试验中每种单一疗法的某个剂量的有效性已被证实的情况。设C的剂量数为m（通常包括零剂量，即安慰剂），相似的D的剂量数为n，整个设计由m×n 个处理组构成，每个处理组为一种不同的C和D的剂量组合，则应用响应面的结果估计可以帮助确定临床使用的C和D剂量的恰当组合（见ICH E4）。

某些情况下，如评价两种处理的有效性所需的受试者数量与单独评价任一种处理的有效性所需的受试者数量相同时，2×2 设计可能会更高效地利用受试者，这一策略已经被证实对非常大型的死亡率试验颇有价值。该方法的效率和可靠性取决于处理A和B之间不存在交互作用，使得A和B对主要有效性指标的主效应服从相加模型，因此，无论是否追加B的效应，A的效应是确定的。对于交叉试验，应在试验前利用先前的信息和数据，这很可能会找到满足无交互作用的证据。

3.2 多中心试验

开展多中心试验主要有两个原因。首先，多中心试验是一种更加高效地评价新药的可接受的方法；某些情况下，为在合理的时间框架内获得足够的受试者以满足试验目的，它可能是唯一可行的方法。原则上，在临床研发的任何阶段均可开展这种性质的多中心试验。多中心试验可能有几个中心，每个中心的受试者数量较大；也可能有很多中心，每个中心只有很少的受试者，比如罕见病研究。

其次，设计成多中心（和多个研究者）试验主要是为研究结果的后续推论提供更好的基础，因为从更广泛的人群中招募受试者和呈现更宽泛的使用药物的临床环境，从而呈现出更典型的未来用药场景。这种情况下，许多研究者的参与也可提供更宽泛的药物价值临床判断。此类试验在药物研发后期将成为确证性试验，可能有大量的研究者和中心参与。为增强可推论性（见词汇表），多中心试验有时会在许多不同国家实施。

要想充分解释和外推多中心试验结论，所有中心实施研究方案的方式应该是明确的和相似的。样本量和检验效能的计算通常基于各中心的处理间差异是相同的无偏估计的假设，因此，制定共同研究方案并给予实施很重要。试验的实施流程应该尽可能标准化。通过研究者会议、试验前的人员培训和试验期间的严密监查，可以减少评价标准和方法的不一致性。良好设计的目的通常是实现每个中心内各处理组的受试者分布相同，而良好管理可以对该目的起到支持作用。应避免中心间的病例数相差太大以及个别中心病例数太少，这一考虑的好处会在后期探查中心间处理效应的异质性时显示出来，因为这样可以减少处理效应不同加权估计之间的差异。（这一点并不适用于所有中心病例数都非常少的试验，以及分析时不考虑中心效应。）如果不采取这些预防措施，加之对结果同质性的质疑，会使多中心试验的价值降低，有时甚至严重到不能为申办方的主张提供令人信服的证据的地步。

最简单的多中心试验是每位研究者负责在一家医院招募受试者，所以，“中心”是由研究者或医院唯一确定的。可是，很多试验会更复杂一些，例如，一个研究者可能从几家医院招募受试者；一个研究者可能代表一个临床医生团队（参与研究者），他们或从一家医院所辖的几个诊所，或从几家相关的医院招募受试者。只要对统计模型中关于中心的定义有疑义，方案中的统计章节（见第5.1章节）就应在特定试验背景下明确定义该术语（例如，按研究者、场所或地区）。多数情况下，根据研究者定义中心较为可行，ICH E6在这方面提供了相关指南。定义中心的目的是使影响主要指标测量的因素和处理的影响达到同质，以免因此引起质疑。任何将中心合并起来进行分析的规则应尽可能在方案中合理阐述并预先规定，但是，任何基于此方法的决策都应始终在盲态下做出，如盲态审核。

方案中应该描述处理效应的估计和检验的统计模型。主要处理效应估计可首先使用包含中心效应的模型，但不包含处理与中心的交互项。如果处理效应中心间是同质的，则在模型中常规地包含交互项会降低对主要效应的检验效率；如果确实存在处理效应的异质性，则对处理效应的解释是有争议的。

某些试验，如大型的死亡率试验，每个中心只有很少受试者，设想中心对主要或次要指标有任何影响都是缺乏依据的，因为中心因素的影响不可能代表临床重要性。还有一些试验可能从一开始就会认识到每个中心有限的受试者使得统计模型中包含中心效应变得不切实际。这种情况下，模型中不应包含中心项，而且也没有必要按中心进行分层随机化。

对于每个中心都有充足的受试者的试验，如果发现阳性处理效应，通常应探索不同中心间处理效应的异质性，因为这可能影响结论的外推性。通过各中心结果的图示方法，或通过对中心与处理间交互作用的统计检验，可能会发现明显的异质性。对交互效应做统计检验时，需认识到其检验效能不高，因为试验是基于探测处理的主效应而设计的。

如果发现处理效应的异质性，则应当谨慎地加以解释，并应积极尝试从试验管理的其他特征或受试者特征方面来寻找原因。这样的原因通常会提示适当的进一步分析和解释。在缺乏原因的情况下，一旦证实处理效应的异质性，例如，通过明显的定量交互作用（见词汇表），意味着处理效应可能需要另一种估计，比如给中心不同赋权以保障处理效应估计的稳健性。理解定性交互作用（见词汇表）的异质性甚至更为重要，当未能找到原因时，要想可靠地预测处理效应，可能需要进一步开展临床试验。

以上针对多中心试验的讨论都是基于采用固定效应模型的。混合模型也可用于探索处理效应的异质性，它把中心效应和中心与处理间的交互效应看作是随机的，尤其适合于中心数量特别多的情况。

3.3 比较的类型

3.3.1 优效性试验

科学地讲，通过安慰剂对照试验显示优于安慰剂，或通过显示优于阳性对照处理，或显示剂量-反应关系，所得到的疗效是最可信的。此类试验被称为“优效性”试验（见词汇表）。本指南一般以优效性试验为假定，除非另有明确说明。

对于严重疾病，如果存在经优效性试验验证的有效的治疗方法，采用安慰剂对照试验可能被认为是有悖伦理的。这种情况下，应当科学地采用阳性对照。安慰剂对照和阳性对照的适用性应当不同试验给予不同考虑。

3.3.2 等效性或非劣效性的试验

某些情况下，研究产品与参照处理相比的目的并非为了显示优效性。此类试验根据其目的分为两大类，一类是“等效性”试验（见词汇表），另一类是“非劣效性”试验（见词汇表）。

生物等效性试验属于前一类。某些情况下，出于其他监管原因也进行临床等效性试验，例如，当化合物不被吸收并因此不存在于血液中时，验证仿制产品与已上市产品的临床等效性。

很多阳性对照试验用于验证研究产品的有效性非劣效于阳性对照药，因此属于后一类。另一种可能是在试验中将研究药品的多个剂量与标准药品的推荐剂量或多个剂量进行比较。这种设计的目的是同时显示研究产品的剂量-反应关系，并将研究产品与阳性对照进行比较。

阳性对照等效性或非劣效性试验也可引入安慰剂对照，从而在一个试验中设定多个目标，例如，这种设计在验证优效于安慰剂的同时，还可以评价相对于阳性对照的有效性与安全性的相似程度。众所周知，采用不包含安慰剂或不设置新药多个剂量的阳性对照等效性（或非劣效性）试验会面临一些困难。与优效性试验相比，此类试验隐性缺乏内部效度，因此必须进行外部验证。等效性（或非劣效性）试验本质上并不保守，因此，在试验设计或实施中的许多缺陷倾向于使结果倾向等效的结论。由于这些原因，这些试验的设计特点应受到特别关注，它们的实施需要特别小心，例如，尽量减少违反入选标准、不依从、退出、失访、数据缺失和其它偏离方案的发生率，并使它们对后续分析的影响降至最低。

应谨慎选择阳性对照。恰当的阳性对照应该是一种被广泛使用的疗法，其针对相关适应症的疗效已在良好设计和良好记录的优效性试验中得到了量化确认，并且能够可靠地预期在将要实施的试验中显示出相似的疗效。为此，新试验应该与以前实施且明确显示出临床相关疗效的优效性试验具有相同的重要设计特征（主要指标、阳性对照的剂量、入排标准等），且考虑与新试验相关的医学或统计学实践的进展。

在试验方案中，一个关键问题是要把证明等效性或非劣效性的意图清晰明确地表述出来。方案中应规定一个等效界值，该界值被视为临床可接受的最大差异，并且应当小于在阳性对照优效性试验中所观察到的差异。对于阳性对照等效性试验，需规定等效界值的上限和下限；而对于阳性对照非劣效性试验，仅需规定界值下限。等效界值的选择应具备临床的合理性。

统计分析通常采用置信区间方法（见第5.5章节）。对于等效性试验，应当使用双侧置信区间。如果置信区间完全落在等效界值之内，可推断为等效。在实操上，该法相当于双单侧检验方法，其（复合）无效假设是处理间差异在等效界值之外，（复合）备择假设是处理间差异在等效界值之内。由于两个无效假设无重叠，故I类错误可控。对于单侧假设检验，其无效假设是处理间差异（试验品减去对照品）等于或小于等效界值的下限，而备择假设是处理间差异大于等效界值下限。单侧或双侧检验的I类错误选择有所不同。样本量计算应当基于这些方法（见第3.5章节）。

在研究产品与阳性对照之间无差异的无效假设下，如果基于观察到无显著差异的检验结果，做出等效性或非劣效性的结论是不合适的。

在选择分析数据集时也存在一些特殊问题。处理组或对照组退出或脱落的受试者都倾向于缺乏应答，因此使用全分析集（见词汇表）的结果证实等效性可能存在偏倚（见第5.2.3章节）。

3.3.3 剂量-反应关系的试验

新研究产品的剂量与应答如何相关，是一个在研发的所有阶段通过各种方法都可获得答案的问题（见ICH E4）。剂量反应试验可服务于许多目的，相对重要的有：有效性的确证；剂量反应曲线的形状和位置的研究；适宜初始剂量的估计；个体剂量调整的最优策略确定；最大剂量的确定（超出该剂量不可能额外获益）。达到上述目的需要收集研究中各种剂量的数据，包括安慰剂（零剂量）。为此，需用到估计剂量反应关系的方法，包括统计检验以及同样重要的置信区间构建和图示方法。假设检验可能需要根据剂量的自然顺序或关于剂量-反应曲线的形状（如单调性）的特定问题做出调整。应当在方案中提供详细的统计分析计划。

3.4 成组序贯设计

采用成组序贯设计便于进行期中分析（见第4.5章节和词汇表）。成组序贯设计虽然不是用于期中分析的唯一可接受的设计类型，却是最常用的，因为在试验期间以周期性间隔评价不同分组的受试者的结局比在获得整个试验每一个受试者数据后进行评价更为可行。在获得处理结局和受试者的处理分配（如揭盲，见第4.5章节）的信息之前，应充分说明统计方法。独立数据监查委员会（见词汇表）可对来源于成组序贯设计的数据实施审查或进行期中分析（见第 4.6章节）。该设计不仅已被最广泛地、成功地应用于大型、长周期的以死亡率或主要非致死性结局为终点的试验，它在其它方面的应用也在增加。尤其是，人们已经认识到所有试验中都必须监查安全性，因此，为了出于安全原因提早终止试验而制定正式流程的必要性往往是需要考虑的。

3.5 样本量

临床试验的受试者例数应足够大，以对所提出的问题提供可靠答案。样本量通常由试验的主要目的确定，如果由其它要素确定，则应明确说明理由。例如，基于安全性问题或需要或者基于重要的次要目的确定的样本量可能比基于主要有效性问题确定的样本量需要更多的受试者（例如，见ICH E1a）。

一般的样本量确定方法应考虑以下要素：主要指标、检验统计量、无效假设、所选剂量下的备择（“工作”）假设（所选受试者人群中在所选剂量下检测出或拒绝的处理间差异）、错误拒绝无效假设的概率（I类错误）、错误地不拒绝无效假设的概率（II类错误），以及应对退出和违背方案的处理方法。某些情况下，以事件率为评价检验效能的主要手段，此时需要做出一些假设，以从所需的事件数推算出试验的最终样本量。

应在方案中给出计算样本量的方法，以及在计算中使用的任何估计量（如方差、均值、反应率、事件率、待检测的差异）。也应该给出这些估计的依据。研究这些假设的偏离对样本量估计的敏感性很重要，而根据偏离假设的合理范围给出对应的样本量范围则是一种方便可行的方法。在确证性研究中，假设通常应基于公开发表的数据或早期试验的结果。对于待检测的处理间差异，可依据在患者管理中对具有临床相关性的最小效应的判断，也可依据对新处理的预期效应的判断，相比之下后者的预期效应更大。通常I类错误概率设在5%或者更小，或者由多重比较所需要的任何调整来决定；检验假设的事先合理性以及结果的预期影响可能会影响I类错误的精确选择。II类错误的概率通常设在10%到20%之间，申办方通常愿意让该值尽可能低，尤其当试验难以或不可能重复时。某些情况下，采用与常规的I类和II类错误水平不同的值也可能被接受，甚至更可取。

样本量应是主分析所需的受试者数量。如果这是“全分析集”，则效应大小的估计与符合方案集（见词汇表）相比，可能需要降低。这是因纳入了退出处理的或者依从性差的患者数据，而考虑稀释处理效应。相应地关于变异的假设可能也需要修改。

等效性或非劣效性试验（见第3.3.2章节）的样本量通常应基于获得处理间差异的置信区间的目的，该差异是指临床可接受的最大处理间差异。如果等效性试验的检验效能是在假设真实差异为0的条件下确定的，如果真实差异不为0，则达到这一检验效能所需的样本量会被低估。如果非劣效性试验的检验效能是在假设0差异的条件下确定的，如果试验产品的效应低于对照，则达到这一检验效能所需的样本量会被低估。“临床可接受的”差异的选择需要合理说明它对将来患者的意义，并且可能小于上文提到的优效性试验旨在证明的“临床相关的”差异。

成组序贯试验不能预先确定确切的样本量，因为它依赖于机会作用以及所选择的终止试验的准则和真实的处理间差异。终止准则的设计应该考虑后续样本量的分布，通常表达为预期样本量和最大样本量。

当事件率低于预期或变异大于预期时，在不揭盲数据或不进行处理间比较的情况下，可使用样本量重新估计的方法（见第4.4章节）。

3.6 数据采集及处理

数据的收集和研究者向申办方传输数据可通过各种媒介进行，包括纸质病例报告表、远程现场监查系统、医疗计算机系统和电子传输。无论采用何种数据收集工具，所收集信息的形式和内容都应完全符合方案，并应在临床试验实施前确定。应注重分析计划的实施所必须的数据，包括确认方案依从性或确定重要方案违背所需要的背景信息（如与服用剂量有关的时点评价）。 “缺失值”应该与“0值”或“特征缺失”区分开来。

从数据收集到数据库最终确定的过程应该按照 GCP 进行（见ICH E6，第5章节）。具体来说，需要及时可靠的程序用于记录数据和纠正错误与遗漏，以确保交付高质量的数据库，并通过实施计划的分析达到试验目的。

4. 试验实施的考虑

4.1 试验监查和期中分析

按照方案认真实施临床试验，对结果的可靠性具有重大影响（见ICH E6）。仔细监查可以确保尽早发现困难，并将它们的发生和复发减至最小。

由制药企业资助的确证性临床试验，通常有两种截然不同的监查类型。一种关注试验质量的监督，另一种涉及破盲以进行处理间的比较（即期中分析）。两种试验监查，除人员职责不同外，还涉及不同类型试验数据和信息的获取，因此需用不同的规则控制潜在的统计和操作偏倚。

出于监督试验质量的目的，试验监查中所涉及的检查可能包括：是否遵循方案，累积数据是否可接受，计划的收集目标是否达到，设计假设是否合适，以及在试验中保留患者是否成功，等等（见第4.2至4.4章节）。这种类型的监查既不需要获取比较处理效应的信息，也不需要对数据进行揭盲，因此对I类错误没有影响。出于这一目的对试验进行监查是申办方的职责（见ICH E6），可由申办方或申办方选择的独立小组来进行。这种类型的监查周期一般是从选择试验现场开始，到收集和清理最后一位受试者的数据结束。

其他类型的试验监查（期中分析）涉及到比较处理结果的累积。期中分析需要揭盲（即破盲）获取处理组分配信息（实际的处理分配或者各组分配的标识）以及比较处理组的汇总信息。这需要在方案（或者首次分析之前的适当修订）中包含期中分析的统计计划，以防止某些类型的偏倚，见第4.5 和 4.6 章节的讨论。

4.2 纳入与排除标准的更改

纳入与排除标准应按方案的规定保持恒定，贯穿受试者招募期。偶尔有些改变是允许的，例如，在长周期试验中，从试验外部或期中分析所获得的对医学知识新的认识，可能建议修改入组标准。监查人员发现违背入组标准情况经常发生，或者由于入组标准过严导致非常低的招募率，也都可能是修改入组标准的理由。修改入组标准应在不破盲的情况下进行，并通过方案修订进行描述，修订的方案应涵盖任何统计学方面的变动，如不同事件率所致的样本量调整，或者分析计划的修改，如根据修改的纳入/排除标准进行分层分析。

4.3 入组率

在受试者入组时间较长的试验中，应监查入组率，如果它明显低于预期水平，应该查明原因并采取补救措施，以确保试验的检验效能，并减轻对选择性入组和其他质量问题的担忧。这些考虑适用于多中心试验的各个中心。

4.4 样本量调整

在长周期试验中，通常有可能对原设计和样本量计算所依据的假设进行检查。如果试验设计的某些重要规定是根据初步的和/或不确定的信息做出的，这种检查尤其重要。对盲态数据进行期中检查可能会发现总应答的方差、事件率或生存状态不如预期。此时，可能需要通过适当修改假设来修正样本量，还应在方案修订和临床研究报告中说明其合理性并记录在案。应该解释为保持盲态所采取的措施及其对I类错误和置信区间宽度的影响（如果有）。只要可能，都应在方案中表述样本量再估计的潜在需要（见3.5章节）。

4.5 期中分析和提早终止试验

期中分析是指，在试验正式完成之前的任何时间，为比较处理组间的有效性或安全性而进行的任何分析。因为这些比较的次数、方法及结果影响试验的解释，因此所有期中分析都应当预先仔细计划并在方案中阐明。有些特殊情况，期中分析可能在试验开始后才发现有必要实施。对于这种情况，补充定义期中分析的方案修订应在分析数据揭盲之前。当期中分析用于决定是否终止试验时，通常会采用成组序贯设计，该设计以统计监查计划作为准则（见第3.4章节）。对于这种期中分析，出现以下情况可以提早终止试验：研究处理的优效性已被证实；相关处理间差异已被证实是不可能的；发生了不可接受的不良反应。一般来说，与安全性监查相比，通过有效性监查来提早终止试验要求更多的证据，即边界更保守。当试验设计和监查目的涉及多个终点时，应考虑多重性问题。

方案中应描述期中分析计划，或至少描述一些相关的考虑，如是否使用灵活的α消耗函数方法，并在第一次期中分析前，在修订的方案中提供进一步的细节。终止试验的准则和特性应在方案或修订的方案中清晰阐述。其他重要指标的分析对提早终止的潜在影响也应考虑。如果试验设有数据监查委员会（见第4.6章节），上述材料应由其撰写或批准。偏离计划总有可能使试验结果失效。如果试验需要修正，任何统计方面的相应修改应尽早在方案修订中详细说明，特别是讨论这些修改对任何分析或推断的影响。在统计方面应始终确保控制总I类错误概率。

期中分析的执行应该是一个完全保密的过程，因为可能涉及非盲的数据和结果。参与试验实施的所有人员应当对这些分析结果保持盲态，因为他们对试验的态度可能会改变并导致招募患者的特征改变或产生处理间比较的偏倚。除了直接参与执行期中分析的人员之外，这一原则可适用于所有研究人员和申办方所雇佣的人员。研究者应仅被告知继续或终止试验的决定，或实施修订试验程序的决定。

大部分支持研究产品有效性和安全性的临床试验应全部完成计划入组的样本量。只有出于伦理原因，或者出现检验效能不再可接受的情况，试验可提早终止。然而，人们都知道出于各种原因申办方的药物研发计划需要获取处理间比较的数据，如为其它试验制定计划；另外，仅有一部分试验会涉及到严重威胁生命的结局或死亡率的研究，出于伦理原因可能需要对入组病例的处理效应比较进行连续监查。无论是哪一种情况，为了应对可能引入的潜在统计偏倚和操作偏倚，应当在分析数据揭盲之前，在方案或修订方案中制定期中统计分析计划。

对于许多研究产品的临床试验，特别是那些具有重大公共卫生意义的临床试验，应将监查有效性和/或安全性结局比较的任务委托给外部独立团队，并清楚地描述其职责。通常将该团队称为独立数据监查委员会、数据和安全监查委员会或数据监查委员会。

当申办方充当监查有效性或安全性比较的角色并因此可以获取非盲的比较信息时，应特别注意保护试验的完整性，并适当地管理和限制信息共享。申办方应当确保并记录内部监查委员会遵守书面的标准操作规程，以及含有期中分析结果记录的决策会议纪要被维护。

任何没有恰当计划的期中分析（不论有或没有提早终止试验的影响）都可能导致试验结果的缺陷，并可能降低所得结论的可靠性，因此，应该避免这些分析。如果实施非计划的期中分析，临床研究报告应该解释其必要性，交待破盲的程度，评价所引入偏倚的潜在程度和对结果解释的影响。

4.6 独立数据监查委员会（IDMC）的作用（见ICH E6第1.25和5.52章节）

独立数据监查委员会可由申办方组建，每隔一段时间评价临床试验进展、安全性数据和关键有效性指标，并向申办方建议继续、修改或终止试验。该委员会应当有书面的操作规程，并保存所有会议记录，包括期中分析结果；当试验完成时，这些应可供审查。该委员会的独立性旨在控制重要的比较信息的分享，防止临床试验的完整性受到因获取试验信息而造成的不利影响。该委员会是独立于机构审查委员会或独立伦理委员会的实体，它的组成应包括通晓统计学等相关学科的临床试验科学家。

当独立数据监查委员会中有申办方代表时，在委员会的操作规程中应明确规定他们的作用（例如，他们是否能就关键问题进行投票）。由于这些申办方人员将会获得非盲信息，因此这些操作规程还应解决如何控制期中试验结果在申办方组织内散布。

5. 数据分析的考虑

5.1 分析的预先确定

当设计一个临床试验时，数据的最终统计分析的主要特征应该在方案的统计章节进行描述。该章节应包括所提出的主要指标确证性分析的所有主要特征以及解决预期分析问题的方法。对于探索性试验，该章节可描述更一般性的原则和方向。

统计分析计划（见词汇表）可作为独立文件撰写，并在最终确定方案之后完成。该文件可以更加技术性地和详细地阐述方案所述的主要特征（见第7.1章节）。该计划可包括对主要和次要指标以及其他数据进行统计分析的详细程序。统计分析计划应经审核或根据数据盲态审核（见第7.1章节定义）结果更新后，在揭盲前最终确定。最终统计分析计划的确定及随后的揭盲应保留正式记录。

如果盲态审核建议修改方案中所述的主要特征，应记录在修订方案中。否则，根据盲态审核建议考虑更新统计分析计划就足够了。只有方案（包括修订方案）中预设的分析才被认为是确证性的。

在临床研究报告的统计章节中，应该清楚地描述所采用的统计方法，包括临床试验过程中何时做出的方法学决策（见ICH E3）。

5.2 分析集

数据纳入主分析的受试者集应在方案的统计章节进行定义。另外，对试验程序（如导入期）启动的所有受试者进行文档记录可能是有用的。该受试者文档的内容取决于特定试验的详细特征，只要可能，至少应收集人口统计学和疾病状态的基线数据。

如果所有随机入组的受试者都满足全部入组标准，完全遵从所有试验程序且无失访，并能提供完整的数据记录，那么要纳入分析的受试者集是显而易见的。试验设计和实施的目标应该尽可能地接近这一理想状态，但实践中却难以达到这一状态。因此，方案的统计章节应该预先阐述可能影响受试者和分析数据的问题。方案还应该说明旨在减少研究实施中任何预期的且可能影响数据分析的不规则问题的程序，这些不规则问题包括各种类型的方案违背、退出和数据缺失。方案应考虑降低这些问题发生频率的方法以及如何解决数据分析中会发生的问题。在盲态审核期间，应确定针对方案违背分析方法可能的修订。最好是根据发生时间、原因及对试验结果的影响来确定任何重大方案违背。方案违背、数据缺失以及其它问题的发生频率和类型应记录在临床研究报告中，并描述它们对试验结果的潜在影响（见ICH E3）。

关于分析集的确定应遵循以下原则：1）使偏倚减到最小；2）避免I类错误膨胀。

5.2.1 全分析集

意向性治疗（见词汇表）原则是指主分析应包括所有随机化受试者。遵循该原则需要完成所有随机化受试者的随访以获得研究结局。实践中这一理想状态很难达到。在本文件中，术语“全分析集”被用来描述尽可能完整的分析集，即尽可能接近包括所有随机化受试者的意向性治疗的理想状态的分析集。在分析中保持初始随机化对于防止偏倚以及为统计检验提供可靠基础是很重要的。全分析集的使用为许多临床试验提供了一种保守策略。许多情况下，它也可以提供处理效应的估计，这些估计更有可能反映了后续临床实践中观察到的效应。

一些有限的情况可能导致将随机化受试者从全分析集中排除，包括未能满足主要入组标准（入选标准违背），未服用过至少一次试验药物以及缺乏随机化后的任何数据。这些排除应是合理的。只有在以下情况下，未能满足入组标准的受试者可从分析中排除而不会引入偏倚：

（1）在随机化之前评判了入组标准；

（2）入选标准违背可以被完全客观地评价；

（3）所有受试者都接受相同的入选标准违背审查；（在开放试验中或者甚至在双盲试验中，如果在审查之前数据被揭盲，相同的审查就很难保证，所以要强调盲态审核的重要性。）

（4）排除所有确定为特定入组标准违背者。

某些情况下，从所有随机化受试者集中排除任何未服用试验药物的受试者可能是合理的。例如，是否开始治疗的决定并不受已知晓所分配治疗的影响，即使排除了这些患者，但意向性治疗原则仍得以遵守。其他情况下，可能需要从所有随机化受试者集中剔除任何随机化后无数据的受试者，除非来自这些特定排除的潜在偏倚或任何其它偏倚得到解决，否则任何分析都不是完整的。

当使用受试者全分析集时，随机化后发生的方案违背可能会对数据和结论产生影响，特别是如果它们的发生与处理分配相关时。大多数情况下把这些受试者的数据纳入分析是合适的，这符合意向性治疗原则。接受一次或多次剂量后退出治疗且以后未提供数据的受试者，或失访的受试者，导致了特殊问题的产生，因为不把这些受试者纳入全分析集中可能会破坏这个原则。这种背景下，受试者无论因任何原因失访，其已经获得的、或根据方案中规定的评价时间点随后收集到的主要指标测量数据，都是有价值的。在主要指标是死亡率或严重疾病发病率的研究中，后续数据的收集尤为重要。如何收集此类数据应在方案中描述。从末次观察值结转方法到复杂数学模型的填补技术可尝试用于替代缺失值。用于确保全分析集中每个受试者主要指标测量值可利用的其它方法，可能会要求做出关于受试者结局或更简单的结局（如成功或失败）的一些假设。任何策略的使用都应在方案的统计章节中进行描述并说明合理性，并且所用的任何数学模型所依据的假设均应解释清楚。证实相应分析结果的稳健性也同样重要，特别是所考虑的策略本身可能会导致处理效应有偏估计的情况。

由于一些问题的不可预测性，有时把不规则问题应对方法的详细考虑推迟到试验结束对数据进行盲态审核时可能更可取，如果这样做则需要在方案中加以说明。

5.2.2 符合方案集

受试者的“符合方案”集，有时被称为“有效病例”、“有效性”样本或“可评价的受试者”样本，被定义为全分析集的受试者中对方案更具依从性的子集，并且以符合如下标准为特征：

（1）完成了对治疗方案的某个预先设定的最小暴露量；

（2）可以获得主要指标的测量值；

（3）无任何重大方案违背，包括入组标准违背。

在揭盲之前，应该按照适合于特定试验情况的方式完整定义并记录将受试者排除在符合方案集之外的确切原因。

使用符合方案集可能有最大的机会使新的治疗在分析中显示出额外的有效性，而且最紧密地反映方案中的科学模型。然而，相应的假设检验和处理效应估计可能保守也可能不保守，这取决于试验本身；对研究方案的依从性可能与处理和结局有关，它可能会导致偏倚甚至是严重的偏倚。

应充分识别和总结导致剔除受试者以生成符合方案集和其它方案违背的问题。相关的方案违背可能包括处理分配的错误、使用禁忌药物、依从性差、失访和数据缺失。从发生频率和发生时间方面评估各处理组间这些问题的模式是一种良好实践。

5.2.3 不同分析集的作用

一般说来，证明主要试验结果对选择不同受试者集具有不敏感性是有利的。在确证性试验中，计划对全分析集及符合方案集都进行分析通常是恰当的，这样可以明确地讨论和解释它们之间的任何差异。某些情况下，需要深入探讨用于分析的受试者集的选择对结论的敏感性。当全分析集和符合方案集得出实质上相同的结论时，会增加试验结果的可信度，但应注意，对于排除了大比例受试者的符合方案分析会给试验的整体正确性带来一些疑虑。

在优效性试验（试图验证研究产品更优）和等效性或非劣效性试验（试图验证研究产品具有可比性，见第 3.3.2 章节）中，全分析集和符合方案集发挥的作用不同。在优效性试验中，全分析集用于主分析（除了例外情况），因为它倾向于避免符合分析集所导致的对有效性的过度乐观估计，因为包含在全分析集中的非依从者一般会降低所估计的处理效应。然而，在等效性或非劣效性试验中，使用全分析集一般不保守，应非常仔细地考虑它的作用。

5.3 缺失值及离群值

缺失数据是临床试验中的一个潜在偏倚来源。因此，应尽一切努力满足方案对数据收集和管理的所有要求。然而，现实中几乎总会有一些缺失数据。虽然如此，只要缺失数据的处理方法合理，尤其是在方案中预先定义了这些方法，则试验可以被认为是可靠的。在盲态审核期间，可以更新统计分析计划，完善这些方法的定义。遗憾的是，没有可推荐的普遍适用的缺失数据处理方法。应该对缺失数据的处理方法做敏感性研究，特别是当缺失数据的比例较大时。

应采用类似的方法探索离群值的影响，它们的统计定义在某种程度上是主观的。只有从医学上和统计上都认为是合理的，把某一特定值明确地确定为异常值才最具说服力，而且医学方面通常会定义适当的操作程序。在方案或统计分析计划中预先设定的有关离群值的程序应当不倾向任何处理组。同样，在盲态审核期间可以有效地更新这方面的分析。如果在试验方案中未预先规定应对离群值的程序，则需要在对实际值做一次分析的同时，至少进行一次排除或减少离群值效应的分析，并讨论它们的结果之间的差异。

5.4 数据转换

最好在试验设计期间基于早期临床试验的类似数据，在分析前做出对关键指标进行转换的决定。应该在方案中对数据转换（如平方根转换、对数转换）进行详细说明，并叙述基本原理，尤其是主要指标。在标准教材中可以找到进行数据转换的一般原则，可确保满足统计方法所依据的假设，而且在许多特定的临床领域已经形成了针对特定指标的惯例。是否以及如何对指标进行转换的决定应该受到对于刻度喜好的影响，以便于临床解释。

类似的考虑也适用于其他衍生指标，例如，自基线变化值、自基线变化百分比、重复测量的“曲线下面积”或两个不同指标的比值。应仔细考虑后续的临床解释，并在方案中说明衍生的合理性。与此密切相关的要点参见第2.2.2章节。

5.5 估计、置信区间及假设检验

为满足试验的主要目的，应该在方案的统计章节中详细说明待检验的假设和/或待估计的处理效应。用于完成这些任务的统计方法应当针对主要指标（以及优选的次要指标）进行描述，并明确所依据的统计模型。只要有可能，处理效应的估计应伴有置信区间，并确定其计算方法。应当说明使用基线数据以提高精度或以潜在基线差异校正估计值的任何意图，例如，使用协方差分析进行校正。

重要的是，要阐述清楚将使用单侧还是双侧统计检验，如果使用单侧检验一定要事先充分说明其合理性。如果认为假设检验不适用，那么应该给出获得统计结论的替代过程。关于单侧或双侧推断方法的问题是有争议的，在统计文献中可以找到各种各样的观点。在监管背景下，更可取的方法是将单侧检验的I类错误设置为双侧检验中使用的传统I类错误的一半，这样就保持了与双侧置信区间的一致性。双侧置信区间通常适合于估计两种处理间差异的可能大小。

所选择的特定统计模型应当反映人们对待分析指标以及试验的统计设计在医学和统计方面的目前认识状态。应充分说明在分析中待拟合的所有效应（例如在方差模型分析中），并应解释根据初步结果对这些效应进行修改的方式（如果有）。同样的考虑也适用于在协方差分析中所拟合的协变量集合（见第5.7章节）。在选择统计方法时（如参数和非参数方法），应注意主要和次要指标的统计分布，其分析结果应包含处理效应量的统计估计值及置信区间（显著性检验除外）。

应当清楚地区分主要指标的主分析与主要或次要指标的支持性分析。在方案的统计章节或统计分析计划中，除主要和次要指标外还应阐明数据的汇总和报告方式的大纲。为了在一系列试验中实现分析一致性的目的，例如对于安全数据，应当包括所采用方法的介绍。

对于已知的药理学参数、单个受试者的方案依从程度或其它生物学基础数据，整合这些信息的建模方法可以洞察实际或潜在有效性的价值，特别是对于处理效应的估计。应始终清晰地确定这些模型所依据的假设，并仔细描述任何结论的局限性。

5.6 显著性及置信水准的调整

当存在多重性时，用于临床试验数据分析常用的频率派方法可能需要对I类错误进行调整。多重性可能来源于多个主要指标（见第2.2.2章节）、处理的多重比较、随时间的多次评价和/或期中分析（见第4.5章节）。在可行的情况下，避免或减少多重性的方法有时更可取，例如，在多个指标中确定一个关键主要指标，在多重比较中选择一个关键的处理比较，对于重复测量使用汇总测量如“曲线下面积”等。在确证性分析中，除采取此类步骤，对多重性的其余任何解决办法也应当在方案中确定。应始终考虑多重性的调整，并应在分析计划中交待任何调整程序的细节，或者解释不必调整的理由。

5.7 亚组、交互作用及协变量

除处理之外，主要指标通常系统性地与其它影响因素相关。例如，它可能与年龄和性别等协变量相关，或者比如多中心试验中不同中心接受处理的受试者这样的特定亚组之间可能存在差异。有些情况下，对协变量影响的调整或者对亚组效应的调整是分析计划中不可缺少的部分，因此应在方案中阐明。应通过试验前的缜密考虑，确定这些协变量以及预期对主要指标有重要影响的因素，并考虑在分析中如何处理，以提高精度和补偿处理组之间的任何不平衡。如果使用一个或多个因素进行分层设计，那么在分析中应考虑这些因素。当不确定调整的潜在价值时，通常建议主要关注未调整的分析，把调整分析作为支持性分析。应特别注意中心效应和主要指标基线值的作用。不建议在主分析中校正随机化后测量的协变量，因为它们可能受到处理的影响。

处理效应本身也可能随亚组或协变量而变化，例如，处理效应可能随年龄降低或者可能在特定诊断类别的受试者中更大。某些情况下，预期会产生交互作用或对交互作用有特别兴趣（如老年病学）时，亚组分析或者包含交互项的统计模型因此成为计划的确证性分析的一部分。然而，大多数情况下亚组分析和交互作用分析应当确定为探索性的，即探索所有处理效应的一致性。一般而言，应首先在所讨论的统计模型添加交互项进行分析，辅之以在相关受试者亚组内或者由协变量定义的层内进行额外的探索性分析。对于探索性分析，应谨慎解释其分析结果，仅仅基于探索性亚组分析的治疗有效性（或缺乏有效性）或安全性的任何结论都不太可能被接受。

5.8 数据的完整性与计算机软件的可靠性

分析结果的可信性取决于用于数据管理（数据录入、存储、验证、校正和检索）以及在统计上处理数据的方法和软件（内部和外部编写）的质量和可靠性。因此，数据管理活动应当基于全面和有效的标准操作规程。用于数据管理和统计分析的计算机软件应当是可靠的，并应提供适当的软件测试过程的文件。

6. 安全性与耐受性评价

6.1 评价的范围

在所有临床试验中，安全性和耐受性（见词汇表）的评价是一个重要方面。在早期阶段，这种评价主要是探索性的，并且只对毒性的直接表达敏感，而在后期阶段，可在更大样本量的受试者中更加全面地描述药物的安全性和耐受性特征。后期阶段的对照试验代表了以无偏的方式探索任何新的潜在不良反应的重要方法，即使这些试验在这方面通常缺乏检验效能。

某些试验可针对以安全性和耐受性的优效性或等效性（与其它药物或与研究药物的其它剂量相比）的特定主张为目的进行设计。这些特定主张需得到来自确证性试验的相关证据支持，就像相应的有效性主张需要证据支持一样。

6.2 指标选择与数据收集

在任何临床试验中，选择用于评价药物安全性和耐受性的方法和测量取决于许多因素，包括对与药物密切相关的不良反应的了解，来自非临床和早期临床研究的信息以及特定药物的药效/药代动力学特性的可能结果、给药方式、待研究的受试者类型，以及试验持续时间。有关临床化学和血液学、生命体征、临床不良事件（疾病、体征和症状）的实验室检查通常构成安全性和耐受性数据的主体。发生严重不良事件以及因不良事件导致治疗终止对于注册是特别重要的（见ICH E2A和ICH E3）。

此外，建议在整个临床试验规划中采用一致的方法来收集和评价数据，以便合并来自不同试验的数据。使用通用的不良事件词典尤为重要。该词典具有一种结构，提供了在三个不同层级上汇总不良事件数据的可能性，即系统-器官分类、首选术语和收录术语（见词汇表）。首选术语通常是汇总不良事件的层级，在数据的描述性展示中，可以汇集属于同一系统-器官分类的首选术语（见ICH M1）。

6.3 待评价受试者集及数据展示

对于整体安全性和耐受性评价，待汇总的受试者集通常被定义为那些接受至少一个剂量研究药物的受试者。应尽可能全面地从这些受试者中收集安全性和耐受性指标，包括不良事件类型、严重程度、发病和持续时间（见ICH E2B）。可能需要在特定的亚组人群，如女性、老年人（见ICH E7）、严重疾病或那些有常见伴随治疗的人群，进行额外的安全性及耐受性评价。这些评价可能需要解决更加特殊的问题（见ICH E3）。

在评价过程中需要注意所有安全性和耐受性指标，并且在方案中应阐明方法。所有不良事件都应报告，无论它们是否被认为与治疗有关。在评价中应当考虑研究人群中的所有可用数据。应当谨慎地定义测量值的单位和实验室指标的参考范围，如果在同一试验中出现不同的单位或不同的参考范围（例如涉及一个以上的实验室），则测量值应当被适当标准化，以便统一评价。应预先确定毒性分级量表的使用，并说明合理性。

某种不良事件的发生率通常以经历事件的受试者数量与处于风险中的受试者数量之比来表示。然而，如何评价发生率并不总是显而易见的，例如，根据情况可考虑把暴露的受试者数量或暴露程度（用人年表示）作为分母。无论计算的目的是估计风险还是在处理组之间进行比较，重要的是要在方案中给出定义。如果计划进行长周期治疗，并预期有相当比例的退出治疗或死亡，这一点尤其重要。对于这些情况，应考虑生存分析方法，并计算累积不良事件率，以避免低估的危险。

对于存在大量体征和症状的背景噪声的情况（如精神病试验），在估计不同不良事件的风险时，应考虑对此进行解释的方法。一种方法是利用“治疗引发事件”（见词汇表）的概念，即只有当不良事件出现或相对于治疗前基线发生恶化时，才记录它们。

减少背景噪声影响的其他方法也许是合适的，如忽略轻度不良事件，或再次随访时观察到的事件才可计入分子。这些方法应在方案中解释并说明其合理性。

6.4 统计评价

安全性与耐受性的研究是一个多维问题。对于任何药物，虽然通常可以预见和监测到某些特定不良反应，但由于可能的不良反应范围非常大，新的和不可预见的反应总可能出现。此外，在违背方案之后经历的不良事件可能引入偏倚，如使用违禁药物。这个背景使药物安全性和耐受性的统计分析和评价变得困难，并且意味着来自确证性临床试验的结论性信息是一种例外而不是通例。

大多数试验中，应用数据的统计描述方法，辅以有助于解释的置信区间计算，是说明安全性和耐受性的最好方法。利用图示方法表达处理组间和受试者间不良事件的模式也有价值。

计算P值有时是有意义的，无论作为评价有关特定差异的辅助手段，还是作为“标记”符号以引起对大量安全性与耐受性指标所出现差异的进一步关注。这对于实验室数据尤其有用，否则可能难以适当地进行汇总。建议对实验室数据既要进行定量分析，如对处理组均数的评价，又要进行定性分析，如计算高于或低于某些阈值的比例。

如果使用假设检验，对多重性的统计调整以量化I类错误是合适的，但是II类错误通常更值得关注。如果未做多重性调整，应谨慎解释常规的统计显著性。

大多数试验中，与阳性对照药物或安慰剂相比，研究者会试图确定未出现临床上不可接受的安全性及耐受性方面的差异。与有效性的非劣性或等效性评价一样，这种情况下使用置信区间比假设检验更可取，因为置信区间往往可以清楚地显示由低发生率所引起的精度变差。

6.5 综合性总结

在研究产品的开发过程中，特别是在上市申请时，通常会将不同试验的药物安全性与耐受性的特性进行汇总。然而，这样汇总是否可用取决于每一个具有高数据质量的、充分和控制良好的试验。

药物的总体可用性始终是风险与获益之间的平衡问题，单个试验中也可考虑这一观点，即使风险/获益评估通常在整个临床试验的总结阶段进行。（见第7.2.2章节）

有关安全性与耐受性报告的更多细节，见ICH E3第12章。

7. 报告

7.1 评价与报告

如引言所述，临床研究报告的结构与内容是ICH E3的主题。该ICH指南充分地涵盖了统计工作报告并适当整合临床和其它资料，本章节因此相对简短。

如第5章节所述，在试验的计划阶段，分析的主要特征应在方案中确定。当试验结束而且数据经整理可供初步检查时，如第5章节提到的按计划进行盲态审核是有价值的。在分析前盲态审核应当包括相关决定，例如，从分析集中排除受试者或数据，可能的数据转换的核查，离群值的定义，将近期其它研究中确定的重要协变量加入模型，参数或非参数方法的重新考虑，等等。这些决定应在报告中加以描述，而且应当与统计师获得处理编码之后做出的决定加以区别，因为盲态下的决定通常会减少产生偏倚的可能性。参与非盲期中分析的统计师或其他人员不应参与盲态审核或修订统计分析计划。数据中如果存在明显的处理诱导效应的可能，将会削弱盲态效果，此时，盲态审核需要特别谨慎。

许多更详细的报告内容和表格应在盲态审核时或盲态审核前完成，以便在实际分析时有一个包括各方面的完整计划，如受试者选择、数据选择与修改、数据汇总与列表、估计与假设检验等。一旦完成数据验证，应按照预先拟定的计划进行分析，越依从于这些计划，结果的可信度越高。应特别注意在方案、方案修订以及基于数据盲态审核更新的统计分析计划中所描述的计划分析与实际分析之间的任何差异。应对偏离计划的分析做出详细解释。

进入试验的所有受试者，无论是否纳入分析，都应在报告中说明。排除在分析之外的所有原因都应记录，还应记录受试者被纳入全分析集但未被纳入符合方案集的原因。类似地，对于纳入分析集的所有受试者，所有重要指标的测量值在所有相关时间点都应该进行说明。

应仔细考虑受试者或数据的所有缺失、退出治疗和重要方案违背对主要指标的主分析的影响。应确定失访、退出治疗或严重方案违背的受试者，并对他们进行描述性分析，包括他们缺失的原因及其与处理和结局的关系。

描述性统计是报告不可缺少的部分。合适的表格和/或图示应清楚地说明主要和次要指标、关键预后指标和人口统计学指标的重要特征。应特别仔细地描述与试验目的有关的主分析的结果。当报告显著性检验结果时，应报告精确的P值（如“P=0.034”）而不是参考临界值。

尽管临床试验分析的主要目标是回答其主要目的提出的问题, 但在非盲分析过程中，基于观察数据的新问题很可能会出现，随之可能需要额外的或许复杂的统计分析。报告中应严格区分这种额外工作与方案中计划的工作。

对于计划分析中未被预先定义为协变量但仍然具有某些预后重要性的基线测量，机会作用可能会导致它们在处理组间出现无法预料的不均衡。最好的解决办法是，证明针对这些不均衡进行校正的补充分析得出了与计划分析基本相同的结论。否则，应讨论这种不均衡对结论的影响。

一般而言，应少用计划外分析。当认为处理效应可能随某个或某些其他因素而变化时，会用到计划外分析，比如会尝试确定特别获益的受试者亚组。众所周知，计划外亚组分析有过度解释的潜在风险（见第5.7章节），应谨慎避免。虽然当受试者亚组中未显示出获益或具有不良反应时会出现类似的解释问题，但应该恰当地评价这些可能性并予以报告。

最后，应根据临床试验结果的分析、解释及展示做出统计判断。为此，试验统计师应是负责临床研究报告的小组成员之一，还应批准临床报告。

7.2 临床数据库的总结

上市申请需要对所有报告临床试验的安全性和有效性证据进行全面总结和综合（欧盟的专家报告、美国的综合总结报告、日本的概要），在适当的时候还可能伴随结果的统计汇总。

总结中有一些特定的统计关注的领域：描述在临床试验项目过程中受试人群的人口统计学和临床特征；通过考虑相关（通常有对照组）试验的结果并强调它们相互印证或矛盾的程度来解决有效性的关键问题；对于其结果有助于上市申请的所有试验，总结从它们的合并数据库中可获得的安全信息，并确定潜在的安全问题。在设计临床项目中，应认真关注测量的统一定义和收集，这将有助于随后一系列试验的解释，特别是如果不同试验之间的测量可能被合并时。应该选择和使用可记录用药细节、病史和不良事件的通用词典。对主要和次要指标采用通用定义几乎总是有价值的，这对meta分析极为重要。关键有效性指标的测量方式、相对于随机化/入组的评价时机、方案违背和偏离的应对以及可能的预后因素定义都应该保持一致，除非有合适的理由不这么做。

应当详细描述用于不同试验之间数据合并的任何统计程序。应注意与试验选择有关的偏倚的可能性、试验结果的同质性、以及各种变异来源的恰当建模。应探索结论对假设和选择的敏感性。

7.2.1 有效性数据

单个临床试验的样本量应该总是大到足以满足其目的的程度。通过总结一系列解决基本相同的关键有效性问题的临床试验，也可以获得额外的有价值的信息。为了便于比较，应该以相同的形式，通常是关注于估计值和置信限的表格和图形，呈现一系列试验的主要结果。使用meta分析技术来合并这些估计值常常是一个有用的补充，因为它允许对处理效应量生成更精确的总体估计，并提供完整而简明的试验结果总结。在一些特殊情况下，meta分析方法也可能是通过整体假设检验提供充分的有效性整体证据的最适当方式，或者唯一方式。当用于此目的时，meta分析应该有它自己的前瞻性书面方案。

7.2.2 安全性数据

在总结安全性数据时，重要的是要彻底检查安全性数据库，以寻找潜在毒性的任何迹象，并通过寻找相关的支持性观察模式来跟踪这些迹象。将人暴露于药物的所有安全数据进行合并，能提供重要的信息来源，因为较大的样本量能提供发现更罕见不良事件的最佳机会，并且可能提供估计罕见不良事件近似发生率的最佳机会。然而，由于缺乏对照组，难以评价来自该数据库的发生率数据，来自对照试验的数据在克服这种困难方面特别有价值。应合并具有相同对照组（安慰剂或特定阳性对照）的研究的结果，并分别展示每个提供充足数据的对照组的结果。

所有通过数据探索发现的潜在毒性的迹象都应报告。评价这些潜在不良反应的现实情况应考虑到由于多次比较而产生的多重性问题。还应适当地使用生存分析方法进行评价，以探索不良事件的发生率与暴露时间和/或随访时间的潜在关系。应适当地量化确定的不良反应的风险，以便正确评价风险/获益关系。

词汇表

Glossary	Content
贝叶斯方法	是指为某些参数（如处理效应）提供后验概率分布的数据分析方法。后验概率分布由该参数的观测数据和先验概率分布衍生而来，被用作统计推断的基础。
偏倚（统计的和操作的）	是指与临床试验的设计、实施、分析和结果评价有关的任何因素导致的处理效应估计值偏离其真实值的系统趋势。由实施偏离所引入的偏倚称为“操作”偏倚，而上述其他来源的偏倚称为“统计”偏倚。
盲态审核	是指在试验完成（最后一位受试者的最后一次观察）到揭盲这段时间内对数据的检查和评价，旨在最终确定分析计划。
内容效度	是指一个指标（如量表）测量它所预期测量的内容的程度。
双模拟	是指在临床试验中当两种处理不能做到完全相同时，使处理实施仍能保持盲态的一种技术。先准备处理A（阳性药和不能区分的安慰剂）和处理B（阳性药和不能区分的安慰剂），然后受试者接受两套处理：A（阳性药）和B（安慰剂）或者A（安慰剂）和B（阳性药）。
脱落	是指临床试验的受试者由于任何原因不能继续按研究方案进行到所要求的最后一次随访。
等效性试验	是指主要目的为证实两种或多种处理的应答差别无重要临床意义的试验。通常以真实的处理间差异落在临床上可接受的等效性界值上下限之间来证实等效性。
频率派方法	是指在假设重现相同实验情境时，用某些结局的发生频率做出解释的统计方法，例如显著性检验和置信区间。
全分析集	是指尽可能接近符合意向性治疗原则的理想的受试者集。该数据集是从所有随机化的受试者中以最少的和合理的方法排除受试者后得到的。
可推论性，推论	是指将临床试验的发现从参与试验的受试者可靠地外推到更广泛的患者人群和临床环境的程度。
全局评价指标	是指将客观指标和研究者对受试者的状态或状态变化的总体印象综合起来所设定的一个单一指标，通常是一个有序分类量表。
独立数据监查委员会（数据和安全监查委员会、监查委员会、数据监查委员会）	独立数据监查委员会由申办方设立，职责是定期评价临床试验进度、安全性数据以及关键有效性终点，并向申办方建议是否继续、修改或终止试验。
意向性治疗原则	是指基于受试者的治疗意向（即计划的治疗方案）而不是实际给予的治疗进行评价的原则，该原则可以对治疗策略的效应做出最佳评价。它的结果是，分配到每一个处理组的受试者即应作为该组的成员被随访、评价和分析，无论他们是否依从于所计划的治疗过程。
交互作用（定性和定量）	是指处理间的比较（如研究产品与对照之间的差异）依赖于另一因素（如中心）的情况。定量交互作用是指该因素的不同水平之间在量的比较上有差异，而定性交互作用是指比较结果至少在该因素某一水平上显示方向不同。
评价者间信度	是指不同评价者在不同场合使用评价工具时产生相同结果的可靠程度。
评价者内信度	是指同一评价者在不同场合使用评价工具时产生相同结果的可靠程度。
期中分析	是指正式完成临床试验前，比较处理组间的有效性或安全性所做的任何分析。
Meta 分析	是指来源于针对同一个问题的两个或多个试验的量化证据的规范评价，常见的方法是将各试验的汇总统计量进行统计合并，有时也采用原始数据的统计合并方法。
多中心试验	是指多个研究者在多个场所按同一个方案实施的临床试验。
非劣效性试验	是指主要目的为验证研究产品的应答在临床上不劣于对照（阳性药或安慰剂对照）的试验。
首选术语和收录术语	在分层级医学词典中，例如MedDRA，收录术语是词典术语的最低层级，以研究者的描述进行编码。首选术语是收录术语的分组层级，通常用于报告发生率。例如，研究者写的是“左臂疼痛”，收录术语编码为：“关节疼痛”，在首选术语层级上报告为“关节痛”。
符合方案集（有效病例，有效性样本，可评价的受试者样本）	是指由充分依从于方案的受试者子集所产生的数据集，以确保这些数据按照所依据的科学模型可能展现出处理效应。依从性包括以下一些考虑：暴露于处理、可获得测量值以及无重大方案违背等。
安全性和耐受性	医疗产品的安全性是指受试者的医学风险，通常在临床试验中由实验室检查（包括临床生化和血液学）、生命体征、临床不良事件（疾病、体征和症状），以及其他特殊的安全性检查（如心电图、眼科检查）等来评价。医疗产品的耐受性是指受试者能耐受明显不良反应的程度。/td>
统计分析计划	是指更技术性地和更详细地阐述方案中描述的分析要点的文件，包括对主要和次要指标及其他数据进行统计分析的详细程序。
优效性试验	是指主要目的为显示研究产品的应答优于对照（阳性药或安慰剂对照）的试验。
替代指标	是指在直接测量临床效应不可行或不实际的情况下，用于间接测量临床效应的指标。
处理效应	是指在临床试验中归因于处理的效应。在大多数临床试验中，感兴趣的处理效应通过两个或多个处理间的比较体现。
治疗引发事件	是指出现在治疗期间的、但在治疗前未曾发生或比治疗前明显恶化的事件。
试验统计师	是指同时具备丰富的教育/训练和经验，可以实施本指南中的原则并负责临床试验统计方面的统计师。

ICH E9(R1) Addendum: Statistical Principles for Clinical Trials

中文版

ICH E9(R1) 临床试验中的估计目标与敏感性分析（E9指导原则增补文件）

A.1. PURPOSE AND SCOPE

To properly inform decision making by pharmaceutical companies, regulators, patients, physicians and other stakeholders, clear descriptions of the benefits and risks of a treatment (medicine) for a given medical condition should be made available. Without such clarity, there is a concern that the reported “treatment effect” will be misunderstood. This addendum presents a structured framework to strengthen the dialogue between disciplines involved in the formulation of clinical trial objectives, design, conduct, analysis and interpretation, as well as between sponsor and regulator regarding the treatment effect(s) of interest that a clinical trial should address.

Precision in describing a treatment effect of interest is facilitated by constructing the “estimand” (see Glossary; A.3.) corresponding to a clinical question of interest. Clarity requires a thoughtful envisioning of “intercurrent events” (see Glossary; A.3.1.) such as discontinuation of assigned treatment, use of an additional or alternative treatment and terminal events such as death. The description of an estimand should reflect the clinical question of interest in respect of these intercurrent events, and this addendum introduces strategies to reflect different questions of interest that might be posed. The choice of strategies can influence how more conventional attributes of a trial are reflected when describing the clinical question, for example the treatments, population or the variable (endpoint) of interest.

The statistical analysis of clinical trial data should be aligned to the estimand. This addendum clarifies the role of “sensitivity analysis” (see Glossary) to explore robustness of conclusions from the main statistical analysis.

Throughout the addendum, references to the original ICH E9 are made using x.y. References within this addendum are made using A.x.y.

This addendum clarifies and extends ICH E9 in respect of the following topics. Firstly, ICH E9 introduced the Intention-To-Treat (ITT) principle in connection with the effect of a treatment policy in a randomised controlled trial, whereby subjects are followed, assessed and analysed irrespective of their compliance to the planned course of treatment, indicating that preservation of randomisation provides a secure foundation for statistical tests. Multiple consequences arising from the ITT principle can be distinguished. Firstly, that the trial analysis should include all subjects relevant for the research question. Secondly, that subjects should be included in the analysis as randomised. Taken directly from the definition of the ITT principle (see ICH E9 Glossary), a third consequence is that subjects should be followed-up and assessed regardless of adherence to the planned course of treatment and that those assessments should be used in the analysis. It remains undisputed that randomisation is a cornerstone of controlled clinical trials and that analysis should aim at exploiting the advantages of randomisation to the greatest extent possible. However, the question remains whether estimating an effect in accordance with the ITT principle always represents the treatment effect of greatest relevance to regulatory and clinical decision making. The framework outlined in this addendum gives a basis for describing different treatment effects and some points to consider for the design and analysis of trials to give estimates of these treatment effects that are reliable for decision making.

Secondly, issues considered generally under data handling and “missing data” (see Glossary) are re-visited. Two important distinctions are made. Firstly, the addendum distinguishes discontinuation of randomised treatment from study withdrawal. The former represents an intercurrent event, to be addressed in the precise specification of the trial objective through the estimand. The latter gives rise to missing data to be addressed in the statistical analysis. Consider, for example, a subject switching treatments in an oncology trial, and a subject for whom no outcome event can be observed because the trial is completed. The former represents an intercurrent event and the clinical question of interest in respect of that should be clear. The latter is administrative censoring which needs to be addressed as a missing data problem in the statistical analysis. Having clarity in the estimand gives a basis for planning which data need to be collected and hence which data, when not collected, present a missing data problem to be addressed in the statistical analysis. In turn, methods to address the problem presented by missing data can be selected to align with the estimand. Secondly, the addendum highlights the distinct consequences of different intercurrent events. Events such as discontinuation of treatment, switching between treatments, or use of an additional medication may render the later measurements of the variable irrelevant or difficult to interpret even when they can be collected. Measurements after a subject dies do not exist.

Thirdly, issues related to the concept of analysis sets are considered in the framework. Section 5.2. strongly recommends that analysis of superiority trials be based on the full analysis set, defined to be as close as possible to including all randomised subjects. However, trials often include repeated measurements on the same subject. Elimination of some planned measurements on some subjects, perhaps because the measurement is considered irrelevant or difficult to interpret, can have similar consequences to excluding subjects altogether from the full analysis set, i.e. that the initial randomisation is not fully preserved. A consequence of this is that the theoretical benefits that randomisation confers on testing hypotheses about treatment effects and the practical benefits of balancing confounding factors at baseline can be diminished. In addition, a meaningful value of the outcome variable might not exist, as when the subject dies. Section 5.2. does not directly address these issues. Clarity is introduced by carefully defining the treatment effect of interest in a way that determines both the population of subjects to be included in the estimation of that treatment effect and the observations from each subject to be included in the analysis considering the occurrence of intercurrent events. The meaning and role of an analysis of the per protocol set is also re-visited in this addendum; in particular whether the need to explore the impact of protocol violations and deviations can be addressed in a way that is less biased and more interpretable than naïve analysis of the per protocol set.

Finally, the concept of robustness (see 1.2.) is given expanded discussion under the heading of sensitivity analysis. A distinction is made between the sensitivity of inference to the assumptions of a chosen method of analysis and the sensitivity to the choice of analytic approach more broadly. With precise specification of an agreed estimand and a method of analysis that is both aligned to the estimand and pre-specified to a level of detail that it can be replicated precisely by a third party, regulatory interest can focus on sensitivity to deviations from assumptions and limitations in the data in respect of a particular analysis.

The principles outlined in this addendum are relevant whenever a treatment effect is estimated, or a hypothesis related to a treatment effect is tested, whether related to efficacy or safety. While the main focus is on randomised clinical trials, the principles are also applicable for single arm trials and observational studies. The framework applies to any data type, including longitudinal, time-to-first event, and recurrent event data. Regulatory interest in the application of the principles outlined will be greater for confirmatory clinical trials and, where used to generate confirmatory conclusions, for data integrated across trials.

A.2. A FRAMEWORK TO ALIGN PLANNING, DESIGN, CONDUCT, ANALYSIS AND INTERPRETATION

Trial planning should proceed in sequence (Figure 1). Clear trial objectives should be translated into key clinical questions of interest by defining suitable estimands. An estimand defines the target of estimation for a particular trial objective (i.e. “what is to be estimated”, see A.3.). A suitable method of estimation (i.e. the analytic approach, referred to as the main “estimator”, see Glossary) can then be selected (see A.5.1.). The main estimator will be underpinned by certain assumptions. To explore the robustness of inferences from the main estimator to deviations from its underlying assumptions, a sensitivity analysis should be conducted, in the form of one or more analyses, targeting the same estimand (see A.5.2.).

Figure 1: Aligning target of estimation, method of estimation, and sensitivity analysis, for a given trial objective

This framework enables proper trial planning that clearly distinguishes between the target of estimation (trial objective, estimand), the method of estimation (estimator), the numerical result (“estimate”, see Glossary), and a sensitivity analysis. This will assist sponsors in planning trials, regulators in their reviews, and will enhance the interactions between these parties when discussing the suitability of clinical trial designs, and the interpretation of clinical trial results.

The specification of appropriate estimands (see A.3.) will usually be the main determinant for aspects of trial design, conduct (see A.4.) and analysis (see A.5.).

A.3. ESTIMANDS

Central questions for drug development and licensing are to establish the existence, and to estimate the magnitude, of treatment effects: how the outcome of treatment compares to what would have happened to the same subjects under alternative treatment (i.e. had they not received the treatment, or had they received a different treatment). An estimand is a precise description of the treatment effect reflecting the clinical question posed by a given clinical trial objective. It summarises at a population level what the outcomes would be in the same patients under different treatment conditions being compared. The targets of estimation are to be defined in advance of a clinical trial. Once defined, a trial can be designed to enable reliable estimation of the targeted treatment effect.

The description of an estimand involves precise specifications of certain attributes, which should be developed based not only on clinical considerations but also on how intercurrent events are reflected in the clinical question of interest. Section A.3.1. introduces intercurrent events. Section A.3.2. introduces strategies to describe the question of interest in respect of intercurrent events. Section A.3.3. describes the attributes of an estimand and Section A.3.4. gives considerations for its construction. It is critically important to understand the differences between the strategies and to precisely articulate which are used in constructing the estimand.

A.3.1. Intercurrent Events to be Reflected in the Clinical Question of Interest

Intercurrent events are events occurring after treatment initiation that affect either the interpretation or the existence of the measurements associated with the clinical question of interest. It is necessary to address intercurrent events when describing the clinical question of interest in order to precisely define the treatment effect that is to be estimated.

Intercurrent events need to be considered in the description of a treatment effect because measurements of the variable can be influenced by the intercurrent event and the occurrence of the intercurrent event may depend on treatment. For example, two patients might be exposed initially to the same treatment and provide the same measure of outcome, but if one patient has received additional medication, the information that the two measures give about the treatment differs between the two patients. Furthermore, whether a patient needs to take additional medication, and whether or not a patient can continue taking treatment, may depend on the treatment to which they are exposed. Unlike missing data, intercurrent events are not to be thought of as a drawback to be avoided in clinical trials. Discontinuation of prescribed treatment, use of additional medication, and other such events may occur in clinical practice as they do in clinical trials, and their occurrence needs to be considered explicitly when defining the clinical question of interest.

Examples of intercurrent events that can affect interpretation of the measurements include discontinuation of assigned treatment and use of an additional or alternative therapy. Use of an additional or alternative therapy can take multiple forms, including change to background or concomitant therapy and switching between treatments of interest. Examples of intercurrent events that would affect the existence of the measurements include terminal events such as death and leg amputation (when assessing symptoms of diabetic foot ulcers), when these events are not part of the variable itself. Certain clinical events can also be intercurrent events, when their occurrence, or non-occurrence, defines a principal stratum of interest (see A.3.2.). Examples include tumour shrinkage defining objective response when assessing a treatment effect on duration of response in oncology and occurrence of infection when assessing a treatment effect on severity of infections occurring after vaccination of initially uninfected subjects.

An intercurrent event might be identified solely by the event itself, such as discontinuation of treatment, or might be more granular. For example, the reason for the event might be specified, such as discontinuation of treatment due to toxicity, or due to lack of efficacy; the event might require to be of certain magnitude or degree, such as use of additional medication exceeding a specified duration or dose; or the timing of the event might be specified, perhaps in relation to its proximity to the assessment of the variable. Some events will affect interpretation of the outcome measurements indefinitely, such as discontinuation of treatment, whilst others will affect interpretation only temporarily, such as short-term use of additional treatment. Indeed, additional or alternative treatments can be diverse; either replacing or supplementing a treatment on which the subject is experiencing inadequate benefit, as an alternative where a subject is not tolerating their assigned treatment, or as a short-term acute treatment to manage a temporary flare in disease symptoms. In a clinical trial, additional or alternative treatments are often identified as e.g. background treatment, rescue medication, prohibited medication, distinguishing their different roles and allowing them to be considered separately. The additional granularity, identifying different intercurrent events, is required if different strategies are to be used. If the intercurrent event for which a strategy needs to be selected depends not only on, for example, failure to continue with treatment, but also on the reason, magnitude or timing associated with that failure, this additional information should be defined and recorded accurately in the clinical trial. The description of intercurrent events might in theory reflect very specific details of treatment and follow-up, such as a single missed dose of a chronic treatment or a dose taken at the wrong time of day. Where such specific criteria are not expected to affect interpretation of the variable, they would not need to be addressed as intercurrent events.

As indicated above, consideration of intercurrent events is required when constructing the estimand. Because the estimand is to be defined in advance of trial design, neither study withdrawal nor other reasons for missing data (e.g. administrative censoring in trials with survival outcomes) are in themselves intercurrent events. Subjects who withdraw from the trial may have experienced an intercurrent event before withdrawal.

A.3.2. Strategies for Addressing Intercurrent Events when Defining the Clinical Question of Interest

Descriptions of various strategies are listed below, each reflecting a different clinical question of interest in respect of a particular intercurrent event. Whether or not the naming convention is used, it is required that the choices of strategy are unambiguously clear once the estimand is constructed. It is not necessary to use the same strategy to address all intercurrent events. Indeed, different strategies will often be used to reflect the clinical question of interest in respect of different intercurrent events. Section A.3.4. gives some considerations on selecting strategies to construct an estimand.

Treatment policy strategy

The occurrence of the intercurrent event is considered irrelevant in defining the treatment effect of interest: the value for the variable of interest is used regardless of whether or not the intercurrent event occurs. For example, when specifying how to address use of additional medication as an intercurrent event, the values of the variable of interest are used whether or not the patient takes additional medication.

If applied in relation to whether or not a patient continues treatment, and whether or not a patient experiences changes in other treatments (e.g. background or concomitant treatments), the intercurrent event is considered to be part of the treatments being compared. In that case, this reflects the comparison described in the ICH E9 Glossary (under ITT Principle) as the effect of a treatment policy.

In general, the treatment policy strategy cannot be implemented for intercurrent events that are terminal events, since values for the variable after the intercurrent event do not exist. For example, an estimand based on this strategy cannot be constructed with respect to a variable that cannot be measured due to death.

Hypothetical strategies

A scenario is envisaged in which the intercurrent event would not occur: the value of the variable to reflect the clinical question of interest is the value which the variable would have taken in the hypothetical scenario defined.

A wide variety of hypothetical scenarios can be envisaged, but some scenarios are likely to be of more clinical or regulatory interest than others. For example, it may be of clinical or regulatory importance to consider the effect of a treatment under different conditions from those of the trial that can be carried out. Specifically, when additional medication must be made available for ethical reasons, a treatment effect of interest might concern the outcomes if the additional medication was not available. A very different hypothetical scenario might postulate that intercurrent events would not occur, or that different intercurrent events would occur. For example, for a subject that will suffer an adverse event and discontinue treatment, it might be considered whether the same subject would not have the adverse event or could continue treatment in spite of the adverse event. The clinical and regulatory interest of such hypotheticals is limited and would usually depend on a clear understanding of why and how the intercurrent event or its consequences would be expected to be different in clinical practice than in the clinical trial.

If a hypothetical strategy is proposed, it should be made clear what hypothetical scenario is envisaged. For example, wording such as “if the patient does not take additional medication” might lead to confusion as to whether the patient hypothetically does not take additional medication because it is not available or because the particular patient is supposed not to require it.

Composite variable strategies

This relates to the variable of interest (see A.3.3.). An intercurrent event is considered in itself to be informative about the patient’s outcome and is therefore incorporated into the definition of the variable. For example, a patient who discontinues treatment because of toxicity may be considered not to have been successfully treated. If the outcome variable was already success or failure, discontinuation of treatment for toxicity would simply be considered another mode of failure. Composite variable strategies do not need to be limited to dichotomous outcomes, however. For example, in a trial measuring physical functioning, a variable might be constructed using outcomes on a continuous scale, with subjects who die being attributed a value reflecting the lack of ability to function. Composite variable strategies can be viewed as implementing the intention-to-treat principle in some cases where the original measurement of the variable might not exist or might not be meaningful, but where the intercurrent event itself meaningfully describes the patient’s outcome, such as when the patient dies.

Terminal events, such as death, are perhaps the most salient examples of the need for the composite strategy. If a treatment saves lives, its effect on various measures in surviving patients may be of interest, but it would be inappropriate to say that the summary measure of interest was only the average value of some numerical measure in survivors. The outcome of interest is survival along with the numerical measures. For example, progression-free survival in oncology trials measures the treatment effect on a combination of the growth of the tumour and survival.

While on treatment strategies

For this strategy, response to treatment prior to the occurrence of the intercurrent event is of interest. Terminology for this strategy will depend on the intercurrent event of interest; e.g. “while alive”, when considering death as an intercurrent event.

If a variable is measured repeatedly, its values up to the time of the intercurrent event may be considered relevant for the clinical question, rather than the value at the same fixed timepoint for all subjects. The same applies to the occurrence of a binary outcome of interest up to the time of the intercurrent event. For example, subjects with a terminal illness may discontinue a purely symptomatic treatment because they die, yet the success of the treatment can be measured based on the effect on symptoms before death. Alternatively, subjects might discontinue treatment and, in some circumstances, it will be of interest to assess the risk of an adverse drug reaction while the patient is exposed to treatment.

Like the composite variable strategy, the while on treatment strategy can hence be thought of as impacting the definition of the variable, in this case by restricting the observation time of interest to the time before the intercurrent event. Particular care is required if the occurrence of the intercurrent event differs between the treatments being compared (see A.3.3.).

Principal stratum strategies

This relates to the population of interest (see A.3.3.). The target population might be taken to be the “principal stratum” (see Glossary) in which an intercurrent event would occur. Alternatively, the target population might be taken to be the principal stratum in which an intercurrent event would not occur. The clinical question of interest relates to the treatment effect only within the principal stratum. For example, it might be desired to know a treatment effect on severity of infections in the principal stratum of patients becoming infected after vaccination. Alternatively, a toxicity might prevent some patients from continuing the test treatment, but it would be desired to know the treatment effect among patients who are able to tolerate the test treatment.

It is important to distinguish “principal stratification” (see Glossary), which is based on potential intercurrent events (for example, subjects who would discontinue therapy if assigned to the test product), from subsetting based on actual intercurrent events (subjects who discontinue therapy on their assigned treatment). The subset of subjects who experience an intercurrent event on the test treatment will often be a different subset from those who experience the same intercurrent event on control. Treatment effects defined by comparing outcomes in these subsets confound the effects of the different treatments with the differences in outcomes possibly due to the differing characteristics of the subjects.

A.3.3. Estimand Attributes

The attributes below are used to construct the estimand, defining the treatment effect of interest.

The treatment condition of interest and, as appropriate, the alternative treatment condition to which comparison will be made (referred to as “treatment” through the remainder of this document). These might be individual interventions, combinations of interventions administered concurrently, e.g. as add-on to standard of care, or might consist of an overall regimen involving a complex sequence of interventions. (see Treatment Policy and Hypothetical strategies under A.3.2.).

The population of patients targeted by the clinical question. This will be represented by the entire trial population, a subgroup defined by a particular characteristic measured at baseline, or a principal stratum defined by the occurrence (or non-occurrence, depending on context) of a specific intercurrent event (see Principal Stratum strategies under A.3.2.).

The variable (or endpoint) to be obtained for each patient that is required to address the clinical question. The specification of the variable might include whether the patient experiences an intercurrent event (see Composite Variable and While on Treatment strategies under A.3.2.).

Precise specifications of treatment, population and variable are likely to address many of the intercurrent events considered in sponsor and regulator discussions of the clinical question of interest. The clinical question of interest in respect of any other intercurrent events will usually be reflected using the strategies introduced as treatment policy, hypothetical or while on treatment.

Finally, a population-level summary for the variable should be specified, providing a basis for comparison between treatment conditions.

When defining a treatment effect of interest, it is important to ensure that the definition identifies an effect due to treatment and not due to potential confounders such as differences in duration of observation or patient characteristics.

A.3.4. Considerations for Constructing an Estimand

The clinical questions of interest and associated estimands should be specified at the initial stages of planning any clinical trial. Precise specification of objectives for most trials will need to reflect discontinuation of treatment and use of additional or alternative treatments. In some settings terminal events, such as death, should be addressed. Some trial objectives can only be described with reference to clinical events, for example the duration of response in subjects who achieve a response.

The construction of an estimand should consider what is of clinical relevance for the particular treatment in the particular therapeutic setting. Considerations include the disease under study, the clinical context (e.g. the availability of alternative treatments), the administration of treatment (e.g. one-off dosing, short-term treatment or chronic dosing) and the goal of treatment (e.g. prevention, disease modification, symptom control). Also important is whether an estimate of the treatment effect can be derived that is reliable for decision making. For example, a clinical question on the treatment effect on clinical outcome regardless of which other therapies are to be used before that outcome is experienced differs to a clinical question on the treatment effect had no additional medication been available. Depending on the setting, either might represent a clinical question of interest. However, in both cases, a clinical trial designed to estimate these treatment effects will often include the possibility to use additional medications if medically required. For the former question, values after the use of additional treatment will be relevant. For the latter question, values after the additional treatment are not directly relevant since the values also reflect the impact of that additional medication. It should be agreed that reliable estimation is possible before the choice of estimand is finalised. This includes, for the latter question, the methods to replace observations that are not to be used in the analysis.

When constructing the estimand it is necessary to have a clear understanding of the treatment to which the clinical question of interest pertains (see A.3.3.). Clear specifications for the treatments of interest might already reflect multiple relevant intercurrent events. Specifically, a treatment might already reflect the clinical question of interest in respect of changes in background treatment, concomitant medications, use of additional or later-line therapies, treatment-switching and conditioning regimens. For example, it is possible to specify treatment as intervention A added to background therapy B, dosed as required. In that case, changes to the dose of background therapy B would not need to be considered as an intercurrent event. However, the use of an additional therapy would need to be considered as an intercurrent event. If use of any additional medication is also reflected, using the treatment policy strategy for example, then treatment might be specified as intervention A added to background therapy B, dosed as required, and with additional medication, as required. Alternatively, if the treatment is specified as intervention A, then both changes in background therapy and use of additional therapy would be addressed as intercurrent events.

Discussions should also consider whether specifications for the population and variable attributes should be used to reflect the clinical question of interest in respect of any intercurrent events. Strategies can then be considered for any other intercurrent events. Usually an iterative process will be necessary to reach an estimand that is of clinical relevance for decision making, and for which a reliable estimate can be made. Some estimands, in particular those for which the measurements taken are relevant to the clinical question, can often be robustly estimated making few assumptions. Other estimands may require methods of analysis with more specific assumptions that may be more difficult to justify and that may be more sensitive to plausible changes in those assumptions (see A.5.1.). Where significant issues exist to develop an appropriate trial design or to derive an adequately reliable estimate for a particular estimand, an alternative estimand, trial design and method of analysis would need to be considered.

Avoiding or over-simplifying the process of discussing and constructing an estimand risks misalignment between trial objectives, trial design, data collection and method of analysis. Whilst an inability to derive a reliable estimate might preclude certain choices of strategy, it is important to proceed sequentially from the trial objective and an understanding of the clinical question of interest, and not for the choice of data collection and method of analysis to determine the estimand.

The experimental situation should also be considered. If the management of subjects (e.g. dose adjustment for intolerance, rescue treatment for inadequate response, burden of clinical trial assessments) under a clinical trial protocol is justified to be different to that which is anticipated in clinical practice, this might be reflected in the construction of the estimand.

Once constructed, the estimand should define a target of estimation clearly and unambiguously. Consider an intercurrent event of discontinuation of treatment; it is of utmost importance to distinguish between treatment effects of interest based on the principal stratum of patients who would be able to continue if administered the test treatment and the effect during continued treatment. Furthermore, neither of these should be taken to represent an effect if all patients can continue with treatment.

As stated above, when using the hypothetical strategy, some conditions are likely to be more acceptable for regulatory decision making than others. The hypothetical conditions described should therefore be justified for the quantification of an interpretable treatment effect that is relevant to inform the decisions to be taken by regulators, and use of the medicine in clinical practice. The question of what the values for the variable of interest would have been if rescue medication had not been available may be an important one. In contrast, the question of what the values for the variable of interest would have been under the hypothetical condition that subjects who discontinued treatment because of adverse drug reaction had in fact continued with treatment, might not be justifiable as being of clinical or regulatory interest. A clinical question of interest based on the effect if all subjects had been able to continue with treatment is not well-defined without a thorough discussion of the hypothetical conditions under which it is supposed that they would have continued. The inability to tolerate a treatment may constitute, in itself, evidence of an inability to achieve a favourable outcome.

Characterising beneficial effects using estimands based on the treatment policy strategy might also be more generally acceptable to support regulatory decision making, specifically in settings where estimands based on alternative strategies might be considered of greater clinical interest, but main and sensitivity estimators cannot be identified that are agreed to support a reliable estimate or robust inference. An estimand based on the treatment policy strategy might offer the possibility to obtain a reliable estimate of a treatment effect that is still relevant. In this situation, it is recommended to also include those estimands that are considered to be of greater clinical relevance and to present the resulting estimates along with a discussion of the limitations, in terms of trial design or statistical analysis, for that specific approach. When constructing estimands based on the treatment policy strategy, inference can be complemented by defining an additional estimand and analysis pertaining to each intercurrent event for which the strategy is used; for example, contrasting both the treatment effect on a symptom score and the proportion of subjects using additional medication under each treatment. Similarly, an estimand using a while on treatment strategy should usually be accompanied by the additional information on the time to intercurrent event distributions, and an estimand based on a principal stratum would usefully be accompanied by information on the proportion of patients in that stratum, if available.

The considerations informing the construction of estimand to support regulatory decision making based on a non-inferiority or equivalence objective may differ to those for the choice of estimand for a superiority objective. As explained in ICH E9, the problem facing the regulator in their decision making is different when based on non-inferiority or equivalence studies compared to superiority studies. In Section 3.3.2. it is stated that such trials are not conservative in nature and the importance of minimising the number of protocol violations and deviations, non-adherence and study withdrawals is indicated. In Section 5.2.1. it is described that the result of the Full Analysis Set (FAS) is generally not conservative and that its role in such trials should be considered very seriously. Estimands that are constructed with one or more intercurrent events accounted for using the treatment policy strategy present similar issues for non-inferiority and equivalence trials as those related to analysis of the FAS under the ITT principle. Responses in both treatment groups can appear more similar following discontinuation of randomised treatment or use of another medication for reasons that are unrelated to the similarity of the initially randomised treatments. Estimands could be constructed to directly address those intercurrent events which can lead to the attenuation of differences between treatment arms (e.g. discontinuations from treatment and use of additional medications). When selecting strategies, it might be important to distinguish between trials designed to detect whether differences exist between treatments containing the same or similar active substance (e.g. comparison of a biosimilar to a reference treatment) and trials where a non-inferiority or equivalence hypothesis is used in order to establish and quantify evidence of efficacy. An estimand can be constructed to target a treatment effect that prioritises sensitivity to detect differences between treatments, if appropriate for regulatory decision making.

A.4. IMPACT ON TRIAL DESIGN AND CONDUCT

The design of a trial needs to be aligned to the estimands that reflect the trial objectives. A trial design that is suitable for one estimand might not be suitable for other estimands of potential importance. Clear definitions for the estimands on which quantification of treatments effects will be based should inform the choices that are made in relation to trial design. This includes determining the inclusion and exclusion criteria that identify the target population, the treatments, including the medications that are allowed and those that are prohibited in the protocol, and other aspects of patient management and data collection. If interest lies, for example, in understanding the treatment effect regardless of whether a particular intercurrent event occurs, a trial in which the variable is collected for all subjects is appropriate. Alternatively, if the estimands that are required to support regulatory decision making do not require the collection of the variable after an intercurrent event, then the benefits of collecting such data for other estimands should be weighed against any complications and potential drawbacks of the collection.

Efforts should be made to collect all data that are relevant to support estimation, including data that inform the characterisation, occurrence and timing of intercurrent events. Data cannot always be collected. Certainly, subjects cannot be retained in a trial against their will, and in some trials missing data for some subjects is inevitable by design, such as administrative censoring in trials with survival outcomes. On the contrary, the occurrence of intercurrent events such as discontinuation of treatment, treatment switching, or use of additional medication, does not imply that the variable cannot be measured thereafter, though the measures may not be relevant. For terminal events such as death, the variable cannot be measured after the intercurrent event, but neither should these data generally be regarded as missing.

Not collecting any data needed to assess an estimand results in a missing data problem for subsequent statistical inference. The validity of statistical analyses may rest upon untestable assumptions and, depending on the proportion of missing data, this may undermine the robustness of the results (see A.5.). A prospective plan to collect informative reasons for why data intended for collection are missing may help to distinguish the occurrence of intercurrent events from missing data. This in turn may improve the analysis and may also lead to a more appropriate choice of sensitivity analysis. For example, “loss to follow-up” may more accurately be recorded as “treatment discontinuation due to lack of efficacy”. Where that has been defined as an intercurrent event, this can be reflected through the strategy chosen to account for that intercurrent event and not as a missing data problem. To reduce missing data, measures can be implemented to retain subjects in the trial. However, measures to reduce or avoid intercurrent events that would normally occur in clinical practice risk reducing the external validity of the trial. For example, selection of the trial population or use of titration schemes or concomitant medications to mitigate the impact of toxicity might not be suitable if those same measures would not be implemented in clinical practice.

Randomisation and blinding remain cornerstones of controlled clinical trials. Design techniques for avoiding bias are addressed in Section 2.3. Certain estimands may necessitate, or may benefit from, use of trial designs such as run-in or enrichment designs, randomised withdrawal designs, or titration designs. It might be of interest to identify the principal stratum of subjects who can tolerate a treatment using a run-in period, in advance of randomising those subjects between test treatment and control. Dialogue between regulator and sponsor would need to consider whether the proposed run-in period is appropriate to identify the target population, and whether the choices made for the subsequent trial design (e.g. washout period, randomisation) supports the estimation of the target treatment effect and associated inference. These considerations might limit the use of these trial designs, and use of that particular strategy.

A precise description of the treatment effects of interest should inform sample size calculations. Particular care should be taken when making reference to historical studies that might, implicitly or explicitly, have reported estimated treatment effects or variability based on a different estimand. Where all subjects contribute information to the analysis, and where the impact of the strategy to reflect intercurrent events is included in the effect size that is targeted and the expected variance, it is not usually necessary to additionally inflate the calculated sample size by the expected proportion of subject withdrawals from the trial.

Section 7.2. addresses issues related to summarising data across clinical trials. The need to have consistent definitions for the variables of interest is highlighted and this can be extended to the construction of estimands. Hence, in situations when synthesising evidence from across a clinical trial programme is envisaged at the planning stage, a suitable estimand should be constructed, included in the trial protocols, and reflected in the choices made for the design of the contributing trials. Similar considerations apply to the design of a meta-analysis, using estimated effect sizes from completed trials to determine non-inferiority margins, or the use of external control groups for the interpretation of single-arm trials. A naïve comparison between data sources, or integration of data from multiple trials without consideration and specification of the estimand that is addressed in each data presentation or statistical analysis, could be misleading.

More generally, a trial is likely to have multiple objectives translated into multiple estimands, each associated with statistical testing and estimation. The multiplicity issues arising should be addressed.

A.5. IMPACT ON TRIAL ANALYSIS

A.5.1. Main Estimation

An estimand for the effect of treatment relative to a control will be estimated by comparing the outcomes in a group of subjects on the treatment to those in a similar group of subjects on the control. For a given estimand, an aligned method of analysis, or estimator, should be implemented that is able to provide an estimate on which reliable interpretation can be based. The method of analysis will also support calculation of confidence intervals and tests for statistical significance. An important consideration for whether an interpretable estimate will be available is the extent of assumptions that need to be made in the analysis. Key assumptions should be stated explicitly together with the estimand and accompanying main and sensitivity estimators. Assumptions should be justifiable and implausible assumptions should be avoided. The robustness of the results to potential departures from the underlying assumptions should be assessed through an estimand-aligned sensitivity analysis (see A.5.2.). Estimation that relies on many or strong assumptions requires more extensive sensitivity analysis. Where the impact of deviations from assumptions cannot be comprehensively investigated through sensitivity analysis, that particular combination of estimand and method of analysis might not be acceptable for decision making.

All methods of analysis rely on assumptions, and different methods may rely on different assumptions even when aligned to the same estimand. Nevertheless, some kinds of assumption are inherent in all methods of analysis aligned to estimands that use each of the different strategies outlined; for example, the methodology for predicting the outcomes that would have been observed in the hypothetical scenario, or for identifying a suitable target population in a principal stratum strategy. Some examples are given below related to the different strategies used to reflect the occurrence of intercurrent events. The issues highlighted will be key components of discussion between sponsor and regulator in advance of an estimand, main analysis and sensitivity analysis being agreed.

Analysis aligned with a treatment policy strategy to address a given intercurrent event may entail stronger or weaker assumptions depending on the design and conduct of the trial. When most subjects are followed-up even after the respective intercurrent event (e.g. discontinuation of treatment), the remaining problem of missing data may be relatively minor. In contrast, when observation is terminated after an intercurrent event, which is obviously undesirable in respect of this strategy, the assumption that (unobserved) outcomes for discontinuing subjects are similar to the (observed) outcomes for those who remain on treatment will often be implausible. An alternative approach to handle the missing data would need to be justified and sensitivity analysis will be expected.

Analysis aligned to a hypothetical strategy involves outcomes different from those actually observed; for example, outcomes if rescue medication had not been given when in fact it was. Observations before the rescue medication and observations on subjects who did not require rescue medication may be informative, but only under strong assumptions.

A composite variable strategy can avoid statistical assumptions about data after an intercurrent event by considering occurrence of the intercurrent event as a component of the outcome. The potential concern relates less to assumptions for estimation, and more to the interpretation of the estimated treatment effect. For the estimand to be interpretable, if scores are assigned for failure because the intercurrent event occurs, these should meaningfully reflect the lack of benefit to the patient (e.g. death may be reflected differently than discontinuation of treatment due to adverse event).

Estimands constructed based on a while on treatment strategy can be estimated provided outcomes are collected up to the time of the intercurrent event. Again, the crucial assumptions concern interpretation. Take discontinuation of treatment by way of example. Outcomes while on treatment may be improved but the treatment may also shorten, or lengthen, the treatment period by provoking, or delaying, discontinuations, and both these effects should be considered in interpretation and assessment of clinical benefit.

Analysis aligned to a principal stratum strategy usually requires strong assumptions. For example, some principal stratification methods infer this from baseline characteristics of the subjects, but the correctness of this inference may be difficult to assess. This difficulty cannot be avoided by simplified methods, however. For example, simply comparing subjects who do not have an intercurrent event on the test treatment to those who do not have an event on control, assuming intercurrent events are unrelated to treatment, is very difficult to justify.

Even after defining estimands that address intercurrent events in an appropriate manner and making efforts to collect the data required for estimation (see A.4.), some data may still be missing, including e.g. administrative censoring in trials with survival outcomes. Failure to collect relevant data should not be confused with the choice not to collect, or to collect and not to use, data made irrelevant by an intercurrent event. For example, data that were intended to be collected after discontinuation of trial medication to inform an estimand based on the treatment policy strategy are missing if uncollected; however, the same data points might be irrelevant for another strategy, and thus, for the purpose of that second estimand, are not missing if uncollected. Where those efforts to collect data are not successful it becomes necessary to make assumptions to handle the missing data in the statistical analysis. Handling of missing data should be based on clinically plausible assumptions and, where possible, guided by the strategies employed in the description of the estimand. The approach taken may be based on observed covariates and post-baseline data from individual subjects and from other similar subjects. Criteria to identify similar subjects might include whether or not the intercurrent event has occurred. For example, for subjects who discontinue treatment without further data being collected, a model may use data from other subjects who discontinued treatment but for whom data collection has continued.

A.5.2. Sensitivity Analysis

A.5.2.1. Role of Sensitivity Analysis

Inferences based on a particular estimand should be robust to limitations in the data and deviations from the assumptions used in the statistical model for the main estimator. This robustness is evaluated through a sensitivity analysis. Sensitivity analysis should be planned for the main estimators of all estimands that will be important for regulatory decision making and labelling in the product information. This can be a topic for discussion and agreement between sponsor and regulator.

The statistical assumptions that underpin the main estimator should be documented. One or more analyses, focused on the same estimand, should then be pre-specified to investigate these assumptions with the objective of verifying whether or not the estimate derived from the main estimator is robust to departures from its assumptions. This might be characterised as the extent of departures from assumptions that change the interpretation of the results in terms of their statistical or clinical significance (e.g. tipping point analysis).

Distinct from sensitivity analysis, where investigations are conducted with the intent of exploring robustness of departures from assumptions, other analyses that are conducted in order to more fully investigate and understand the trial data can be termed “supplementary analysis” (see Glossary; A.5.3.). Where the primary estimand(s) of interest is agreed between sponsor and regulator, the main estimator is pre-specified unambiguously, and the sensitivity analysis verifies that the estimate derived is reliable for interpretation, supplementary analyses should generally be given lower priority in assessment.

A.5.2.2. Choice of Sensitivity Analysis

When planning and conducting a sensitivity analysis, altering multiple aspects of the main analysis simultaneously can make it challenging to identify which assumptions, if any, are responsible for any potential differences seen. It is therefore desirable to adopt a structured approach, specifying the changes in assumptions that underlie the alternative analyses, rather than simply comparing the results of different analyses based on different sets of assumptions. The need for analyses varying multiple assumptions simultaneously should then be considered on a case by case basis. A distinction between testable and untestable assumptions may be useful when assessing the interpretation and relevance of different analyses.

The need for sensitivity analysis in respect of missing data is established and retains its importance in this framework. Missing data should be defined and considered in respect of a particular estimand (see A.4.). The distinction between data that are missing in respect of a specific estimand and data that are not directly relevant to a specific estimand gives rise to separate sets of assumptions to be examined in sensitivity analysis.

A.5.3. Supplementary Analysis

Interpretation of trial results should focus on the main estimator for each agreed estimand providing that the corresponding estimate is verified to be robust through the sensitivity analysis. Supplementary analyses for an estimand can be conducted in addition to the main and sensitivity analysis to provide additional insights into the understanding of the treatment effect. They generally play a lesser role for interpretation of trial results. The need for, and utility of, supplementary analyses should be considered for each trial.

Section 5.2.3. indicates that it is usually appropriate to plan for analyses based on both the FAS and the Per Protocol Set (PPS) so that differences between them can be the subject of explicit discussion and interpretation. Consistent results from analyses based on the FAS and the PPS is indicated as increasing confidence in the trial results. It is also described in Section 5.2.2. that results based on a PPS might be subject to severe bias. In respect of the framework presented in this addendum, it may not be possible to construct a relevant estimand to which analysis of the PPS is aligned. As noted above, analysis of the PPS does not achieve the goal of estimating the effect in any principal stratum, for example, in those subjects able to tolerate and continue to take the test treatment, because it may not compare similar subjects on different treatments.

Protocol violations and deviations might exclude subjects from the PPS, for example by having a visit outside a time window, without an intercurrent event necessarily having occurred. Likewise, subjects could experience an intercurrent event, such as death, without having deviated from the protocol. Notwithstanding the differences between violations and deviations from the protocol and intercurrent events, events likely to affect the interpretation or existence of measurements are considered in the description of the estimand. Estimands might be constructed, with aligned method of analysis, that better address the objective usually associated with the analysis of the PPS. If so, analysis of the PPS might not add additional insights.

A.6. DOCUMENTING ESTIMANDS AND SENSITIVITY ANALYSIS

A trial protocol should define and specify explicitly a primary estimand that corresponds to the primary trial objective. The protocol and the analysis plan should pre-specify the main estimator that is aligned with the primary estimand and leads to the primary analysis, together with a suitable sensitivity analysis to explore the robustness under deviations from its assumptions. Estimands for secondary trial objectives (e.g. related to secondary variables) that are likely to support regulatory decisions should also be defined and specified explicitly, each with a corresponding main estimator and a suitable sensitivity analysis. Additional exploratory trial objectives may be considered for exploratory purposes, leading to additional estimands.

The choice of the primary estimand will usually be the main determinant for aspects of trial design, conduct and analysis. Following usual practices, these aspects should be well documented in the trial protocol. If secondary estimands are of key interest, these considerations may be extended to support these as needed and should be documented as well. Beyond these aspects, the conventional considerations for trial design, conduct and analysis remain the same.

While it is to the benefit of the sponsor to have clarity on what is being estimated, it is not a regulatory requirement to document an estimand for each exploratory objective.

Results from the main, sensitivity and supplementary analyses should be reported systematically in the clinical trial report, specifying whether each analysis was pre-specified, introduced while the trial was still blinded, or performed post hoc. Summaries of the number and timings of each intercurrent event in each treatment group should be reported.

Changes to the estimand during the trial can be problematic and can reduce the credibility of the trial. Addressing intercurrent events that were not foreseen at the design stage, and are identified during the conduct of the trial, should discuss not only the choices made for the analysis, but the effect on the estimand, i.e. on the description of the treatment effect that is being estimated, and the interpretation of the trial results. A change to the estimand should usually be reflected through amendment to the protocol.

GLOSSARY

Term	Content
Estimand:	A precise description of the treatment effect reflecting the clinical question posed by the trial objective. It summarises at a population-level what the outcomes would be in the same patients under different treatment conditions being compared.
Estimate:	A numerical value computed by an estimator.
Estimator:	A method of analysis to compute an estimate of the estimand using clinical trial data.
Intercurrent Events:	Events occurring after treatment initiation that affect either the interpretation or the existence of the measurements associated with the clinical question of interest. It is necessary to address intercurrent events when describing the clinical question of interest in order to precisely define the treatment effect that is to be estimated.
Missing Data:	Data that would be meaningful for the analysis of a given estimand but were not collected. They should be distinguished from data that do not exist or data that are not considered meaningful because of an intercurrent event.
Principal Stratification:	Classification of subjects according to the potential occurrence of an intercurrent event on all treatments. With two treatments, there are four principal strata with respect to a given intercurrent event: subjects who would not experience the event on either treatment, subjects who would experience the event on treatment A but not B, subjects who would experience the event on treatment B but not A, and subjects who would experience the event on both treatments. In this document a principal stratum refers to any of the strata (or combination of strata) defined by principal stratification.
Sensitivity Analysis:	A series of analyses conducted with the intent to explore the robustness of inferences from the main estimator to deviations from its underlying modelling assumptions and limitations in the data.
Supplementary Analysis:	A general description for analyses that are conducted in addition to the main and sensitivity analysis with the intent to provide additional insights into the understanding of the treatment effect.

ICH E9(R1) 临床试验中的估计目标与敏感性分析（E9指导原则增补文件）

English Version

ICH E9(R1) Addendum: Statistical Principles for Clinical Trials

A.1. 目的和范围

为了给制药公司、监管机构、患者、医生和其他利益相关方的决策提供正确的信息，应明确描述特定医疗条件下治疗（药物）的获益和风险。如果不能对此进行明确描述，报告的“治疗效应”可能会被误解。本增补提出了一个结构化的框架，以加强参与制定临床试验目的、设计、实施、分析和解释的多学科间的交流，并加强申办方和监管机构之间关于临床试验中治疗效应的沟通。

构建相应临床问题的“估计目标”（见词汇表；A.3.）有助于精确描述治疗效应，这就需要深思熟虑地定义“伴发事件” （见词汇表；A.3.1.），如终止分配的治疗，使用额外或其他治疗，或终末事件（如死亡）等。估计目标的描述应该反映出与这些伴发事件相关的临床问题，并且本增补介绍了反映不同临床问题的策略。在描述临床问题时，策略的选择可能会影响到如何反映试验的更加常规的属性，例如治疗、人群或相关的变量（终点）。

临床试验数据的统计分析应当与估计目标对应。本增补阐明了“敏感性分析”（见词汇表）在探索主要统计分析结论稳健性中的作用。

本增补中，对原始ICH E9的引用采用x.y格式，对本增补的引用采用A.x.y.格式。

本增补就以下若干方面澄清和扩展了ICH E9。第一，ICH E9介绍了随机对照试验中对应于疗法策略的意向治疗（ITT）原则，据此对受试者进行随访、评估和分析，而不考虑其是否依从计划的治疗过程，这表明保持随机化为统计学检验提供了一个坚实的基础。ITT 原则具有以下三个含义。首先，试验分析应包括与研究问题相关的所有受试者。其次，受试者应按随机化时的分配纳入分析。最后，根据ITT原则（见ICH E9词汇表）的定义，无论是否依从预定的治疗过程，都应对受试者进行随访和评估，并在分析中使用这些评估。毫无疑问，随机化是对照临床试验的基石，分析时应最大限度地利用随机化的这一优势。然而，根据ITT 原则估计治疗效应能否总是代表与监管和临床决策最相关的治疗效应，这个问题仍然悬而未决。本增补中概述的框架为描述不同的治疗效应提供了基础，并提出了试验设计和分析需考虑的要点，以便估计治疗效应，为决策提供可靠依据。

第二，本增补重新审视了通常归为数据处理和“缺失数据”（见词汇表）的一些问题，并提出了两个重要的区别。首先，增补对终止随机分配的治疗和退出研究加以区分。前者代表一个伴发事件，需通过在试验目的中对估计目标的精确说明加以解决；后者导致缺失数据，需在统计分析中加以解决。例如，考虑在肿瘤学试验中转组治疗的受试者，以及由于试验完成而无法观测到结局事件的受试者。前者代表伴发事件，关于该事件的临床问题应明确。后者属于管理性删失，需要在统计分析中作为缺失数据问题加以解决。估计目标的清晰性为计划需要收集哪些数据提供了依据，以及哪些数据如果未被收集到即为缺失数据问题，需要在统计分析中加以解决。然后，可以选择解决缺失数据问题的方法，以与估计目标一致。其次，增补强调了不同伴发事件的不同影响。诸如终止治疗、转组治疗或使用额外药物等事件可能导致变量的后续观测值即使可以收集到数据也与估计目标不相关或难以解释。而对于死亡的受试者，死亡后的观测值是不存在的。

第三，在框架中考虑了与分析集概念相关的问题。第 5.2.节强烈建议优效性试验的分析基于全分析集，即尽可能包括所有随机化受试者的分析集。然而，试验往往包括对同一受试者的重复观测。某些受试者按计划收集的观测值可能被认为是无关的或难以解释的，剔除这些观测值，与从全分析集中完全剔除受试者可能具有类似的后果，即没有完全保留最初的随机化。这样做的一个后果是，随机化赋予关于治疗效应的检验假设的理论优势获益以及平衡基线混杂因素的实际获益可能被削弱。另外，有意义的结局变量取值可能不存在，例如当受试者已死亡。第 5.2.节没有直接阐明这些问题。这些问题要在考虑伴发事件的前提下，通过仔细定义关注的治疗效应来进行明确，既要确定要包括在治疗效应估计中的受试者人群，又要确定每个受试者包括在分析中的观测值。本增补也重新审视了使用符合方案集来分析的意义和作用，尤其是，是否需要用比分析符合方案集更能减少偏倚、更有可解读性的方式，来研究方案违背和偏离的影响。

最后，在敏感性分析部分进一步讨论了稳健性的概念（见1.2.）。特别区分了所选分析方法的假设的敏感性，以及分析方法选择上的敏感性。通过精确说明已达成共识的估计目标，以及与估计目标一致的分析方法且其预先设定的细节描述达到能使第三方精确地重现分析结果的程度，这样，监管机构对于一个特定分析可聚焦于假设偏离和数据局限的敏感性。

无论是基于有效性或安全性的治疗效应估计，还是对治疗效应相关假设的检验，本增补中概述的原则均适用。虽然主要关注的是随机临床试验，但这些原则也同样适用于单臂试验和观察性研究。该框架适用于任何数据类型，包括纵向数据、首次事件发作时间数据和复发事件数据。对于确证性临床试验和用于产生确证性结论的跨试验整合数据，监管部门对所述原则的应用将更为关注。

A.2. 将计划、设计、实施、分析和解释协调一致的框架

试验计划应按顺序进行（图 1）。应通过定义合适的估计目标，将明确的试验目的转化为关键的临床所关注的问题。估计目标根据特定的试验目的定义估计的目标（即“要估计什么”，见A.3.），然后可以选择合适的估计方法（即分析方法，称为主“估计方法”，见词汇表）（见 A.5.1.）。主估计方法将以特定假设为基础，为了探索根据主估计方法所作推断对偏离其基本假设的稳健性，应针对同一估计目标采用一种或多种形式进行敏感性分析（见A.5.2.）。

图1：协调估计的目标、估计的方法和敏感性分析，使其与给定试验目的对应

该框架有助于制定适当的试验计划，以明确区分估计的目标（试验目的，估计目标）、估计的方法（估计方法）、数值结果（“估计值”，见词汇表）和敏感性分析。这将有助于申办方的试验计划制定和监管机构的审评工作，并在双方讨论临床试验设计的适宜性和临床试验结果的解释时增强交流。

指定适当的估计目标（见 A.3.）通常是试验设计、实施（见A.4.）和分析（见A.5.）方面的主要决定因素。

A.3. 估计目标

药物开发和批准的核心问题是明确治疗效应是否存在，并估计其大小：如何比较相同受试者接受不同治疗的结局（即，如果受试者未接受治疗或接受不同治疗）。估计目标是对治疗效应的精确描述，反映了既定临床试验目的提出的临床问题。它在群体层面上总结了同一批患者在不同治疗条件下比较的结果。估计的目标将在临床试验之前定义。一旦定义了估计的目标，即可设计试验以可靠地估计治疗效应。

估计目标的描述涉及特定属性的精确说明，这些属性不仅应基于临床考虑而制定，还应基于所关注的临床问题中如何反映伴发事件。第 A.3.1.节介绍了伴发事件。第 A.3.2.节介绍了各种策略，来描述与伴发事件有关的问题。第 A.3.3.节描述了估计目标的属性，第 A.3.4.节则提出了估计目标构建的考虑要点。理解不同策略之间的差异，并精确阐明哪些策略用于构建估计目标，这一点至关重要。

A.3.1. 临床问题中反映的伴发事件

伴发事件是指治疗开始后发生的事件，可影响与临床问题相关的观测结果的解读或存在。在描述临床问题时，有必要阐明伴发事件，以便准确定义需要估计的治疗效应。

在描述治疗效应时需要考虑伴发事件，因为变量的观测结果可能受伴发事件的影响，而伴发事件的发生可能取决于治疗。例如，两名患者可能最初暴露于相同的治疗并提供相同的结局观测值，但如果其中一名患者接受了其他药物治疗，则两名患者之间，观测值所反映的治疗的信息会有所不同。此外，患者接受的治疗会影响到他们是否需要服用其他用药，以及是否可以继续接受治疗。与缺失数据不同，伴发事件不应被认为是临床试验中需要避免的缺陷。在临床试验中发生的终止既定治疗、使用其他药物和其他此类事件在临床实践中也可能发生，因此在定义临床问题时需要明确考虑这些事件发生的可能。

可影响观测结果解释的伴发事件包括终止所分配的治疗，和使用额外或其他治疗。使用额外或其他疗法可以有多种形式，包括改变基础治疗或合并治疗、转组治疗。影响观测结果存在的伴发事件包括终末事件，例如死亡和腿截肢（当评估糖尿病性足溃疡的症状时），而且这些事件不是变量本身的一部分。当某些临床事件的发生或不发生定义了一个主层时（见 A.3.2.），这些事件也可以是伴发事件。例如，肿瘤领域中在评估缓解持续时间疗效时定义客观缓解的肿瘤缩小；对于初始未感染的接种疫苗受试者在评估感染严重程度疗效时的感染发生。

伴发事件可能仅由事件本身确定，如终止治疗，或可能有更详细的定义。详细的定义例如，可明确说明事件发生的原因，如因毒性作用终止治疗，或因缺乏疗效而终止治疗；事件可能需要达到一定量级或程度，如使用超过规定时间或剂量的其他药物；或明确说明事件发生的时机，可能与其对变量评估的接近程度有关。一些事件会无限期地影响结局观测值的解释，例如终止治疗，而另一些事件只会暂时影响，例如短期使用其他治疗。事实上，额外或其他治疗可以是多样的；可以是替代或补充受试者获益不足时的治疗，或作为对既定治疗不耐受的另一种选择，或作为控制疾病暂时急性发作的短期急性治疗。在临床试验中，额外或其他治疗通常是指诸如基础治疗、补救药物和禁用药物，要区分它们的不同作用以对其分别考虑。如果要使用不同的策略，则需要额外的详细信息，确定不同的伴发事件。例如，如果伴发事件不仅取决于未继续治疗，还取决于与未继续治疗相关的原因、程度或时机，则应在临床试验中准确定义和记录该附加信息。理论上，描述伴发事件可能体现治疗和随访非常具体的细节，例如长期治疗的单次漏服或日间服药的错误时间。如果预期这些具体标准不会影响对变量的解释，则不需要将它们作为伴发事件处理。

如上所述，在构建估计目标时需要考虑伴发事件。因为估计目标要在试验设计之前进行定义，所以无论是退出研究还是其他缺失数据的原因（例如生存结局的试验中的管理性删失）本身都不是伴发事件。退出试验的受试者在退出前可能已经发生了伴发事件。

A.3.2. 在定义临床问题时解决伴发事件的策略

下面列出了多种策略，每种策略又体现了对于特定伴发事件的不同临床问题。无论是否使用如下命名规则，构建估计目标时策略的选择都必须清晰明确。无需使用相同的策略来处理所有的伴发事件。事实上，通常会使用不同的策略来明确体现不同伴发事件的临床问题。第 A.3.4.节给出了一些在构建估计目标时策略选择上的考虑。

疗法策略

疗法策略下伴发事件的发生与定义治疗效应无关，即无论是否发生伴发事件，均会使用相关变量的值。例如，将使用其他药物治疗作为伴发事件时，规定无论患者是否服用其他药物，都使用相关变量的值。

对于患者是否继续治疗以及患者的其他治疗（基础或合并治疗）是否有变化等伴发事件，在疗法策略中被视为治疗的一部分。基于这种情况的比较就体现了ICH E9所阐述的ITT原则，比较结果亦是疗法策略下的治疗效应。

一般情况下，对于终末事件类型的伴发事件，不能采用疗法策略，原因是该类伴发事件后变量的值不再存在。例如，在死亡之后变量是无法观测的，因此不能基于此策略构建估计目标。

假想策略

假想策略设想一种没有发生伴发事件的情景：此时，体现临床问题的变量值是在所假设的情景下采用的变量值。

存在各种各样的假设情景，但其中有些情景更具临床或监管意义。例如，在与可实施试验条件不同的条件下的治疗效应可能具有临床或监管重要性。具体而言，当出于伦理原因必须提供额外药物治疗时，可能要考虑未提供额外药物情形下的治疗效应。一个非常不同的假设情景可能是假定伴发事件不会发生，或者会发生不同的伴发事件。例如，对于因发生不良事件而终止治疗的受试者，可考虑同一受试者没有发生不良事件或即使发生不良事件仍然继续治疗的情景。这种假设情景的临床和监管意义有限，并且通常需要清楚地理解伴发事件或其后果在临床实践与临床试验中为什么不同以及如何不同。

如果提出了一个假想策略，应该明确具体的假设情景是什么。举例来说，诸如“如果患者未服用额外药物”之类的措辞可能会导致混淆，因为不清楚患者是因为没有额外药物可用而未服用，还是该患者不需要服用额外药物而未服用。

复合变量策略

复合变量策略与关注的变量有关（见 A.3.3.）。伴发事件本身可提供关于患者结局的信息，因此将其纳入变量的定义之中。例如，由于毒性而终止治疗的患者可能被认为治疗失败。如果变量已被定义为成功或失败，因毒性终止治疗将被认为是另一种形式的失败。复合变量策略不仅限于二分类变量，也可以是连续型变量。例如，在观测生理功能的试验中，死亡的受试者可以用某一数值代表生理功能缺失。当变量原始观测值可能不存在或没有意义，但是伴发事件本身能够体现患者结局（如患者死亡）时，可将复合变量策略视为遵循意向治疗原则的策略。

终末事件，如死亡，可能是需要采用复合策略的最突出例子。如果某种治疗可以挽救生命，可能会关注其对存活患者的各种指标的作用，但是，如果汇总指标仅关注存活患者的一些数值指标的平均值是不够的，要同时关注数值指标和是否生存。例如，肿瘤试验中的无进展生存期衡量了肿瘤生长和生存组合在一起的治疗效应。

在治策略

在治策略关注在伴发事件发生之前的治疗效应。该策略的具体术语将取决于相关伴发事件；例如，当将死亡视为伴发事件时，可以称为“在世策略”。

如果一个变量被重复测量，则伴发事件发生前的所有观测值都可能被认为与临床问题相关，而不是所有受试者在相同固定时间点的值。这也适用于二分类结局在伴发事件之前发生的情况。例如，处于终末期的受试者可能会因为死亡而终止对症治疗，但可以根据死亡前症状的缓解情况评估治疗效果。还有一种情形，受试者可能终止治疗，此时评估其暴露于治疗期间药物不良反应的风险是值得关注的。

因此，在治策略与复合变量策略类似，会影响变量的定义。在这种情况下，在治策略通过将相应的观测时间限制在伴发事件之前来影响。如果各治疗组间的伴发事件的发生率不同，则尤其需要谨慎（见A.3.3.）。

主层策略

主层策略与人群有关（见 A.3.3.）。可认为目标人群是会发生伴发事件的“主层”（见词汇表）。或者，目标人群是不会发生伴发事件的主层。临床问题仅在该主层中与治疗效应相关。例如，在接种疫苗后仍然感染的患者主层中，可能需要了解针对感染严重程度的治疗效应。或者，毒性可能会使一些患者无法继续接受试验药物，但需要了解能够耐受试验药物的患者的治疗效应。

区分“主层”和子集很重要。“主层”（见词汇表）是基于潜在的伴发事件（例如，若分配到试验组将终止治疗的受试者），而“子集”是基于实际发生的伴发事件（终止既定治疗的受试者）。在试验组发生伴发事件的受试者子集通常与对照组发生相同伴发事件的受试者子集不同。比较这些子集的结局而定义的治疗效应，会混杂不同治疗间的真实效应和可能由于受试者不同特征导致的结局差异。

A.3.3. 估计目标的属性

下述属性用于构建估计目标，定义相关的治疗效应。

治疗（处理）：相关的治疗条件，以及适用时进行比较的其他治疗条件（在本文件其余部分中称为“治疗”）。这些可能是单独的干预措施，也可能是同时进行的干预措施的组合（例如作为加载治疗），或者是一个复杂干预序列组成的整体方案。（请见A.3.2.下的疗法策略和假想策略）。

人群：临床问题所针对的患者人群。可以是整个试验人群，也可以是按某种基线特征定义的亚组，或由特定伴发事件的发生（或不发生，视具体情况而定）定义的主层（参见 A.3.2.下的主层策略）。

变量（或终点）：为解决临床问题从每个患者获得的变量（或终点）。变量定义可能包括患者是否发生伴发事件（参见 A.3.2.下的复合变量策略和在治策略）。

其他伴发事件: 在申办方与监管机构关于相关临床问题的交流中，治疗、人群和变量的精确说明有助于解决一些伴发事件。针对任何其他伴发事件的临床相关问题，通常采用疗法策略、假想策略或在治策略来反映。

群体层面汇总：最后，应规定变量的群体层面的汇总统计量，为不同治疗之间的比较提供基础。

在定义治疗效应时，重要的是能够明确效应是由治疗引起的，而不是由潜在的混杂因素如观察期或患者特征的差异等引起的。

A.3.4. 构建估计目标的考量

临床问题及与之相关联的估计目标，应当在计划临床试验的初始阶段予以明确。大多数临床试验目的的精确说明，需要体现终止治疗、使用额外治疗或其他治疗的影响。在某些情况下，还应说明死亡一类的终末事件。有些试验目的只能参照临床事件来描述，例如获得应答的受试者的应答持续时间。

构建一个估计目标，应该考虑在特定医疗环境下特定治疗的临床相关性。需考虑的因素包括：所研究的疾病、临床情况（例如可供选择的其他治疗）、治疗方式（例如一次性给药、短期治疗或长期给药）和治疗目的（例如预防、疾病改善、症状控制）。同样重要的是，能否估计出可靠的治疗效应供决策之用。例如，在临床结局发生前，无论是否使用其他治疗情况下的治疗效应和假设没有额外药物可用情况下的治疗效应，是不同的，它们可能都是值得关注的临床问题。但是，在这两种情况下，为估计这些治疗效应，相应的临床试验设计通常会考虑到在医学上需要使用额外药物的可能性。对于前一问题，使用额外治疗后的观测值是有意义的。对于后一问题，额外治疗后的观测值则无直接相关性，因为这些数值也反映了额外治疗的影响。在估计目标最终确定之前，应确认能够得到可靠的估计，包括后一个问题中，用什么方法替代未被分析使用的观测值。

在构建估计目标时，有必要清楚地了解相关临床问题所涉及的治疗（见 A.3.3.）。对治疗的明确说明可能已经反映了多个相关的伴发事件。具体而言，治疗可能已经反映了临床关注问题所涉及的以下变化：基础治疗、合并用药、使用额外或后线治疗、转组治疗和预处理方案。例如，可以将治疗指定为干预 A加基础治疗B，并按需给药。这种情况下，无需将基础治疗B剂量的变化视为伴发事件。但是，需要将额外治疗视为伴发事件。如果治疗还涉及额外药物的使用，例如在使用疗法策略时，可将治疗定义为干预A加基础治疗B，按需给药，并按需使用额外药物。或者，如果治疗定义为干预 A，那么基础治疗的变化和额外治疗的使用都将视为伴发事件。

还应讨论是否通过明确人群和变量属性来说明相关临床问题的伴发事件。然后可以考虑有关任何其他伴发事件的策略。通常需要反复讨论来确定对于决策具有临床相关性，并且能得出可靠估计值的估计目标。一些估计目标，特别是那些观测值与临床问题相关的估计目标，通常可以通过很少的假设做出稳健的估计。而有些估计目标的分析方法可能需要更具体的假设，这些假设可能更难以论证，并且可能对假设的合理变化更敏感（见 A.5.1.）。对于某一特定的估计目标，如果在试验设计的合理性或者估计值的可靠性方面存在明显不足的话，就需要考虑另一种估计目标、试验设计和分析方法。

省略或过度简化讨论和构建估计目标的过程，会产生导致试验目的、试验设计、数据收集和分析方法之间不一致的风险。当无法得出可靠的估计值时，可能会妨碍某些策略的选择，重要的是要从试验目的和对临床相关问题的理解出发，而不是为了选择数据收集和分析方法，来确定估计目标。

还应考虑试验现状。如果根据临床试验方案对受试者的管理（例如，因不耐受而进行剂量调整，因应答不足而做的补救治疗，临床试验评估所带来的负担）被证实与临床实践中所预期的不同，这可能要在估计目标的构建中有所体现。

估计目标的构建，应该明确和清晰地定义估计的目标。以终止治疗伴发事件为例，最重要的是区分基于假设接受试验药物就能继续治疗的患者主层的治疗效应与实际持续治疗期间的效应。此外，如果所有患者都能继续治疗，则这两种情况都不能用来反映效应。

如上所述，当使用假想策略时，某些情形更可能为监管决策所接受。因此，所描述的假想情形应该能合理地用来量化可解释的治疗效应，为监管机构做决策和临床实践中药物的使用提供相关信息。假如未获得补救药物，变量值会是多少也许是一个重要问题。相反，如果假想情形中因药物不良反应而终止治疗的受试者实际上继续接受治疗，那么相关变量值在这一假想情形下会是多少这个问题可能不具有临床或监管意义。假如所有受试者都能够继续接受治疗，但没有对他们会继续这一假想情形进行充分讨论的话，则基于该效应的临床问题的定义是不充分的。药物不耐受本身可能就构成了无法达到有利结局的证据。

使用基于疗法策略的估计目标来描述获益效应以支持监管决策，可能更被普遍接受，特别是在某些情况下，尽管基于其他策略的估计目标可能被认为更具临床意义，但是无法找到其公认的能支持可靠估计值或稳健推断的主估计方法和敏感性估计方法。基于疗法策略的估计目标仍然可能得到具有临床相关性的可靠估计值。在这种情况下，建议还包括那些被认为具有更大临床相关性的估计目标，并给出所得到的估计值，以及关于该特定方法在试验设计或统计分析方面的局限性的讨论。在构建基于疗法策略的估计目标时，可以对使用该策略的每个伴发事件定义额外的估计目标和分析来补充推断；例如，对比各治疗组中治疗对症状评分的治疗效应和使用额外药物的受试者比例。类似的，使用在治策略的估计目标通常应该有伴发事件发生时间分布的附加信息，基于主层策略的估计目标通常有关于该主层中患者比例的信息（如果有）。

基于非劣效性或等效性目的构建估计目标以支持监管决策的考虑，可能与基于优效性目的的估计目标不同。正如在ICH E9中所解释的，非劣效性或等效性研究与优效性研究相比，监管机构在决策中面临的问题是不同的。在第 3.3.2.节中指出，这类试验本质上不保守，重要的是尽量减少方案违背和偏离、不依从和退出研究的数量。在第5.2.1.节中指出，全分析集（FAS）的结果一般不是保守的，它在这类试验中的作用应该被认真考虑。使用疗法策略来说明由一个或多个伴发事件构建的估计目标，对于非劣效性和等效性试验而言，存在的问题与ITT原则下使用FAS分析存在的相关问题类似。终止随机分配的治疗或因各种原因使用另一种药物，两个治疗组的应答情况可以表现得更相近，而这与最初的随机分配的治疗组间的相似度无关。可以构建估计目标来直接说明那些可能导致治疗组之间差异被弱化的伴发事件（例如终止治疗和使用额外药物）。在选择策略时，区分用于检测出含有相同或相似药物活性成分的治疗之间是否有差异的试验（例如，将生物类似物与参比治疗进行比较）与采用非劣效性或等效性假设来建立和量化有效性证据的试验是很重要的。为适用于监管决策，可以构建一个针对治疗效应的估计目标，优先考虑能更灵敏地检测出治疗间效应的差异。

A.4. 对试验设计和实施的影响

试验设计需要与反映试验目的的估计目标相一致。一种适用于某个估计目标的试验设计，不一定适用于其他具有潜在重要性的估计目标。治疗效应的量化依赖于估计目标，而估计目标的明确定义应当为试验设计的选择提供相关信息。这包括定义目标人群的入选和排除标准，治疗（包括方案中允许的药物和禁用药物），以及患者管理和数据收集等方面。例如，如果关注的是不管是否发生特定伴发事件的治疗效应，则该试验应收集所有受试者的变量。或者，如果用于支持监管决策而制定的估计目标不需要收集伴发事件后的变量值，则应权衡为其他估计目标收集此类数据的益处与数据收集的复杂性和潜在缺陷。

应尽可能收集所有与估计相关的数据，包括伴发事件的特征、发生情况和时间。然而数据不是总能被收集到，不能违背受试者意愿将其留在试验中，而且在某些试验中，受试者数据的缺失按照设计是不可避免的，如生存结局试验中的管理性删失。相反，如发生终止治疗、转组治疗或使用额外药物等伴发事件，尽管这些测量结果可能并不相关，并不意味着在事件之后无法测量变量。对于死亡等终末事件，变量不能在伴发事件后进行观测，但这些数据通常都不应被视为缺失。

不收集评估估计目标所需的数据，将导致后续统计推断中的缺失数据问题。统计分析的合理性可能取决于不可验证的假设和缺失数据的比例，这些可能会削弱结果的稳健性（见 A.5.）。制定一个前瞻性计划收集详细的数据缺失的原因，将有助于区分伴发事件的发生与缺失数据。这样可改进分析，并可使敏感性分析的选择更合理。例如，将“失访”记录为“因缺乏疗效而终止治疗”可能更准确。在将其定义为伴发事件的情况下，可以选择相应策略，而不是将其作为缺失数据问题来处理。为减少缺失数据，可采取措施将受试者保留在试验中。然而，减少或避免临床实践中通常会发生的伴发事件的措施存在降低试验外部有效性的风险。例如，如果在临床实践中不会通过选择试验人群，或使用滴定方案，或使用合并用药来减轻毒性的影响，那么在试验中采取这些措施也就不合适了。

随机化和盲法仍然是对照临床试验的基石。避免偏倚的设计方法请见第 2.3.节。某些估计目标可能需要或受益于试验设计的运用，如导入期或富集设计、随机撤药设计或滴定设计。在对受试者进行随机化分组之前，利用导入期识别对药物耐受的受试者主层，这个办法是可取的。监管机构和申办方之间的交流需要考虑拟定的导入期是否适合于确定目标人群，以及后续试验设计（例如，洗脱期、随机化）的选择是否支持目标治疗效应的估计和相关推断。这些考虑可能会限制这些试验设计的使用，以及特定策略的使用。

对治疗效应的精确描述应当为样本量计算提供信息。在参照历史研究时应特别谨慎，这些历史研究可能隐含或明确地报告了基于不同估计目标的治疗效应或变异的估计值。当所有受试者均能为分析提供信息，且在目标效应量和预期方差中已考虑了相应的策略来反映伴发事件的影响时，则在计算样本量时通常不需要按预期退出试验的受试者比例额外增加样本量。

第 7.2.节说明了关于跨临床试验数据汇总的问题。强调对所关注的变量应有一致的定义，并且这可延伸用于估计目标的构建。因此，为了从多个临床试验获得综合性证据，在计划阶段应构建一个合适的估计目标，使其包含在项目里的所有试验方案中，并反映在相关试验的设计选择中。类似的考虑适用于meta分析的设计，使用已完成的试验中估计的效应量来确定非劣效界值，或者使用外部对照组来诠释单臂试验。在没有考虑和说明每个数据呈现或统计分析所对应的估计目标的情况下，不同来源数据之间的简单比较，或者来自多个试验的数据整合可能会产生误导。

总的来说，一个试验有可能将多个目的转化为多个估计目标，每个估计目标都与统计检验和估计相关联。此时，应当考虑其中的多重性问题。

A.5. 对试验分析的影响

A.5.1. 主要估计

对于治疗组相较于对照组的治疗效应，估计目标通过比较具有相似受试者的试验组和对照组的结果来进行估计。对于给定的估计目标，应采用与其相一致的分析方法（或估计方法），使所得估计值可以支持对结果的可靠解读。该分析方法还应能计算置信区间并进行统计学显著性的检验。可解释的估计值是否存在，一个重要的考虑因素是分析中需要作出的假设的程度。关键假设应与估计目标及其主要估计方法和敏感性估计方法一起明确说明。假设应是合理的，要避免不恰当的假设。对于潜在的偏离假设的情形，结果的稳健性应通过与估计目标相对应的敏感性分析进行评估（见 A.5.2.）。如果一个估计依赖于多种假设或强假设，则需要更广泛的敏感性分析。如果偏离假设所造成的影响不能通过敏感性分析进行全面研究，那么这种特定的估计目标和分析方法的组合可能无法用于决策。

所有的分析方法都依赖于假设，而且即使对应于相同的估计目标，不同的分析方法也可能依赖于不同的假设。尽管如此，对于使用每种不同策略的估计目标，相对应的所有方法中都存在一些固有的假设。例如，预测在假想情形下本可观察到的结局的方法，或者在主层策略下确定合适目标人群的方法。下面列举了一些关于针对不同伴发事件采用不同策略的例子。在申办方和监管机构就估计目标、主要分析和敏感性分析达成一致之前，这里所强调的问题将是双方讨论的关键。

与针对某一伴发事件的疗法策略相对应的分析，根据试验的设计和实施，可能需要或强或弱的假设。如果在相应的伴发事件（如治疗终止）之后仍对大多数受试者进行随访，那么缺失数据问题可能相对较小。相反，如果伴发事件后停止观察（对于该策略显然是不提倡的），那么假设终止治疗受试者的（未观察到的）结局与继续治疗的受试者的（观察到的）结局相似通常是不合理的。此时在处理缺失数据时需要对其他方法进行论证，并进行敏感性分析。

与假想策略相对应的分析方法所涉及的结局与实际观测的结局不同；例如，尽管实际上给予了补救药物，但需要估计未给予补救药物时的结局。在给予补救药物之前的观测值与不需要补救药物的受试者的观测值可能提供有效信息，但需要更强的假设。

复合变量策略可以通过将伴发事件的发生视为结局的组成部分来避免对伴发事件后的数据做出统计学假设。这种情况下，潜在的担忧往往不在于估计中相应的假设，而在于对治疗效应估计结果的解释。为使估计目标得以合理解释，如果将伴发事件的发生认定为失败，并给出评分，这个评分应该能够有效地反映出患者获益的缺乏程度（例如，对死亡与因不良事件导致处理终止的反映也许不同）。

如果伴发事件发生之前的结局已被收集，那么依据在治策略所构建的估计目标就可以被估计。同样的，此处的关键假设将影响到结果的解释。以终止治疗为例，治疗过程中结局可能会改善，但同时治疗也可能因为引发、延迟、终止治疗等原因，使得治疗期缩短或延长。此类影响应在解释和评估临床获益时予以考虑。

与主层策略相对应的分析通常需要较强假设。例如，一些主层方法是基于受试者的基线特征而推断出来的，但这种推断的正确性可能难以评估。而且，这种困难不能通过简化的方法来避免。例如，假设伴发事件与处理无关，并简单地比较试验组和对照组中未出现伴发事件的受试者，这是非常难以论证的。

即使以恰当的方式定义了解决伴发事件的估计目标，并努力收集估计所需的数据（见 A.4.），一些数据仍然可能缺失。例如，生存结局试验中的管理性删失。未能收集到相关数据不应与选择不收集或选择收集但不使用（因伴发事件变得无关的数据）相混淆。例如，对于基于疗法策略的估计目标，在终止试验药物后的数据仍应被收集，如果未收集，则视为数据缺失；然而，对另一种策略而言，相同的数据点可能不相关，因此，对于相应的估计目标，此类未收集数据则不会被视为缺失。如果数据收集不完整，则有必要在统计分析中对缺失数据的处理做出一些假设。缺失数据的处理应基于临床上的合理假设，并在可能的情况下以估计目标描述中采用的策略为指导。采取的方法可能基于个体受试者和与其相似受试者所观测到的协变量和基线后数据。识别相似受试者的标准可能包括是否发生伴发事件。例如，对于终止治疗但未收集更多数据的受试者，可使用终止治疗但继续收集数据的其他受试者的数据来建模。

A.5.2. 敏感性分析

A.5.2.1. 敏感性分析的作用

基于特定估计目标的统计推断，应该对数据的局限以及主估计方法统计模型中假设的偏离具有稳健性。这种稳健性应通过敏感性分析来评价。对于所有用于监管决策和说明书制定的估计目标的主估计方法，都应有相应的敏感性分析计划。此问题需要在申办方和监管机构之间讨论并达成一致。

支持主估计方法的统计假设应明确记录。对于同一估计目标，应该预先规定一项或多项分析来评估这些假设，目的是验证根据主估计方法得出的估计值是否对假设偏离具有稳健性。其衡量标准可以是对假设不同程度的偏离是否会改变结果的统计学或临床意义（如临界点分析）。

敏感性分析旨在探索偏离假设时分析结果的稳健性，与此不同的，为了更全面地研究和理解试验数据而进行的其他分析可称为“补充分析”（见词汇表，A.5.3.）。如果申办方和监管机构就所关注的主要估计目标达成一致，并预先明确规定了主估计方法，且敏感性分析也验证了估计值的结果解释是可靠的，则补充分析在结果评估中通常不被优先考量。

A.5.2.2. 敏感性分析的选择

当计划和实施敏感性分析时，同时改变主要分析的多个方面可能难以确定由哪些假设导致了目前所观测到的潜在差异。因此，通常采用结构化的方法，指定不同分析背后的假设的变化，而不是简单地基于一组不同的假设比较不同分析的结果。应根据具体情况考虑是否需要同时改变多个假设的分析。在评估不同分析的解释和相关性时，区分可验证的和不可验证的假设可能是有帮助的。

在本文所设立框架中，进一步明确了对缺失数据进行敏感性分析的必要性和重要性。缺失数据应依据特定估计目标进行定义和考虑（见 A.4.）。对应于特定估计目标的缺失数据，以及与特定估计目标不直接相关的数据，两者之间存在区别，由此在分析中产生了不同类别的假设，需要通过敏感性分析来检查。

A.5.3. 补充分析

试验结果的解释应侧重于对应每个估计目标的主估计方法，并通过敏感性分析验证相应估计值的稳健性。除了主要分析和敏感性分析之外，还可以对估计目标进行补充分析，以提供对治疗效应更全面的了解。补充分析在解释试验结果方面的作用通常较小。每项试验均需考虑补充分析的必要性和作用。

第 5.2.3.节指出，同时基于全分析集（FAS）和符合方案集（PPS）的分析计划通常是适当的，从而它们之间的差异会成为讨论和结果解读的关键。如果基于FAS分析和PPS分析的结果一致，则可增强试验结果的可信度。第5.2.2.节还指出，基于PPS的结果可能会产生严重偏倚。就本增补中提出的框架而言，可能无法构建与PPS分析相对应的估计目标。如上所述，PPS分析不能实现在任何主层（例如，在能够耐受并继续接受试验药物的受试者）中估计效应的目的，因为PPS所比较的受试者在不同治疗组之间可能不具有可比性。

即使没有发生伴发事件，方案违背和偏离（例如，在时间窗外进行访视）也可能会使受试者从PPS中被排除。同样，受试者可能发生伴发事件（例如死亡）但却没有偏离方案。尽管违背和偏离方案与伴发事件之间存在差异，在估计目标的描述中仍应考虑可能影响观测结果解释或存在的事件。通过构建估计目标和相应的分析方法，可能更好地反映与PPS分析相关的目标。此时，PPS分析也许不能提供额外的信息。

A.6. 估计目标和敏感性分析的记录

试验方案应当定义并明确说明与主要试验目的相对应的主要估计目标。方案和分析计划中应预先规定与主要估计目标一致并对应主要分析的主估计方法，以及当假设偏离时用来探索结果稳健性的合适的敏感性分析。对于可能支持监管决策的次要试验目的（例如，与次要变量相关），相应的估计目标也应当明确定义和说明，并且每个估计目标都有相应的主估计方法和合适的敏感性分析。还可以考虑属于探索性质的额外的探索性试验目的，此时也会产生额外的估计目标。

主要估计目标的选择通常是试验设计、实施和分析的主要决定因素。按照常规，这些信息应在试验方案中详细记录。如果次要估计目标同样需要重点关注，这些考虑可扩展到相应的估计目标，并同样在方案中记录。除此之外，对于试验设计、实施和分析的常规考虑仍然保持不变。

尽管明确阐明估计的内容对申办方有益，但监管部门并不要求对每一个探索性目的的估计目标都进行记录。

在临床试验报告中应系统报告主要分析、敏感性分析和补充分析的结果，同时详细说明每项分析是否为预先规定的、在试验仍处于盲态时引入进行的，还是事后进行的。应汇总报告各处理组中各类伴发事件的数量和出现时间。

试验期间改变估计目标可能是有问题的，这样做会降低试验的可信度。对于在设计阶段未预见但在试验实施过程中发现的伴发事件，不仅要讨论分析方法的选择，还要讨论它们对估计目标的影响，即对所估计的治疗效应描述的影响，和对试验结果解释的影响。估计目标的改变通常应通过修订试验方案来体现。

词汇表

术语	内容
估计目标：	对治疗效应的精确描述，反映了针对临床试验目的提出的临床问题。它在群体水平上汇总比较相同患者在不同治疗条件下的结局。
估计值：	由估计方法计算得出的数值。
估计方法：	采用临床试验数据计算估计目标的估计值的分析方法。
伴发事件：	治疗开始后发生的事件，可影响与临床问题相关的观测结果的解释或存在。在描述相关临床问题时，需解决伴发事件，以便准确定义需要估计的治疗效应。
缺失数据：	对于既定估计目标的分析有意义、但未收集到的数据。它应该与不存在的数据，或由于伴发事件而被认为没有意义的数据区分开来。
主分层：	根据所有治疗中伴发事件的潜在发生情况，对受试者进行的分类。以两种治疗为例，针对特定的伴发事件，有四个主层：任一治疗期间均不会发生事件的受试者，在A治疗期间会发生事件但在B治疗期间不会发生事件的受试者，在B治疗期间会发生事件但在A治疗期间不会发生事件的受试者，以及在两种治疗期间均会发生事件的受试者。在本文件中，主层是指主分层定义的任何分层（或分层组合）。
敏感性分析：/td>	针对模型假设的偏离和数据局限，探索主估计方法统计推断的稳健性的一系列分析。
补充分析：	对于主要分析和敏感性分析之外的分析的一般描述，目的是更多地了解治疗效应。