Skip to main content

ICH E9 临床试验的统计学原则

English Version ICH E9 Statistical Principles for Clinical Trials

1. 引言

1.1 背景与目的

医药产品的有效性和安全性需由临床试验来论证。所采用的临床试验需遵循ICH在1996年5月1日通过的“良好临 床实践(GCP):综合指南”(ICH E6)。 ICH E6已阐明统计学在临床试验设计和分析中不可或缺的作用。由于统计学研究在临床试验领域的不断发展,加之临床研究在药物审批流程及一般医疗保健中的重要作用,因此,有必要制订一份关于临床试验统计学问题的简明文件。本指南旨在协调在欧洲、日本和美国提交上市申请的临床试验所应用的统计学方法的原则。

作为起点,本指南使用了欧盟专利医药产品委员会(CPMP)在题为《用于申请医药产品上市许可的临床试验生物统计学方法》(1994年12月)指南的意见,并参照了日本厚生省的《临床研究中的统计分析指南》(1992年3月)和美国食品药品监督管理局的《新药申请中临床与统计部分的格式与内容指南》(1998 年 7月)。其他 ICH指南也包含一些与统计学原则和方法有关的主题,特别是下面所列的指南。本指南的各个部分会对包含相关内容的特定指南进行标注。

E1A:人群暴露程度对评价临床安全性的影响
E2A:临床安全性数据管理:快速报告的定义与标准
E2B:临床安全性数据管理:个例安全报告传输数据元素
E2C:临床安全性数据管理:上市药品的定期安全性更新报告
E3:临床研究报告的结构与内容
E4:支持药品注册的剂量反应信息
E5:国外临床数据可接受性的种族因素
E6:良好临床实践:综合指南
E7:特殊人群的支持性研究:老年医学
E8:临床试验的一般考虑
E10:临床试验中对照组的选择
M1:用于监管目的的医学术语标准化
M3:用于实施药物人体临床试验的非临床安全性研究

本指南旨在为申办方在整体临床研发背景下,对研究产品临床试验的设计、实施、分析和评价提供指导。本指南也将会帮助科学专家准备上市申请总结报告或者评价主要来自研发后期的临床试验的有效性和安全性证据。

1.2 范围与方向

本指南的重点是统计学原则,并不涉及具体统计步骤或方法的使用。确保这些原则得到正确实施的具体程序性步骤是申办方的职责。本指南对不同临床试验之间的数据整合亦作了讨论,但并不作为重点。其他ICH指南涵盖了与数据管理及临床试验监查活动有关的原则和程序,此处不再赘述。

本指南对很多科学学科的人士都是有意义的。然而,正如ICH E6 所述,我们假定所有与临床试验有关的统计工作 的实际职责由训练有素且经验丰富的统计师承担。试验统计师(见词汇表)在与其他临床试验专家合作时,其作用和职责是确保在支持药物研发的临床试验中恰当地应用统计学原则。因此,试验统计师应同时具备足够的教育/训练和经验以贯彻本指南所阐明的原则。

对于每一个用于上市申请的临床试验,有关设计、实施和拟采用的统计分析的主要特征等重要细节需在研究方案中阐明。对方案中步骤的遵循程度和主要分析预先计划的程度,都将决定试验最终结果和结论的可信度。方案及后续修订应获得包括试验统计师在内的责任人员的批准。试验统计师应恰当使用技术术语,保证方案以及任何修订都能清楚准确地涵盖所有相关的统计问题。

本指南所述的原则主要与研发后期实施的临床试验有关,其中很多是有效性的确证性试验。除有效性外,确证性试验也可把安全性指标(如不良事件、临床实验室指标或心电图测量)、药效学或药代动力学指标(如确证性的生物等效性试验)作为主要指标。其次,有些确证性结果可能来源于不同试验的整合数据,本指南有些原则适用于这种情况。最后,虽然药物研发早期本质上以探索性临床试验为主,但统计学原则也与这些临床试验有关。因此,本指南应尽可能地应用于临床研发的各个阶段。

本指南所描述的很多原则致力于最小化偏倚(见词汇表)和最大化精度。这里的术语“偏倚”是指与临床试验设计、实施、分析和结果解释有关的任何因素所导致的处理效应(见词汇表)的估计值与真实值偏离的系统性趋势。应尽可能地识别偏倚的潜在来源,以便采取措施限制这些偏倚。偏倚的存在可能严重削弱从临床试验中得出正确结论的能力。

有些偏倚源于试验设计,例如,在处理分配过程中将风险较低的受试者系统地分配到其中一个处理组。其他偏倚源于临床试验的实施和分析。例如,违背方案且基于对受试者结局的认识从分析中排除受试者是偏倚的可能来源,这可能影响处理效应的准确估计。偏倚常在不知不觉中发生,且难以直接测量,因而评价试验结果和主要结论的稳健性是重要的。稳健性是一个概念,是指整体结论对数据的各种限制、假设和数据分析方法的敏感性。稳健性意味着,当基于另一假设或分析方法进行分析时,试验的处理效应和主要结论不会受到实质性的影响。在对处理效应和处理间比较的不确定性的统计测量进行解释时,应考虑偏倚对P值、置信区间或推断的潜在影响。

由于临床试验设计和分析的主要方法基于频率派统计方法,因此在讨论假设检验和/或置信区间时,本指南主要使用频率派方法(见词汇表)。这并不意味着其它方法不可取,如果理由充分且所得结论足够稳健,则贝叶斯方法(见词汇表)及其他方法亦可考虑。

2. 总体临床研发的考虑

2.1 试验背景

2.1.1 研发计划

新药临床研发过程的广义目标是发现药物是否在某一剂量范围和用法上能够显示出既安全又有效,且其风险获益关系能够被接受。可能从药物获益的特定对象以及特定的适应症也需要被定义。

满足这些目标通常需要一系列循序渐进的临床试验,每一个临床试验有其特定目的(见ICH E8),应该在一个或一系列临床计划中明确,这些计划应具有适当的决策点和随知识累积而进行修订的灵活性。上市申请应清晰地描述这些计划的主要内容和每个试验的作用。对整个试验项目证据的解释和评价需要综合单个试验的证据(见第7.2章节),为此应确保试验在一些特征上采用通用标准,如医学术语词典、主要测量的定义与时点、方案违背的处理,等等。当医学问题通过一个以上的试验来回答时,统计汇总、综述或meta分析(见词汇表)可能会有用。应尽量在计划中考虑到这一点,以便清晰地确定相关的试验,并且预先指定必要的设计方面的共同特征。应该在该计划中阐述可能会涉及整体计划中若干试验的其他主要统计学问题(如果有的话)。

2.1.2 确证性试验

确证性试验是一种预先提出假设并进行评价的具有充分对照的试验。原则上确证性试验需要提供有效性或安全性的确凿证据。此类试验中,感兴趣的关键假设通常需预先定义,应能直接反映试验的主要目的,且在试验完成后得到检验。在确证性试验中,以适当的精度估计处理效应的大小,与把这些效应和临床意义联系起来同等重要。

确证性试验旨在提供确凿证据以支持主张,因此,按照方案及标准操作规程进行试验尤为重要。应该解释和书面记录不可避免的变化,并考察它们的影响。此类试验设计的合理性以及其它重要的统计方面,如计划分析的主要特征,均应写入方案。每个试验应仅解决有限的问题。

支持所主张的确凿证据要求确证性试验的结果证实研究产品具有临床获益。因此确证性试验应清晰明确地回答每一个与有效性或安全性主张有关的关键临床问题。另外,推论(见词汇表)到目标患者人群的基础得以理解和解释很重要,这也会影响到所需研究中心和/或试验的数量和参与人员(如专家或全科医师)。确证性试验的结果应当是稳健的。某些情况下,单一确证性试验所提供证据强度可能就足够了。

2.1.3 探索性试验

确证性试验的理论基础和设计几乎总是依赖于一系列早期探索性临床研究工作。这些探索性研究和所有临床试验一样应有清晰和明确的目的,但与确证性试验相比,它们的目的并不总是对预先定义的假设进行简单检验。此外,探索性试验可能有时需要采用更灵活的方法进行设计,以便根据积累的结果更改设计。它们的分析可能仅限于数据探索,也可能进行假设检验,但假设的拟定可能依赖于数据。尽管这类试验可能对整体的相关证据有贡献,但不能作为证明有效性的正式依据。

任何试验可能同时具有确证性和探索性两个方面。例如,在大多数确证性试验中,也会对数据进行探索性分析,作为解释和支持研究发现、为后期研究提出进一步假设的基础。方案应明确区分进行确证试验和对数据做探索性分析的两种不同情况。

2.2 试验范围

2.2.1 人群

在药物研发的早期阶段,临床试验受试者的选择在很大程度上受到主观愿望的影响,即希望最大可能地观察到感兴趣的特定临床疗效,因此,研究对象往往是药物最终适用的患者总体中一个非常局限的亚组。但在开展确证性试验的时候,试验受试者应更能反映目标人群。因此,在保持足够的同质性以精确估计处理效应的同时,尽可能放宽目标人群的纳入和排除标准,这对确证性试验是有益的。由于地理位置、实施时间、特定研究者和诊所的医疗实践等因素的影响,任何一个临床试验都不可能完全代表将来的用药者。尽管如此,应尽可能减少这些因素的影响,并在解释试验结果时充分讨论。

2.2.2 主要和次要指标

主要指标(又称“目标”指标,主要终点)应能够提供与试验主要目的直接相关的最具临床相关性和说服力的证据。通常应只设置一个主要指标。因大部分确证性试验的主要目的是提供与有效性相关的强有力的科学证据,所以主要指标通常是有效性指标。安全性/耐受性有时也可能是主要指标,且会一直是一种重要的考量。有关生活质量和卫生经济的指标是进一步的潜在主要指标。主要指标的选择应反映相关研究领域公认的准则和标准。建议使用在早期研究或发表文献中获得的具有实践经验的可靠且已验证的指标。在纳入和排除标准所描述的患者人群中,应该有充分的证据说明主要指 标能够有效和可靠地度量临床相关的和重要的治疗获益。主要指标通常用于样本量估计(见第3.5章节)。/p>

很多情况下,评价受试者结局的方法可能并不直接,应仔细定义。例如,将死亡率作为主要指标而无进一步说明是不够的,因为对死亡率的评价可以是比较某些固定时点的存活比例,也可以是比较在特定时域内生存时间的总体分布。另一个常见的例子是复发事件,处理效应的测量可以是简单的二分类指标(特定时期内的任何复发)、首次复发的时间、复发率(观察的单位时间的事件数),等等。在评价慢性病的处理效应时,随时间变化的功能状态对选择主要指标提出了其他挑战。相应的方法有多种,例如,观察期开始和结束时所做评价的比较、由观察期所有评价求得的斜率的比较、超过或低于规定阈值的受试者比例的比较、基于重复测量数据方法的比较。为避免因事后定义所产生的多重性担忧,在方案中规定主要指标的精确定义至关重要,因为该定义将用于统计分析。另外,所选择的具体主要指标的临床相关性和相关测量过程的合理性通常需要在方案中阐明。

主要指标及其选择理由应在方案中详细说明。揭盲后重新定义主要指标通常是不可接受的,因为由此引入的偏倚很难评价。当根据主要目的确定的临床效应存在多种测量方法时,应根据临床相关性、重要性、客观性、和/或其它相关特性,在方案中选择其中一种切实可行的测量方法作为主要指标。

次要指标是与主要目的相关的支持性指标,或与次要目的相关的效应指标。在方案中预先定义次要指标,并说明它们的相对重要性以及在解释试验结果时的作用也很重要。次要指标的数量应有限制,且与试验要回答的有限问题相关。

2.2.3 复合指标

当与主要目的相关的多种测量方法中难以确定单一的主要指标时,另一种有用的策略是按预先确定的计算方法将多个指标组合成一个单一或“复合”指标。主要指标有时以多种临床测量方法相组合的形式出现(如关节炎、精神疾病和其它疾病使用的量表),这虽涉及多重性问题,但无需调整I类错误。将多个指标组合的方法应在方案中详细说明,且应以临床获益的大小对结果进行解释。当复合指标被用作主要指标时,可以对复合指标中有临床意义的单个指标进行单独分析。当量表被用作主要指标时,阐明内容效度(见词汇表)、评价者内和评价者间信度(见词汇表)及检测疾病严重程度变化的反应度等尤其重要。

2.2.4 全局评价指标

在某些情况下,全局评价指标(见词汇表)用于评价某个处理的整体安全性、有效性和/或实用性。这种指标类型整合了客观指标和研究者对受试者的状态或状态变化的总体印象,它通常是一个有序分类量表。整体有效性的全局评价方法已经用于某些治疗领域,如神经病学和精神病学。

全局评价指标一般带有主观成分。使用全局评价指标作为主要或次要指标时,应该在方案中对量表的以下方面进行详细说明:

1) 量表与试验主要目的的相关性;

2) 量表的效度和信度基础;

3) 如何根据所收集的数据将个体受试者归类于量表中的特定类别;

4) 如何将有缺失数据的受试者归类于量表中的特定类别,或用其他方法评价。

若研究者选取的全局评价指标中包含客观指标,则这些客观指标应作为附加的主要指标,或至少作为重要的次要指标。

全局实用性评价综合了获益与风险两方面因素,反映了经治医生的决策过程,即医生在做出使用产品的决策时,必须权衡获益与风险。全局实用性指标会产生这样的问题,即某些情况下会将获益和不良反应方面差别很大的两种产品判断为等效。例如,将一种治疗的全局实用性指标判断为等效于或优效于另一种治疗时,可能掩盖了其疗效甚微或无效但不良反应较少的事实。因此不建议将全局实用性指标作为主要指标。如果全局实用性指标被用作主要指标,则将特定的有效性和安全性结局分别作为附加的主要指标考虑是非常重要的。

2.2.5 多个主要指标

有时需要使用一个以上的主要指标,且每一个指标(或其中一个子集)都足以涵盖其治疗效果的范围。解释这类证据的既定方式应当详细说明,即应该说明对任一指标,或最少几个指标,或全部指标的影响是否被认为是达到试验目的所必需的。应该针对已定义的主要指标清楚地说明主要假设或相关的假设与参数(如均数、百分数、分布),并清楚地叙述统计推断方法。因为存在潜在的多重性问题,所以应解释对I类错误的影响(见第5.6章节),也应在方案中给出控制I 类错误的方法。在评价对I类错误的影响时,所提出的主要指标之间的相关程度也需要考虑。如果试验目的是证实所有主要指标的效果,则无需调整I类错误,但必须仔细考虑对 II 类错误和样本量的影响。

2.2.6 替代指标

当通过观察实际临床有效性直接评价受试者的临床获益不可行时,可以考虑间接标准(替代指标—见词汇表)。一些被认为可以预测临床获益的指标通常可作为替代指标。确定替代指标有两个主要关注点:第一,它可能不是相关临床结局的真正预测因子,例如,它可以测量与一个特定药理学机制有关的治疗活性,但不能提供治疗的作用范围与最终效果的全部信息,无论是阳性还是阴性。许多例证表明,治疗在替代指标显示出高度阳性效应,而最终被证明对受试者的临床结局是有害的。与此相反,也有一些例证显示,治疗的临床获益明确却未能在替代指标体现。第二,替代指标可能不会定量测量可直接权衡不良反应的临床获益。验证替代指标的统计学标准已经具备,但是使用它们的经验相对有限。在实践中,替代证据的强度取决于(1)替代关系的生物学合理性;(2)流行病学研究证明替代指标对临床结局的预后价值;(3)临床试验证明替代指标的处理效应相当于临床结局的效应。一种产品的临床指标和替代指标之间的关系并不一定适用于治疗同一种疾病但具有不同作用方式的另一种产品。

2.2.7 分类指标

连续型或等级指标有时可能需要转化为二分类或其他分类指标。“成功”和“应答”的标准是二分类的常见例子。分类标准需明确规定,例如,连续型指标最小百分比的改善(相对于基线),或者有序等级量表中等于或高于某个阈值水平(如“良”)的按顺序分类。

舒张压降低于90mmHg是一个常见的二分类例子。当分类有明确的临床相关性时,它们是最有用的。众所周知,选择分类标准很容易使临床结果产生偏倚,因此在方案中应预先定义和特别说明分类标准。由于分类通常意味着信息丢失,因此在分析中会损失检验效能,样本量计算时需加以考虑。

2.3 避免偏倚的设计技术

临床试验中,避免偏倚的最重要的设计技术是盲法和随机化,它们为上市申请中大多数对照临床试验所常规采用。大多数此类试验采用双盲法,按照合适的随机化方案,对治疗药物进行预先包装并提供给试验中心,只标明受试者编号和疗程,从而使参与试验的任何人都不知道分配给任何特定受试者的具体治疗药物,甚至不知道编码字母。该方法会在第2.3.1 章节和第2.3.2章节中的大部分内容中进行介绍,例外情况会在最后考虑。

设计阶段应在方案中制定针对性措施,以使试验实施过程中可能损害分析的不规范操作最小化,从而减少偏倚。这里指的不规范操作包括各种类型的方案违背、退出和数据缺失。方案中应考虑一些方法,以减少出现这些问题的频率,以及解决在数据分析中出现的问题。

2.3.1 盲法

盲法或遮蔽是为了限制临床试验的实施和解释时所产生的有意或无意的偏倚,这些偏倚可能源于以下情况的影响:知晓受试者的招募和处理分组、受试者的后续治疗、受试者对治疗的态度、终点评价、退出的处理、从分析中剔除数据,等等。盲法的根本目标是防止知晓处理分组,直到所有产生偏倚的机会都消失。

在双盲试验中,所有受试者及参与受试者的治疗或临床评价的研究者和申办方人员,包括确定受试者资格、评价终点或评价方案依从性的任何人,均不知道受试者所接受的治疗。在整个试验实施过程中,这种盲态要始终保持,只有当数据被清理到可接受的质量水平时,才可对适当的人员揭盲。如果需要对不参与受试者的治疗或临床评价的申办方人员揭盲处理编码(如生物分析学家、稽查员、参与严重不良事件报告的人员),申办方应该制定严格的标准操作规程,以防止处理编码的不当传播。在单盲试验中,研究者和/或他的成员知道处理分组信息,但受试者不知道,反之亦然。在开放试验中,所有的人都可能知道处理分组信息。双盲试验是最优方法,它要求试验所采用的处理在使用前或使用期间均无法被识别出来(如外观、味道等),且在整个试验期间均适当地保持盲态。

达到理想的双盲会有很多困难:有些处理可能具有完全不同的性质,例如,手术和药物治疗;两种药物可能具有不同的剂型,虽然使用胶囊可以令它们无法被区分,但改变剂型可能会改变药代动力学和/或药效学的特性,因此需要建立制剂的生物等效性;两种处理的每日用法可能不同。这些情况下,使用“双模拟”(见词汇表)技术是实现双盲条件的一种方法,该技术有时会强制实施一种非同寻常的使用方案,使得受试者的积极性和依从性受到负面影响。伦理上的困难也可能会干扰该技术的应用,例如手术过程的模拟。无论如何,应当努力克服这些困难。

某些临床试验的双盲性质可能由于明显的处理诱导效应而遭到部分破坏。这种情况下,使研究者和有关申办方人员对某些检验结果(如所选择的临床实验室测量)保持盲态,可以使盲法得到改善。使偏倚最小化的类似方法(见下文)应当在开放试验中考虑,例如独特的处理效应无法对患者设盲的试验。

如果双盲试验不可行,则应考虑用单盲方案。有些情况下,只有开放试验在实践上或伦理上是可行的。单盲和开放试验更具灵活性,但特别重要的是,研究者知道了下一个受试者的处理不应影响入组受试者的决定,即该决定应在知道随机化处理之前做出。对于这些试验,应考虑使用中央随机化方法,如采用电话随机化管理处理的分配。此外,应该由不参与治疗受试者并对处理保持盲态的医务人员进行临床评价。在单盲或开放试验中,应尽一切努力使各种已知的偏倚来源降到最低,并且应采用尽可能客观的主要指标。应在方案中解释所采用的盲态程度的原因,以及所采取的使偏倚最小化的措施。例如,申办方应当有严格的标准操作规程,以保证在清理数据库以供分析之前,适当限制对处理编码的获取。

只有经治医师认为对某一受试者的治疗有必要知道其处理分配时,才应考虑对该受试者破盲。无论什么原因导致的任何有意或无意地破盲都应该在试验结束时给予报告和解释。处理分配的揭盲过程及时间都应该记录在案。

本文件中,数据的盲态审核(见词汇表)是指在试验完成(对最后一位受试者的最后一次观察)到揭盲之间的这段时间内对数据的检查。

2.3.2 随机化

在临床试验中,随机化将机会元素引入到受试者的处理分配中。在试验数据的后续分析期间,它为定量评价与处理效应有关的证据提供了坚实的统计基础。它倾向于使各处理组的已知和未知的预后因素分布相似。与盲法结合,在受试者的选择和分配时,随机化有助于避免因处理分配的可预测性而可能出现的偏倚。

临床试验的随机化列表记录了施与受试者处理的随机分配,其最简单的方式是处理的序列表(或交叉试验中的处理序列),或按受试者编号对应的编码。有些试验,如具有筛选阶段的试验,可能使问题复杂一些,但是预先计划的受试者的处理分配或处理序列应是唯一的。不同的试验设计需要不同的程序来生成随机化列表。随机化列表应当有重现性(如果需要)。

虽然无限制条件的随机化是一种可接受的方法,但区组随机一般具有某些优势,它有助于增加处理组间的可比性,特别是当受试者特征可能随时间变化时,例如由于招募策略改变引起的变化。它还能更好地保证各处理组的样本量几乎相等。在交叉试验中,它提供了获得具有更高效率和更易于解释的平衡设计的方法。选择区组长度时需注意,既要足够短以限制可能的不平衡,又要足够长以避免对区组序列末尾的可预测性。区组长度通常应对研究者及其他有关人员保持盲态;使用两种或多种区组长度与每个区组随机选择长度,可达到同样目的。(理论上,在双盲试验中,可预测性并不重要,但药物的药理作用可能提供猜测机会。)

对于多中心试验(见词汇表),应按中心进行随机化。提倡每个中心有一个单独的随机方案,即按中心分层或为每个中心分配若干完整的区组。更一般地,按照基线测量的重要预后因素(如疾病的严重程度、年龄、性别等)进行分层,可保障层内的平衡分配,这种方法在小型试验中潜在益处更大。分层因素一般不超过三个,否则实现平衡不仅困难,而且麻烦。应用动态分配程序(见下文)可能有助于同时在多个分层因素之间达到平衡,只要可以调整其余试验流程以适应这类方法。应当在后续的分析中对分层随机化的因素加以考虑。

进入试验的下一个随机化受试者,应该接受对应于随机化列表(如果随机化是分层的,则在相应的层中)中下一个号码的处理。只有当已经确认下一个受试者进入到试验的随机化阶段时,才能给受试者分配合适的号码和相关处理。具有增加可预测性的随机化细节,如区组长度,不应包含在试验方案中。随机化列表本身应该由申办方或独立方安全存档,以确保整个试验过程维持盲态。在试验期间获取随机化列表应该考虑在紧急情况下为任何受试者破盲的可能性。破盲应遵循的程序、必要的文件以及受试者后续的处理和评价均应在方案中写明。

动态分配也是一种选择,该方法根据当前已分配的处理的平衡情况进行处理分配,对于分层试验,处理分配视受试者所属层内的平衡情况而定。应当避免确定性的动态分配程序,应当为每个处理分配纳入适当的随机化要素。应尽一切努力保持试验的双盲状态。例如,仅限于中央试验办公室知道处理编码,并由办公室通过电话联系来控制动态分配。这种方法允许对入选标准进行额外检查,并会建立试验入组的记录,这些信息对某些类型的多中心试验具有价值。随后会启用双盲试验的预包装和贴标签的药品供应系统,但它们的使用顺序不再是依次的。最好使用适当的计算机算法使中央试验办公室的人员对处理编码保持盲态。当考虑动态分配时,应该仔细评价物流的复杂性以及对分析的潜在影响。

3. 试验设计的考虑

3.1 设计类型

3.1.1 平行组设计

对于确证性试验,最常见的临床试验设计是平行组设计,该设计将受试者随机分配到两组或多组中的一组,每组采用不同的处理。这些处理包括一个或多个剂量的研究产品,以及一个或多个对照处理,如安慰剂或/和阳性对照。该设计的假设比大多数其它设计简单,但与其它设计一样,可能会有使分析和解释复杂化的额外试验特征,如协变量、随时间的重复测量、设计因素之间的交互作用、方案违背、脱落(见词汇表)、退出等。

3.1.2 交叉设计

在交叉设计中,每个受试者被随机分到两个或多个处理序列,因此处理间的比较相当于自身对照。这种简单策略之所以有吸引力,主要因为它减少了满足检验效能所需的受试者,有时减少的程度相当可观。2×2 交叉设计是最简单的,该设计通常在先后两个处理周期中安排一个洗脱期,每个受试者以随机顺序在每个处理周期接受两个处理中的其中一个。最常见的扩展设计是n个周期和n(>2)个处理,每个受试者先后接受所有 n 个处理。此类设计形式多样,例如,每个受试者接受n(>2)个处理中的一个子集,或者对一个受试者重复给予处理。

交叉设计有很多问题可导致其结果无效,主要困难在于残留效应,即在后继处理周期内的前序处理的残余影响。使用相加模型时,不同的残留效应将使处理间的直接比较产生偏倚。对于2×2设计,统计上无法将残留效应从处理与周期的交互作用中区分开来,并且因为相应的对比是“受试者之间”,故检验这两个效应中任何一个都缺乏检验效能。这一问题在高阶设计中并不严重,但不能完全消除。

因此,使用交叉设计重要的是要避免残留效应,最好的办法是在充分了解疾病领域和新药的基础上有选择地和谨慎地使用该设计,诸如针对病情稳定的慢性病;治疗周期内可充分发挥药物的相关效应;洗脱期足够长以使药物效应完全消退等。应该在试验前利用已有信息及数据确定是否可满足这些条件。

交叉试验还有一些需要密切注意的问题,其中,受试者失访导致的分析和解释的复杂化最值得关注。另外,残留效应的潜在作用导致后续处理周期所发生的不良事件很难判断是哪种处理所致。这些问题以及其它问题在ICH E4中已有阐述。交叉设计一般应严格限于预期仅有少数失访的试验。

采用2×2交叉设计验证相同药物的两种制剂的生物等效性甚为常用,往往令人满意,尤其是以健康志愿者为对象的试验,如果两个周期间的洗脱时间足够长,极不可能发生相关药代动力学指标的残留效应。不过,在分析期间基于获得的数据核实这一假设仍然非常重要,例如,通过在每个周期开始时未检测到药物来证实无残留效应。

3.1.3 析因设计

在析因设计中,通过使用不同的处理组合可以同时评价两个或多个处理。最简单的例子是2×2析因设计,受试者被随机分配到两个处理 A和B的四种可能组合之一,即单独A、单独B、既有A又有B、既无A又无B。该设计多以检验A和B的交互作用为特定目的。如果基于检验主效应计算样本量,则交互作用统计检验的检验效能可能不足。当该设计被用于检验A和B的联合效应时,特别是如果两者可能被一起使用,这一考虑尤为重要。

析因设计的另一个重要用途是,建立同时使用处理C和D时的剂量-反应特征,特别是在先前试验中每种单一疗法的某个剂量的有效性已被证实的情况。设C的剂量数为m(通常包括零剂量,即安慰剂),相似的D的剂量数为n,整个设计由m×n 个处理组构成,每个处理组为一种不同的C和D的剂量组合,则应用响应面的结果估计可以帮助确定临床使用的C和D剂量的恰当组合(见ICH E4)。

某些情况下,如评价两种处理的有效性所需的受试者数量与单独评价任一种处理的有效性所需的受试者数量相同时,2×2 设计可能会更高效地利用受试者,这一策略已经被证实对非常大型的死亡率试验颇有价值。该方法的效率和可靠性取决于处理A和B之间不存在交互作用,使得A和B对主要有效性指标的主效应服从相加模型,因此,无论是否追加B的效应,A的效应是确定的。对于交叉试验,应在试验前利用先前的信息和数据,这很可能会找到满足无交互作用的证据。

3.2 多中心试验

开展多中心试验主要有两个原因。首先,多中心试验是一种更加高效地评价新药的可接受的方法;某些情况下,为在合理的时间框架内获得足够的受试者以满足试验目的,它可能是唯一可行的方法。原则上,在临床研发的任何阶段均可开展这种性质的多中心试验。多中心试验可能有几个中心,每个中心的受试者数量较大;也可能有很多中心,每个中心只有很少的受试者,比如罕见病研究。

其次,设计成多中心(和多个研究者)试验主要是为研究结果的后续推论提供更好的基础,因为从更广泛的人群中招募受试者和呈现更宽泛的使用药物的临床环境,从而呈现出更典型的未来用药场景。这种情况下,许多研究者的参与也可提供更宽泛的药物价值临床判断。此类试验在药物研发后期将成为确证性试验,可能有大量的研究者和中心参与。为增强可推论性(见词汇表),多中心试验有时会在许多不同国家实施。

要想充分解释和外推多中心试验结论,所有中心实施研究方案的方式应该是明确的和相似的。样本量和检验效能的计算通常基于各中心的处理间差异是相同的无偏估计的假设,因此,制定共同研究方案并给予实施很重要。试验的实施流程应该尽可能标准化。通过研究者会议、试验前的人员培训和试验期间的严密监查,可以减少评价标准和方法的不一致性。良好设计的目的通常是实现每个中心内各处理组的受试者分布相同,而良好管理可以对该目的起到支持作用。应避免中心间的病例数相差太大以及个别中心病例数太少,这一考虑的好处会在后期探查中心间处理效应的异质性时显示出来,因为这样可以减少处理效应不同加权估计之间的差异。(这一点并不适用于所有中心病例数都非常少的试验,以及分析时不考虑中心效应。)如果不采取这些预防措施,加之对结果同质性的质疑,会使多中心试验的价值降低,有时甚至严重到不能为申办方的主张提供令人信服的证据的地步。

最简单的多中心试验是每位研究者负责在一家医院招募受试者,所以,“中心”是由研究者或医院唯一确定的。可是,很多试验会更复杂一些,例如,一个研究者可能从几家医院招募受试者;一个研究者可能代表一个临床医生团队(参与研究者),他们或从一家医院所辖的几个诊所,或从几家相关的医院招募受试者。只要对统计模型中关于中心的定义有疑义,方案中的统计章节(见第5.1章节)就应在特定试验背景下明确定义该术语(例如,按研究者、场所或地区)。多数情况下,根据研究者定义中心较为可行,ICH E6在这方面提供了相关指南。定义中心的目的是使影响主要指标测量的因素和处理的影响达到同质,以免因此引起质疑。任何将中心合并起来进行分析的规则应尽可能在方案中合理阐述并预先规定,但是,任何基于此方法的决策都应始终在盲态下做出,如盲态审核。

方案中应该描述处理效应的估计和检验的统计模型。主要处理效应估计可首先使用包含中心效应的模型,但不包含处理与中心的交互项。如果处理效应中心间是同质的,则在模型中常规地包含交互项会降低对主要效应的检验效率;如果确实存在处理效应的异质性,则对处理效应的解释是有争议的。

某些试验,如大型的死亡率试验,每个中心只有很少受试者,设想中心对主要或次要指标有任何影响都是缺乏依据的,因为中心因素的影响不可能代表临床重要性。还有一些试验可能从一开始就会认识到每个中心有限的受试者使得统计模型中包含中心效应变得不切实际。这种情况下,模型中不应包含中心项,而且也没有必要按中心进行分层随机化。

对于每个中心都有充足的受试者的试验,如果发现阳性处理效应,通常应探索不同中心间处理效应的异质性,因为这可能影响结论的外推性。通过各中心结果的图示方法,或通过对中心与处理间交互作用的统计检验,可能会发现明显的异质性。对交互效应做统计检验时,需认识到其检验效能不高,因为试验是基于探测处理的主效应而设计的。

如果发现处理效应的异质性,则应当谨慎地加以解释,并应积极尝试从试验管理的其他特征或受试者特征方面来寻找原因。这样的原因通常会提示适当的进一步分析和解释。在缺乏原因的情况下,一旦证实处理效应的异质性,例如,通过明显的定量交互作用(见词汇表),意味着处理效应可能需要另一种估计,比如给中心不同赋权以保障处理效应估计的稳健性。理解定性交互作用(见词汇表)的异质性甚至更为重要,当未能找到原因时,要想可靠地预测处理效应,可能需要进一步开展临床试验。

以上针对多中心试验的讨论都是基于采用固定效应模型的。混合模型也可用于探索处理效应的异质性,它把中心效应和中心与处理间的交互效应看作是随机的,尤其适合于中心数量特别多的情况。

3.3 比较的类型

3.3.1 优效性试验

Scientifically, efficacy is most convincingly established by demonstrating superiority to placebo in a placebo-controlled trial, by showing superiority to an active control treatment or by demonstrating a dose-response relationship. This type of trial is referred to as a ‘superiority’ trial (see Glossary). Generally in this guidance superiority trials are assumed, unless it is explicitly stated otherwise.

For serious illnesses, when a therapeutic treatment which has been shown to be efficacious by superiority trial(s) exists, a placebo-controlled trial may be considered unethical. In that case the scientifically sound use of an active treatment as a control should be considered. The appropriateness of placebo control vs. active control should be considered on a trial by trial basis.

3.3.2 Trials to Show Equivalence or Non-inferiority

In some cases, an investigational product is compared to a reference treatment without the objective of showing superiority. This type of trial is divided into two major categories according to its objective; one is an 'equivalence' trial (see Glossary) and the other is a 'non-inferiority' trial (see Glossary).

Many active control trials are designed to show that the efficacy of an investigational product is no worse than that of the active comparator, and hence fall into the latter category. Another possibility is a trial in which multiple doses of the investigational drug are compared with the recommended dose or multiple doses of the standard drug. The purpose of this design is simultaneously to show a dose-response relationship for the investigational product and to compare the investigational product with the active control.

Active control equivalence or non-inferiority trials may also incorporate a placebo, thus pursuing multiple goals in one trial; for example, they may establish superiority to placebo and hence validate the trial design and simultaneously evaluate the degree of similarity of efficacy and safety to the active comparator. There are well known difficulties associated with the use of the active control equivalence (or non-inferiority) trials that do not incorporate a placebo or do not use multiple doses of the new drug. These relate to the implicit lack of any measure of internal validity (in contrast to superiority trials), thus making external validation necessary. The equivalence (or non-inferiority) trial is not conservative in nature, so that many flaws in the design or conduct of the trial will tend to bias the results towards a conclusion of equivalence. For these reasons, the design features of such trials should receive special attention and their conduct needs special care. For example, it is especially important to minimise the incidence of violations of the entry criteria, non-compliance, withdrawals, losses to follow-up, missing data and other deviations from the protocol, and also to minimise their impact on the subsequent analyses.

Active comparators should be chosen with care. An example of a suitable active comparator would be a widely used therapy whose efficacy in the relevant indication has been clearly established and quantified in well designed and well documented superiority trial(s) and which can be reliably expected to exhibit similar efficacy in the contemplated active control trial. To this end, the new trial should have the same important design features (primary variables, the dose of the active comparator, eligibility criteria, etc.) as the previously conducted superiority trials in which the active comparator clearly demonstrated clinically relevant efficacy, taking into account advances in medical or statistical practice relevant to the new trial.

It is vital that the protocol of a trial designed to demonstrate equivalence or non-inferiority contain a clear statement that this is its explicit intention. An equivalence margin should be specified in the protocol; this margin is the largest difference that can be judged as being clinically acceptable and should be smaller than differences observed in superiority trials of the active comparator. For the active control equivalence trial, both the upper and the lower equivalence margins are needed, while only the lower margin is needed for the active control non-inferiority trial. The choice of equivalence margins should be justified clinically.

Statistical analysis is generally based on the use of confidence intervals (see Section 5.5). For equivalence trials, two-sided confidence intervals should be used. Equivalence is inferred when the entire confidence interval falls within the equivalence margins. Operationally, this is equivalent to the method of using two simultaneous one-sided tests to test the (composite) null hypothesis that the treatment difference is outside the equivalence margins versus the (composite) alternative hypothesis that the treatment difference is within the margins. Because the two null hypotheses are disjoint, the type I error is appropriately controlled. For non-inferiority trials a one-sided interval should be used. The confidence interval approach has a one-sided hypothesis test counterpart for testing the null hypothesis that the treatment difference (investigational product minus control) is equal to the lower equivalence margin versus the alternative that the treatment difference is greater than the lower equivalence margin. The choice of type I error should be a consideration separate from the use of a one-sided or two-sided procedure. Sample size calculations should be based on these methods (see Section 3.5).

Concluding equivalence or non-inferiority based on observing a non-significant test result of the null hypothesis that there is no difference between the investigational product and the active comparator is inappropriate.

There are also special issues in the choice of analysis sets. Subjects who withdraw or dropout of the treatment group or the comparator group will tend to have a lack of response, and hence the results of using the full analysis set (see Glossary) may be biased toward demonstrating equivalence (see Section 5.2.3).

3.3.3 Trials to Show Dose-response Relationship

How response is related to the dose of a new investigational product is a question to which answers may be obtained in all phases of development, and by a variety of approaches (see ICH E4). Dose-response trials may serve a number of objectives, amongst which the following are of particular importance: the confirmation of efficacy; the investigation of the shape and location of the dose-response curve; the estimation of an appropriate starting dose; the identification of optimal strategies for individual dose adjustments; the determination of a maximal dose beyond which additional benefit would be unlikely to occur. These objectives should be addressed using the data collected at a number of doses under investigation, including a placebo (zero dose) wherever appropriate. For this purpose the application of procedures to estimate the relationship between dose and response, including the construction of confidence intervals and the use of graphical methods, is as important as the use of statistical tests. The hypothesis tests that are used may need to be tailored to the natural ordering of doses or to particular questions regarding the shape of the dose-response curve (e.g. monotonicity). The details of the planned statistical procedures should be given in the protocol.

3.4 Group Sequential Designs

Group sequential designs are used to facilitate the conduct of interim analysis (see section 4.5 and Glossary). While group sequential designs are not the only acceptable types of designs permitting interim analysis, they are the most commonly applied because it is more practicable to assess grouped subject outcomes at periodic intervals during the trial than on a continuous basis as data from each subject become available. The statistical methods should be fully specified in advance of the availability of information on treatment outcomes and subject treatment assignments (i.e. blind breaking, see Section 4.5). An Independent Data Monitoring Committee (see Glossary) may be used to review or to conduct the interim analysis of data arising from a group sequential design (see Section 4.6). While the design has been most widely and successfully used in large, long-term trials of mortality or major non-fatal endpoints, its use is growing in other circumstances. In particular, it is recognised that safety must be monitored in all trials and therefore the need for formal procedures to cover early stopping for safety reasons should always be considered.

3.5 Sample Size

The number of subjects in a clinical trial should always be large enough to provide a reliable answer to the questions addressed. This number is usually determined by the primary objective of the trial. If the sample size is determined on some other basis, then this should be made clear and justified. For example, a trial sized on the basis of safety questions or requirements or important secondary objectives may need larger numbers of subjects than a trial sized on the basis of the primary efficacy question (see, for example, ICH E1a).

Using the usual method for determining the appropriate sample size, the following items should be specified: a primary variable, the test statistic, the null hypothesis, the alternative ('working') hypothesis at the chosen dose(s) (embodying consideration of the treatment difference to be detected or rejected at the dose and in the subject population selected), the probability of erroneously rejecting the null hypothesis (the type I error), and the probability of erroneously failing to reject the null hypothesis (the type II error), as well as the approach to dealing with treatment withdrawals and protocol violations. In some instances, the event rate is of primary interest for evaluating power, and assumptions should be made to extrapolate from the required number of events to the eventual sample size for the trial.

The method by which the sample size is calculated should be given in the protocol, together with the estimates of any quantities used in the calculations (such as variances, mean values, response rates, event rates, difference to be detected). The basis of these estimates should also be given. It is important to investigate the sensitivity of the sample size estimate to a variety of deviations from these assumptions and this may be facilitated by providing a range of sample sizes appropriate for a reasonable range of deviations from assumptions. In confirmatory trials, assumptions should normally be based on published data or on the results of earlier trials. The treatment difference to be detected may be based on a judgement concerning the minimal effect which has clinical relevance in the management of patients or on a judgement concerning the anticipated effect of the new treatment, where this is larger. Conventionally the probability of type I error is set at 5% or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice may be influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. The probability of type II error is conventionally set at 10% to 20%; it is in the sponsor’s interest to keep this figure as low as feasible especially in the case of trials that are difficult or impossible to repeat. Alternative values to the conventional levels of type I and type II error may be acceptable or even preferable in some cases.

Sample size calculations should refer to the number of subjects required for the primary analysis. If this is the 'full analysis set', estimates of the effect size may need to be reduced compared to the per protocol set (see Glossary). This is to allow for the dilution of the treatment effect arising from the inclusion of data from patients who have withdrawn from treatment or whose compliance is poor. The assumptions about variability may also need to be revised.

The sample size of an equivalence trial or a non-inferiority trial (see Section 3.3.2) should normally be based on the objective of obtaining a confidence interval for the treatment difference that shows that the treatments differ at most by a clinically acceptable difference. When the power of an equivalence trial is assessed at a true difference of zero, then the sample size necessary to achieve this power is underestimated if the true difference is not zero. When the power of a non-inferiority trial is assessed at a zero difference, then the sample size needed to achieve that power will be underestimated if the effect of the investigational product is less than that of the active control. The choice of a 'clinically acceptable’ difference needs justification with respect to its meaning for future patients, and may be smaller than the 'clinically relevant' difference referred to above in the context of superiority trials designed to establish that a difference exists.

The exact sample size in a group sequential trial cannot be fixed in advance because it depends upon the play of chance in combination with the chosen stopping guideline and the true treatment difference. The design of the stopping guideline should take into account the consequent distribution of the sample size, usually embodied in the expected and maximum sample sizes.

When event rates are lower than anticipated or variability is larger than expected, methods for sample size re-estimation are available without unblinding data or making treatment comparisons (see Section 4.4).

3.6 Data Capture and Processing

The collection of data and transfer of data from the investigator to the sponsor can take place through a variety of media, including paper case record forms, remote site monitoring systems, medical computer systems and electronic transfer. Whatever data capture instrument is used, the form and content of the information collected should be in full accordance with the protocol and should be established in advance of the conduct of the clinical trial. It should focus on the data necessary to implement the planned analysis, including the context information (such as timing assessments relative to dosing) necessary to confirm protocol compliance or identify important protocol deviations. ‘Missing values’ should be distinguishable from the ‘value zero’ or ‘characteristic absent’.

The process of data capture through to database finalisation should be carried out in accordance with GCP (see ICH E6, Section 5). Specifically, timely and reliable processes for recording data and rectifying errors and omissions are necessary to ensure delivery of a quality database and the achievement of the trial objectives through the implementation of the planned analysis.

IV. TRIAL CONDUCT CONSIDERATIONS

4.1 Trial Monitoring and Interim Analysis

Careful conduct of a clinical trial according to the protocol has a major impact on the credibility of the results (see ICH E6). Careful monitoring can ensure that difficulties are noticed early and their occurrence or recurrence minimised.

There are two distinct types of monitoring that generally characterise confirmatory clinical trials sponsored by the pharmaceutical industry. One type of monitoring concerns the oversight of the quality of the trial, while the other type involves breaking the blind to make treatment comparisons (i.e. interim analysis). Both types of trial monitoring, in addition to entailing different staff responsibilities, involve access to different types of trial data and information, and thus different principles apply for the control of potential statistical and operational bias.

For the purpose of overseeing the quality of the trial the checks involved in trial monitoring may include whether the protocol is being followed, the acceptability of data being accrued, the success of planned accrual targets, the appropriateness of the design assumptions, success in keeping patients in the trials, etc. (see Sections 4.2 to 4.4). This type of monitoring does not require access to information on comparative treatment effects, nor unblinding of data and therefore has no impact on type I error. The monitoring of a trial for this purpose is the responsibility of the sponsor (see ICH E6) and can be carried out by the sponsor or an independent group selected by the sponsor. The period for this type of monitoring usually starts with the selection of the trial sites and ends with the collection and cleaning of the last subject’s data.

The other type of trial monitoring (interim analysis) involves the accruing of comparative treatment results. Interim analysis requires unblinded (i.e. key breaking) access to treatment group assignment (actual treatment assignment or identification of group assignment) and comparative treatment group summary information. This necessitates that the protocol (or appropriate amendments prior to a first analysis) contains statistical plans for the interim analysis to prevent certain types of bias. This is discussed in Sections 4.5 & 4.6.

4.2 Changes in Inclusion and Exclusion Criteria

Inclusion and exclusion criteria should remain constant, as specified in the protocol, throughout the period of subject recruitment. Changes may occasionally be appropriate, for example, in long term trials, where growing medical knowledge either from outside the trial or from interim analyses may suggest a change of entry criteria. Changes may also result from the discovery by monitoring staff that regular violations of the entry criteria are occurring, or that seriously low recruitment rates are due to over-restrictive criteria. Changes should be made without breaking the blind and should always be described by a protocol amendment which should cover any statistical consequences, such as sample size adjustments arising from different event rates, or modifications to the planned analysis, such as stratifying the analysis according to modified inclusion/exclusion criteria.

4.3 Accrual Rates

In trials with a long time-scale for the accrual of subjects, the rate of accrual should be monitored and, if it falls appreciably below the projected level, the reasons should be identified and remedial actions taken in order to protect the power of the trial and alleviate concerns about selective entry and other aspects of quality. In a multicentre trial these considerations apply to the individual centres.

4.4 Sample Size Adjustment

In long term trials there will usually be an opportunity to check the assumptions which underlay the original design and sample size calculations. This may be particularly important if the trial specifications have been made on preliminary and/or uncertain information. An interim check conducted on the blinded data may reveal that overall response variances, event rates or survival experience are not as anticipated. A revised sample size may then be calculated using suitably modified assumptions, and should be justified and documented in a protocol amendment and in the clinical study report. The steps taken to preserve blindness and the consequences, if any, for the type I error and the width of confidence intervals should be explained. The potential need for re-estimation of the sample size should be envisaged in the protocol whenever possible (see Section 3.5).

4.5 Interim Analysis and Early Stopping

An interim analysis is any analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to formal completion of a trial. Because the number, methods and consequences of these comparisons affect the interpretation of the trial, all interim analyses should be carefully planned in advance and described in the protocol. Special circumstances may dictate the need for an interim analysis that was not defined at the start of a trial. In these cases, a protocol amendment describing the interim analysis should be completed prior to unblinded access to treatment comparison data. When an interim analysis is planned with the intention of deciding whether or not to terminate a trial, this is usually accomplished by the use of a group sequential design which employs statistical monitoring schemes as guidelines (see Section 3.4). The goal of such an interim analysis is to stop the trial early if the superiority of the treatment under study is clearly established, if the demonstration of a relevant treatment difference has become unlikely or if unacceptable adverse effects are apparent. Generally, boundaries for monitoring efficacy require more evidence to terminate a trial early (i.e. they are more conservative) than boundaries for monitoring safety. When the trial design and monitoring objective involve multiple endpoints then this aspect of multiplicity may also need to be taken into account.

The protocol should describe the schedule of interim analyses, or at least the considerations which will govern its generation, for example if flexible alpha spending function approaches are to be employed; further details may be given in a protocol amendment before the time of the first interim analysis. The stopping guidelines and their properties should be clearly described in the protocol or amendments. The potential effects of early stopping on the analysis of other important variables should also be considered. This material should be written or approved by the Data Monitoring Committee (see Section 4.6), when the trial has one. Deviations from the planned procedure always bear the potential of invalidating the trial results. If it becomes necessary to make changes to the trial, any consequent changes to the statistical procedures should be specified in an amendment to the protocol at the earliest opportunity, especially discussing the impact on any analysis and inferences that such changes may cause. The procedures selected should always ensure that the overall probability of type I error is controlled.

The execution of an interim analysis should be a completely confidential process because unblinded data and results are potentially involved. All staff involved in the conduct of the trial should remain blind to the results of such analyses, because of the possibility that their attitudes to the trial will be modified and cause changes in the characteristics of patients to be recruited or biases in treatment comparisons. This principle may be applied to all investigator staff and to staff employed by the sponsor except for those who are directly involved in the execution of the interim analysis. Investigators should only be informed about the decision to continue or to discontinue the trial, or to implement modifications to trial procedures.

Most clinical trials intended to support the efficacy and safety of an investigational product should proceed to full completion of planned sample size accrual; trials should be stopped early only for ethical reasons or if the power is no longer acceptable. However, it is recognised that drug development plans involve the need for sponsor access to comparative treatment data for a variety of reasons, such as planning other trials. It is also recognised that only a subset of trials will involve the study of serious life-threatening outcomes or mortality which may need sequential monitoring of accruing comparative treatment effects for ethical reasons. In either of these situations, plans for interim statistical analysis should be in place in the protocol or in protocol amendments prior to the unblinded access to comparative treatment data in order to deal with the potential statistical and operational bias that may be introduced.

For many clinical trials of investigational products, especially those that have major public health significance, the responsibility for monitoring comparisons of efficacy and/or safety outcomes should be assigned to an external independent group, often called an Independent Data Monitoring Committee (IDMC), a Data and Safety Monitoring Board or a Data Monitoring Committee whose responsibilities should be clearly described.

When a sponsor assumes the role of monitoring efficacy or safety comparisons and therefore has access to unblinded comparative information, particular care should be taken to protect the integrity of the trial and to manage and limit appropriately the sharing of information. The sponsor should assure and document that the internal monitoring committee has complied with written standard operating procedures and that minutes of decision making meetings including records of interim results are maintained.

Any interim analysis that is not planned appropriately (with or without the consequences of stopping the trial early) may flaw the results of a trial and possibly weaken confidence in the conclusions drawn. Therefore, such analyses should be avoided. If unplanned interim analysis is conducted, the clinical study report should explain why it was necessary, the degree to which blindness had to be broken, provide an assessment of the potential magnitude of bias introduced, and the impact on the interpretation of the results.

4.6 Role of Independent Data Monitoring Committee (IDMC) (see Sections 1.25 and 5.52 of ICH E6)

An IDMC may be established by the sponsor to assess at intervals the progress of a clinical trial, safety data, and critical efficacy variables and recommend to the sponsor whether to continue, modify or terminate a trial. The IDMC should have written operating procedures and maintain records of all its meetings, including interim results; these should be available for review when the trial is complete. The independence of the IDMC is intended to control the sharing of important comparative information and to protect the integrity of the clinical trial from adverse impact resulting from access to trial information. The IDMC is a separate entity from an Institutional Review Board (IRB) or an Independent Ethics Committee (IEC), and its composition should include clinical trial scientists knowledgeable in the appropriate disciplines including statistics.

When there are sponsor representatives on the IDMC, their role should be clearly defined in the operating procedures of the committee (for example, covering whether or not they can vote on key issues). Since these sponsor staff would have access to unblinded information, the procedures should also address the control of dissemination of interim trial results within the sponsor organisation.

V. DATA ANALYSIS CONSIDERATIONS

5.1 Prespecification of the Analysis

When designing a clinical trial the principal features of the eventual statistical analysis of the data should be described in the statistical section of the protocol. This section should include all the principal features of the proposed confirmatory analysis of the primary variable(s) and the way in which anticipated analysis problems will be handled. In case of exploratory trials this section could describe more general principles and directions.

The statistical analysis plan (see Glossary) may be written as a separate document to be completed after finalising the protocol. In this document, a more technical and detailed elaboration of the principal features stated in the protocol may be included (see section 7.1). The plan may include detailed procedures for executing the statistical analysis of the primary and secondary variables and other data. The plan should be reviewed and possibly updated as a result of the blind review of the data (see 7.1 for definition) and should be finalised before breaking the blind. Formal records should be kept of when the statistical analysis plan was finalised as well as when the blind was subsequently broken.

If the blind review suggests changes to the principal features stated in the protocol, these should be documented in a protocol amendment. Otherwise, it will suffice to update the statistical analysis plan with the considerations suggested from the blind review. Only results from analyses envisaged in the protocol (including amendments) can be regarded as confirmatory.

In the statistical section of the clinical study report the statistical methodology should be clearly described including when in the clinical trial process methodology decisions were made (see ICH E3).

5.2 Analysis Sets

The set of subjects whose data are to be included in the main analyses should be defined in the statistical section of the protocol. In addition, documentation for all subjects for whom trial procedures (e.g. run-in period) were initiated may be useful. The content of this subject documentation depends on detailed features of the particular trial, but at least demographic and baseline data on disease status should be collected whenever possible.

If all subjects randomised into a clinical trial satisfied all entry criteria, followed all trial procedures perfectly with no losses to follow-up, and provided complete data records, then the set of subjects to be included in the analysis would be self-evident. The design and conduct of a trial should aim to approach this ideal as closely as possible, but, in practice, it is doubtful if it can ever be fully achieved. Hence, the statistical section of the protocol should address anticipated problems prospectively in terms of how these affect the subjects and data to be analysed. The protocol should also specify procedures aimed at minimising any anticipated irregularities in study conduct that might impair a satisfactory analysis, including various types of protocol violations, withdrawals and missing values. The protocol should consider ways both to reduce the frequency of such problems, and also to handle the problems that do occur in the analysis of data. Possible amendments to the way in which the analysis will deal with protocol violations should be identified during the blind review. It is desirable to identify any important protocol violation with respect to the time when it occurred, its cause and influence on the trial result. The frequency and type of protocol violations, missing values, and other problems should be documented in the clinical study report and their potential influence on the trial results should be described (see ICH E3).

Decisions concerning the analysis set should be guided by the following principles : 1) to minimise bias, and 2) to avoid inflation of type I error.

5.2.1 Full Analysis Set

The intention-to-treat (see Glossary) principle implies that the primary analysis should include all randomised subjects. Compliance with this principle would necessitate complete follow-up of all randomised subjects for study outcomes. In practice this ideal may be difficult to achieve, for reasons to be described. In this document the term 'full analysis set' is used to describe the analysis set which is as complete as possible and as close as possible to the intention-to-treat ideal of including all randomised subjects. Preservation of the initial randomisation in analysis is important in preventing bias and in providing a secure foundation for statistical tests. In many clinical trials the use of the full analysis set provides a conservative strategy. Under many circumstances it may also provide estimates of treatment effects which are more likely to mirror those observed in subsequent practice.

There are a limited number of circumstances that might lead to excluding randomised subjects from the full analysis set including the failure to satisfy major entry criteria (eligibility violations), the failure to take at least one dose of trial medication and the lack of any data post randomisation. Such exclusions should always be justified. Subjects who fail to satisfy an entry criterion may be excluded from the analysis without the possibility of introducing bias only under the following circumstances:

(i) the entry criterion was measured prior to randomisation;

(ii) the detection of the relevant eligibility violations can be made completely objectively;

(iii) all subjects receive equal scrutiny for eligibility violations; (This may be difficult to ensure in an open-label study, or even in a double-blind study if the data are unblinded prior to this scrutiny, emphasising the importance of the blind review.)

(iv) all detected violations of the particular entry criterion are excluded.

In some situations, it may be reasonable to eliminate from the set of all randomised subjects any subject who took no trial medication. The intention-to-treat principle would be preserved despite the exclusion of these patients provided, for example, that the decision of whether or not to begin treatment could not be influenced by knowledge of the assigned treatment. In other situations it may be necessary to eliminate from the set of all randomised subjects any subject without data post randomisation. No analysis is complete unless the potential biases arising from these specific exclusions, or any others, are addressed.

When the full analysis set of subjects is used, violations of the protocol that occur after randomisation may have an impact on the data and conclusions, particularly if their occurrence is related to treatment assignment. In most respects it is appropriate to include the data from such subjects in the analysis, consistent with the intention-to-treat principle. Special problems arise in connection with subjects withdrawn from treatment after receiving one or more doses who provide no data after this point, and subjects otherwise lost to follow-up, because failure to include these subjects in the full analysis set may seriously undermine the approach. Measurements of primary variables made at the time of the loss to follow-up of a subject for any reason, or subsequently collected in accordance with the intended schedule of assessments in the protocol, are valuable in this context; subsequent collection is especially important in studies where the primary variable is mortality or serious morbidity. The intention to collect data in this way should be described in the protocol. Imputation techniques, ranging from the carrying forward of the last observation to the use of complex mathematical models, may also be used in an attempt to compensate for missing data. Other methods employed to ensure the availability of measurements of primary variables for every subject in the full analysis set may require some assumptions about the subjects' outcomes or a simpler choice of outcome (e.g. success / failure). The use of any of these strategies should be described and justified in the statistical section of the protocol and the assumptions underlying any mathematical models employed should be clearly explained. It is also important to demonstrate the robustness of the corresponding results of analysis especially when the strategy in question could itself lead to biased estimates of treatment effects.

Because of the unpredictability of some problems, it may sometimes be preferable to defer detailed consideration of the manner of dealing with irregularities until the blind review of the data at the end of the trial, and, if so, this should be stated in the protocol.

5.2.2 Per Protocol Set

sample or the 'evaluable subjects' sample, defines a subset of the subjects in the full analysis set who are more compliant with the protocol and is characterised by criteria such as the following:

(i) the completion of a certain pre-specified minimal exposure to the treatment regimen;

(ii) the availability of measurements of the primary variable(s);

(iii) the absence of any major protocol violations including the violation of entry criteria.

The precise reasons for excluding subjects from the per protocol set should be fully defined and documented before breaking the blind in a manner appropriate to the circumstances of the specific trial.

The use of the per protocol set may maximise the opportunity for a new treatment to show additional efficacy in the analysis, and most closely reflects the scientific model underlying the protocol. However, the corresponding test of the hypothesis and estimate of the treatment effect may or may not be conservative depending on the trial; the bias, which may be severe, arises from the fact that adherence to the study protocol may be related to treatment and outcome.

The problems that lead to the exclusion of subjects to create the per protocol set, and other protocol violations, should be fully identified and summarised. Relevant protocol violations may include errors in treatment assignment, the use of excluded medication, poor compliance, loss to follow-up and missing data. It is good practice to assess the pattern of such problems among the treatment groups with respect to frequency and time to occurrence.

5.2.3 Roles of the Different Analysis Sets

In general, it is advantageous to demonstrate a lack of sensitivity of the principal trial results to alternative choices of the set of subjects analysed. In confirmatory trials it is usually appropriate to plan to conduct both an analysis of the full analysis set and a per protocol analysis, so that any differences between them can be the subject of explicit discussion and interpretation. In some cases, it may be desirable to plan further exploration of the sensitivity of conclusions to the choice of the set of subjects analysed. When the full analysis set and the per protocol set lead to essentially the same conclusions, confidence in the trial results is increased, bearing in mind, however, that the need to exclude a substantial proportion of subjects from the per protocol analysis throws some doubt on the overall validity of the trial.

The full analysis set and the per protocol set play different roles in superiority trials (which seek to show the investigational product to be superior), and in equivalence or non-inferiority trials (which seek to show the investigational product to be comparable, see section 3.3.2). In superiority trials the full analysis set is used in the primary analysis (apart from exceptional circumstances) because it tends to avoid over-optimistic estimates of efficacy resulting from a per protocol analysis, since the non-compliers included in the full analysis set will generally diminish the estimated treatment effect. However, in an equivalence or non-inferiority trial use of the full analysis set is generally not conservative and its role should be considered very carefully.

5.3 Missing Values and Outliers

Missing values represent a potential source of bias in a clinical trial. Hence, every effort should be undertaken to fulfil all the requirements of the protocol concerning the collection and management of data. In reality, however, there will almost always be some missing data. A trial may be regarded as valid, nonetheless, provided the methods of dealing with missing values are sensible, and particularly if those methods are pre-defined in the protocol. Definition of methods may be refined by updating this aspect in the statistical analysis plan during the blind review. Unfortunately, no universally applicable methods of handling missing values can be recommended. An investigation should be made concerning the sensitivity of the results of analysis to the method of handling missing values, especially if the number of missing values is substantial.

5.4 Data Transformation

The decision to transform key variables prior to analysis is best made during the design of the trial on the basis of similar data from earlier clinical trials. Transformations (e.g. square root, logarithm) should be specified in the protocol and a rationale provided, especially for the primary variable(s). The general principles guiding the use of transformations to ensure that the assumptions underlying the statistical methods are met are to be found in standard texts; conventions for particular variables have been developed in a number of specific clinical areas. The decision on whether and how to transform a variable should be influenced by the preference for a scale which facilitates clinical interpretation.

Similar considerations apply to other derived variables, such as the use of change from baseline, percentage change from baseline, the 'area under the curve' of repeated measures, or the ratio of two different variables. Subsequent clinical interpretation should be carefully considered, and the derivation should be justified in the protocol. Closely related points are made in Section 2.2.2.

5.5 Estimation, Confidence Intervals and Hypothesis Testing

The statistical section of the protocol should specify the hypotheses that are to be tested and/or the treatment effects which are to be estimated in order to satisfy the primary objectives of the trial. The statistical methods to be used to accomplish these tasks should be described for the primary (and preferably the secondary) variables, and the underlying statistical model should be made clear. Estimates of treatment effects should be accompanied by confidence intervals, whenever possible, and the way in which these will be calculated should be identified. A description should be given of any intentions to use baseline data to improve precision or to adjust estimates for potential baseline differences, for example by means of analysis of covariance.

It is important to clarify whether one- or two-sided tests of statistical significance will be used, and in particular to justify prospectively the use of one-sided tests. If hypothesis tests are not considered appropriate, then the alternative process for arriving at statistical conclusions should be given. The issue of one-sided or two-sided approaches to inference is controversial and a diversity of views can be found in the statistical literature. The approach of setting type I errors for one-sided tests at half the conventional type I error used in two-sided tests is preferable in regulatory settings. This promotes consistency with the two-sided confidence intervals that are generally appropriate for estimating the possible size of the difference between two treatments.

The particular statistical model chosen should reflect the current state of medical and statistical knowledge about the variables to be analysed as well as the statistical design of the trial. All effects to be fitted in the analysis (for example in analysis of variance models) should be fully specified, and the manner, if any, in which this set of effects might be modified in response to preliminary results should be explained. The same considerations apply to the set of covariates fitted in an analysis of covariance. (See also Section 5.7.). In the choice of statistical methods due attention should be paid to the statistical distribution of both primary and secondary variables. When making this choice (for example between parametric and non-parametric methods) it is important to bear in mind the need to provide statistical estimates of the size of treatment effects together with confidence intervals (in addition to significance tests).

The primary analysis of the primary variable should be clearly distinguished from supporting analyses of the primary or secondary variables. Within the statistical section of the protocol or the statistical analysis plan there should also be an outline of the way in which data other than the primary and secondary variables will be summarised and reported. This should include a reference to any approaches adopted for the purpose of achieving consistency of analysis across a range of trials, for example for safety data.

Modelling approaches that incorporate information on known pharmacological parameters, the extent of protocol compliance for individual subjects or other biologically based data may provide valuable insights into actual or potential efficacy, especially with regard to estimation of treatment effects. The assumptions underlying such models should always be clearly identified, and the limitations of any conclusions should be carefully described.

5.6 Adjustment of Significance and Confidence Levels

When multiplicity is present, the usual frequentist approach to the analysis of clinical trial data may necessitate an adjustment to the type I error. Multiplicity may arise, for example, from multiple primary variables (see Section 2.2.2), multiple comparisons of treatments, repeated evaluation over time and/or interim analyses (see Section 4.5). Methods to avoid or reduce multiplicity are sometimes preferable when available, such as the identification of the key primary variable (multiple variables), the choice of a critical treatment contrast (multiple comparisons), the use of a summary measure such as ‘area under the curve’ (repeated measures). In confirmatory analyses, any aspects of multiplicity which remain after steps of this kind have been taken should be identified in the protocol; adjustment should always be considered and the details of any adjustment procedure or an explanation of why adjustment is not thought to be necessary should be set out in the analysis plan.

5.7 Subgroups, Interactions and Covariates

The primary variable(s) is often systematically related to other influences apart from treatment. For example, there may be relationships to covariates such as age and sex, or there may be differences between specific subgroups of subjects such as those treated at the different centres of a multicentre trial. In some instances an adjustment for the influence of covariates or for subgroup effects is an integral part of the planned analysis and hence should be set out in the protocol. Pre-trial deliberations should identify those covariates and factors expected to have an important influence on the primary variable(s), and should consider how to account for these in the analysis in order to improve precision and to compensate for any lack of balance between treatment groups. If one or more factors are used to stratify the design, it is appropriate to account for those factors in the analysis. When the potential value of an adjustment is in doubt, it is often advisable to nominate the unadjusted analysis as the one for primary attention, the adjusted analysis being supportive. Special attention should be paid to centre effects and to the role of baseline measurements of the primary variable. It is not advisable to adjust the main analyses for covariates measured after randomisation because they may be affected by the treatments.

The treatment effect itself may also vary with subgroup or covariate - for example, the effect may decrease with age or may be larger in a particular diagnostic category of subjects. In some cases such interactions are anticipated or are of particular prior interest (e.g. geriatrics), and hence a subgroup analysis, or a statistical model including interactions, is part of the planned confirmatory analysis. In most cases, however, subgroup or interaction analyses are exploratory and should be clearly identified as such; they should explore the uniformity of any treatment effects found overall. In general, such analyses should proceed first through the addition of interaction terms to the statistical model in question, complemented by additional exploratory analysis within relevant subgroups of subjects, or within strata defined by the covariates. When exploratory, these analyses should be interpreted cautiously; any conclusion of treatment efficacy (or lack thereof) or safety based solely on exploratory subgroup analyses are unlikely to be accepted.

5.8 Integrity of Data and Computer Software Validity

The credibility of the numerical results of the analysis depends on the quality and validity of the methods and software (both internally and externally written) used both for data management (data entry, storage, verification, correction and retrieval) and also for processing the data statistically. Data management activities should therefore be based on thorough and effective standard operating procedures. The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available.

VI. EVALUATION OF SAFETY AND TOLERABILITY

6.1 Scope of Evaluation

In all clinical trials evaluation of safety and tolerability (see Glossary) constitutes an important element. In early phases this evaluation is mostly of an exploratory nature, and is only sensitive to frank expressions of toxicity, whereas in later phases the establishment of the safety and tolerability profile of a drug can be characterised more fully in larger samples of subjects. Later phase controlled trials represent an important means of exploring in an unbiased manner any new potential adverse effects, even if such trials generally lack power in this respect.

Certain trials may be designed with the purpose of making specific claims about superiority or equivalence with regard to safety and tolerability compared to another drug or to another dose of the investigational drug. Such specific claims should be supported by relevant evidence from confirmatory trials, similar to that necessary for corresponding efficacy claims.

6.2 Choice of Variables and Data Collection

In any clinical trial the methods and measurements chosen to evaluate the safety and tolerability of a drug will depend on a number of factors, including knowledge of the adverse effects of closely related drugs, information from non-clinical and earlier clinical trials and possible consequences of the pharmacodynamic/pharmacokinetic properties of the particular drug, the mode of administration, the type of subjects to be studied, and the duration of the trial. Laboratory tests concerning clinical chemistry and haematology, vital signs, and clinical adverse events (diseases, signs and symptoms) usually form the main body of the safety and tolerability data. The occurrence of serious adverse events and treatment discontinuations due to adverse events are particularly important to register (see ICH E2A and ICH E3).

Furthermore, it is recommended that a consistent methodology be used for the data collection and evaluation throughout a clinical trial program in order to facilitate the combining of data from different trials. The use of a common adverse event dictionary is particularly important. This dictionary has a structure which gives the possibility to summarise the adverse event data on three different levels; system-organ class, preferred term or included term (see Glossary). The preferred term is the level on which adverse events usually are summarised, and preferred terms belonging to the same system-organ class could then be brought together in the descriptive presentation of data (see ICH M1).

6.3 Set of Subjects to be Evaluated and Presentation of Data

For the overall safety and tolerability assessment, the set of subjects to be summarised is usually defined as those subjects who received at least one dose of the investigational drug. Safety and tolerability variables should be collected as comprehensively as possible from these subjects, including type of adverse event, severity, onset and duration (see ICH E2B). Additional safety and tolerability evaluations may be needed in specific subpopulations, such as females, the elderly (see ICH E7), the severely ill, or those who have a common concomitant treatment. These evaluations may need to address more specific issues (see ICH E3).

All safety and tolerability variables will need attention during evaluation, and the broad approach should be indicated in the protocol. All adverse events should be reported, whether or not they are considered to be related to treatment. All available data in the study population should be accounted for in the evaluation. Definitions of measurement units and reference ranges of laboratory variables should be made with care; if different units or different reference ranges appear in the same trial (e.g. if more than one laboratory is involved), then measurements should be appropriately standardised to allow a unified evaluation. Use of a toxicity grading scale should be prespecified and justified.

The incidence of a certain adverse event is usually expressed in the form of a proportion relating number of subjects experiencing events to number of subjects at risk. However, it is not always self-evident how to assess incidence. For example, depending on the situation the number of exposed subjects or the extent of exposure (in person-years) could be considered for the denominator. Whether the purpose of the calculation is to estimate a risk or to make a comparison between treatment groups it is important that the definition is given in the protocol. This is especially important if long-term treatment is planned and a substantial proportion of treatment withdrawals or deaths are expected. For such situations survival analysis methods should be considered and cumulative adverse event rates calculated in order to avoid the risk of underestimation.

In situations when there is a substantial background noise of signs and symptoms (e.g. in psychiatric trials) one should consider ways of accounting for this in the estimation of risk for different adverse events. One such method is to make use of the 'treatment emergent' (see Glossary) concept in which adverse events are recorded only if they emerge or worsen relative to pretreatment baseline.

Other methods to reduce the effect of the background noise may also be appropriate such as ignoring adverse events of mild severity or requiring that an event should have been observed at repeated visits to qualify for inclusion in the numerator. Such methods should be explained and justified in the protocol.

6.4 Statistical Evaluation

The investigation of safety and tolerability is a multidimensional problem. Although some specific adverse effects can usually be anticipated and specifically monitored for any drug, the range of possible adverse effects is very large, and new and unforeseeable effects are always possible. Further, an adverse event experienced after a protocol violation, such as use of an excluded medication, may introduce a bias. This background underlies the statistical difficulties associated with the analytical evaluation of safety and tolerability of drugs, and means that conclusive information from confirmatory clinical trials is the exception rather than the rule.

In most trials the safety and tolerability implications are best addressed by applying descriptive statistical methods to the data, supplemented by calculation of confidence intervals wherever this aids interpretation. It is also valuable to make use of graphical presentations in which patterns of adverse events are displayed both within treatment groups and within subjects.

The calculation of p-values is sometimes useful either as an aid to evaluating a specific difference of interest, or as a 'flagging' device applied to a large number of safety and tolerability variables to highlight differences worth further attention. This is particularly useful for laboratory data, which otherwise can be difficult to summarise appropriately. It is recommended that laboratory data be subjected to both a quantitative analysis, e.g. evaluation of treatment means, and a qualitative analysis where counting of numbers above or below certain thresholds are calculated.

If hypothesis tests are used, statistical adjustments for multiplicity to quantify the type I error are appropriate, but the type II error is usually of more concern. Care should be taken when interpreting putative statistically significant findings when there is no multiplicity adjustment.

In the majority of trials investigators are seeking to establish that there are no clinically unacceptable differences in safety and tolerability compared with either a comparator drug or a placebo. As is the case for non-inferiority or equivalence evaluation of efficacy the use of confidence intervals is preferred to hypothesis testing in this situation. In this way, the considerable imprecision often arising from low frequencies of occurrence is clearly demonstrated.

6.5 Integrated Summary

The safety and tolerability properties of a drug are commonly summarised across trials continuously during an investigational product’s development and in particular at the time of a marketing application. The usefulness of this summary, however, is dependent on adequate and well-controlled individual trials with high data quality.

The overall usefulness of a drug is always a question of balance between risk and benefit and in a single trial such a perspective could also be considered, even if the assessment of risk/benefit usually is performed in the summary of the entire clinical trial program. (See section 7.2.2)

For more details on the reporting of safety and tolerability, see Chapter 12 of ICH E3.

VII. REPORTING

7.1 Evaluation and Reporting

As stated in the Introduction, the structure and content of clinical study reports is the subject of ICH E3. That ICH guidance fully covers the reporting of statistical work, appropriately integrated with clinical and other material. The current section is therefore relatively brief.

During the planning phase of a trial the principal features of the analysis should have been specified in the protocol as described in Section 5. When the conduct of the trial is over and the data are assembled and available for preliminary inspection, it is valuable to carry out the blind review of the planned analysis also described in Section 5. This pre-analysis review, blinded to treatment, should cover decisions concerning, for example, the exclusion of subjects or data from the analysis sets; possible transformations may also be checked, and outliers defined; important covariates identified in other recent research may be added to the model; the use of parametric or non-parametric methods may be reconsidered. Decisions made at this time should be described in the report, and should be distinguished from those made after the statistician has had access to the treatment codes, as blind decisions will generally introduce less potential for bias. Statisticians or other staff involved in unblinded interim analysis should not participate in the blind review or in making modifications to the statistical analysis plan. When the blinding is compromised by the possibility that treatment induced effects may be apparent in the data, special care will be needed for the blind review.

Many of the more detailed aspects of presentation and tabulation should be finalised at or about the time of the blind review so that by the time of the actual analysis full plans exist for all its aspects including subject selection, data selection and modification, data summary and tabulation, estimation and hypothesis testing. Once data validation is complete, the analysis should proceed according to the pre-defined plans; the more these plans are adhered to, the greater the credibility of the results. Particular attention should be paid to any differences between the planned analysis and the actual analysis as described in the protocol, protocol amendments or the updated statistical analysis plan based on a blind review of data. A careful explanation should be provided for deviations from the planned analysis.

All subjects who entered the trial should be accounted for in the report, whether or not they are included in the analysis. All reasons for exclusion from analysis should be documented; for any subject included in the full analysis set but not in the per protocol set, the reasons for exclusion from the latter should also be documented. Similarly, for all subjects included in an analysis set, the measurements of all important variables should be accounted for at all relevant time-points.

The effect of all losses of subjects or data, withdrawals from treatment and major protocol violations on the main analyses of the primary variable(s) should be considered carefully. Subjects lost to follow up, withdrawn from treatment, or with a severe protocol violation should be identified, and a descriptive analysis of them provided, including the reasons for their loss and its relationship to treatment and outcome.

Descriptive statistics form an indispensable part of reports. Suitable tables and/or graphical presentations should illustrate clearly the important features of the primary and secondary variables and of key prognostic and demographic variables. The results of the main analyses relating to the objectives of the trial should be the subject of particularly careful descriptive presentation. When reporting the results of significance tests, precise p-values (e.g.'p=0.034') should be reported rather than making exclusive reference to critical values.

Although the primary goal of the analysis of a clinical trial should be to answer the questions posed by its main objectives, new questions based on the observed data may well emerge during the unblinded analysis. Additional and perhaps complex statistical analysis may be the consequence. This additional work should be strictly distinguished in the report from work which was planned in the protocol.

The play of chance may lead to unforeseen imbalances between the treatment groups in terms of baseline measurements not pre-defined as covariates in the planned analysis but having some prognostic importance nevertheless. This is best dealt with by showing that an additional analysis which accounts for these imbalances reaches essentially the same conclusions as the planned analysis. If this is not the case, the effect of the imbalances on the conclusions should be discussed.

In general, sparing use should be made of unplanned analyses. Such analyses are often carried out when it is thought that the treatment effect may vary according to some other factor or factors. An attempt may then be made to identify subgroups of subjects for whom the effect is particularly beneficial. The potential dangers of over-interpretation of unplanned subgroup analyses are well known (see also Section 5.7), and should be carefully avoided. Although similar problems of interpretation arise if a treatment appears to have no benefit, or an adverse effect, in a subgroup of subjects, such possibilities should be properly assessed and should therefore be reported.

Finally statistical judgement should be brought to bear on the analysis, interpretation and presentation of the results of a clinical trial. To this end the trial statistician should be a member of the team responsible for the clinical study report, and should approve the clinical report.

7.2 Summarising the Clinical Database

An overall summary and synthesis of the evidence on safety and efficacy from all the reported clinical trials is required for a marketing application (Expert report in EU, integrated summary reports in USA, Gaiyo in Japan). This may be accompanied, when appropriate, by a statistical combination of results.

Within the summary a number of areas of specific statistical interest arise: describing the demography and clinical features of the population treated during the course of the clinical trial programme; addressing the key questions of efficacy by considering the results of the relevant (usually controlled) trials and highlighting the degree to which they reinforce or contradict each other; summarising the safety information available from the combined database of all the trials whose results contribute to the marketing application and identifying potential safety issues. During the design of a clinical programme careful attention should be paid to the uniform definition and collection of measurements which will facilitate subsequent interpretation of the series of trials, particularly if they are likely to be combined across trials. A common dictionary for recording the details of medication, medical history and adverse events should be selected and used. A common definition of the primary and secondary variables is nearly always worthwhile, and essential for meta-analysis. The manner of measuring key efficacy variables, the timing of assessments relative to randomisation/entry, the handling of protocol violators and deviators and perhaps the definition of prognostic factors, should all be kept compatible unless there are valid reasons not to do so.

Any statistical procedures used to combine data across trials should be described in detail. Attention should be paid to the possibility of bias associated with the selection of trials, to the homogeneity of their results, and to the proper modelling of the various sources of variation. The sensitivity of conclusions to the assumptions and selections made should be explored.

7.2.1 Efficacy Data

Individual clinical trials should always be large enough to satisfy their objectives. Additional valuable information may also be gained by summarising a series of clinical trials which address essentially identical key efficacy questions. The main results of such a set of trials should be presented in an identical form to permit comparison, usually in tables or graphs which focus on estimates plus confidence limits. The use of meta-analytic techniques to combine these estimates is often a useful addition, because it allows a more precise overall estimate of the size of the treatment effects to be generated, and provides a complete and concise summary of the results of the trials. Under exceptional circumstances a meta analytic approach may also be the most appropriate way, or the only way, of providing sufficient overall evidence of efficacy via an overall hypothesis test. When used for this purpose the meta-analysis should have its own prospectively written protocol.

7.2.2 Safety Data

In summarising safety data it is important to examine the safety database thoroughly for any indications of potential toxicity, and to follow up any indications by looking for an associated supportive pattern of observations. The combination of the safety data from all human exposure to the drug provides an important source of information, because its larger sample size provides the best chance of detecting the rarer adverse events and, perhaps, of estimating their approximate incidence. However, incidence data from this database are difficult to evaluate because of the lack of a comparator group, and data from comparative trials are especially valuable in overcoming this difficulty. The results from trials which use a common comparator (placebo or specific active comparator) should be combined and presented separately for each comparator providing sufficient data.

All indications of potential toxicity arising from exploration of the data should be reported. The evaluation of the reality of these potential adverse effects should take account of the issue of multiplicity arising from the numerous comparisons made. The evaluation should also make appropriate use of survival analysis methods to exploit the potential relationship of the incidence of adverse events to duration of exposure and/or follow-up. The risks associated with identified adverse effects should be appropriately quantified to allow a proper assessment of the risk/benefit relationship.

GLOSSARY

GlossaryContent
Bayesian ApproachesApproaches to data analysis that provide a posterior probability distribution for some parameter (e.g. treatment effect), derived from the observed data and a prior probability distribution for the parameter. The posterior distribution is then used as the basis for statistical inference.
Bias (Statistical & Operational)The systematic tendency of any factors associated with the design, conduct, analysis and evaluation of the results of a clinical trial to make the estimate of a treatment effect deviate from its true value. Bias introduced through deviations in conduct is referred to as 'operational' bias. The other sources of bias listed above are referred to as 'statistical'.
Blind ReviewThe checking and assessment of data during the period of time between trial completion (the last observation on the last subject) and the breaking of the blind, for the purpose of finalising the planned analysis.
Content ValidityThe extent to which a variable (e.g. a rating scale) measures what it is supposed to measure.
Double-DummyA technique for retaining the blind when administering supplies in a clinical trial, when the two treatments cannot be made identical. Supplies are prepared for Treatment A (active and indistinguishable placebo) and for Treatment B (active and indistinguishable placebo). Subjects then take two sets of treatment; either A (active) and B (placebo), or A (placebo) and B (active).
DropoutA subject in a clinical trial who for any reason fails to continue in the trial until the last visit required of him/her by the study protocol.
Equivalence TrialA trial with the primary objective of showing that the response to two or more treatments differs by an amount which is clinically unimportant. This is usually demonstrated by showing that the true treatment difference is likely to lie between a lower and an upper equivalence margin of clinically acceptable differences.
Frequentist MethodsStatistical methods, such as significance tests and confidence intervals, which can be interpreted in terms of the frequency of certain outcomes occurring in hypothetical repeated realisations of the same experimental situation.
Full Analysis SetThe set of subjects that is as close as possible to the ideal implied by the intention-to-treat principle. It is derived from the set of all randomised subjects by minimal and justified elimination of subjects.
Generalisability, GeneralisationThe extent to which the findings of a clinical trial can be reliably extrapolated from the subjects who participated in the trial to a broader patient population and a broader range of clinical settings.
Global Assessment VariableA single variable, usually a scale of ordered categorical ratings, which integrates objective variables and the investigator's overall impression about the state or change in state of a subject.
Independent Data Monitoring Committee (IDMC) (Data and Safety Monitoring Board, Monitoring Committee, Data Monitoring Committee)An independent data-monitoring committee that may be established by the sponsor to assess at intervals the progress of a clinical trial, the safety data, and the critical efficacy endpoints, and to recommend to the sponsor whether to continue, modify, or stop a trial.
Intention-To-Treat PrincipleThe principle that asserts that the effect of a treatment policy can be best assessed by evaluating on the basis of the intention to treat a subject (i.e. the planned treatment regimen) rather than the actual treatment given. It has the consequence that subjects allocated to a treatment group should be followed up, assessed and analysed as members of that group irrespective of their compliance to the planned course of treatment.
Interaction (Qualitative & Quantitative)The situation in which a treatment contrast (e.g. difference between investigational product and control) is dependent on another factor (e.g. centre). A quantitative interaction refers to the case where the magnitude of the contrast differs at the different levels of the factor, whereas for a qualitative interaction the direction of the contrast differs for at least one level of the factor.
Inter-Rater ReliabilityThe property of yielding equivalent results when used by different raters on different occasions.
Intra-Rater ReliabilityThe property of yielding equivalent results when used by the same rater on different occasions.
Interim AnalysisAny analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to the formal completion of a trial.
Meta-AnalysisThe formal evaluation of the quantitative evidence from two or more trials bearing on the same question. This most commonly involves the statistical combination of summary statistics from the various trials, but the term is sometimes also used to refer to the combination of the raw data.
Multicentre TrialA clinical trial conducted according to a single protocol but at more than one site, and therefore, carried out by more than one investigator.
Non-Inferiority TrialA trial with the primary objective of showing that the response to the investigational product is not clinically inferior to a comparative agent (active or placebo control).
Preferred and Included TermsIn a hierarchical medical dictionary, for example MedDRA, the included term is the lowest level of dictionary term to which the investigator description is coded. The preferred term is the level of grouping of included terms typically used in reporting frequency of occurrence. For example, the investigator text “Pain in the left arm” might be coded to the included term “Joint pain”, which is reported at the preferred term level as “Arthralgia”.
Per Protocol Set (Valid Cases, Efficacy Sample, Evaluable Subjects Sample)The set of data generated by the subset of subjects who complied with the protocol sufficiently to ensure that these data would be likely to exhibit the effects of treatment, according to the underlying scientific model. Compliance covers such considerations as exposure to treatment, availability of measurements and absence of major protocol violations.
Safety & TolerabilityThe safety of a medical product concerns the medical risk to the subject, usually assessed in a clinical trial by laboratory tests (including clinical chemistry and haematology), vital signs, clinical adverse events (diseases, signs and symptoms), and other special safety tests (e.g. ECGs, ophthalmology). The tolerability of the medical product represents the degree to which overt adverse effects can be tolerated by the subject.
Statistical Analysis PlanA statistical analysis plan is a document that contains a more technical and detailed elaboration of the principal features of the analysis described in the protocol, and includes detailed procedures for executing the statistical analysis of the primary and secondary variables and other data.
Superiority TrialA trial with the primary objective of showing that the response to the investigational product is superior to a comparative agent (active or placebo control).
Surrogate VariableA variable that provides an indirect measurement of effect in situations where direct measurement of clinical effect is not feasible or practical.
Treatment EffectAn effect attributed to a treatment in a clinical trial. In most clinical trials the treatment effect of interest is a comparison (or contrast) of two or more treatments.
Treatment EmergentAn event that emerges during treatment having been absent pre-treatment, or worsens relative to the pre-treatment state.
Trial StatisticianA statistician who has a combination of education/training and experience sufficient to implement the principles in this guidance and who is responsible for the statistical aspects of the trial.