Comparison of Two or More Correlated AUCs in Paired Sample Design

Purpose of study Methods of comparing the accuracy of diagnostic tests are of increasing necessity in biomedical science. When a test result is measured on a continuous scale, an assessment of the performance of the overall value of the test can be made using the Receiver Operating Characteristic (ROC) curve. This curve describes the discrimination ability of a diagnosis test in terms of diseased subjects from non-diseased subjects. The area under the ROC curve (AUC) describes the probability that a randomly chosen diseased subject will have higher probability of having disease than a randomly chosen non-diseased subject. For comparing two or more diagnostic test results, the difference between AUCs is often used. This paper proposes a non-parametric alternative method of comparing two or more correlated area under the curve (AUCs) of diagnostic tests for paired sample data. This method is based on Chisquare test statistic. Methods This paper investigated both parametric and non-parametric methods of comparing the equality of two AUCs and proposed a Chi-square test for the comparison of two or more diagnostic test processes. The proposed method does not require the knowledge of true status of subjects or gold standard in evaluating the accuracy of tests unlike other existing methods. The proposed method is most suitable for paired sample design. It also offers reliable statistical inferences even in small sample problems and circumvent the difficulties of deriving the statistical moments of complex summary statistics as seen in the Delong method. The proposed method provides for further analysis to determine the possible reason for rejecting the null hypothesis of equality of AUCs. Results The proposed method when applied on real data, avoids the lengthy and more difficult procedures of estimating the variances of two AUCs as a way of determining if two AUCs differ significantly. The method is validated using the Cochran Q test and was shown to compare favourably. The proposed method recommended for comparing two or more correlated AUCs when the data is paired. It is simple and does not require prior knowledge of true status of subjects unlike other existing methods.


INTRODUCTION
The performance of a diagnostic test in the case of a binary predictor can be evaluated using the measures of sensitivity and specificity (Mandrekar,2010). In studying statistical methods for diagnosis, the comparison of the measures of diagnostic test accuracy such as sensitivity and specificity having the prior knowledge the true disease status is always an interesting topic (Senaratna et al, 2015). In medical sciences generally, the use of diagnostic procedures is based on clinical investigations or laboratory experiments or trials purposely to classify subject into diseased or non-diseased. These procedures makes for vital decision making aided with advanced machines/tools to detect any given condition. However, in many instances, we encounter predictors that are measured on a continuous or ordinal scale. In such cases, it is desirable to assess performance of a diagnostic test over the range of possible cut-points for the predictor variable. This is achieved by a receiver operating characteristic (ROC) curve that includes all the possible decision thresholds from a diagnostic test result (Mandrekar,2010). For decades now, receiver operating characteristic curve (ROC) analysis has been used as a popular technique of evaluating the performance or ability of a test to discriminate between alternative health status (Kummar and Indrayan,2011). The ROC curve represents a graph of sensitivity against 1-specificity across various cut-off values of diagnostic test. It assesses the effectiveness of continuous diagnostic test results to differentiate between groups of healthy and diseased individuals (Greiner et al., 2000;Zhou et al., 2002;Pepe, 2004). It is also a common tool for assessing the performance of various classification tools such as diagnostic tests, and to compare accuracy between tests or predictive models. The ROC curve was originated in the theory of signal detection in the years 1950-1960(Green and Swets, 1966Egan, 1975) to discriminate between signal and noise. It can provide a direct and visual comparison of two or more diagnostic tests on a single set of scales. It is possible to compare different tests at all decision cut-offs by constructing the ROC curves. For statistical analysis, a recommended numerical index of accuracy associated with an ROC curve is often better used to summarize the information provided for the ROC curve into a single global value or index (Swets and Picket,1982). This index called area under the ROC curve is the most popular summary measure of ROC curves (Honghu-Liu, 2005;Pepe,2003). However in many studies involving paired sample designs, the positive and negative predictive values as measures of diagnostic accuracy have been estimated and compared (Leisenring et al, 2000;Moskowitz and Pepe, 2006;Wang et al, 2006). The use of correlated AUCs from alternative diagnostic tests have also been used in comparing the accuracy of test results (Krzanowski and Hand,2009;Pepe,2003). Meanwhile, Hanley and McNeil (1982) in their paper first wrote on the theory for comparing two AUCs for two independent AUCs. This work was extended by Hanley and McNeil (1983) for comparing two correlated AUCs as induced by paired sample data. Hanley and McNeil (1983) in their work used Wilcoxon's non-parametric method for estimating the AUC' and its standard errors while DeLong, DeLong and Clarke-Pearson (1988) in comparing correlated AUCs used the Mann Whitney method for estimating the AUC and its standard errors. Park, Goo and Jo(2004) as well as Hanley and McNeil (1983) pointed out that the trapezoidal rule (Mann Whitney test) as a non-parametric method underestimates the AUC but rather used the Dorfman and Alf (1969) method of maximum likelihood estimation for estimating AUC mainly for comparing independent AUCs. Metz, Wang, and Kronman(1984) extended this comparison to two correlated AUCs. Furthermore, ROC curves generated using data from patients where each patient is subjected to two (or more) different diagnostic tests of interest are considered as correlated ROC curves (Mandrekar,2010).Similarly, in paired designs, the estimation and comparison of certain measures of diagnostic accuracy such as the positive (negative) predictive values has been the subject of several studies (Moskowitz and Pepe,2006;Wang et al, 2006).
In this paper, we propose a nonparametric method based on chi-square test for comparing two or more correlated AUCs when the diagnostic test results are paired in the absence of the true disease status. This is due to the fact that the changes due to subjects represent a major component of the overall changes of the AUC. Therefore, to better control for these sources of changes when comparing diagnostic tests, a paired study design is often advised because it usually induces positive correlation between the tests results of the same subjects.
To carry out significant test for the differences between two or more correlated AUCs, it is necessary to consider the distribution of the outcome which also determines the procedure to be adopted in estimating the AUCs and its variance-covariance matrix. Three possible procedures to be used include the parametric, semi-parametric and non-parametric methods. Two nonparametric methods are known for use in literature that is best for comparing correlated AUCs. There are McNeil, 1983 andDelong, Delong &Clarke-Pearson, 1988. For these methods, the AUC and its variance covariance matrix are estimated using Wilcoxons method and Mann Whitney method respectively. Different methods of estimating the AUC have been used for each method. For instance, the parametric approach which was suggested in the paper, Dorfman and Alf (1969) method of fitting smooth curves based on the binormal assumption is used where the ROC curve can be completely described by two parameters estimated using Maximum Likelihood Estimation (MLE). A review of some of the existing methods for comparing AUC is outlined here.

Parametric (Binormal ROC Curve) Method
The parametric analysis assuming the binormal model was developed by Dorfman andAlf Jr.(1969), McClish (1989)and later implemented and further developed by Metz et al(1998). To compare the AUCs of two diagnostic test results for paired sample design and given the viability of the binormal assumption according to McClish(1989), the hypothesis for the equality of two AUCs denoted respectively as 1 2 AUC and AUC can be tested using the test statistic given as a a e f f a and a a a The variance of AUC can be estimated by substituting estimators for the parameters a1 and a2.
From equation 1, Metz et al (1984) is an estimate of the covariance between the two correlated AUC's in parametric approach of comparative study of two diagnostic procedures. Where  and SE denote the correlation coefficient between the two estimated AUC's and the standard error (i.e. the square root of variance) of estimate of AUC's respectively. If the two diagnostic tests are not examined on the same subjects, obviously the two estimated AUC's are independent and the covariance term would be zero.

Non-parametric methods
Hanley and McNeil showed that AUC has a meaningful interpretation as Man-Whitney U-statistics and thus, Ustatistics is a nonparametric estimate of AUC (Hajian-Tilaki,2013). In addition, they proposed exponential approximation of SE of nonparametric AUC (Hanley and McNeil,1982). Delong et al. also developed a nonparametric methods of SE of AUC (Delong et al,1988). The DeLong's method of components of U-statistics and its SE has been well illustrated by Hanley and Hajian-Tilaki in a single modality of diagnostic test (Hanley and Hajian-Tilaki,1997). DeLong et al(1988) developed a consistent empirical (nonparametric) estimator of the covariance matrix for several AUC estimators in a paired design.. The conventional nonparametric test for comparing correlated AUCs proposed by DeLong et al.(1988) uses a consistent variance estimator and relies on asymptotic normality of the AUC estimator. The advantage of Delong method is that the covariance between two correlated AUC can be estimated from its components of variance covariance matrix as well (DeLong et al,1988 When the variances are estimated, one can calculate the AUC for the two diagnostic tests and then make comparison.

PROPOSED CHI-SQUARE TEST STATISTIC FOR COMPARING THE EQUALITY OF TWO OR MORE CORRELATED AUCs
Interest is to develop a simple and easy to understand method of testing the equality of AUCs arising from two or more diagnostic tests across different diagnostic tests. It was proposed a chi-square test for the comparison of two or more diagnostic tests based on continuous, ordinal or binary scale data. Given measurement of test results on continuous scale, we dichotomize the results as positive or diseased (coded 1) and negative or non-diseased (coded 0) using a cut-off value c and present the information as coded in a contingency table.
Suppose n is a random sample of subjects drawn from a population of subjects for this study and ij x is the sample test result for the ith subject at jth diagnostic test T, i = 1, 2, …, n and j = 1, 2,…,T,

SUBSEQUENT ANALYSIS IF NULL HYPOTHESIS IS REJECTED
When the null hypothesis of equations 19 is rejected, it means that differences exist in the AUC across diagnostic tests or the proportion of positive response across diagnostic tests. Therefore, it is of interest to determine which of the AUC or equivalently the proportion of positive response among the diagnostic tests that has contributed to the rejection of H0.
see eq u a tio n n n E y y see eq u a tio n n ij sk y y assumes the value 1 provided ij sk y and y both assume the value 1 with probability .
The corresponding test statistic is given as:

APPLICATION TO REAL DATA
The proposed methods can be applied to real data obtained from a retrospective study of pregnant women at risk for gestational diabetes mellitus (GDM) at certain hospitals in Ebonyi State Nigeria.The records of a total of 1113 pregnant women who had earlier tested positive after screening using 1 hour 50g Glucose Challenge Test (GCT) and who were also subjected to diagnosis using 2-hour 75g OGTT as well as 3-hours 100g OGTT according to WHO(1999) and National Diabetic Data Group(NDDG,1979) criteria were taken. This was to compare the efficacy of these two diagnostic procedures, These pregnant women were seen to have positive risk factors and aged between 15-45 years at less than 24 weeks and between 24-28 weeks of gestation. Women who were known diabetics, or who were suffering from any chronic illness were excluded from the study. After obtaining permission from the hospitals' Research and Ethics Committee, assess was granted into the record units of the antenatal wards of these hospitals where the medical history of the patients were kept in a proforma containing general information on demographic characteristics such as body mass index, maternal age, previous fetal weight and vital clinical histories such as obstetric history of GDM and family history of diabetes were taken.
The GDM response variables (tests results) for the two tests, namely 75g OGTT and 100g OGTT represents the paired data for the pregnant women. These data type is suitable for comparing the accuracy of two tests in terms of their AUCs. Under this arrangement, the null hypothesis of interest which is testing of equality of the proportion of positive response is equivalent to testing the equality of AUCs for the tests. This comparison will be evaluated using the proposed method.
The research interest is to compare two or more correlated AUCs of diagnostic tests which also are equivalent to comparing the probability of positive response for paired sample design. To do this, the data was coded for this work based on the specification of equation 7 to generate the corresponding data of 1's and 0's. In other to calculate the chi-square test statistic of equation 23 for testing the null hypothesis of no difference among the proportion of positive response (see equation 19) in paired sample design, we evaluate the data for the work to have table 3.   This means that the proportion of positive response for the two diagnostic tests differ significantly. In other words, the two AUC for the tests are different. From Table 3, the proportion of pregnant women who have GDM increased after the second diagnostic test.
Furthermore, our interest may be to determine which of the test is superior or other wise. This is also the same as carrying out further analysis to determine which of the test that may have contributed to our rejecting the null hypothesis of equality of AUCs in equation 19. This means that we need to test the null hypothesis of equation 25. From

AUC AUC 
.This means that that the second diagnostic test (100g OGTT), that is 2 AUC is preferred to first diagnostic test (75g OGTT), that is 1 AUC because the second test was able to discover few more pregnant women who actually have plasma glucose level of at least 7.8mmol/l(GDM positive patients).

COMPARISON OF THE PROPOSED METHOD WITH DELONG ET AL (1988) METHOD
Using the same coded data meant for comparing AUC, the estimates of AUCs was obtained for the diagnostic tests as 0.687, and 0.752 respectively for the first and second diagnostic test respectively. To test the null hypothesis of equation 19 for the homogeneity of AUCs, the non-parametric test by DeLong et al.(1988) and the proposed chisquare test yielded significant results with their p-values as 0.0068 and 0.0027 respectively.

VALIDATION OF THE PROPOSED METHOD USING COCHRAN Q TEST
To make the proposed method valid in terms of efficiency, we illustrate using Cochran Q test for dichotomous data since the same null hypothesis of equality of AUCs (proportion of diseased pregnant women) across diagnostic tests can suitably be tested. Using the paired coded data which is also applicable, we let i B be the sum of the number of 1's in row i, the pregnant women, where i=1,2,….,1113 and , j k Z be the sum of the number of 1's in column j and k, where j is test 1 and k is test 2. Then the statistic for Cochran's Q test is given by     2 2 2 2 2 1 1 2 2 2 2 2 1 1 8. The chi-square test statistic is therefore recommended for comparing the equality of two or more correlated AUCs in paired sample design.