Detection of Sex-Related Differential Item Functioning in Raven’s Standard Progressive Matrices Test Using the Mantel-Haenszel Method

This study aimed to determine the presence of sex-related differential item functioning (DIF) in Raven’s Standard Progressive Matrices (SPM) Test. The research design was a comparative study, where boys formed the focal group and girls formed the reference group. The software used was SPSS-v26, and binary data of focal and reference groups were analyzed using jMetrik software to detect DIF according to the Mantel-Haenszel method. A total of 1032 students (49.6% boys and 50.4% girls), 570 from intermediate school and 462 from secondary school, were selected from 24 schools using a stratified random-sampling procedure. The statistical analyses showed that the Raven’s Standard Progressive Matrices (SPM) Test was one-dimensional. The results showed that, of five moderate DIF items, four items favored the focal group (boys), and one item favored the reference group (girls). The results also showed two large DIF items; the direction of the one item favored the reference group (girls), and the direction of the second item favor the focal group (boys). The findings showed that there were several unbiased items, but some were clearly biased against female performance. According to these findings, we recommend reanalyzing the data using methods depending on the item response theory, such as the logistic regression, simultaneous item bias test (SIB) or IRT-likelihood ratio (IRT-LR) methods to confirm the results seen here.


Introduction
Tests that are used in education and psychology for various purposes should meet specific standards, including validity, reliability, and practicality. These characteristics not only are the fundamental principles of measurement but also are the social values used by decision-makers in addition to measurement. In this regard, items in the test should not provide advantages or disadvantages for any subgroup at the same ability level. Otherwise, the test will be biased toward specific groups (Messick, 1995). The test is used as a data and information collection tool to make educational decisions and judgments. It is a measuring tool that contains a set of stimulants that represent the trait or ability meant to be measured; where researchers focused their efforts, in building tests and developing them, on extracting item Properties in terms of difficulty, distinction, and guesswork. Despite the importance of these features, they are not sufficient for judging the test items' validity for their designated purpose. This is because the response to the test items may be affected by factors other than the ability of the examined individuals, like gender, race, place of residence, language, or socioeconomic status; which may all affect the test results negatively, and subsequently the decisions based on them. Based on that, the items are described as having differential functioning towards a group or a category out of others (Jensen, 1980). Roever (2005) mentioned that interest in differential item functioning in intelligence tests started in the beginning of the twentieth century, when Binet discovered by chance that the average grades of the economic high class in certain test items were higher than those of the economic low class. He then reviewed the content of these items and discovered that some of them were affected by the socioeconomic status of the examined students (biased towards the higher class). Subsequently, the biased items were removed from the test, and a new amended version was issued.
There are several meta-analyses demonstrating that there are sex differences in some cognitive abilities. The first meta-analysis showed that male students outperform female students in spatial and mathematical ability, but that female students outperform male students in verbal ability" (Francisco et al., 2004. P. 1). Hyde et al. (1990) found a male advantage in quantitative ability, but those researchers noted that many quantitative items were expressed in spatial form. Linn and Petersen (1985) found a male advantage in spatial rotation, spatial relations, and visualization. Voyer et al. (1995) found the same male advantage in spatial ability, finding that it was the most important sex difference in spatial rotation. Feingold (1988) found a male advantage in reasoning ability. Thus, research findings support the idea that the main sex difference may be attributed to overall spatial performance in which male students outperform female students (Neisser et al., 1996). Findings of Colom and Garcia (2002) supported the view that the information content has a role in the estimates of sex differences in general intelligence.
They concluded that researchers must be careful in selecting the markers of central abilities such as fluid intelligence, which is supposed to be the core of intelligent behavior.
There are many studies that focus on differences between male and female students in tests (for instance, Willingham and Cole, 1997;Gallagher et al, 2000). These studies indicate that male students have better spatial ability than female students. This suggests that male students use this spatial ability more often than females when solving problems that can give them advantages when solving certain kinds of problems in geometry. Some studies also indicate that female students are better than their male counterparts in verbal skills which can give them advantages in items where communication is important. Female students also score relatively higher in tests in mathematics that better match course work (Willingham and Cole, 1997). While there are a few studies that treated the sex differences in Raven's Matrices tests according to the DIF approach. Since 2010, some studies have demonstrated slight differential functioning in some items. (Shibaev et al., 2020) except that these studies were conducted on Raven's colored progressive matrices (CPM) and advanced progressive matrices (APM) rather than SPM, which is the subject of this study.

Literature Review
The purpose of most standardized achievement tests is to distinguish among ability levels of examinees and thereby rank order individuals on some skill or trait. Ranking examinees accurately requires that all the items in a test discriminate among levels of the valid skill or purported ability. Problems are encountered when a test contains item (s) that also discriminate among levels of abilities other than the valid skill. Unfortunately, because ordering is a unidimensional concept, we cannot order examinees on two or more abilities at the same time unless we base our ranking on (Laveault et al., 1994).
Differential item functioning (DIF) refers to a psychometric difference in how an item functions for two groups. DIF refers to a difference in item performance between two comparable groups of examinees, that is, groups that are matched with respect to the construct being measured by the test. The comparison of matched or comparable groups is critical because it is important to distinguish between differences in item functioning from difference between groups (Dorans, & Holland, 1992). It is important to determine whether items have DIF for at least two reasons. The presence of DIF signals potential bias, and bias has an impact on validity of inferences drawn from group comparisons (Lane, Wang, & Magone, 1996). Zumbo (1999) mentioned that there are two types of Differential Item Function: the uniform and non-uniform DIF. Uniform DIF appears when the probability of answering the item correctly is consistently higher in one group and across all levels of the ability; thus there is no interaction between the ability level and group membership. Whereas the non-uniform DIF appears when there is an interaction between the level of ability (θ) and group membership; which mean that the pattern of differences in the probability of responding to the item is not the same at all levels of ability, so we find these differences are in favor of one group in a certain ability level, while it is in favor of another group in another ability level.

Mantel-Haenszel (MH) Method
The mantel-Haenszel (MH) method is a statistically powerful technique for detecting DIF. The MH procedure was first developed by Mantel and Haenszel in 1959 and was used as a method to detect DIF for the first time by Thayer in 1988 (Holland &Wainer, 1993). The mantel-Haenszel procedure estimates of the common odds ratio α across all matched categories. The form at its index is given as follows: Where is the proportion of the reference group in score interval I who answered the item correctly, and =1-. Similarly, is the proportion of the focal group who answered the item correctly, and =1-. Thus, is the ratio of the odds (p/q) that the reference group students have answered the item correctly to the odds that the focal group students have answered the item correctly. If there is no difference in the performance of the two groups on this item within this score interval, then will be equal to 1. If the focal group performs better on the item than the reference group, then <1. If, on the other hands the reference group performs better than the focal group, 1. α is average factor by which the odds that a member of the reference group responds correctly to the item exceeds the odds that a member of the focal group responds correctly to the item. It is observed that the index is weighted by the number of cases in the interval; also, that the interval in which the numbers at cases in the interval; also, that the interval in which the numbers of cases in the two groups are more nearly equal receives the heavier weight. There is a chi-square test associated with the MH approach, namely a test of the null hypothesis, And where the -0.5 in the expression for χ serves as a continuity correction to improve the accuracy of chi-square percentage points as approximation to the observed significance levels. The quantity χ is distributed approximately as a chi-square with one degree of freedom. For the sake of convenience α is transformed to another scale, yielding an index that is referred to as MH D-DIF (∆ by means of the conversion. ∆ 2.35 ln α …….. (5) This transformation centers the index about the value 0 (which corresponds to the absence of differential item functioning), and puts it on a scale roughly comparable to the Educational Testing Service (ETS) delta scale of item difficulty and reverse the index so that the item positive values of ∆ indicate that the item favors the focal group; negative values indicate that the item favors the reference group and disfavors the focal group.
To use the ∆ measure to identify test items that exhibit varying degrees of DIF a classification scheme was developed of ETS for use in test development that puts items into one of three categories: negligible DIF (A), Intermediate DIF (B), and large DIF(C). Items are classified as A for a particular combination of reference and focal groups if either ∆ is not statistically different from zero or if the magnitude of the ∆ values is less than one delta unit in absolute value. Items are classified as C if ∆ both exceeds 1.5 in absolute value and is statistically significantly larger than 1 in absolute value. All other items are classified as category B. In both category A and C statistical significance is at the 0.05 level for a single item. In this study, girls formed the focal group, while boys are formed the reference group (Holland & Wainer, 1993).

Study Problem and Questions:
Sex differential performance in nonverbal ability tests is a cause for alarm, however in studies that examine sex differential performance in nonverbal ability tests, it is rather challenging to determine whether the significant differences in the nonverbal ability test between boys and girls is due to their true differences in ability or testrelated factors such as item-type. Therefore, this study will address the gap in examining this issue of sex difference because of item-type and identify characteristics of content of Raven's Standard Progressive Matrices Test items that cause the differential performance by sex. By detecting DIF according the characteristics of the items content for the Raven's (SPM) test, the issue of the apparently widening sex gap will be explained from a new perspective of item characteristics using the Mantel-Haenszel method. Therefore, the research question is: "What are the items that show differential functioning in Raven's Standard Progressive Matrices Test according of the student's sex?"

The Significance of the Study:
The main objective of the study is to investigate which items that show DIF, for male and female students on the Raven's Standard Progressive Matrices (SPM) Test using Mantel-Haenszel (MH) which is based on classical test theory (CTT). This study is one of the few studies that shows interest in examining the differential item functioning in (SPM) test to the best of our knowledge. It seeks to find evidence that its items lack differential functioning, using a statistical method suitable for finding differential functioning (Mantel-Haenszel method). The scientific significance of the study lies in that results may provide evidence of the suitability of the RSPM test and the validity of its results to be used for the functions it was designed for, including but not limited to achieving fairness in screening and accepting students in programs of gifted education despite the difference in sex.

Methodology 3.1 Study design:
This is a comparative research study, where the girls form the reference group and boys form the focal group because they form the interest of this study. A total of 12 schools were selected from the six educational districts in the State of Kuwait, where two schools were selected from each educational district chosen randomly (total=24 school).

Study community and Participants:
The study community consisted of all male and female students in the eighth grade, intermediate school, and the eleventh grade, secondary school, in the public schools of the Kuwaiti Ministry of Education (totaling 23315 students according to the statistics supplied by the Ministry of Education for the academic year 2012/2013. Participants were 1032 students (49.6% boys and 50.4% girls), 570 from intermediate school and 462 from secondary school, ranging in age from 13 to 16 years. Each participant completed the SPM test. The mean SPM score for the total sample was 30.17 (SD=6.46). The mean score for boys was 30.28 (SD=6.05), and for girls it was 30.07 (SD=6.85). All were Kuwaiti citizens and students in the governmental schools in the six districts in Kuwait. Twenty-four schools were selected from the six districts using a stratified random-sampling procedure.

Instrument:
The Raven's Standard Progressive Matrices (SPM) Test is one of the most widely used measures of cognitive ability. SPM scores are considered among the best estimates of general intelligence. It is a nonverbal test designed to assess ability to reason and solve new problems without relying extensively on declarative knowledge derived from schooling or previous experience. It is one of the most well-known, formal, broad intelligence tests. It was prepared by Raven in 1938 as a tool that measures general intelligence. The test in its original form contained 60 matrices. It is also considered one of the best tests to measure the capacity for abstract (nonverbal) reasoning, and has good psychometric characteristics, upon which a large body of published scientific studies have been built, and it has been accepted and used in the five continents of the world (Abdel- Khalek & Raven, 2006). In Kuwait, the Ministry of Education adjusted the test after studying its standardization, because 12 items were deleted, and the order of items was adjusted according to the difficulty factor.

Data Analyses:
The classical test theory was used in data analysis. Reliability analyses were carried out using internal consistency reliability using Kuder-Richardson KR-20, which is a reliability measure for a test with binary variables. To evaluate construct validity, we conducted an exploratory factor analysis principal component analysis as the extraction method, and varimax with Kaiser Normalization as the rotation method. A factor was considered important if its eigenvalue exceeded 1.0. The communality represents the percentage of variance of the tool item accounted for all factors. A p-value of < 0.05 was considered as statistically significant for all tests. All statistical analyses were conducted using the Statistical Package for the Social Sciences (SPSS) version 26, and Binary data of focal and reference groups were analyzed using jMetrik 4 software (Meyer, J. P., 2014) to detecting DIF according to the Mantel-Haenszel method, girls formed the focal group, while boys are formed the reference group.

Validity and Reliability:
The validity of the SPM test has been verified using exploratory factor analysis using the principle component method, then using orthogonal ratio with the varimax method for all items. This was to provide a better explanation of the psychometric properties extracted before rotation. The eigenvalue was used according to Kaiser Criterion where the eigenvalue per factor increases more than one whole, and 0.30 was considered the least significant factor loading of the item according to Guilford Scale. The results showed the Kaiser-Meyer-Olkin Measure to be 0.902 and the Bartlett's Test of Sphericity to be chi-square = 7226.500 and p < 0.001, which indicated that the data in this study were suitable for factor analysis. Table 1 shows that the eigenvalue of the first factor was 7.109, and 14.81% of the total variance was explained. The eigenvalue of the second factor was 2.075, and 4.32% of the total variance was explained. When the eigenvalue of the first factor is divided by that of the second factor, the result equals 3.43 (i.e., greater than 2, which is considered an indicator of one-dimensionality (Hattie, 1985)). This means that the SPM test is loaded by one general factor. It appears that this measure of reasoning ability does not require other cognitive abilities to any significant degree. For further explanation, Fig. 2 is the scree plot showing the unidimensionality of the items.

Fig. 2 Scree plot of eigenvalues of factors resulted from psychometric analysis items
The reliability analysis of Kuder-Richardson Formula (KR-20) was used to measure the internal consistency of the SPM test items. The analysis showed that the reliability coefficient for internal consistency of the SPM test was 0.86, which is an acceptable value for proceeding with this study.

Results: 4.1 Results of Mantel-Haenszel and Effect Size (odds ratios)
To answer the question of the study, Mantel-Haenszel statistics were calculated, as well as the effect (common odds ratio) for each item in the SPM test using jMetric statistics program, in terms of student sex, where girls were considered the reference group and boys were the focal group. The data in Table 2 shows the summary results (M-H statistics, the significance levels, odds ratios (effect size), and 95% confidence interval) for each of the fortyeight items from the Mantel-Haenszel method to identify Differential Item Functioning on the SPM test. Eight items from the SPM test gave numbers of 9, 10,17,18,31,42,44, and 45, suggesting differential functioning according to the sex of the student, where the data showed that the values of Mantel-Haenszel statistic for these items were statistically significant at α < 0.05.

Results of Delta Mantel-Haenszel and DIF Direction
To determine the amount of differential functioning and its direction, the values of Mantel-Haenszel statistics for the items that showed differential functioning were converted to delta value (Δ MH ) according to formula no. (5). Table 3 shows the summary results from the Mantel-Haenszel method to identify Differential Item Functioning on the SPM test. The study results showed five moderate DIF item whose numbers were 9, 17, 18, 42 and 44. The direction of DIF in these items showed that four items 9, 17, 18, and 42 favored the focal group (boys), and one item (44) favored the reference group (girls). The study results showed two large DIF items (10 and 45). The results indicate that the direction of item (10) favors the reference group (girls), and the direction of the second item (45) favors the focal group. Rather than present all the non-parametric Item Characteristic Curves (ICCs) for the items that showed DIF, two ICCs are presented (figure 3) for the purpose of demonstration. These non-parametric ICCs were selected because they demonstrate, first, an item 10 show that the reference group (girls) has a higher chance to answer correctly than the focal group (boys), and then second, an item 45 show that the focal group (boys) has a higher chance to answer correctly than the reference group (girls).

Discussion
The statistical analyses showed that the Raven's Standard Progressive Matrices (SPM) Test is one-dimensional. It seems that this measure of reasoning ability does not require other cognitive abilities to a significant degree. The results of the first procedure used in this study indicate that 8 of the 48 items showed differential functioning according to the sex of the student. The results of the second procedure showed that, of five moderate DIF items, four favored the focal group (boys), and one item favored the reference group (girls). The results also showed two large DIF items. The direction of the one item favored the reference group (girls), and the direction of the second item favored the focal group (boys). Finally, the results indicated that there was one item showing negligible DIF favoring the reference group (girls), and it was ignored according to the instructions of the Mantel-Hansel method. Our results support the idea that comparisons between diverse groups show minimal bias when Raven's Standard Progressive Matrices Test is used. Therefore, there is a sex difference in the SPM Test (Colom and Garcia, 2002); however, given that this test is based on abstract figures and that boys have on average a higher spatial ability than girls (Voyer et al., 1995), we predicted that some items may be easier for boys. Thus, boys might solve some items correctly because of their visuo-spatial nature. This could be considered as a threat to bias (Francisco et al., 2004, p. 10).
The results of current study provide evidence that there are sex differences in performance on few test items in SPM test. The differences in cognitive abilities between the two sexes have always been a subject of investigation for researchers (Wechsler et al, 2014). The differences between the sexes in Raven's Matrices tests have always been among the most interesting, most controversial subjects; and yet, the studies have not reached a clear conclusive result (Yang et al, 2014). The results of the studies on the differences in terms of sex are inconsistent and do not follow the same path. Some studies have attributed these differences between the sexes to differences in factors such as the nature of the sample and whether the sample was representative of its community, and the use of statistical methods that fail to identify DIF. A study by Mackintosh & Bennet (2005) showed that boys surpassed girls in some of the items that are similar in a certain pattern. However, their study sample was not large, and the study was conducted using Raven's APM rather than SPM. Therefore, the study emphasized the necessity for researchers to conduct studies that employ methods of qualitative research such as focus groups in order to evaluate the reasons behind the existence of differential item functioning and to verify the sources of variance that affect test scores. This allows determination of whether the subgroups are affected by the same sources of variance and whether any of the sources of variance unfairly favor a subgroup before judging the item's bias. In summary, the authors have investigated the visuo-spatial basis of the SPM test. The male advantage on this test could derive from their visuo-spatial nature.

Conclusion and Recommendations
Results of Differential Item Functioning obtained with this study show that boys have on average a higher spatial ability than girls. Based on the description of the results and discussion above, the following conclusions were reached: The study is a comparative study, using DIF data to reveal different performance characteristics of male and female examinees. Therefore, identifying DIF items is as important as determining the underlying source of difficulty across the focal and reference groups. A complete evaluation of test quality must include an evaluation of each question. Therefore, questions should assess only knowledge or skills that are identified as part of the domain being tested and should avoid assessing irrelevant factors and examination items should be fair among examinees from all possible subgroup of the population of the examinees. DIF is an issue that must be properly addressed in examinations and tests designed for heterogeneous groups. Based on these findings, the following recommendations can be considered for future studies: Conduct further research including additional variables other than sex, especially age and region. It is also necessary for item writers to develop test items and subject them to pilot studies to select items that are free from DIF. Another step in this research would be to include and compare other methods of identifying DIF items, because it was mentioned above that DIF detection methods can vary in their results. Therefore, it may be advantageous to reanalyze the data using a nonmodal-based method (i.e., Item Response Theory) such as the logistic regression, simultaneous item bias test (SIB) and IRT-likelihood ratio (IRT-LR) methods to confirm the results seen here. Future studies can be directed towards examining sex differential performance for intelligence tests, specifically among gifted students. Another important recommendation for future studies is that another method of detecting DIF such as the multidimensional model to detect the presence of differential dimensions (Shelly & Stout, 1993) is used to examine whether both DIF methods flag the same items. Limitation: The results of this study are limited to using the Mantle-Hansel method to detect the differential item functioning, which is based on the Classic Test Theory (CTT). Another limitation of the study is related to the type of statistical program used in the analysis process, which is jMetrik software.