Introduction

A substantial part of medical education research focuses on learning in teams (e.g., departments, problem-based learning groups) or centres (e.g., clinics, institutions) that are followed over time. Individual students or employees sharing the same team or centre tend to be more similar in learning than students or employees from different teams or centres [1]. In other words, when students or employees are nested within teams or centres, there is an intra-team or intra-centre correlation that should be taken into account in the analysis of data obtained from individuals in these teams or centres. Further, when individuals are measured several times on the same performance (or other) variable, these repeated measurements tend to be correlated, that is: we are dealing with an intra-individual correlation that should be taken into account when analyzing data obtained from these individuals [23]. This paper presents the benefits that result from adopting a proper multilevel perspective on the conceptualization and estimation in such a study context, a context that is quite common in medical education research. Many researchers still resort to methods that cannot account for intra-team and/or intra-individual correlation and this may result in incorrect conclusions with regard to effects and relations of interest.

Context

Suppose, a researcher is interested in the effect of two types of group learning on test performance right after a course (i.e., immediate test performance) and one month later (i.e., delayed test performance), and decides to conduct a randomized experiment. The advantage of randomized experiments is that they allow for cause-effect inference much more than quasi-experimental and other types of studies.

The researcher decides to randomly allocate 450 students to 30 learning groups in such a way that every learning group comprises 15 students. Next, the learning groups are allocated randomly to either treatment A (control condition) or B (experimental treatment condition). The 15 A-groups study a medical problem by means of traditional cooperative learning, while the 15 B-groups study the same medical problem—and for the same interval of time as in the control group—but by means of a newly developed and more structured type of cooperative learning. The immediate test is administered directly after treatment. In the month after the test, students do not receive any additional treatment; they resume their usual study activities. At the end of the month, a delayed test is administered. Figure 1 visualizes the study design described.

Fig. 1
figure 1

Study design used as example in this paper

This type of study design is also known as split-plot design. This term stems from agricultural experiments in which split plots of land received different treatments and were monitored or measured across time [4]. Likewise, in quite a number of educational and psychological experiments, students or other subjects are randomly allocated to different treatment conditions and are measured two or more times on the same performance or other variable [5]. In this example, we are dealing with such a design and we are confronted with one additional feature: treatment is not administered at the level of the individual student (as is the case in many psychological experiments) but at the level of learning groups in which the students take part (as is the case in more and more educational experiments). The following two hypotheses are to be tested:

  • Hypothesis 1 (H1), immediate treatment effect: students who undergo treatment B perform better than their peers who undergo treatment A, on both immediate and delayed test; and

  • Hypothesis 2 (H2), treatment-by-time interaction effect: students who undergo treatment A will lose more knowledge in the month following the immediate test than students who undergo treatment B.

In this context, we can distinguish between fixed effects and random effects. The purpose of a study like this is to generalize the findings to a larger population of (possible) students, and we assume that the students in our study form a random sample from a population that has a particular and preferably Normal (i.e., bell-shaped) distribution. In other words, we treat ‘student’ as a random effect. Treatment, however, is a fixed effect; we are interested in the specific comparison of treatments A and B, and we do not consider these two treatments to be a random sample of possible treatments to which we generalize. Likewise, in this context time is treated as a fixed effect; we are interested in differences in performance between two fixed time points, and we do not consider these two time points a random sample of possible time points to which we generalize. Finally, learning group (i.e., 15 students each) is treated as a random effect for the same reason as we treat student as a random effect; we consider the learning groups in our study a random sample from a population of (possible) learning groups that has a particular and preferably Normal (i.e., bell-shaped) distribution.

Approach

In fact, we can distinguish three hierarchical levels in this problem: learning group (level 3: k), students nested within learning groups (level 2: j), and repeated measurements from the same students (level 1: i). Three common approaches to this type of research problems are:

  • Single-level fixed-effects or ordinary least squares (OLS) regression in which the three hierarchical levels are treated as one single level and in which random effects of student and learning group are ignored;

  • Split-plot analysis of variance (ANOVA) or two-level mixed-effects regression in which the learning group level (i.e., k) and thus random effect of learning group is ignored; and

  • Three-level mixed-effects regression, which takes into account the full three-level hierarchical structure of the data.

We speak of mixed-effects analysis if our analysis involves at least one fixed and at least one random effect. Since all random effects are ignored in OLS regression, we call OLS regression fixed-effects analysis. The two-level approach includes the random effect of student, and the three-level approach includes the random effects of both student and learning group.

This paper compares the aforementioned three approaches, using this study context as example. For simplicity, this example uses equally large learning groups, an equal number of learning groups per treatment condition, and all students perform both the immediate and delayed test. We then speak of a ‘balanced design’. As a consequence of this design, all three approaches yield the same point estimates with regard to average treatment condition differences on immediate and delayed test. For unbalanced data, both OLS and split-plot ANOVA are likely to yield biased point estimates [2]. Further, it is demonstrated in this paper that, even for balanced data, OLS and split-plot ANOVA are biased in different ways with regard to variances and standard errors.

In the design at hand, two repeated measurements (level-1: i) are taken by 450 students (level-2: j) who are nested within 30 learning groups (level-3: k) that comprise 15 students each. The learning group level residual on the immediate test is a random, allowed-to-vary departure from the overall mean of the fixed-effect treatment condition on the immediate test, the student level residual on the immediate test is an allowed-to-vary random departure from the learning group level departure on the immediate test, and the change from immediate to delayed test is student-dependent.

Method

For educational purposes, to allow for a good comparison of the three methods, data from this design were simulated in SPSS v19; a detailed overview of the simulation procedure is available from the author.

The advantage of a simulation study is that the outcomes of the study are known and as such it allows for a comparison of strengths and weaknesses of various methods of analysis, here: OLS regression, split-plot ANOVA, and three-level mixed-effects regression. All analyses were performed in MLwiN v2.10, a programme designed for multilevel analysis and suitable for a context like this, using for estimation fully informed maximum likelihood (FIML) for the fixed effects and restricted maximum likelihood (REML) for the random effects (in the two-level and three-level model) [3]. There is quite a variety of programmes that enable multilevel analysis (e.g., SAS, SPSS, STATA, HLM, SYSTAT, HLM, MLwiN, Mplus, R), and which programme is recommended depends on study design, types of variables under consideration, sample size, and a few other factors [69]. Since SPSS is a much more commonly used programme among medical education researchers than MLwiN, some instructions on how to do multilevel analysis in SPSS are provided in the Appendix.

Between-subject effects

In OLS regression, all test outcomes from all students are assumed to be independent and identically distributed (i. i. d.). The research problem introduced previously is represented in the following equation:

$${{y}_{i}}={{b}_{0}}+{{b}_{1}}*\text{treatment}\_{{1}_{i}}+{{b}_{2}}*\text{time}\_{{1}_{i}}+{{b}_{3}}*\text{treatment}\_1*\text{time}\_{{1}_{i}}+{{e}_{i}},$$

where

y i  = test score y observed at data point i;

b 0 = the average immediate test score in the control condition (A);

b 1 = the difference between treatment conditions in average immediate test score;

b 2 = the average change from immediate to delayed test in the control condition (A);

b 3 = the difference between treatment conditions in average change from immediate to delayed test

(treatment-by-time interaction); and

e i  = the residual or random deviation from the fixed prediction.

The student-by-time interaction is ignored; all repeated measurements are assumed to be independent observations, as if they were 900 randomly sampled students attending one and the same lecture once in time. Due to the balanced design, the fixed point estimates b 0 (44.076), b 1 (11.089), b 2 (− 4.036), and b 3 (2.867) in OLS regression are the same as the fixed point estimates in the two-level and three-level model presented later on. However, the standard errors and residuals are different in each model.

Within-subject effects

Repeated measurements data enable the researcher to separate within-subject variance from between-subject variance, and both types of variance are important in medical education research. Two types of correlation resonate in repeated measurements data: data sampling is hierarchical in that repeated measurements are taken from the same subjects (here: students), and educational measurement largely results from observation or self-reporting which creates serial correlation. Both types of correlation should be modelled appropriately. If students are measured more than twice, serial correlation observes special attention, for there are different types of serial correlation [23].

In the case of a balanced design with two repeated measurements, quite some researchers opt for split-plot ANOVA. In fact, this is a two-level mixed-effects model, in which student is the upper level (j) and the repeated measurements are correctly treated as observations from the same students. The within-subject (i.e., student) variance is separated from the between-subject variance, which is something that does not happen in OLS regression. The students participating in the study are assumed to form a random sample from a population that follows a particular (ideally: Normal) distribution. The OLS regression equation can be adjusted to derive the regression equation for the current model:

$${{y}_{ij}}={{b}_{0j}}+{{b}_{1}}*\text{treatment}\_{{1}_{j}}+{{b}_{2}}*\text{time}\_{{1}_{ij}}+{{b}_{3}}*\text{treatment}\_1*\text{time}\_{{1}_{ij}}+{{e}_{ij}},$$

where

y ij  = test score y observed for student j at repeated measurement i;

b 0j  = the average immediate test score of person j in the control condition (A); and

e ij  = the residual or random deviation of student j at repeated measurement i from the fixed prediction.

In the equation above

$${{b}_{0j}}={{b}_{0}}+{{u}_{0j}},$$

and u 0j is the student-specific deviation in immediate test score with regard to b 0. Note that u 0j and b 0j are random (not fixed) effects, hence the denomination mixed-effects model. In the standard error of a between-subject effect such as the experimental treatment effect, both within-subject and between-subject variance are present and cannot be separated. In the standard error of a within-subject effect such as change in test score over time the within-subject variance can and should be separated from the between-subject variance. It is for that reason that, as becomes clear in the results section later on, OLS regression yields a larger standard error for b 2 and b 3 than split-plot ANOVA.

Learning groups

The learning group level is not taken into account in either OLS regression or in split-plot ANOVA. While the within-student between-measurement correlation is accounted for in split-plot ANOVA, the within-group between-subject correlation is not, and in OLS regression both types of correlation are ignored. Some researchers attempt to solve this by aggregating the data to the level of learning group. An average test score is then computed per repeated measurement for every learning group, and split-plot ANOVA is performed on the aggregated data. In this approach, the student level is wiped out of the analysis. Given that we (and researchers in medical education in a broader context) are also interested in the development of individual students, and effects on the individual student level can be different from effects on the learning group level, reducing the individual students’ data to their learning group average is not a feasible approach in medical education research.

A second group of researchers attempts to take the learning group level into account by including it as fixed effects in either an OLS regression model, or in a split-plot ANOVA. Either way, the fixed learning group approach is problematic for a number of reasons. First of all, in OLS regression and split-plot ANOVA, as discussed previously, the fixed effect of the model consumes four degrees of freedom (one for each of b 0, b 1, b 2, and b 3). When treating group as fixed factor, one needs dummy variables for the various learning groups and dummy variables for the learning group-by-time interaction. As a result, the fixed part requires 60 instead of four degrees of freedom (imagine the consequences if a study includes 300 learning groups). This affects standard errors greatly and results in highly inaccurate estimation. Moreover, no single parameter in the model addresses the treatment effect. Since each treatment condition comprises 15 groups, we cannot include both treatment and group as fixed effects in the model. Although in a balanced design we can compute the average learning group score for each treatment condition and this will indirectly lead to the treatment condition differences in average test score as obtained via the OLS regression or split-plot ANOVA model discussed previously, the learning group-specific standard errors in the model cannot be easily translated into one standard error for the treatment effect. The same problem is present in the analysis of the time effect and the treatment-by-time interaction effect. Finally, including learning group as fixed effect in the model disables generalization to other learning groups. One is generally interested in the effects of treatments A and B in any learning group that could apply one of these treatments. The 30 learning groups should therefore be considered as random learning groups; the first 15 learning groups form a random sample from a population of learning groups in which treatment A is applied, and the other 15 learning groups form a random sample from a population of learning groups in which treatment B is applied.

Three levels

Multilevel analysis can take the hierarchical structure of the data into account in a way that none of the previously discussed approaches does, and enables correct analysis at each of the three levels: learning group (k), student (j), and repeated measurement (i). The appropriate multilevel model is a three-level mixed-effects model. In this model, the average immediate test score in the control condition, the difference in average immediate test score between the treatment conditions, the average change from immediate to delayed test in the control condition, and the difference in average change from immediate to delayed test between treatment conditions together form the fixed part. The random part can be fully explained by a combination of random intercept variance at the learning group level (k), and random intercept variance and random slope variance (and their covariance) at the student level (i), meaning that the residual on the lowest level—the level of the repeated measurements (i)—is equal to zero (and does not consume any degree of freedom). Consequently, the model consumes eight degrees of freedom, of which four for the fixed part and four for the random part (including one for the covariance between random intercept and random slope on the level of student). The full model is as follows:

$${{y}_{ijk}}={{b}_{0jk}}+{{b}_{1}}*\text{treatment}\_{{1}_{k}}+{{b}_{2j}}*\text{time}\_{{1}_{ijk}}+{{b}_{3}}*\text{treatment}\_1*\text{time}\_{{1}_{ijk}}\,+\,{{e}_{ijk}},$$

and given that e ijk , the residual on the level of repeated measurement i from student j in learning group k is equal to zero, the model can be reduced to

$${{y}_{ijk}}={{b}_{0jk}}+{{b}_{1}}*\text{treatment}\_{{1}_{k}}+{{b}_{2j}}*\text{time}\_{{1}_{ijk}}+{{b}_{3}}*\text{treatment}\_1*\text{time}\_{{1}_{ijk}},$$

where

y ijk  = the test score from student j from learning group k on repeated measurement i;

b 0jk  = the immediate test score of student j from learning group k in the control condition (A);

b 1 = the difference in average immediate test score between the treatment conditions;

b 2j  = the change from immediate to delayed test for student j in the control condition (A); and

b 3 = the difference in average change from immediate to delayed test between treatment conditions.

For b 0jk holds

$${{b}_{0jk}}={{b}_{0}}+{{v}_{0k}}+{{u}_{0jk}},$$

where

b 0 = the average immediate test score in the control condition (A);

v 0k  = the learning group-specific deviation in average immediate test score with regard to b 0; and

u 0jk  = the student-specific deviation in immediate test score with regard to v 0k.

For b 2j holds

$${{b}_{2j}}={{b}_{2}}+{{u}_{2jk}},$$

where

b 2 = the average change from immediate to delayed test in the control condition (A); and

u 2jk  = the deviation in change from immediate to delayed test for student j in learning group k

with regard to b 2.

Thus, the model can also be written as

$${{y}_{ijk}}=\left( {{b}_{0}}+{{v}_{0k}}+{{u}_{0jk}} \right)+{{b}_{1}}*\text{treatment}\_{{1}_{k}}+\left( {{b}_{2}}+{{u}_{2jk}} \right)*\text{time}\_{{1}_{ijk}}+{{b}_{3}}*\text{treatment}\_1*\text{time}\_{{1}_{ijk}}.$$

For student j in treatment condition A the immediate test score is (b 0 + v 0k  + u 0jk ), while for student j in treatment condition B the immediate test score is (b 0 + v 0k  + u 0jk ) + b 1. For student j in treatment condition A the change from immediate to delayed test is (b 2 + u 2jk ), whereas for student j in treatment condition B the change is (b 2 + u 2jk ) + b 3. The level-3 (k) residuals for the various learning groups v 0k are assumed to form a random sample from a normally distributed population with mean zero and variance \(\sigma _{v0}^{2}\):

$${{\text{v}}_{0k}}\tilde{\ }\text{N}\left( 0,\sigma _{v0}^{2} \right).$$

The level-2 (j) residuals for the students nested within learning groups (k), u 0jk and u 2jk , are also assumed to form random samples from normally distributed populations with mean zero and variance \( \Omega u \):

$$\left[ \begin{aligned}& {{\text{u}}_{\text{0}jk}} \\& {{\text{u}}_{\text{2}jk}} \\\end{aligned} \right]\sim \text{N}\left( 0,{{\Omega }_{u}} \right).$$

Results

Table 1 presents standard errors (SE) for each of b 0 (44.076), b 1 (11.089), b 2 (− 4.036), and b 3 (2.867), as well as random intercept variance at the learning group level (k), random intercept variance and random slope variance and their covariance at the student level (j), and the lowest-level residual (e) and associated SEs.

Table 1 Standard errors (SE) for each of b 0 (44.076), b 1 (11.089), b 2 (− 4.036), and b 3 (2.867), as well as random intercept variance at the learning group level (k), random intercept variance and random slope variance and their covariance at the student level (j), and the lowest-level residual (e) and associated SEs (between parentheses)

OLS heavily overestimates the standard errors for b 2 and b 3, effects in which within-subject variance plays a role. The within-subject variance is separated from the between-subject variance in split-plot ANOVA and the three-level model, and as a consequence, the standard errors for b 2 and b 3 are much smaller than according to OLS. Ignoring the learning group level does not affect the standard errors for b 2 and b 3 in the split-plot ANOVA. This is because within-subject effects have different variances and degrees of freedom than between-subject effects, and ignoring the learning group level only affects the degrees of freedom of between-subject effects. This also explains why the standard error for b 1 is underestimated in both OLS and split-plot ANOVA.

Conclusion

The advent of the personal computer with more and more computational power resulted in an increased use of multilevel models [10]. Nonetheless, many still use OLS and related ANOVA approaches for multilevel data because they are used to it. For instance, in experimental psychology there is a longstanding tradition of using ANOVA models, and OLS is typically (over)used in much of health research. Many researchers continue using ANOVA or OLS because they ‘have always done it like that’ and think that ‘a more complex analysis does not make much difference anyway.’ Indeed, there are situations when a more complex analysis does not make much difference. That is, when little to no interaction between students within groups results in very little within-group dependency, taking into account the group level may not result in substantially different outcomes with regard to the effects or relations of interest. However, this is not the norm, and even smaller within-group dependency should make researchers examine what adding the group level changes in outcomes [1].

The beast of aggregation

In any case, aggregating student-level data to some group average does not resolve the phenomenon of within-group dependency and is rarely if ever a good approach to deal with such dependency. This also holds for situations where for instance groups of students trained by the same clinical teacher have to rate teaching skills or other characteristics of that teacher. While a common argument to ‘justify’ aggregation is that such ratings aim at ‘evaluating the performance of an individual clinical teacher at the workplace’ [11], students rarely provide exactly the same ratings, some clinical teachers may receive more ratings than others, and some clinical teachers may receive more varied ratings than others. All this information is lost when aggregating students’ data to one single average score per teacher, and this can have major influences on effects and relations of interest, including negative correlations being artificially changed into positive ones and vice versa [1]. Therefore, do not wipe out the student level through aggregation.

A similar reasoning holds for repeated measurements. Recently, a series of well-designed randomized experiments provided evidence for the statement that in studies where learners have to perform a series of tasks, it is better to measure a characteristic of interest—for instance mental effort—after each task (i.e., repeatedly) than once retrospectively [12]. This is an excellent statement, for repeated measurements data enable the researcher to separate within-subject variance from between-subject variance, and both types of variance are important in (medical) education research. However, if we aggregate these repeated measurement data to one average score to then correlate that average score to some other (perhaps also aggregated) variable, we fall in the same trap of aggregation and can face potential serious distortions of effects and relations of interest [1].

N students being measured k times does not equate N times k independent observations

Ignoring intra-individual correlation as is done in OLS is unfortunately still quite common in (medical) education research, including in high-quality research published in respectable journals. For instance, in a recent study, six medical residents who individually interpreted eight electrocardiograms (ECGs) were treated as a ‘sample size of 48’ (i.e., six times eight) on which something comparable to OLS regression was performed [13]. This is like seeing 48 residents who independently rated one single ECG. In the latter case, assuming 48 independent observations could be realistic. In the current context, however, there is a within-resident between ECG/interpretation correlation that reduces the number of independent observations to somewhere between the number of residents (i.e., six) and the total number of data points (i.e., six times eight).

A slightly different yet similar approach is chosen when researchers perform separate ANOVAs to test for group differences at each time point instead of accounting for the fact that at least considerable proportions of students have taken multiple tests and that students taking the test at some point may have been nested within learning groups [14]. Even if testing for group differences at a specific time point is legitimate from an interest in group differences at that very point in time, ignoring group nesting and the intra-group correlation that goes with it tends to result in an overestimation of the number of independent observations at that time point and this may exaggerate to some extent the statistical significance of a group difference at that time point.

One thing should be added, before turning to the next section. The papers used as examples of studies in which a multilevel approach should or could have been used [1114] were not chosen because of a lack of quality. On the contrary, each of the papers discussed presents high-quality research published in a respectable journal. However, it is well possible that adopting a multilevel analysis approach would have resulted in somewhat different conclusions with regard to some effect(s) or relation(s) of interest. These papers illustrate that even in the case of a well-designed study, different approaches to analysis do exist and it is worth thinking carefully about which analysis approach accounts for your study design and data to an optimal extent. There is a metaphorical bridge between research questions, study design, and data analysis; the study design is supposed to logically follow from your research questions and should be reflected in the data analysis stage. Even if both optimal and suboptimal approaches of analysis result in a statistically significant p-value for a particular effect or relation of interest, statistics is not in the first place about p-values; it is rather about the mathematical modelling of empirical phenomena. For instance, it is possible that in a particular context OLS regression yields a statistically significant positive correlation between two variables where appropriate multilevel regression yields a statistically significant negative correlation between the same variables.

In the following section, some benefits of multilevel analysis to the aforementioned approaches are discussed.

Some benefits of multilevel analysis

Compared with the approaches discussed previously, appropriate multilevel analysis has a number of benefits on the conceptualization and estimation in a study context as discussed in this paper.

Firstly, multilevel analysis stimulates the researcher to conceptualize and specify the various levels in the research design, so that the variance at each level can be modelled and estimated correctly. Multilevel analysis is the only approach that enables the researcher to conceptualize the hierarchical structure of the data and specify the hierarchical levels in the data correctly and completely. In split-plot ANOVA, the given three-level mixed-effects structure is reduced to a two-level mixed-effects structure, and in OLS regression it is reduced to a one-level fixed-effects structure. The fixed learning group approach results in non-interpretable models that consume many degrees of freedom and the estimates cannot be used to generalize to other learning groups in which treatment A or B is or could be applied.

Secondly, whether one is interested in estimating between-subject or within-subject effects or a combination of these two types, multilevel modelling enables the researcher to model and estimate these effects appropriately. OLS regression results in underestimated standard errors of between-subject (here: treatment) effects and overestimated standard errors of within-subject (here: time) effects and split-plot interaction (here: treatment-by-time) effects, whereas the split-plot ANOVA approach results in underestimated standard errors of between-subject effects. While standard errors of within-subject and split-plot interaction effects are inflated when intra-student correlation is ignored, ignoring intra-group correlation induces a downward bias in standard errors of between-subject effects. Standard errors affect outcomes of statistical significance tests and the width of confidence intervals around the point estimates (i.e., the latter are used for interval estimation). Underestimated standard errors result in a larger Type I error probability in hypothesis testing (i.e., incorrect rejecting of a true null hypothesis) and too narrow confidence intervals for an effect or relation of interest. Overestimated standard errors result in larger Type II error probability in hypothesis testing (i.e., failing to reject an untrue null hypothesis) and too wide confidence intervals for an effect or relation of interest. Either of the two can result in incorrect conclusions with regard to (treatment) effects and relations of interest, and this is not a good thing if we decide to use the outcomes of our analyses for curriculum and policy making in (medical) education.

Thirdly, in all approaches discussed in this paper except for the three-level model, different types of homogeneity assumptions are made. OLS regression assumes homogeneity of variance across combinations of treatment condition and repeated measurements (which is what ‘identically’ in i. i. d. refers to), meaning for the design discussed in this paper that the variance in immediate test score is equal for both treatment conditions and equal across repeated measurements. In split-plot ANOVA, homogeneity of the covariance matrix for both treatment conditions is assumed. Both types of homogeneity assumptions are frequently violated, and not taking these violations into account can lead to serious bias in standard errors and interval estimation. Further, in designs in which students undergo more than two repeated measurements, different serial correlation structures can arise. Multilevel analysis enables the researcher to model heterogeneity of variances and potential serial correlation easily. In the study discussed in this paper, difference in variance between immediate and delayed test score and between treatment conditions is modelled by the inclusion of the random intercepts and random slope. The random part is modelled completely, whereas in all other approaches discussed in this paper a considerable proportion of the random part remains unexplained.

Fourthly, especially in non-experimental studies, unbalanced designs (e.g., learning groups of varying size, missing data) are to be expected, and then split-plot ANOVA and OLS regression yield biased point estimates. However, even in experimental studies, dropout can occur, and then multilevel analysis generally provides a less biased and more efficient approach than split-plot ANOVA, OLS regression or similar approaches [15].

A note on the number of levels

The need for a three-level mixed-effects model in this paper arises from the presence of both intra-individual correlation and intra-group correlation. The intra-individual correlation results from the same individuals being measured twice, while the intra-group correlation is due to the fact that individuals were learning in groups of 15. Had only one test been administered (instead of two), there would have been no repeated measurements and thus no intra-individual correlation to be taken into account in the analysis. In that case, a two-level model with group (level-2: j) and individual student (level-1: i) would have been appropriate. This may hold even if group size is as small as two students [16].

Likewise, had treatment not been administered at the level of groups of collaborating individuals but at the level of the individual, there would have been no intra-group correlation to be taken into account in the analysis. If then students were still measured twice (i.e., immediate and delayed test), a two-level model with individual student (level-2: j) and measurement occasion (level-1: i) would have been appropriate [15]. OLS could have been appropriate had treatment been administered at the level of the individual and only one test was administered (i.e., no repeated measurements).

Many medical education research questions focus on learning in teams or clinics and/or learning over time. In this context, medical education research could profit from the benefits of multilevel analysis more than it has done until now. This paper demonstrates what can happen when resorting to a frequently used but suboptimal method of analysis and provides an approach that can be used by other researchers dealing with this kind of data.

Essentials

  • When individuals are nested within teams, there is an intra-team correlation that should be taken into account in the analysis of data obtained from individuals in these teams or centres.

  • The intra-individual correlation resulting from individuals being measured two or more times should be taken into account when analyzing data obtained from these individuals.

  • Multilevel analysis enables the researcher to conceptualize the hierarchical structure of research data and appropriately account for intra-individual and/or intra-team correlation.

  • Traditional regression and analysis of variance methods fall short in dealing with intra-individual and/or intra-team correlation and are therefore generally not recommended in such a context.

  • Much of (medical) education research is about individuals nested within teams and/or individuals measured repeatedly; therefore, (medical) education research provides a natural context for multilevel analysis.

Source(s) of support in the form of grants

None.