Validity and reliability assessment of a peer evaluation method in team-based learning classes
Article information
Abstract
Purpose
Team-based learning (TBL) is increasingly employed in medical education because of its potential to promote active group learning. In TBL, learners are usually asked to assess the contributions of peers within their group to ensure accountability. The purpose of this study is to assess the validity and reliability of a peer evaluation instrument that was used in TBL classes in a single medical school.
Methods
A total of 141 students were divided into 18 groups in 11 TBL classes. The students were asked to evaluate their peers in the group based on evaluation criteria that were provided to them. We analyzed the comments that were written for the highest and lowest achievers to assess the validity of the peer evaluation instrument. The reliability of the instrument was assessed by examining the agreement among peer ratings within each group of students via intraclass correlation coefficient (ICC) analysis.
Results
Most of the students provided reasonable and understandable comments for the high and low achievers within their group, and most of those comments were compatible with the evaluation criteria. The average ICC of each group ranged from 0.390 to 0.863, and the overall average was 0.659. There was no significant difference in inter-rater reliability according to the number of members in the group or the timing of the evaluation within the course.
Conclusion
The peer evaluation instrument that was used in the TBL classes was valid and reliable. Providing evaluation criteria and rules seemed to improve the validity and reliability of the instrument.
Introduction
Team-based learning (TBL) is a well-defined instructional strategy that is increasingly employed in medical education because of its potential to promote active learning without requiring many faculty members [1]. TBL provides frequent opportunities for peers to enhance learning as teammates talk and listen to one another to arrive at consensus decisions. It also fosters individual and group accountability as small groups of students work together to solve problems and to answer questions [2]. To ensure accountability for group work, learners are asked to assess the contributions of peers within their group [3]. Peer evaluation methods in TBL have been constructed in many ways; however, limited data exist regarding the validity and reliability of these instruments, especially in medical education [4].
Previous studies have examined the validity of peer evaluation instruments mostly by comparing peer evaluation scores with tutor evaluation scores or test grades [5-9]. However, tutors are not always available to supervise every small group activity, and tests usually measure knowledge instead of student contributions. Thus, the validity assessment of a peer evaluation method should examine whether the instrument actually measures a student’s contributions in the right way. The other psychometric issue of such instruments is their reliability. If peer evaluation methods are to be considered reliable, one would expect that students who contribute the most to their groups should consistently get above-average peer evaluation scores from their peers, and low contributors should consistently get below-average peer evaluation scores (inter-rater reliability) [10]. The minimum number of members in each group to achieve the appropriate level of inter- rater reliability is also another issue regarding the reliability of peer evaluation instruments.
The purpose of this study is to assess the validity and reliability of a peer evaluation instrument that was used in TBL classes in a single medical school. This study examined whether the instrument actually measured students’ contributions and whether the students who contributed the most to their groups consistently received higher peer evaluation scores from their peers and low contributors consistently got lower peer evaluation scores. This study also assessed whether the reliability of the instrument was affected by the number of students in a group or by the timing of the TBL class during the course.
Methods
At Seoul National University College of Medicine in South Korea, system-based integrated course starts from the fourth quarter of the first year. To promote active learning, TBL is frequently conducted throughout the course. A total of 146 students in the class of 2020 were divided serially into 18 groups by their student number for the TBL classes. However, five students dropped out before the beginning of the fourth quarter, and there were 141 students left. Eventually, five groups were composed of seven students, two groups were composed of nine students, and the other 11 groups were composed of eight students. The members of the groups were fixed until the end of the course. During the fourth quarter of the 2016 academic year, there were a total of 11 TBL classes. After every TBL class, the students were asked to evaluate their peers in the group. There were five criteria for the peer evaluation which were provided to the students. (1) Did the peer prepare enough for the class? (2) Did the peer actively participate in the group discussion? (3) Did the peer contribute to the group activity? (4) Did the peer respect others in the group? (5) Did the peer show sincerity during the class? The students were asked to rate their peers with an average of 10 points for each person; however, they had to rate at least one person above 11 points, and at least one person below 9 points. They were also asked to write a short comment for the person whom they gave the highest and lowest points in the group describing the reason for the high or low score. The students filled out the paper-based peer evaluation form by themselves after they left the classroom and placed them in a collection box the next morning. The evaluation of TBL classes was composed of iRAT (individual readiness assurance test) scores, tRAT (team readiness assurance test) scores, and peer evaluation scores. The proportion of peer evaluation scores was about 5% of the total course evaluation scores.
To assess the validity of the peer evaluation instrument, we analyzed the written comments for the highest and lowest achievers by the peer evaluation scores and examined whether the peer evaluation was conducted appropriately based on the criteria that the students were given. There were a total of 1,548 peer evaluation results, meaning that, there were 1,548 possible comments to the highest achievers and 1,548 possible comments to the lowest achievers. Two researchers in the study team independently reviewed all the comments and categorized them into groups according to the key concept of each comment. The comments which were categorized differently by each researcher were collected and reviewed again by both of them. The researchers discussed about each comment to reach a consensus.
The reliability of the instrument was assessed by examining the agreement among the performance ratings within each group of students via intraclass correlation coefficient (ICC) analysis. First, we examined whether the ICC differed among the groups. Second, we examined whether the ICC varied by the number of students in each group. Third, we examined whether the ICC changed throughout the course. Analysis of variance (ANOVA) was conducted to analyze differences among the ICCs. IBM SPSS ver. 23.0 (IBM Corp., Armonk, USA) was used for ICC and ANOVA analysis. The study was approved by the Institutional Review Board of Seoul National University College of Medicine and Seoul National University Hospital (IRB No. 1704-154-849).
Results
Among the total of 1,548 possible comments for each set, there were 1,317 comments for the highest achievers and 1,313 comments for the lowest achievers. Among the 1,317 comments for the highest achievers, 1,233 comments were positive and reasonable, while the other 85 comments were not specific and understandable. Among the positive comments, 901 comments were compatible with the criteria provided to the students, while the other 331 comments were not (Table 1). During the 11 TBL classes, the proportion of comments that were compatible with the criteria provided to the students increased moderately, while the proportion of comments that were not compatible decreased consistently (Table 2). Among the 1,313 comments for the lowest achievers, 1,207 comments were negative and reasonable, while the other 106 comments were not specific and understandable. All the negative comments were compatible with the criteria for evaluation (Table 3).
The average ICCs of each group ranged from 0.390 to 0.863, and the overall average was 0.659. There were significant differences among the average ICCs across groups (p<0.001). The average ICC of groups composed of seven students was 0.694, that of groups of eight students was 0.629, and that of groups of nine students was 0.783 (Table 4). There was no significant difference in the average ICC according to the number of students in the group (p=0.075).
The average ICC of each class ranged from 0.574 to 0.817, and there was no significant difference in the average ICC of each class during the course (p=0.193). When we divided the classes into three serial periods corresponding to their timing within the course (beginning, middle, and end), the average ICC of each period was 0.710, 0.632, and 0.629, respectively (Table 5). There was likewise no significant difference in the average ICC across the three periods (p=0.090).
Discussion
Most of the students provided reasonable and understandable comments for the highest and lowest achievers within their group, and most of those comments were compatible with the evaluation criteria that were given to the students. However, approximately one-fifth of the comments for the highest achievers were not compatible with the criteria. Some students mentioned that the high achievers exhibited good leadership, expressed creative and essential ideas, and gave well-organized presentations. There could be several reasons for this mismatch between the criteria and the comments. Students still might not have been familiar with the evaluation criteria at the beginning of the TBL classes. Indeed, the proportion of comments that were compatible with the criteria increased moderately, while the proportion of comments that were not compatible decreased consistently during the course. However, all the comments for the lowest performers were compatible with the criteria throughout the course. Thus, it would be more reasonable to infer that the criteria provided to the students did not fully cover the aspects of the highest achievers in the TBL class. In that case, it would be better to modify the evaluation criteria to improve the validity of the instrument.
The total average ICC was 0.659, which is an acceptable level compared to previous studies on the interrater reliability of peer evaluation methods [11-13]. Several factors may have contributed to this result. First, in this study, the students were provided the evaluation criteria to guide them in the process of peer evaluation. Previous studies have shown that providing evaluation criteria to the students improved the reliability of the peer evaluation method [14]. Second, the students were also asked to rate at least one person in the group above 11 points, and at least one person below 9 points. This ensured discrimination among the peer evaluation scores of the students, which eventually increased the interrater reliability of the instrument. It is known that students, especially in medical school, feel uncomfortable giving their peers different scores, so that they tend to give similar scores to their peers if there are no other rules or regulations [15].
The average ICC of each group ranged from 0.390 to 0.863, and significant variance was present in the average ICC across groups. However, no significant difference in the average ICC was found according to the number of students in the group. A previous study showed that the ICC increased when the number of students in the group was increased from four or five to six in TBL classes [4]. From our study, we might infer that if there are more than six students in a group, increasing the number of students in the group does not significantly improve the reliability of the peer evaluation instrument. Thus, it seems that six to seven students might be the most appropriate number for a group in TBL classes to facilitate interaction among the students while not compromising the reliability of the instrument.
No significant difference was found in the average ICC of each class during the course. When we divided the classes into three serial periods corresponding to their timing in the course, the ICCs of the beginning classes were slightly higher than those of the classes in the middle and the end period, although this difference was not statistically significant. In this study we were not able to determine why the ICC slightly decreased from the beginning of the course to the middle. One possibility is that the students might have tried to give other students high scores in the middle and the end period of classes than the students whom they already gave the highest scores in the beginning period. This phenomenon is known as “gaming the system,” which refers to a tendency for students to try to even out the peer evaluation scores during the TBL classes [16]. Fortunately, in this study, the ICC did not decrease further after the middle period of classes.
This study also has some limitations. First, this study assessed the validity and reliability of a single specific instrument used in a single institution. Because every evaluation instrument has its own psychometric characteristics, the results of this study may not be directly applied to other peer evaluation methods in other circumstances. Second, the students were distributed into groups only by their serial student number, regardless of their age, gender, previous academic achievements, or other characteristics. There may been several factors influencing the group dynamics that were not fully considered in this study. Finally, we were not able to further investigate the reason for the differences in inter-rater reliability among the groups.
In conclusion, the peer evaluation instrument that was used in the TBL classes in a single medical school was valid and reliable. Most of the students assessed their peers’ group activity and contributions based on the evaluation criteria. The students who contributed the most consistently got higher peer evaluation scores from other students in the group. No significant differences in inter-rater reliability were found among the groups according to the number of members in the group or the timing of evaluation during the course. Providing evaluation criteria and rules to the students seems to have improved the validity and reliability of the instrument. Further study is needed to explore the underlying group dynamics and to improve the validity and reliability of peer evaluation instruments.
Acknowledgements
None.
Notes
Funding
None.
Conflicts of interest
No potential conflict of interest relevant to this article was reported.
Authors’ contribution
HY carried out the study, conducted the analysis, and drafted the paper. WP led the design and implementation of the study and reviewed the final version before submission. SM, SM, KP, JP participated in the study and drafted the paper together. All authors read and approved the final manuscript.