A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

Zakiya Salim Hamed Al-Naddabi

doi:https://doi.org/10.14264/uql.2018.154

A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

Zakiya Salim Hamed Al-Naddabi

Center for Preparatory Studies

Research output: Thesis › Doctoral Thesis

80 Downloads (Pure)

Abstract

Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.

The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.

Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.

This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution	School of Education, The University of Queensland, Australia
Supervisors/Advisors	Hillier, Mathew , Supervisor, External person Iwashita, Noriko, Supervisor, External person Campbell, Chris , Supervisor, External person
Thesis sponsors	Ministry of Higher Education, Oman
Award date	Dec 20 2017
DOIs	https://doi.org/10.14264/uql.2018.154
Publication status	Published - Dec 20 2017

Keywords

Testing mode effect
Reliability
Construct validity
Construct-irrelevance
Construct under-representation
Course management system
Moodle
Validity framework
Validation

Access to Document

https://doi.org/10.14264/uql.2018.154

PhD_final_thesis_Zakiya Al NadabiFinal published version, 4.06 MB

https://espace.library.uq.edu.au/view/UQ:702967

Cite this

@phdthesis{4f115a38c6434bd1b98ce45e4eae6312,

title = "A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system",

abstract = "Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees{\textquoteright} ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees{\textquoteright} post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker{\textquoteright}s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.",

keywords = "Testing mode effect, Reliability, Construct validity, Construct-irrelevance, Construct under-representation, Course management system, Moodle, Validity framework, Validation",

author = "{Salim Hamed Al-Naddabi}, Zakiya",

year = "2017",

month = dec,

day = "20",

doi = "https://doi.org/10.14264/uql.2018.154",

language = "English",

school = "School of Education, The University of Queensland, Australia",

}

TY - BOOK

T1 - A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

AU - Salim Hamed Al-Naddabi, Zakiya

PY - 2017/12/20

Y1 - 2017/12/20

N2 - Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

AB - Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

KW - Testing mode effect

KW - Reliability

KW - Construct validity

KW - Construct-irrelevance

KW - Construct under-representation

KW - Course management system

KW - Moodle

KW - Validity framework

KW - Validation

UR - https://scholar.google.com/scholar?q=intitle:A%20validation%20framework%20for%20an%20online%20English%20language%20Exit%20Test:%20A%20case%20study%20using%20Moodle%20as%20an%20assessment%20management%20system

U2 - https://doi.org/10.14264/uql.2018.154

DO - https://doi.org/10.14264/uql.2018.154

M3 - Doctoral Thesis

ER -

A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this