Reliability and Validation
Reliability refers to the repeatability of findings. Researchers use the concept of reliability to check how consistent the findings are for a given study when performed under the same controlled conditions repeatedly using the same methodology. If the results remain unchanged under these conditions, then the study’s findings can be deemed reliable.
 
        Reliability and validity are concepts that are applied to instruments such as rating scales and screening tools. Instruments developed to measure simulation learning outcomes need evidence of their reliability and validity for rigorous research. If assessment of young surgeon’s competence is to be carried out using a laparoscopy simulator, metrics obtained during training must be reliable and valid (Thijjsen et al., 2010).
About reliability
Reliability refers to the repeatability of findings. Researchers use the concept of reliability to check how consistent the findings are for a given study when performed under the same controlled conditions repeatedly using the same methodology. If the results remain unchanged under these conditions, then the study’s findings can be deemed reliable. To assess the reliability of a research finding, one can test the matter under question at different points in time and with different observers to establish whether the results produced are consistent and repeatable.
There are different types of reliability each with several techniques for measuring. A numerical value known as the reliability coefficient is applied to each technique (Bolarinwa, 2015):

Internal stability means test-retest reliability. The same people get the same test on separate occasions and get the same results. The results are compared and correlated with the first test to give a measure of stability e.g. Spearman-Brown coefficient. Scores equal or greater than 0.7 may be considered sufficient by some researchers although other researchers prefer a higher coefficient.
Internal consistency or homogeneity is a measure used to evaluate the degree to which different test items that measure the same construct produce similar results. This process starts by splitting in half all items of a test that are intended to probe the same area of knowledge and in the next step a correlation between the two groups is calculated.
Equivalence is inter-rater reliability or interobserver agreement. This determines if the raters, using the same instrument, are measuring the same assessment equivalent. This is important because observers may be subjective and their assessments may not be the same. When different observers assign grades to simulation activities, the observers may score the same skills or behaviors differently. The measure of equivalence is e.g. Cohen’s Kappa coefficient. The Kappa statistic varies from 0 to 1, where: „0 = agreement equivalent to Chance”, and „1 = perfect agreement”. (Bolarinwa, 2015; Wong et al., 2012).
About validity
W badaniach naukowych termin „ważność” odnosi się do dokładności i trafności wniosków wyciągniętych na podstawie danych. Gwarantuje on, że badania mierzą to, co mają mierzyć, a wyniki odzwierciedlają badane zjawisko. W ramach tego pojęcia istnieje kilka rodzajów ważności, w tym ważność typu „face”, „content”, „construct”, „concurrent”, ważność predykcyjna (kryterialna), które odnoszą się do różnych aspektów procesu badawczego.

Face validity refers to whether a scale „appears” to measure what it is supposed to measure. Face validity determines the overall property of a task of the simulator. Face validity is usually assessed by experts in the field response to questionnaires and shows whether trainees accept or not the simulation as a valid educational tool (Hassan et al., 2006; Munro, 2012). According to Bolarinwa, 2015, face validity is evaluating whether each of the measuring items matches any given conceptual domain of the concept. Some authors (Bölenius et al., 2012) are of the opinion that face validity is a component of content validity while others believe it is not (Cook et al., 2006; Kember et al. 2008; Sangoseni et al., 2013).
Content validity refers to whether a test or scale is measuring all of the components of a given construct. It reflects the extent to which the task of the simulator under study includes all relevant steps of the techniques or procedure. Over the years however, different ratings have been proposed and developed. These could be in Likert scaling or absolute number ratings (Bolarinwa, 2015). Content validity is often assessed by interviewing expert surgeons. Face and content validity are subjective assessments of a simulator’s validity (Hassan et al., 2006; Thijjsen et al., 2010; Mundo, 2012; Thomas et al., 2014). Despite this, face and content validity can be effectively used by experts for immediately identifying poor-quality research.
The term ‘construct’ can be quite confusing when used in the context of testing and construct validity. It may give the impression that it is associated with the designing or building of a test. However, in research, the term ‘construct’ (such as in construct validity) refers to any psychological attribute, theoretical idea, abstract subject, or an underlying theme that a researcher wishes to measure in a study. Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring whether there is a statistically significant difference in performance measured between different groups with different experiences and skills. It’s central to establishing the overall validity of a method. Demonstrating a significant difference in the scores between novices, senior residents and expert surgeons demonstrates that the simulator correctly identifies quantifiable aspects of surgical skill. A simulator has construct validity, as a training system, if it results in an improved task performance of inexperienced surgeons to the level of expert surgeons in minimally invasive surgery (Hassan et al., 2006; Munro, 2012; Thomas et al., 2014).
Concurrent validity is a method of assessing validity that involves comparing a device with an already existing device, or an already established criterion. This type of validity measures the degree to which the simulator correlates with existing performance measures of the same surgical task or procedure, e.g., by another simulator of the same type that has previously undergone validation (Hassan et al., 2006; McGaghie et al., 2011; Munro, 2012; Thomas et al., 2014). Concurrent validity is achieved when there is a strong correlation between performance on a VR simulator and on an established form of laparoscopic assessment, such as a box trainer (Wanzel et al., 2002).
Predictive (criterion) validity measures and predicts the degree to which the test can associate with other measures of the same type test at a later time in an operating room environment for standardized outcomes of surgical procedures (Hassan et al., 2006; Mundo, 2012; Thomas et al., 2014). Simulator metrics display predictive validity when they show a strong correlation with objective assessment of in vivo surgical skill (Thijjsen et al., 2010). Construct, concurrent, and predictive validity provide quantitative measures of validity for the metrics employed by the simulator (Wanzel et al., 2002) (Figure 1).
Based on another classification, there are two types of validity: internal and external. Internal validity refers to whether or not the results of an experiment are due to the manipulation of the independent variables. Internal validity determines the extent to which a cause-and-effect relationship established between variables of a study holds. That is to say that the causal relationship determined between the variables should be valid and not influenced by any other factor (are not due to methodological errors).
External validity refers to whether the results of a study generalize to the real world or other situations. It shows how applicable a sample study is in representing the features of its parent population and the effect of the variable being tested. Therefore, it can be referred to as the degree to which the study’s findings would be useful in understanding the population and the extent to which the findings can be used in the field of study for other research studies (Andrade, 2018).
A research study is qualitatively acceptable in the scientific world when it adheres to both reliability and validity (Figure 2). The figure 2 explains explains why both these factors are necessary in any study:
- Neither valid, nor reliable – the research methods don’t hit the heart of the research aim (not valid) and repeated attempts are unfocussed;
- Reliable, not valid – the research methods don’t hit the heart ot the research aim but repeated attempts get almost the same (but wrong) results. That is, you are consistently and systematically measuring the wrong value;
- Valid not reliable – It shows hits that are randomly spread across the target. You seldom hit the center of the target but, on average, you are getting the right answer for the group (but not very well for individuals). Here, you can clearly see that reliability is directly related to the variability of your measure;
- Both reliable and valid – the research method hits the heart of the research aim and repeated attempts all hit int heart (similar results)(based on Bolarinwa 2015).
References
Andrade C. Internal, External, and Ecological Validity in Research Design, Conduct, and Evaluation, Indian Journal of Psychological Medicine, 40 (5): 498-499, 2018.
Bolarinwa O.A., Principles and Methods of Validity and Reliability Testing of Questionnaires Used in Social and Health Science Researches, Nigerian Postgraduate Medical Journal, 22(4): 195- 201, 2015.
Bölenius K, Brulin C, Grankvist K, Lindkvist M, Söderberg J., A content validated questionnaire for assessment of self reported venous blood sampling practices. BMC Res Notes;5:39, 2012.
Cook D.A., Beckman T.J., Current concepts in validity and reliability for psychometric instruments: Theory and application, Am J Med.,119:166.e7‑16, 2006.
Hassan I., Maschuw K., Rothmund M., et al. Novices in surgery are the target group of a virtual reality training laboratory. Eur Surg Res, 38:109 –13, 2006.
Kember D, Leung DY. Establishing the validity and reliability of course evaluation questionnaires. Assess Eval High Educ, 33:341‑53, 2008.
McGaghie W.C., Issenberg S.B., Cohen E.R., Barsuk J.H., Wayne D.B., Does simulation‑based medical education with deliberate practice yield better results than traditional clinical education? A meta‑analytic comparative review of the evidence. Acad Med 86: 706‑711, 2011.
Munro M.G., Surgical simulation: Where have we come from? Where are we now? Where are we going? J Minim Invasive Gynecol 19: 272‑283, 2012.
Sangoseni O, Hellman M, Hill C. Development and validation of a questionnaire to assess the effect of online learning on behaviors, attitude and clinical practices of physical therapists in United States regarding of evidence‑based practice. Internet J Allied Health Sci Pract, 11:1‑12, 2013
Thijssen A.S., Maries P., Schijven M.D., Contemporary virtual reality laparoscopy simulators: quicksand or solid grounds for assessing surgical trainees?, The American Journal of Surgery, 199: 529-541, 2010.
Thomas G.W., Johns B.D., Marsh J.L., Anderson D.D., A review of the role of simulation in developing and assessing orthopaedic surgical skills. Iowa Orthop J 34: 181‑189, 2014.
Wanzel K.R., Ward M., Reznick R.K., Teaching the surgical craft: from selection to certification. Curr Probl Surg, 39:573– 659, 2002.Wong K.L., Ong S.F., Kuek T.Y., Constructing a survey questionnaire to collect data on service quality of business academics. Eur J Soc Sci; 29:209‑21, 2012.
 
        