Fleiss kappa
Description
Fleiss kappa is a statistical test used to measure the inter-rater agreement between two or more raters (also referred to as judges, observers or coders) when subjects (for example these could be patients/images/biopsies) are being assigned a categorical rating (for example different diagnoses). This test determines the degree of agreement between raters over what would be expected by random chance.
Interpreting kappa values:
Null Hypothesis Kappa = 0 | Agreement is due to chance |
0.01-0.020 | Slight agreement |
0.21-0.40 | Fair Agreement |
0.41-0.60 | Moderate Agreement |
0.61-0.80 | Substantial Agreement |
0.81-1.00 | Almost Perfect Agreement |
Negative (Kappa<0) | Agreement less than that expected by chance |
However, The following are requirements and assumptions made by Fleiss kappa:
- Subjects are being assigned a categorical rating. These can be either nominal (for example ethnicity, gender etc.) or ordinal (socioeconomic status, level of education etc.) However, Fleiss kappa does not account for the order of the variables.
- The categories being assigned must be mutually exclusive of one another and they are not able to overlap. For example, a patient can only be assigned to the category of mild, moderate or severe illness but not mild and moderate illness.
- Each rater must use the same rating scale when categorising subjects and therefore have the same number of ratings to choose from when assigning subjects.
- The raters are assumed to be non-unique raters. This means the raters assigning ratings to a particular subject are not assumed to be the same raters assigning ratings to another subjects. For example, a university is looking to measure the inter-rater agreement between 5 raters in the diagnosis of a rare disease in children. To do this they randomly sample from the 20 paediatricians they employ. For patient 1, 5 paediatricians at a university hospital are asked to provide a diagnosis of the rare disease. For patient 2, 5 different paediatricians are then asked to provide their diagnosis and so on. Unique raters are the case when the same 5 paediatricians provide their diagnosis for every patient. However, a rater may still be able to provide a rating on more than one subject if randomly sampled to do so.
- The raters are independent of one another, meaning the rating assigned to a subject by one rater has no effect on the rating assigned by another.
- The subjects are randomly selected from the population of interest rather than specifically chosen.
Preparing your data
- Columns should represent the different raters
- Rows are the subjects being assigned different values
- Cells contain the rating being assigned to each subject
- This format is the one used by our sample data
Worked Example
Download provided example file below
Instructions
- Click Analyze
- Open sample data. This should have the subjects being assigned ratings as rows, the raters (judges/observers/coders) as columns and finally their respective ratings as values.
- Under the tab Table you will see all your data as tabulated in the original file.
- Under the tab Selected columns, you will see the categorical variable/s you have previously selected.
- Under the tab Select Variables, select the raters you would like to include for test.
- Under the tab Fleiss kappa multiple raters, the following information is provided.
- The number of subjects
- The number of raters included
- Kappa
- Z value
- P value
- Kappa, Z value and P values are also calculated for the individual ratings in addition to the overall agreement.
Written By Arif Jalal