The Hong Kong Practitioner

October 2003, Volume 25, No. 10

Original Article

Reliability and construct validity of the Chinese (Hong Kong) SF-36 for patients in primary care

C L K Lam 林露娟

HK Pract 2003;25:468-475

Summary

Objective: To assess the internal and test-retest reliability, and construct validity of the Chinese (Hong Kong) SF-36 for patients in primary care.

Design: Cross-sectional questionnaire face-to-face interviews and retest by telephone interviews.

Subjects: 500 Chinese patients aged 18 or above attending a government general outpatient primary care clinic in Hong Kong.

Main outcome measures: Internal reliability was measured by Cronbach's alpha. Test-retest reliability was measured by the difference between test-retest scores and intraclass correlation. Construct validity was assessed by the correlations between the Chinese (Hong Kong) SF-36 scores and the Chinese COOP/WONCA Chart scores, and the correlation between the Chinese (Hong Kong) SF-36 scores and the total number of chronic diseases.

Results: Internal reliability coefficients of all the Chinese (Hong Kong) SF-36 scales exceeded 0.7; there was no clinically important difference between test-retest scores of the Chinese (Hong Kong) SF-36. The expected correlations were observed between the Chinese (Hong Kong) SF-36 scores and the COOP/WONCA Chart scores. There was a negative correlation between the total number of chronic diseases and the scores of five scales of the Chinese (Hong Kong) SF-36.

Conclusion: The Chinese (Hong Kong) SF-36 was reliable for group comparison and had good convergent and divergent construct validity for patients in primary care.

Keywords: Quality of life, the SF-36, COOP/WONCA Charts, Chinese, Validity, Reliability.

摘要

目的： 評估SF-36中文（香港）譯本的內部及重複測試的可靠性和有效性。

設計： 橫向性面對面訪問的問卷調查和電話訪問的重複測試。

研究對象： 500位18歲或以上，曾使用某一所政府普通科診所的中國籍病人。

主要測量內容： 內部可靠性是以 Cronbach's alpha 來量度，重複測試的可靠性是以重複測試評分的差異和等級內部的相互關係來量度。有效性則以中文（香港）譯本SF-36評分與中文譯COOP/WONCA評分之間的相互關係，以及中文（香港）譯本SF-36評分與慢性疾病的總數的相互關係來作評估。

結果： 中文（香港）譯本SF-36的所有量具的內部可靠性係數皆超過0.7，它的重複測試評分並無重要的臨床上差異。中文（香港）譯本 SF-36評分和COOP/WONCA評分達到預期的相互關係。慢性疾病的總數與中文（香港）譯本SF-36的五個範疇的評分形成負的相互關係。

結論： 對基層醫療服務，中文（香港）譯本SF-36用於組別比較是可靠的，並有良好的有效性。

詞彙： 生活質素，SF-36，COOP/WONCA表，中國人，有效性，可靠性。

Introduction

The Chinese (Hong Kong) SF-36 is a Chinese translation of the MOS 36-item Short-form Health Survey (SF-36) adapted to the Chinese population in Hong Kong. The SF-36 is a generic measure of health-related quality of life (HRQOL). It has eight scales: the physical functioning (PF), role-physical (RP), i.e. limitation of daily roles due to physical problems, bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role-emotional (RE), i.e. limitation of daily roles due to emotional problems, and mental health (MH). Each scale has a range of 100 with higher scores indicating better HRQOL.¹ It has been shown to be acceptable and relevant to the Chinese in Hong Kong in an earlier study,² and a norm reference has been established for the general adult population in Hong Kong.³

The aim of this study was to assess the reliability and validity of the Chinese (Hong Kong) SF-36 for patients in primary care. Reliability is defined as the degree to which an instrument is free from errors.^4,5 The level of reliability determines the highest degree of validity possible but it does not automatically imply validity. Validity means that the instrument really measures what it purports to measure.^4,6 These are two most important properties of a measuring tool, which must be confirmed before the instrument can be applied to the relevant population.

There are two types of reliability. The first is scale internal reliability, which is based on the theory that the result is likely to be an accurate representation of the actual state if the results measured by different items are consistent.^4,6 The other type of reliability is whether the measure gives reproducible results on repeated measurements of the same condition.^4,6 Earlier studies showed that the internal reliability of the social functioning and general health scales of the Chinese (Hong Kong) SF-36 were just short of the generally expected standard of 0.7.^2,3 This study would further assess the internal reliability and determine the test-retest reliability of the Chinese (Hong Kong) SF-36 for Chinese patients in primary care.

Ideally, the validity of a measure should be compared to a gold standard (criterion), but this is not available for HRQOL measurement. In the absence of a gold standard, the best that one can test is construct validity.^4,7,8 A construct is an abstract variable that is constructed to reflect a hypothesis on how measurable variables will correlate with one another.^6,9 There are three steps in the testing of construct validity: The first starts with the construction of the domain of variables; the second is the establishment of the internal structure of observed variables, and the third is the verification of the hypothesised correlation between the theoretical construct and other external criteria.^4,7,10,11 The construct of the SF-36 is the eight domains of HRQOL measured by eight scales. The construct validity of the internal structure of the observed variables of the scales has been confirmed in an earlier study.² This study would try to verify the correlations between the SF-36 scale scores and external criteria.

Methods and subjects

Chinese patients aged 18 or above attending a government general outpatient clinic in Hong Kong were randomly selected by a pre-determined random number table matching the appointment number of the patient for the particular clinic session. Each eligible patient was invited to be interviewed face-to-face by a trained interviewer with the Chinese (Hong Kong) SF-36, the Chinese COOP/WONCA Charts, and a structured questionnaire on sociodemography and the presence of chronic diseases. A copy of the Chinese SF-36, the COOP/WONCA questions without the illustrations and the questionnaire can be obtained from the author upon written request.

The COOP/WONCA Charts is a HRQOL measure that assesses six domains (physical fitness, feelings, daily activities, social activities, change in health and overall health) with six single-item charts.^12,13 Each chart is rated on a five-point scale with higher scores indicating worse HRQOL. It has been translated, and shown to be valid, reliable and sensitive on Chinese patients in primary care.^14,16

Chronic morbidity was measured by the total number and diagnosis of self-reported chronic diseases. Each subject was asked if he/she had ever been diagnosed for more than one month by a registered medical practitioner to have hypertension, diabetes mellitus, heart disease of any kind, stroke, chronic pulmonary disease (asthma or other chronic respiratory problems), chronic joint problem, psychological illness or any other chronic disease. The total number of chronic diseases was calculated by the summation of the number of positive responses to these questions.

Five hundred and three eligible patients were sampled but three patients refused to be interviewed. Five hundred (99.4%) subjects completed the initial interview. Subjects were interviewed by telephone with the Chinese (Hong Kong) SF-36 again within one week from the first interview to assess test-retest reliability. Three hundred and sixty-two (72.4%) of those who had the first interview completed the second one. The characteristics of all the subjects and those who completed both the first and second interviews are shown in Table 1. There was no significant difference between the two samples.

Data analysis and hypotheses

The responses to the Chinese (Hong Kong) SF-36 were re-coded and the scale scores were calculated by the standard algorithm described in the SF-36 Manual.¹⁷ The distribution by the proportion of subjects of the scores of the Chinese COOP/WONCA Charts was determined.

The internal reliability of the Chinese (Hong Kong) SF-36 was measured by Cronbach's alpha and 0.7 or above was used as the standard for group comparison.^5,18,19 Test-retest reliability was assessed by the difference between the test-retest scores, the statistical significance of which was analysed by the paired samples t tests. It was hypothesised that the difference should not be statistically significant, and 95% of the test-retest differences should be within 2 standard deviations (SD) of the mean differences if the measure was reproducible.^4,20 Test-retest reliability was further assessed by intraclass correlation (ICC), which measures the average similarity of subjects' actual scores on the two ratings, and 0.7 or above is the desirable standard for group evaluation.^4,5

The Chinese COOP/WONCA Chart scores and the total number of chronic diseases were used as external criteria for testing the construct validity of the Chinese (Hong Kong) SF-36.

It was hypothesised that there should be significant correlations (convergent validity) between related domains of the Chinese (Hong Kong) SF-36 and the Chinese COOP/WONCA Charts: the SF-36 physical functioning (PF) score should correlate with the COOP/WONCA physical fitness score; the SF-36 role-physical and role-emotional (RP and RE) scores should correlate with the COOP/WONCA daily activities score; the SF-36 social functioning (SF) score should correlate with the COOP/WONCA social activities score, the SF-36 general health (GH) score should correlate with the COOP/WONCA overall health score, and the SF-36 mental health (MH) score should correlate with the COOP/WONCA feelings score. A review by McDowell et al showed that 0.4 was generally accepted as the minimal standard for convergent validity.^4,21 On the other hand, there should not be any significant correlation (divergent validity) between scores of unrelated domains: the SF-36 PF score should not be related to the COOP/WONCA feelings score, and the SF-36 MH score should not be related to the COOP/WONCA physical fitness score. The correlations between the Chinese (Hong Kong) SF-36 and Chinese COOP/WONCA Chart scores were measured by Spearman's rho correlations.

It was hypothesised that patients with chronic diseases should have worse HRQOL than those without any chronic disease. Two sample t tests were used to test the statistical significance of the SF-36 scores between the two groups. Furthermore, there should be negative correlations between the Chinese (Hong Kong) SF-36 scores and the total number of chronic diseases, which was measured by Pearson correlations.

All data analyses were carried out with the SPSS Programme for Windows 11.0 (SPSS Inc, 2002).

Results

One of 500 subjects did not answer question 11d (item GH5) of the Chinese (Hong Kong) SF-36. There was no missing or out of range data from the Chinese COOP/WONCA Charts. The distribution of the Chinese COOP/WONCA Chart scores and the mean Chinese (Hong Kong) SF-36 scores of the sample are shown in Table 2. Scores of the COOP/WONCA Charts are presented in proportions because they are categorical.

Reliability of the Chinese (Hong Kong) SF-36

The internal reliability (Cronbach's alpha) and test-retest reliability of the eight SF-36 scales are shown in Table 3. The internal reliability was above the standard of 0.7 for all scales including the social functioning scale. The differences between the test and retest scores were all less than five points; a statistically significant difference was found in only the bodily pain (BP) and social functioning (SF) scales. The proportions of differences that were within 2 SD of the mean difference were near 95% for all but the RP scale. Intraclass correlations were above 0.7 for six scales, it was just short of the standard for the role-emotional (RE) scale but it was below 0.5 for the social functioning (SF) scale.

Construct validity of the Chinese (Hong Kong) SF-36

Table 4 shows the Spearman's correlations between the Chinese (Hong Kong) SF-36 scores and the Chinese COOP/WONCA Chart scores that were statistically significant (p<0.05), and correlations that were >0.4 are shown in bold. The expected direction of correlations between scores of related domains was negative because higher SF-36 scores indicate better HRQOL but higher COOP/WONCA Chart scores represent poorer HRQOL. There was a strong correlation (>0.4) between the physical functioning (PF) and physical fitness scores, the general health (GH) and overall health scores, and the vitality (VT) and overall health scores, the role-physical (RP) and daily activities scores, the social functioning (SF) and social activities scores, and the mental health (MH) and feelings scores. The role-emotional (RE) score correlated weakly (r = -0.13 to - 0.27) with all COOP/WONCA Chart scores except the physical fitness scores. There was no significant correlation between the PF and feelings scores or between the MH and physical fitness scores, supporting divergent validity.

Table 5 compares the SF-36 scores of patients with and without any chronic disease. The Hong Kong Chinese adult population mean and standard deviation (S.D.) are also shown for comparison.³ The SF-36 scores of the physical health related domains of subjects were generally lower than those of the general population norm, but their mental health related domain scores were higher. The scores of patients with chronic diseases were lower than those of patients without any chronic disease in five scales, and the differences were statistically significant for physical functioning (PF), bodily pain (BP) and general health (GH). The social functioning score of patients with chronic diseases was significantly higher than those of patients without any chronic disease. There was a negative correlation between the total number of chronic diseases and the physical functioning (PF), bodily pain (BP), general health (GH), vitality (VT) and mental health (MH) scores, but a significant positive correlation between the total number of chronic diseases and the social functioning score (Table 6).

Discussion

The Cronbach's alphas on internal reliability of the SF-36 scales were all above 0.7, and those of the role-physical, bodily pain and social functioning scales exceeded the standard of 0.9 for individual assessment. The results were better than those found in an earlier study on patients in primary care,² probably because the sample of this study was larger and there was more variation between subjects.⁴

The Chinese (Hong Kong) SF-36 scores were generally reproducible on repeated measurements. The direction of the change in the scores was not consistent among the different scales, suggesting that there was no systematic bias and the variations were mostly random. Although the difference in the BP and SF scores were statistically significant, the effect size differences (mean score difference/S.D. of the first interview) were less than 0.3, which is generally not considered to be clinically important.^22-24 The intraclass correlation (ICC) of the role-emotional (RE) scale was just short of 0.7 and many experts agree that 0.5 may be adequate for group assessments.^4,5 However, the low ICC in the social functioning scale deserves further evaluation.

The repeat interview was carried out by telephone interview, which gave similar results to those obtained by face-to-face interview, suggesting that the data collected by these two methods can be pooled together. Lam et al also showed that telephone interviews gave similar results on health service utilisation as those found in the face-to-face household survey.²⁵ These findings are important because telephone interview is becoming a popular survey method in Hong Kong and it is often used in combination with face-to-face interviews in the same study.

There may be a concern that subjects could remember their answers of the first interview when the interview was repeated within one week, leading to falsely high test-retest reliability. This was unlikely with the large number of questions that each subject had to answer. Subjects' conditions may have changed if the test-retest interval is too long, resulting in a falsely low reliability for a responsive measure. Most experts recommend an interval of one to two weeks between interviews for assessing test-retest reliability.^4,26

Construct validity of the Chinese (Hong Kong) SF-36

The hypothesised correlations between the Chinese (Hong Kong) SF-36 scores and the Chinese COOP/WONCA Chart scores were generally observed, confirming convergent and divergent construct validity. As other studies have found, the COOP/WONCA daily activities score correlated strongly with the role-physical (RP) score but only moderately with the role-emotional (RE) score.^27,28 The RE score correlated significantly with the COOP/WONCA feelings score but not the physical fitness score, supporting the construct validity of this scale in measuring role limitations related to emotional rather than physical problems. The results support the construct of RE and RP as two separate scales. The combination of these two SF-36 scales into one single role functioning scale, as proposed by Fukuhara et al,^29,30 may miss the limitations caused by emotional problems.

The hypothesised negative correlations between the Chinese (Hong Kong) SF-36 scores and the total number of chronic diseases were found in only five of the eight scales. It was unexpected that the role-physical, role-emotional and social functioning scores of patients with chronic diseases were higher than those of patients without any chronic disease. One possible explanation was that subjects without any chronic disease consulting the clinic were likely to have acute illnesses that had interfered with their daily or social activities.

Conclusion

The Chinese (Hong Kong) SF-36 has been shown to have good internal and test-retest reliability among Chinese patients in primary care. There was little difference between the results obtained by face-to-face and telephone interviews suggesting that data obtained by these two methods can be pooled together for analysis. The construct validity of the Chinese (Hong Kong) SF-36 was confirmed by significant correlations (convergent validity) with related domain scores and insignificant correlation (divergent validity) with unrelated domain scores of the Chinese COOP/WONCA Charts. There was a negative correlation between the total number of chronic diseases and several scales of the Chinese (Hong Kong) SF-36, further supporting its construct validity.

The Chinese (Hong Kong) SF-36 can be used to assess HRQOL of patients in primary care in Hong Kong reliably and validly. The inclusion of HRQOL as an outcome measure of the impact of illnesses and the effects of treatments can make health care more patient-centred.

Acknowledgement

Parts of this paper have been submitted to the University of Hong Kong for the award of the Doctor of Medicine degree.

Key messages

The Chinese (Hong Kong) SF-36 is a health-related quality of life (HRQOL) measure.
It has been shown to have good internal reliability and adequate test-retest reliability for group assessment among Chinese patients in primary care.
There was no clinically important difference between results obtained by face-to-face and telephone interviews.
It has been shown to have construct validity for Chinese patients in primary care.
The Chinese (Hong Kong) SF-36 can be used to assess patient perceived effect of an illness or treatment.

C L K Lam, MBBS, MICGP, FRCGP, FHKAM(Family Medicine)
Associate Professor,
Family Medicine Unit, The University of Hong Kong.

Correspondence to : Dr C L K Lam, 3/F, Ap Lei Chau Clinic, 161 Main Street, Ap Lei Chau, Hong Kong.

References

Ware JE, Snow KK, Kosinski M, et al. SF-36 Health Survey Manual & Interpretation Guide. Boston: The Health Institute, New England Medical Center; 1993.
Lam CLK, Gandek B, Ren XS, et al. Tests of scaling assumptions and construct validity of the Chinese (HK) version of the SF-36 Health Survey. J Clin Epidemiol 1998;51:1139-1147.
Lam CLK, Lauder IJ, Lam TP, et al. Population based norming of the Chinese (HK) version of the SF-36 Health Survey. HK Pract 1999;21:460-470.
McDowell I, Newell C. The theoretical and technical foundations of health measurement. In: McDowell I, Newell C (Ed). Measuring Health - A Guide to Rating Scales and Questionnaire, New York: Oxford University Press; 1996;10-46.
Nunnally JC, Bernstein RH. The Assessment of reliability. In: Nunnally JC, Bernstein RH (Ed). Psychometric Theory. New York: McGraw-Hill, Inc; 1994;248-292.
Nunnally JC, Bernstein RH. Validity. In: Nunnally JC, Bernstein RH (Ed). Psychometric Theory. New York: McGraw Hill, Inc.; 1994;83-113.
Guyatt GH, Jaeschke R, Feeny DH, et al. Measurements in Clinical Trials: Choosing the Right Approach. In: Spilker B (Ed). Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: Lippincott-Raven Publishers, 1996;41-48.
Muldoon MF, Barger SD, Flory JD, et al. What are quality of life measurements measuring? BMJ 1998;316:542-545.
Ware JE, Keller SD. Interpreting General Health Measures. In: Spilker B (Ed). Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: Lippincott-Raven Publishers, 1996;445-460.
McHorney CA, Ware JE, Raczek AE. The MOS 36-Item Short Form Health Survey (SF-36), II: Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993;31:247-263.
Gandek B, Ware JE. Methods for validating and norming translations of health status questionnaires: the IQOLA Project approach. J Clin Epidemiol 1998;51:953-959.
Scholten JHG, van Weel C. Functional status assessment in family practice: the Darmouth COOP functional health Assessment Charts/WONCA. Lelystad: Meditekst; 1992.
van Weel C, Kong-Zahn C, Touw-Otten FWMM, et al. Measuring Functional Status with the COOP/WONCA Charts: A Manual. Groningen, The Netherlands: Northern Centre for Health Care Research (NCH); 1995.
Lam CLK, van Weel C, Lauder IJ. Can the Dartmouth COOP/WONCA Charts be used to assess the functional status of Chinese patients? Family Practice 1994;11:85-94.
Lam CLK, Lauder IJ. The impact of chronic diseases on the health-related quality of life (HRQOL) of Chinese patients in primary care. Family Practice 2000;17:159-166.
Lam CLK, Lauder IJ, Lam DTP. How does a change in the administration method affect the reliability of the COOP/WONCA Charts? Family Practice 1999;16:184-189.
Ware JE, Snow KK, Kosinski M, et al. Scoring the SF-36. In: Ware JE, et al. (Ed). SF-36 Health Survey Manual & Interpretation Guide. Boston: The Health Institute, New England Medical Center, 1993;6:1-6:22.
Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951;16:297-334.
Bland JM, Altman DG. Statistics notes - Cronbach's alpha. BMJ 1997;314:572.
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;I:307-310.
Bullinger M, Anderson R, Cella D, et al. Developing and evaluating cross-cultural instruments from minimum requirements to optimal models. Quality of Life Research 1993;2:451-459.
Norman GR, Sridhar FG, Walter SD, et al. The relation of distribution- and anchor-based approaches in interpretation of changes in health related quality of life. Med Care 2001;39:1039-1047.
Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989;27:S178-S189.
Cohen J. The t test for measures. In: Cohen J (Ed). Statistical Power Analysis for the Behavioural Sciences. Hillsdale, New Jersey: Lawrence Erlbaum Associates, 1988;19-74.
Lam TH, Kleevans WL, Wong CM. Doctor-consultation in Hong Kong: a comparison between findings of a telephone interview with general household survey. Community Medicine 1988;10:175-179.
Deyo RA, Diehr PD, Patrick DL. Reproducibility and responsiveness of health status measures - Statistics and strategies for evaluation. Controlled Clinical Trials 1991;12:142s-158s.
van Weel C, Kong-Zahn C, Touw-Otten FWMM, et al. Validity. In: van Weel C et al. (Ed), Measuring Functional Health Status with the COOP/WONCA Charts: A Manual. Groningen, the Netherlands: Northern Centre of Health Care Research, 1995;12-15.
Siu AL, Ouslander JG, Osterweil D, et al. Change in self-reported functioning in older persons entering a residential care facility. J Clin Epidem 1993;46:1093-1101.
Fukuhara S, Ware JE, Kosinski M, et al. Psychometric and clinical tests of validity of the Japanese SF-36 Health Survey. J Clin Epidemiol 1998;51:1045-1053.
Fukuhara S, Bito S, Green J, et al. Translation, adaptation, and validation of the SF-36 Health Survey for use in Japan. J Clin Epidemiol 1998;51:1037-1044.