Cross-cultural validation of the German and Turkish versions of the PHQ-9: an IRT approach

Cross-cultural validation of the German and Turkish versions of the PHQ-9: an IRT approach Background: The Patient Health Questionnaire’s depression module (PHQ-9) is a widely used screening tool to assess depressive disorders. However, cross-linguistic and cross-cultural validation of the PHQ-9 is mostly lacking. This study investigates whether scores on the German and Turkish versions of the PHQ-9 are comparable. Methods: Data from Germans without a migration background (German version, n = 1670) and Turkish immigrants in Germany (either German or Turkish version, n = 307) were used. Differential Item Functioning (DIF) was assessed using Item Response Theory (IRT) models. Results: Several items of the PHQ-9 were found to exhibit DIF related to language or ethnicity, e.g. ‘sleep problems’, ‘appetite changes’ and ‘anhedonia’. However, PHQ-9 sum scores were found to be unbiased, i.e., DIF had no notable impact on scale levels. Conclusions: PHQ-9 sum scores can be compared between Turkish immigrants and Germans without a migration background without any adjustments, regardless of whether they complete the German or the Turkish version. Keywords: Depression, Patient health Questionnaire-9 (PHQ-9), Item response theory (IRT), Differential item functioning (DIF), Cross-cultural / ethnic comparison Background frequently used and best validated questionnaires world- Depression is a highly prevalent disorder leading to suf- wide [14–16]. It is recommended as a general measure of fering and disability [1, 2]. It is predicted to be the major depression severity by the DSM-5 (Diagnostic and Statis- cause of burden of disease by 2020 [3]. Differences exist tical Manual of Mental Disorders, 5th Edition) [17]and across countries and ethnic groups in epidemiology [4–7] has been translated into over 70 languages and dialects and symptom presentation [8–10]of depressivedisorders. [18]. In the present study, we investigate whether PHQ-9 Many cross-cultural studies applied self-report question- scores are comparable between the German majority naires to assess and describe the phenomenology of de- population without a migration background and the lar- pressive disorders. However, cross-linguistic and gest minority group in Germany, Turkish immigrants [19]. cross-cultural validation of self-report questionnaires is To our knowledge, only three studies have investigated mostly lacking. Such validation analyses are urgently the comparability of different language versions of the needed for a valid comparison of prevalence rates and PHQ-9: Huang and colleagues [20] found differences in symptom profiles of depressive disorders across linguistic item functioning between the English and Chinese ver- and ethnic groups [11]. Among self-report questionnaires sion of the items assessing sleep, appetite, and psycho- for assessing depression, the Patient Health motor changes in a large sample of primary care Questionnaire-9 (PHQ-9) [12, 13] is one of the most patients. Comparing the English and Spanish version, they also found differences in sleep and appetite items, * Correspondence: ricarda.nater-mewes@univie.ac.at plus anhedonia and self-esteem items. Arthurs and col- Department of Psychology, University of Marburg, Marburg, Germany leagues [21] found differences between the English and Outpatient Unit for Research, Teaching and Practice, Faculty of Psychology, French version for anhedonia, sleep, and self-esteem University of Vienna, Renngasse 6-8, 1010 Vienna, Austria Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Reich et al. BMC Psychology (2018) 6:26 Page 2 of 13 items in patients with systemic sclerosis. Comparing the First, we examine whether the German and Turkish German and Russian version in primary care patients language versions of the PHQ-9 are comparable. Then, [22], a difference in item functioning was found in the we examine whether the German PHQ-9 is comparable sleep problems item. across ethnic groups. This two-step approach is neces- Regarding the comparability across ethnic and racial sary because Turkish language utilization and German groups, two studies have confirmed the comparability language proficiency vary considerably among Turkish of the English version between African-American and immigrants [35]. Based on previous studies on DIF in non-Hispanic White primary care patients [20, 23]. PHQ-9 items, one might expect DIF in the sleep, psy- Moreover, one study in a general population sample chomotor changes, anhedonia, appetite changes, and confirmed the comparability of the German version low self-esteem items. However, this is the first study to between Germans without a migration background investigate cross-linguistic and cross-cultural validity of and a heterogeneous sample of immigrants living in the Turkish version of the PHQ-9, and one of the few to Germany [24]. However, Crane and colleagues found study this topic at all. Consequently, all items of the differences in items about sleep, low energy, and psy- PHQ-9 were tested on DIF without statistical chomotor changes between HIV-infected pre-assumptions. Based on the results, recommenda- African-Americans and Whites in the English version tions for applying the PHQ-9 in Turkish immigrants are [25], and Baas and colleagues confirmed a cultural provided. bias in theDutch versionof the PHQ-9in the item psychomotor changes between Surinam Dutch and Methods Native Dutch male primary care patients [11]. Al- Data sources though the reasons for differences in item functioning This article provides secondary analyses of original data are mostly unclear, most studies confirmed that such obtained in four independent, cross-sectional studies. differences had minimal impact on the scale level and that sum scores were mainly comparable across the Study 1 investigated samples. A representative sample of the German general popula- To establish cross-linguistic and cross-cultural tion (n = 2510) was screened for disability, somatic com- measurement equivalence, equality in item functioning plaints, mental health, and healthcare utilization. The needs to be inspected. The probability of endorsing a assessment was conducted by a demographic consulting specific item should be the same for all individuals company (USUMA, Berlin) in 2007. The study material with a certain underlying level of depression, and was available in German only. Details of the procedure should not be influenced by ethnic or linguistic are described elsewhere, e.g. [36]. For the present ana- group. If these prerequisites are not fulfilled, the item lyses, only data of Germans without a migration back- is considered to have Differential Item Functioning ground and of Turkish immigrants responding to the (DIF) [26, 27]. The absence of DIF justifies German language version of the PHQ-9 are used. cross-cultural comparisons based on the sum score as an indicator for the latent trait, and allows observed Study 2 differences to be related to actual differences between A convenience sample of Turkish immigrants (n = 214) groups. DIF can be appropriately assessed using Item completed questionnaires about perceived discrimin- Response Theory (IRT) analysis [28, 29]. IRT provides ation and depressive and somatoform symptoms. Data parametric and nonparametric models, which consti- were collected in 2011 and 2012 [37]. The study material tute powerful tools for separating measurement bias was provided in German or Turkish according to the from true group differences [30, 31]. participants’ choice. The study was carried out using an The objective of this study is to investigate whether online survey and paper-and-pencil versions with a PHQ-9 scores are comparable between Turkish immi- snowball system. grants in Germany and Germans without a migration background. This is especially important since Turkish immigrants represent the largest minority group in Study 3 Germany [19], and are among the three largest immi- Two matched inpatient samples (Turkish immigrants vs. grant populations in other European countries such as Germans without a migration background, n = 50 each) the Netherlands, Denmark, and Austria [32]. Moreover, were recruited in five psychiatric clinics in 2011 and as prevalence rates of affective disorders in labor mi- 2012 [38]. Participants were asked about subjective con- grants in Europe are elevated [5, 33, 34], properly work- cepts of mental illness, motivation for psychotherapy, ing assessment instruments for depression are and mental health symptoms. The study material was particularly important in this group. provided as paper-and-pencil versions in German or Reich et al. BMC Psychology (2018) 6:26 Page 3 of 13 Turkish according to the participants’ choice. A bilingual Turkish version of the PHQ-9 [46] has been validated in research assistant helped illiterate participants. only one study [47], which showed acceptable results re- garding reliability and validity for the Turkish population Study 4 in Turkey. In a pilot study, Turkish immigrant inpatients (n = 29) were recruited to participate in a randomized controlled Statistical procedure trial (RCT) on the effects of a motivation-enhancing Data preparation and definition of the subgroups program at the beginning of their inpatient treatment. Overall, data of n = 2853 participants were eligible from They provided baseline information about motivation the four studies described above. n = 10 participants had for psychotherapy, mental health symptoms, and illness more than two missing items in the PHQ-9 and were ex- perception at the beginning of inpatient treatment in cluded from the present analysis. We selected three sub- two different psychiatric clinics in 2013 and 2014. Study groups, differing in ethnicity (no migration background material was available on a computer in German or at all vs. Turkish migration background), and language Turkish according to the participants’ choice. A bilingual version of the PHQ-9 (German vs. Turkish): Germans research assistant helped participants who were illiterate with no migration background completing the German or needed assistance with the computer. This sample version of the PHQ-9 (G-G), Turkish immigrants com- was included to enclose Turkish immigrants with a low pleting the German version of the PHQ-9 (T-G), and level of literacy in the analysis. Persons with low German Turkish immigrants completing the Turkish version of language proficiency and low educational levels usually the PHQ-9 (T-T). Ethnic groups were defined by the get excluded from research in Germany, but are charac- parents’ country of birth according to Schenk et al. [48]. teristic for the population of Turkish immigrants [39]. Persons were included only if both parents were born ei- ther in Germany or in Turkey. n = 334 participants were Measures excluded based on this criterion. Non-migrants had to Participants in all studies provided information on socio- be born in Germany, i.e. have no immigration experi- demographic and migration-related variables, and symp- ence. Their mother tongue had to be German, and they toms of depression measured by the PHQ-9. The had to hold a German passport. Based on these criteria, PHQ-9 is a nine-item self-rating instrument, with each a further n = 5 participants were excluded. The age range item representing one of the DSM-IV (Diagnostic and was restricted to 18–65 years, since there were no eld- Statistical Manual of Mental Disorders, 4th Edition) cri- erly participants in the T-T sample and only very few in teria for a depressive episode (anhedonia, depressed the T-G sample. Accordingly, n = 90 participants under mood, sleep problems, feeling tired, change in appetite, 18 and n = 437 participants over 65 were excluded. Final negative self-evaluation, concentration problems, psy- sample sizes were n = 1670, n = 191, and n (G-G) (T-G) (T-T) chomotor changes, suicidality). Each item can be scored = 116. as 0 (not at all), 1 (several days), 2 (more than half the days), or 3 (nearly every day), according to the frequency Evaluation of prerequisites of experiencing difficulties in the respective area in the IRT analyses require unidimensionality, i.e. the items previous 2 weeks. Sum scores range from 0 to 27. Inter- should measure the symptoms of one underlying dis- preting the PHQ-9 with respect to depression severity, a order. The PHQ-9 has been shown to be a score of 5 to 9 represents mild depressive symptoms, 10 one-dimensional measure of depression in previous to 14 moderate depressive symptoms, and 15 to 27 se- studies [23, 25, 49–51]. Consequently, we hypothesize vere depressive symptoms [40]. that unidimensionality is present as well in the German German and Turkish versions of the PHQ-9 were re- and Turkish versions of the PHQ-9. However, as a spe- trieved from the Pfizer Patient Health Questionnaire cial relevance of somatoform complaints in migrant pop- Screeners website [18]. The German version of the ulations in general [10, 52, 53] and Turkish immigrants PHQ-9 [41] was elaborated by several steps of transla- in particular [54, 55] has been discussed, a two-factor so- tion and blind back-translation following state-of-the-art lution was also plausible. We addressed dimensionality procedures for test translation [42]. Various studies have using confirmatory factor analysis (CFA), testing a demonstrated its validity [14, 15, 43–45]. Furthermore, single-factor model and a two-factor model including results from the American and German PHQ validation the items ‘sleep problems’, ‘low energy’, ‘appetite changes’, studies are similar regarding criterion validity, construct and ‘psychomotor changes’ on a somatic factor and the validity, internal consistency, sensitivity to change and items ‘anhedonia’, ‘depressed mood’, ‘low self-esteem’, ‘con- recommended cut-off scores [12–16]. Consequently, the centration difficulties’, ‘and suicidal ideation’ on a German PHQ-9 can be considered a trustworthy and cognitive-affective factor. Dimensionality of the PHQ-9 completely reliable PHQ version. However, to date, the was inspected for all three subgroups separately and for Reich et al. BMC Psychology (2018) 6:26 Page 4 of 13 the total sample. Missing values were handled with information about DIF-free items in our samples, we full-information maximum likelihood estimation (n used an iterative process to identify anchor items to be one = 10; n =0; n = used for evaluating DIF in candidate items. We adopted missing (G-G) two missings (G-G) one missing (T-G) 4; n =1; n =2; n the “leave-one-out” approach for the selection of anchor two missings (T-G) one missing (T-T) two missings = 0). For model fit comparison, we followed a pro- items, i.e. every single item was tested for DIF, assuming (T-T) cedure which involves comparing the change in that the remaining items were DIF-free and thus serving goodness-of-fit indices, which are unaffected by sample as anchor items. If any of the X tests for an item was size [56]. Following Cheung’s recommendations, we significant at the p < .05 level, the item was considered compared the CFI between the single-factor and the to be a candidate DIF item. This process was repeated two-factor models, with a difference of Δ < 0.01 indi- with the remaining items to purify the sample of anchor CFI cating substantively similar models [56]. Mplus version 5 items until there were no more new candidate DIF items was used for CFA [57]. in the next analysis. In the second stage of analysis, the candidate DIF items were tested for DIF relative to the Item response theory (IRT) analyses set of anchor items that had been identified in step one. For IRT analyses, the parametric graded-response model Finally, Test Characteristic Curves (TCC) and Test In- (GRM) [58, 59], the polytomous extension of the formation Curves (TIC) were inspected. The TCC plots two-parameter logistic model, was applied. The GRM the most likely standard PHQ-9 score associated with estimates two types of item parameters and one person each level of depression [25]. The TIC plots the informa- parameter, based on the pattern of responses observed tion at each depression level, e.g. the measurement pre- in the data. The item parameters are: item slope a, and cision at each depression level and the standard error item location b. The item slope parameter a indicates associated which each depression level. Where the TCC how steeply the probability of endorsing an item in- is steep and test information is high, the PHQ-9 has creases with an increasing underlying level of depression. good measurement precision and a small standard error The person parameter theta (θ) estimates the underlying of measurement. All IRT analyses were computed with level of depression. The item location parameters b indi- IRTPRO 2.1 for Windows [61]. cate the positions of the thresholds from one response category to another. The b parameters represent the trait Results level necessary to respond above the threshold with .50 Sample characteristics probability [60]. In the case of the PHQ-9, there are A final sample of n = 1977 participants was analyzed. three thresholds: from ‘not at all’ to ‘several days’ (b ), The mean age of the total sample was 42.6 years, with from ‘several days’ to ‘more than half the days’ (b ), and T-G being significantly younger (32.6 vs. 43.7 years, see from ‘more than half the days’ to ‘nearly every day’ (b ). Table 1). In the total sample, 97% of participants had Item parameters can be interpreted as a z-scale (mean = completed nine or more years of education, and 61% 0, standard deviation = 1). All parameters estimated by were employed. However, only 82% of T-T had com- the GRM are reported on a logit scale. Item Characteris- pleted 9 years of education or beyond, and the employ- tic Curves (ICCs) were used for the graphical investiga- ment rate was only 47%. The proportion of inpatients tion of the operation characteristics. The form of an ICC was markedly higher in T-T (57%) than in the other sub- describes how changes in trait level relate to changes in groups (3 and 5%). Moreover, the proportion of partici- the probability of a specified response. For polytomous pants with moderate or severe depression as estimated items, the ICC regresses the probability of responses in by the PHQ-9 sum score was higher among T-T. each category on trait level [60]. Second-generation immigrants were more likely to be in For Differential Item Functioning (DIF), our analyses the T-G subgroup (62% vs. 10%). T-G were also more disentangle differences in item functioning related to likely to indicate German as their mother tongue (17% language (German vs. Turkish) and to ethnicity and mi- vs. 6%) and to have a better German language profi- gration background (Germans without a migration back- ciency, if their mother tongue was Turkish. ground vs. Turkish migration background). The first analysis investigated DIF related to language, comparing Evaluation of prerequisites T-G and T-T. The second investigated DIF related to The single-factor model showed good fit in each sub- ethnicity and migration background, comparing T-G to group and for the entire sample (G-G: X (27) = 521.6, G-G. DIF analyses were conducted in two steps: first p < .001; CFI = .938; RMSEA [90% C.I.] = .105 [.097; selecting anchor items, and then evaluating candidate .113]. T-G: X (27) = 67.4, p < .001; CFI = .955; RMSEA items for DIF. Anchor items allow responses from two [90% C.I.] = .089 [.062; .115]. T-T: X (27) = 22.0, groups to be linked so that parameters are estimated in p > .05; CFI = 1.0; RMSEA [90% C.I.] = .000 [.000; a common metric [60]. Since we had no a priori .057]. Total: X (27) = 454.6, p < .001; CFI = .964; Reich et al. BMC Psychology (2018) 6:26 Page 5 of 13 Table 1 Sample description stratified by language and ethnicity G-G (n = 1670) T-G (n = 191) T-T (n = 116) Total (n = 1977) Test statistic Sociodemographic characteristics Age in years, mean (SD) 43.7 (12.7) 32.6 (9.9) 43.7 (11.1) 42.6 (12.8) F(2) = 70.2*** Female sex, n (%) 930 (55.7) 109 (57.4) 71 (61.2) 1110 (56.2) X (2) = 1.5* a 2 Education ≥9 years, n (%) 1638 (98.2) 181 (96.3) 94 (82.4) 1913 (97.1) X (2) = 157.8*** b 2 Being employed, n (%) 1037 (62.1) 118 (62.4) 54 (46.6) 1209 (61.2) X (2) = 11.1** Clinical characteristics Being in inpatient treatment, n (%) 49 (2.9) 9 (4.7) 66 (56.9) 124 (6.3) X (2) = 538.1*** PHQ-9 total score, mean (SD) 2.6 (3.9) 7.2 (6.3) 13.6 (7.3) 3.7 (5.3) F(2) = 397.5*** Depression severity as defined by the PHQ-9 None (0–4), n (%) 1360 (81.4) 73 (38.2) 12 (10.3) 1530 (77.4) X (2) = 409.4*** Mild (5–9), n (%) 210 (12.6) 64 (33.5) 33 (28.4) 222 (11.2) X (2) = 72.9*** Moderate (10–14), n (%) 62 (3.7) 31 (16.2) 17 (14.7) 162 (8.2) X (2) = 168.4*** Severe (≥15), n (%) 38 (2.3) 23 (12.0) 54 (46.6) 63 (3.2) X (2) = 256.0*** Migration-related characteristics Years since immigration, mean (SD) – 28.0 (11.1) 26.1 (10.9) 26.9 (11.0) F(1) = 1.7* d 2 Second generation, n (%) – 117 (61.6) 12 (10.3) 129 (42.2) X (1) = 76.8*** Mother tongue = German, n (%) – 32 (16.8) 7 (6.0) 39 (12.7) X (1) = 7.5** German language proficiency, mean (SD) – 1.4 (0.7) 2.8 (1.0) 2.0 (1.1) F(1) = 165.8*** G-G Germans with no migration background completing the German version of the PHQ-9, T-G Turkish immigrants completing the German version of the PHQ-9, T-T Turkish immigrants completing the Turkish version of the PHQ-9 Includes all school graduation certificates normally received after 9 or more years of school, i.e. the German “Hauptschulabschluss”, “Realschulabschluss” or b c “Abitur”, and the Turkish “Ortaokul diploması” or “Lise bitirme sınavı”. Working part-time or full-time. Applies only for participants who were born in Turkey. d e Participants born in Germany, both parents born in Turkey. Self-reported German language proficiency, if mother tongue is Turkish (1 = very good,4 = poor/bad) *p < .05, **p < .01, ***p < .001 RMSEA [90% C.I.] = .090 [.082; .097]). The fit of the options. Additionally, the range of the item location two-factor model was similarly good in all subgroups parameters indicated that the PHQ-9 items covered and in the entire sample (G-G: X (26) = 488.5, p levels of depression from about 1 standard deviation < .001; CFI = .942; RMSEA [90% C.I.] = .103 [.095; below to 2 standard deviations above the sample .111]. T-G: X (26) = 58.0, p < .001; CFI = .964; RMSEA population mean. [90% C.I.] = .080 [.052; .108]. T-T: X (26) = 21.5, The graphical inspection of the ICCs (Fig. 1) showed p > .05; CFI = 1.0; RMSEA [90% C.I.] = .000 [.000; that all PHQ-9 items work well in our samples. Peaks of .057]. Total: X (26) = 422.4, p < .001; CFI = .967; RCCs (Response Characteristic Curves) for response op- RMSEA [90% C.I.] = .088 [.081; .095]). The differences tions 2 and 3 (and for ‘psychomotor changes’ and ‘sui- in CFI between the one-factor and the two-factor cidal ideation’ also response option 1) corresponded to model were < 0.01 for all subgroups as well as for the underlying depression levels well above the population total sample (Δ = 0.004, Δ = 0.009, Δ mean. Most RCCs had their own peak where the re- CFI G-G CFI T-G CFI =0, Δ = 0.003), which indicates substantively spective response option was the most likely to be en- T-T CFI total similar models. As the single-factor model is more dorsed. However, in various items and especially in the parsimonious, we assume that our hypothesis is con- T-T sample (Fig. 1, right column), response option 2 firmed and presuppose unidimensionality of the Ger- ‘more than half the days’ did not offer much additional man and Turkish PHQ-9 versions for the following information, since the area under its RCC which is cov- IRT analyses. ered in addition to the adjacent RCCs is small or non-existent. IRT parameter estimates and inspection of ICCs The item slope parameters a ranged from 1.45 to DIF related to language 4.16, indicating that the response categories differ- In the first step, we identified five DIF-free items (items 2, entiated among trait levels fairly well (Table 2). The 6–9, see Table 3). These items served as anchor items for ascending order of the item location parameters b , evaluating DIF in the remaining items. Statistically signifi- b ,and b confirmed the correct order of response cant DIF regarding item slope was identified in the item 2 3 Reich et al. BMC Psychology (2018) 6:26 Page 6 of 13 Table 2 Item slope a and item locations b ,b , and b , stratified by language and ethnicity 1 2 3 Item Sample a (SE) b (SE) b (SE) b (SE) 1 2 3 1. Anhedonia G-G 2.93 (0.17) −0.49 (0.04) 0.92 (0.07) 1.54 (0.10) T-G 2.59 (0.35) −0.45 (0.12) 1.15 (0.14) 1.85 (0.20) T-T 1.45 (0.32) −0.52 (0.29) 1.46 (0.26) 2.06 (0.34) 2. Depressed mood G-G 3.97 (0.26) −0.26 (0.04) 0.83 (0.06) 1.47 (0.10) T-G 3.46 (0.51) −0.13 (0.10) 0.80 (0.11) 1.51 (0.15) T-T 4.16 (0.84) −0.13 (0.13) 0.88 (0.18) 1.26 (0.22) 3. Sleep problems G-G 2.54 (0.14) −0.60 (0.04) 0.63 (0.06) 1.31 (0.09) T-G 2.37 (0.32) −0.47 (0.13) 0.55 (0.11) 1.34 (0.16) T-T 2.33 (0.48) −0.48 (0.20) 0.67 (0.18) 1.02 (0.21) 4. Low energy G-G 3.02 (0.17) −0.83 (0.04) 0.56 (0.06) 1.32 (0.09) T-G 2.94 (0.40) −0.84 (0.13) 0.43 (0.10) 1.22 (0.14) T-T 2.95 (0.61) −0.78 (0.23) 0.77 (0.17) 1.12 (0.21) 5. Appetite changes G-G 2.53 (0.16) 0.04 (0.05) 1.08 (0.08) 2.07 (0.15) T-G 2.40 (0.34) 0.00 (0.11) 0.81 (0.12) 1.55 (0.18) T-T 1.57 (0.36) 0.07 (0.20) 1.59 (0.30) 1.88 (0.34) 6. Low self-esteem G-G 3.04 (0.20) 0.05 (0.04) 0.93 (0.07) 1.54 (0.11) T-G 2.95 (0.44) 0.14 (0.10) 1.01 (0.12) 1.62 (0.17) T-T 2.97 (0.64) 0.03 (0.14) 1.13 (0.21) 1.51 (0.26) 7. Concentration difficulties G-G 2.92 (0.19) 0.08 (0.05) 1.07 (0.08) 1.89 (0.13) T-G 2.08 (0.30) 0.09 (0.11) 0.98 (0.14) 1.75 (0.21) T-T 2.33 (0.51) 0.33 (0.15) 1.27 (0.23) 1.93 (0.32) 8. Psychomotor changes G-G 2.32 (0.17) 0.63 (0.07) 1.64 (0.13) 2.39 (0.20) T-G 2.67 (0.43) 0.56 (0.11) 1.51 (0.17) 2.04 (0.23) T-T 2.76 (0.64) 0.25 (0.14) 1.25 (0.22) 1.58 (0.27) 9. Suicidal ideation G-G 2.74 (0.23) 0.79 (0.07) 1.64 (0.12) 2.29 (0.19) T-G 2.40 (0.42) 1.02 (0.13) 1.71 (0.20) 2.28 (0.29) T-T 2.06 (0.52) 0.90 (0.18) 1.86 (0.32) 2.08 (0.36) Bolded data where DIF (see Table 3) is present G-G Germans with no migration background completing the German version of the PHQ-9 (n = 1670), T-G Turkish immigrants completing the German version of the PHQ-9 (n = 191), T-T Turkish immigrants completing the Turkish version of the PHQ-9 (n = 116) ‘anhedonia’. The probability of endorsing this item with DIF related to ethnicity and migration background increasing level of depression increased more rapidly in In the first step, we identified seven DIF-free items T-G than in T-T. Significant DIF was found for the loca- (items 1–4, 6, 8–9, see Table 3), which served as anchor tion parameters of the items ‘sleep problems’, ‘low energy’, items. The items ‘appetite changes’ and ‘concentration and ‘appetite changes’. While the locations of the first difficulties’ were evaluated for DIF in the second stage of threshold (b : ‘not at all’ to ‘several days’) were similar in analysis. While the threshold b was similar for both 1 1 both subgroups, the locations of the thresholds b and b groups, the thresholds b and b were shifted upwards 2 3 2 3 differed: b was lower in T-G for all items, while b for G-G as compared to T-G. For G-G, the probability 2 3 was higher in T-G in items 3 and 4, and higher in of endorsing item 7 increased more rapidly with rising T-T in item 5 (see Table 2). Estimating group param- underlying level of depression than for T-G. Estimating eters with DIF-free items only, the group estimate of group parameters with DIF-free items only, the mean the latent depression factor was 1.03 standard devia- depression level was 1 standard deviation higher in T-G tions higher in T-T than in T-G. Using all items, it than in G-G. Based on IRT estimates of depression using was 1.04 standard deviations higher in T-T than in all items, the group estimate was identical: With respect T-G. In summary, language-related DIF is present in to the total score, i.e. on scale level, there was no directly four items, but the impact on the scale level and the observable impact of DIF related to ethnicity and migra- total score seems to be minimal. tion background. Reich et al. BMC Psychology (2018) 6:26 Page 7 of 13 Fig. 1 (See legend on next page.) Reich et al. BMC Psychology (2018) 6:26 Page 8 of 13 (See figure on previous page.) Fig. 1 Item characteristic curves (ICC) for each PHQ-9 depression item in all three subgroups. Left column: ICCs for each item for G-G; middle column: ICCs for T-G; right column: ICCs for T-T. Response options are 0 (not at all), 1 (several days), 2 (more than half the days), or 3 (nearly every day). The X-axis indicates the estimated level of depression (theta). The Y-axis indicates the probability of endorsing a response option at a given level of estimated depression Test characteristics and test information Comparability of language versions TCCs (Fig. 2, left column) showed that the expected The PHQ-9 sum score was comparable between German PHQ-9 score is about 6 to 9 points at the mean level and Turkish language versions. Although there was item of depression in our samples (theta = 0). The PHQ-9 level bias, this was not reflected in total scores. This could had curvilinear scaling properties in all three sub- be due to cancelling out of opposite item level DIF, or the groups. Consequently, differences between standard limited effect of item level DIF at low to average range of scores have different implications depending on the the scale where most subjects were located. Consequently, starting score. For example, a reduction in the under- differences between mean scores can be attributed to real lying level of depression of 1.5 standard deviations in differences between subgroups. In our analyses, the T-T G-G was represented by 13.5 points in the PHQ-9 sample included a higher proportion of inpatients and se- starting from theta = 1.5, and by 7.5 points starting verely depressed participants, which is reflected in a from theta = 0. meaningful difference between T-G and T-T in the latent Inspecting TICs (Fig. 2, right column), we learned that depression factor. These differences reflect true differ- the PHQ-9 offers good measurement precision (i.e. small ences in depression severity instead of measurement bias. standard errors) from about 1 standard deviation below In line with other studies comparing different language the population mean to about 2.5 standard deviations versions of the PHQ-9, we found DIF for the item ‘sleep above. Accordingly, Cronbach’s alpha was .90 for T-T problems’ [20–22]. However, studies on the and G-G, and .91 for T-G. cross-linguistic validity of the CES-D in English- and Dutch-speaking patients with systemic sclerosis [62]and Discussion the BDI in English- and Spanish-speaking outpatients [63] The scope of the present study was to examine found no DIF for the corresponding sleep items. In con- whether the Turkish and German versions of the clusion, the bias in the sleep item seems to be based in the PHQ-9 provide cross-linguistic and cross-cultural PHQ-9 item formulation itself rather than in the symptom validity. The German version is comparable to the of sleep problems across cultures. Language-related DIF English and is equally well validated. We applied for the items ‘appetite changes’ and ‘anhedonia’ were also IRT analyses to three samples which differed regard- found in other studies [20, 21], and was possibly related to ing language version and ethnicity. the PHQ-9 response options in our study: ‘More than half Table 3 Analyses of differential item functioning (DIF) a b DIF related to language DIF related to ethnicity and migration background c d e c d e Item Total Slope parameter Location parameters Total Slope parameter Location parameters 1. Anhedonia 10.9* 6.4* 4.5 2.2 0.4 1.8 2. Depressed mood 4.3 0.5 3.8 4.0 0.3 3.7 3. Sleep problems 8.3 0.1 8.3* 5.3 0.4 4.9 4. Low energy 11.2* 0.0 11.2* 2.8 0.1 2.7 5. Appetite changes 19.7*** 3.3 16.4*** 14.8** 0.3 14.5** 6. Low self-esteem 3.6 0.3 3.3 0.2 0.0 0.2 7. Concentration difficulties 5.1 0.2 4.8 18.7*** 6.8** 12.0** 8. Psychomotor changes 4.2 0.0 4.2 1.9 0.1 1.9 9. Suicidal ideation 5.6 0.4 4.2 3.0 0.3 2.7 2 2 2 We report X statistics. Significant X tests indicate that there is a difference in item functioning. Results for anchor items are printed in italics. X values for anchor items are reported from the last iteration of step one, where anchor items have been selected and purified. Candidate for DIF items are in bold, and X values are those estimated from the second stage of analysis, i.e. where candidate DIF items were tested against the previously identified set of DIF-free anchor items Analysis 1 comparing T-G (Turkish immigrants completing the German version of the PHQ-9, n = 191) with T-T (Turkish immigrants completing the Turkish version of the PHQ-9, n = 116). Analysis 2 comparing G-G (Germans with no migration background completing the German version of the PHQ-9, n = 1670) with c d e T-G (Turkish immigrants completing the German version of the PHQ-9, n = 191). df = 4. df = 1. df = 3 *p < .05, **p < .01, ***p < .001 Reich et al. BMC Psychology (2018) 6:26 Page 9 of 13 Fig. 2 Test characteristic curves (TCC) and test information curves (TIC) for the PHQ-9 for all three subgroups. TCCs can be found in the left column. The X-axis indicates the estimated level of depression (theta) and the Y-axis indicates the most likely expected PHQ-9 sum score associated with each level of depression. The dotted lines may serve as a guide when estimating differences between TCCs with respect to the most likely expected PHQ-9 sum score corresponding to levels of depression at the group mean (theta 0), 1.5 standard deviations below the group mean, and 1.5 standard deviations above the group mean. TICs can be found in the right column. The X-axis continues to be the estimated level of depression (theta). Here, the solid line plots the amount of measurement precision, i.e. measurement information (left Y-axis), at each depression level. The dotted line plots the standard error of measurement (right Y-axis) associated with each depression level the days’ was barely used by Turkish immigrants, espe- the response categories ‘more than half the days’ and cially when completing the Turkish version. One recent ‘nearly every day’ and working with a three-point Likert study on the Spanish version of the PHQ-9 also reported scale improved cross-cultural psychometric characteristics problems with PHQ-9 response categories [64]; collapsing of the PHQ-9 in this study. Reich et al. BMC Psychology (2018) 6:26 Page 10 of 13 Comparability across ethnic groups However, none of these studies investigated Turkish im- Our finding that PHQ-9 sum scores are comparable be- migrants. We did not adjust for sample differences in tween Germans without a migration background and age, education, and employment, since these variables Turkish immigrants in Germany without any restrictions are not independent of the groups examined here: The concurs with previous studies addressing the utilization T-G sample was substantially younger than the other of the PHQ-9 in culturally diverse populations [11, 20, groups, as more second- than first-generation Turkish 23–25]. Higher PHQ-9 sum scores in the T-G than in immigrants chose to respond to questionnaires in Ger- the G-G sample might be explained by self-selection man. DIF related to age has been reported for items 1, 2, processes resulting in more T-G with clinical signs of de- and 4 in a UK sample [65], which might have influenced pression participating in study 2 compared to the mainly the results of our analyses. Among Turkish immigrants, representative G-G sample from study 1. In contrast to the proportion of persons with only basic education or previous studies [11, 25], we found DIF for the items ‘ap- who are unemployed is greater than in the German gen- petite changes’ and ‘concentration difficulties’. The dif- eral population [19]. According to Cameron et al. [65], ferences manifested in a lower threshold for T-G to the PHQ-9 is free of DIF related to education. The pro- endorse the clinically meaningful response categories portion of seriously ill persons in the samples might ‘more than half the days’ and ‘nearly every day’. have affected analyses through sampling bias, as the pro- portion was higher in the Turkish immigrant samples. General characteristics Last but not least, the sample without a migration back- The PHQ-9 items covered a wide range of depression ground might encompass any data of repatriated Russian severities, and the PHQ-9 had a very good measurement Germans, since they are not classified as migrants in of- precision around and above the population mean of de- ficial statistics. pression. Our findings regarding these general character- Furthermore, as no gold standard measure of depres- istics of the PHQ-9 concur with previous research sion was included in the original studies, we were unable demonstrating the high quality of this depression ques- to compare sensitivity and specificity for each of our tionnaire [40, 43]. However, differences between means samples. The addition of a gold standard would have re- (as used in longitudinal studies or for documenting the sulted in a more sophisticated understanding of the im- course of therapy) should be interpreted with caution plications of our findings for the accuracy of diagnostic due to curvilinear scaling properties. A rapid initial im- recommendations of the PHQ-9. We did not test provement in PHQ-9 sum scores, especially in severely whether DIF had a consistent impact across levels of de- depressed patients, may not correspond to an equally pression severity (uniform DIF) or whether the impact strong improvement in underlying depression. of DIF varied by symptom level (nonuniform DIF). Fi- nally, the original studies rely on different settings and Strengths and limitations study designs, implying that data from different sources The strengths of our study are that we applied a might not be fully comparable. state-of-the-art statistical approach, i.e., Item Response Theory, and used relatively large samples including a Conclusions broad spectrum of depression severities. We evaluated Based on the main findings of the present study, the the psychometric characteristics of two PHQ-9 language PHQ-9 total sum score can be recommended as a versions in-depth for application in culturally diverse cross-cultural and cross-linguistic valid screening tool populations. Nonetheless, there are some limitations to for depression in Germans without a migration back- our study. Our analyses only included people with a ground and Turkish immigrants, regardless of whether Turkish migration background or no migration back- they complete the Turkish or the German version. These ground at all. Further differentiations between the influ- results might be transferable to the comparability with ences of migration background and ethnicity (i.e. the English version. When interpreting individual scores Turkish immigrants living in Germany vs. Turkish of Turkish immigrants in clinical practice or in com- people living in Turkey) are lacking. When interpreting parative studies, the response categories ‘more than half the results, it is important to consider that there is a lot the days’ and ‘nearly every day’ should both be consid- of heterogeneity in terms of participant characteristics ered as clinically meaningful responses, as suggested by and participant capabilities in the data, which might the categorical algorithm for the diagnosis of depressive affect the analyzes. The presented results might be disorder according to DSM-IV [13]. According to our biased due to sociodemographic differences between the results, both response options should be regarded as samples. Regarding gender, some studies report no or equally important. Further analysis may evaluate only a minor influence of gender on PHQ-9 scores [65, whether both response options are necessary or whether 66], while others report a significant influence [51]. they can be collapsed into one. Furthermore, Turkish Reich et al. BMC Psychology (2018) 6:26 Page 11 of 13 immigrants seemed to be more willing to endorse some Competing interests The authors declare that they have no competing interests. of the PHQ-9 items. Consequently, there might be inter- cultural differences in the perception or expression of depression [8]. External or relational bias [67] with re- Publisher’sNote spect to second variables (e.g. symptom expression) may Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. exist. Any ensuing differences in the predictive validity of the PHQ-9 [60] might be subject of further research. Author details In summary, the PHQ-9 can be highly recommended as Department of Psychology, University of Marburg, Marburg, Germany. Outpatient Unit for Research, Teaching and Practice, Faculty of Psychology, a cross-cultural and cross-linguistic valid depression University of Vienna, Renngasse 6-8, 1010 Vienna, Austria. Institute of screener for the investigated samples. Medical Psychology, Medical School, University of Leipzig, Leipzig, Germany. Clinic and Policlinic for Psychosomatic Medicine and Psychotherapy, Abbreviations 5 University Medical Center Mainz, Mainz, Germany. Institute of Medical Δ : Delta (=difference) in CFI; a: item slope parameter; b: item location CFI Psychology, Justus-Liebig-University, Gießen, Germany. parameter; b : threshold from ‘not at all’ to ‘several days’ in the PHQ-9; b : threshold from ‘several days’ to ‘more than half the days’ in the PHQ-9; Received: 16 May 2018 Accepted: 22 May 2018 b : threshold from ‘more than half the days’ to ‘nearly every day’ in the PHQ- 9; BDI: Beck Depression Inventory; C.I.: Confidence Interval; CES-D: Center for epidemiological studies-depression measure; CFA: Confirmatory factor analysis; CFI: Confirmatory fit index; DFG: German Research Foundation; References DIF: Differential item functioning; DSM-5: Diagnostic and statistical manual of 1. Paykel ES, Brugha T, Fryers T. Size and burden of depressive disorders in mental disorders, 5th edition; DSM-IV: Diagnostic and statistical manual of Europe. Eur Neuropsychopharmacol. 2005;15:411–23. https://doi.org/10. mental disorders, 4th edition; e.g.: Latin abbreviation “exempli gratia”, which 1016/j.euroneuro.2005.04.008. means “for example”; G-G: Germans with no migration background 2. Wittchen H-U, Jacobi F. Size and burden of mental disorders in Europe - a completing the German version of the PHQ-9; GRM: Parametric graded- critical review and appraisal of 27 studies. Eur Neuropsychopharmacol. 2005; response model; HIV: Human immunodeficiency virus; i.e.: Latin abbreviation 15:357–76. https://doi.org/10.1016/j.euroneuro.2005.04.012. “id est”, which means “in other words”; ICC: Item characteristic curve; 3. World Health Organization [WHO]. The world health report: Mental Health: IRT: Item response theory; n: Sample size; p: Probability; PHQ-9: Patient health New understanding and hope. 2001. questionnaire-9 (9 designates the depression module); RCC: Response 4. Ayuso-Mateos JL, Vázquez-Barquero JL, Dowrick C, Lehtinen V, Dalgard OS, characteristic curve; RCT: Randomized controlled trial; RMSEA: Root Mean Casey P, et al. Depressive disorders in Europe: prevalence figures from the square error of approximation; SD: Standard deviation; SE: Standard error; ODIN study. Br J Psychiatry. 2001;179:308–16. TCC: Test characteristic curve; T-G: Turkish immigrants completing the 5. de Wit MAS, Tuinebreijer WC, Dekker J, Beekman AJTF, Gorissen WHM, German version of the PHQ-9; TIC: Test information curve; T-T: Turkish Schrier AC, et al. Depressive and anxiety disorders in different ethnic groups: immigrants completing the Turkish version of the PHQ-9; USUMA: Name of a a population based study among native Dutch, and Turkish, Moroccan and demographic consulting company in Berlin, Germany; vs.: versus; X : Chi Surinamese migrants in Amsterdam. Soc Psychiatry Psychiatr Epidemiol. Square; z-scale: Standard scale in statistics where the standard deviation is 2008;43:905–12. https://doi.org/10.1007/s00127-008-0382-5. one and the mean is zero; θ: Theta, person parameter 6. González HM, Tarraf W, Whitfield KE, Vega WWA, González HMH. The epidemiology of major depression and ethnicity in the United States. J Acknowledgements Psychiatr Res. 2010;44:1043–51. https://doi.org/10.1016/j.jpsychires.2010.03.017. We would like to thank PD Dr. Heide Gläsmer, Dipl.-Psych. Luisa Bockel, Dipl.- 7. Hasin D, Goodwin R, Stinson F, Grant B. Epidemiology of major depressive Psych. Johanna Laskawi, and Dipl.-Psych. Daniela Zürn for their collaboration disorder. Arch Gen Psychiatry. 2005;62:1097–106. in data collection. We also express our thanks to the cooperating clinic sites: 8. Deisenhammer E a, Coban-Başaran M, Mantar A, Prunnlechner R, Kemmler Vitos Clinic for Psychiatry and Psychotherapy Marburg (Medical Director: Prof. G, Alkın T, et al. Ethnic and migrational impact on the clinical manifestation Dr. Dr. Matthias J. Müller); “Parkland-Klinik” Bad Wildungen; Clinic for of depression. Soc Psychiatry Psychiatr Epidemiol 2011. doi:https://doi.org/ Psychiatry and Psychotherapy at the Hospital of Offenbach; Clinic for 10.1007/s00127-011-0417-1. Psychiatry, Psychotherapy, and Psychosomatics at the Hospital of Frankfurt 9. Zayas LH, Gulbas LE. Are suicide attempts by young Latinas a cultural idiom Höchst; and MEDIAN “Klinik am Südpark” Bad Nauheim. of distress? Transcult Psychiatry. 2012; https://doi.org/10.1177/ Funding 10. Kirmayer LJ, Young A. Culture and somatization: clinical, epidemiological, Study 1 was supported by a grant from the German Research Foundation and ethnographic perspectives. Psychosom Med. 1998;60:420–30. (DFG) to Prof. Rief (Grant RI 574/14–1). 11. Baas KD, Cramer AOJ, Koeter MWJ, van de Lisdonk EH, van Weert HC, Schene AH. Measurement invariance with respect to ethnicity of the patient Availability of data and materials health Questionnaire-9 (PHQ-9). J Affect Disord. 2011;129:229–35. https://doi. The datasets used and/or analysed during the current study are available org/10.1016/j.jad.2010.08.026. from the corresponding author on request. 12. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16:606–13. Authors’ contributions 13. Spitzer R, Kroenke K, Validation WJ. Utility of a self-report version of PRIME- HR analyzed and interpreted the data, and was a major contributor in MD: the PHQ primary care study. JAMA. 1999;282:1737–44. writing the manuscript. WR and EB were the principal investigators of study 14. Löwe B, Kroenke K, Herzog W, Gräfe K. Measuring depression outcome with 1 and critically revised earlier versions of the manuscript. RM was the a brief self-report instrument: sensitivity to change of the patient health principal investigator of studies 2–4, and was a major contributor in writing questionnaire (PHQ-9). J Affect Disord. 2004;81:61–6. https://doi.org/10.1016/ the manuscript. All authors read and approved the final manuscript. S0165-0327(03)00198-8. 15. Löwe B, Spitzer R, Gräfe K, Kroenke K, Quenter A, Zipfel S, et al. Comparative Ethics approval and consent to participate validity of three screening questionnaires for DSM-IV depressive disorders The institutional review board of the German Psychological Association and physicians’ diagnoses. J Affect Disord. 2004;78:131–40. https://doi.org/ (study 1) and the institutional review board of the Department of 10.1016/S0165-0327(02)00237-9. Psychology, Marburg University, Germany (studies 2 to 4) reviewed and 16. Löwe B, Unützer J, Callahan C, Perkins A, Kroenke K. Monitoring depression approved the study protocols. Participants of all studies provided written treatment outcomes with the patient health questionnaire-9. Med Care. informed consent. 2004;42:1194–201. Reich et al. BMC Psychology (2018) 6:26 Page 12 of 13 17. American Psychiatric Association [APA]. Diagnostic and statistical manual of 38. Reich H, Bockel L, Mewes R. Motivation for psychotherapy and illness beliefs mental disorders, DSM-5. 2013. in Turkish immigrant inpatients in Germany: results of a cultural comparison 18. Pfizer Inc. Patient health questionnaires (PHQ) screeners, official website. study. J Racial Ethn Heal Disparities. 2015;2:112–23. https://doi.org/10.1007/ 2013. http://www.phqscreeners.com/overview.aspx?Screener=02_PHQ-9. s40615-014-0054-y. 39. Woellert F, Kröhnert S, Sippel L, Klingholz R. Ungenutzte Potenziale: Zur 19. Statistisches Bundesamt. Bevölkerung und Erwerbstätigkeit: Bevölkerung mit Lage der Integration in Deutschland. Berlin-Institut für Bevölkerung und Migrationshintergrund, Ergebnisse des Mikrozensus 2013. Wiesbaden: Entwicklung; 2009. Statistisches Bundesamt; 2014. 40. Kroenke K, Spitzer R, Williams J, Löwe B. The patient health questionnaire 20. Huang FY, Chung H, Kroenke K, Delucchi KL, Spitzer RL. Using the patient somatic, anxiety, and depressive symptom scales: a systematic review. health Questionnaire-9 to measure depression among racially and ethnically Psychiatry Prim Care. 2010;32:345–59. diverse primary care patients. J Gen Intern Med. 2006;21:547–52. https://doi. 41. Löwe B, Spitzer RL, Zipfel S, Herzog W. PRIME MD Patient Health org/10.1111/j.1525-1497.2006.00409.x. Questionnaire (PHQ) — German version Manual and materials. 2nd ed. 21. Arthurs E, Steele RJ, Hudson M, Baron M, Thombs BD. Are scores on English Karlsruhe: Pfizer; 2002. and French versions of the PHQ-9 comparable? An assessment of 42. Bracken BA, Barona A. State of the art procedures for translating, validating differential item functioning. PLoS One. 2012;7:e52028. https://doi.org/10. and using psychoeducational tests in cross-cultural assessment. Sch Psychol 1371/journal.pone.0052028. Int. 1991;12:119–32. https://doi.org/10.1177/0143034391121010. 22. Hirsch O, Donner-Banzhoff N, Bachmann V. Measurement equivalence of 43. Martin A, Rief W, Klaiberg A, Braehler E. Validity of the brief patient health four psychological questionnaires in native-born Germans, Russian-speaking questionnaire mood scale (PHQ-9) in the general population. Gen Hosp immigrants, and native-born Russians. J Transcult Nurs. 2013; https://doi.org/ Psychiatry. 2006;28:71–7. https://doi.org/10.1016/j.genhosppsych.2005.07.003. 10.1177/1043659613482003. 44. Löwe B, Gräfe K, Zipfel S, Witte S, Loerch B, Herzog W. Diagnosing ICD-10 23. Hepner KA, Morales LS, Hays RD, Edelen MO, Miranda J. Evaluating depressive episodes: superior criterion validity of the patient health differential item functioning of the PRIME-MD mood module among questionnaire. Psychother Psychosom. 2004;73:386–90. https://doi.org/10. impoverished black and white women in primary care. Women’s Heal 1159/000080393. Issues. 2008;18:53–61. https://doi.org/10.1016/j.whi.2007.10.001. 45. Henkel V, Mergl R, Kohnen R, Allgaier A-K, Möller H-J, Hegerl U. Use of brief 24. Mewes R, Christ O, Rief W, Brähler E, Martin A, Glaesmer H. Are depression depression screening tools in primary care: consideration of heterogeneity and somatisation equivalent for migrants and native Germans? An in performance in different patient groups. Gen Hosp Psychiatry. 2004;26: investigation of measurement invariance for the PHQ-9 and PHQ-15. 190–8. https://doi.org/10.1016/j.genhosppsych.2004.02.003. Diagnostica. 2010;56:230–9. https://doi.org/10.1026/0012-1924/a000026. 46. Çorapçıoğlu A, Özer GU. [Patient Health Questionnaire-9]. Patient Health 25. Crane PK, Gibbons LE, Willig JH, Mugavero MJ, Lawrence ST, Schumacher JE, Questionnaire (PHQ) Screeners. http://www.phqscreeners.com/sites/g/files/ et al. Measuring depression levels in HIV-infected patients as part of routine g10016261/f/201412/PHQ9_Turkish%20for%20Turkey.pdf. Accessed 12 Dec clinical care using the nine-item patient health questionnaire (PHQ-9). AIDS Care. 2010;22:874–85. https://doi.org/10.1080/09540120903483034. 47. Yazici Güleç M, Güleç H, Simşek G, Turhan M, Aydin Sünbül E. Psychometric 26. Holland P, Wainer H. Differential item functioning. In: Hillsdale (NJ): properties of the Turkish version of the patient health questionnaire- Lawrence Erlbaum associates; 1993. somatic, anxiety, and depressive symptoms. Compr Psychiatry. 2012;53:623– 27. Camilli G, Shepard L. Methods for identifying biased test items. Thousand 9. https://doi.org/10.1016/j.comppsych.2011.08.002. Oaks: Sage Publications; 1994. 48. Schenk L, Bau AM, Borde T, Butler J, Lampert T, Neuhauser H, et al. 28. Adler M, Hetta J, Isacsson G, Brodin U. An item response theory evaluation Mindestindikatorensatz zur Erfassung des Migrationsstatus [minimum set of of three depression assessment instruments in a clinical sample. BMC Med indicators for measuring the migration status]. Bundesgesundheitsblatt - Res Methodol. 2012;12:84. https://doi.org/10.1186/1471-2288-12-84. Gesundheitsforsch - Gesundheitsschutz. 2006;49:853–60. 29. Reise SP, Waller NG. Item response theory and clinical measurement. Annu 49. Cameron IM, Crawford JR, Lawton K, Reid IC. Psychometric comparison of Rev Clin Psychol. 2009;5:27–48. https://doi.org/10.1146/annurev.clinpsy. PHQ-9 and HADS for measuring depression severity in primary care. Br J 032408.153553. Gen Pract. 2008;58:32–6. https://doi.org/10.3399/bjgp08X263794. 30. Waller NG, Thompson JS, Wenk E. Using IRT to separate measurement bias 50. Dum M, Pickren J, Sobell LC, Sobell MB. Comparing the BDI-II and the PHQ- from true group differences on homogeneous and heterogeneous scales: 9 with outpatient substance abusers. Addict Behav. 2008;33:381–7. https:// an illustration with the MMPI. Psychol Methods. 2000;5:125–46. https://doi. doi.org/10.1016/j.addbeh.2007.09.017. org/10.1037//1082-989X.5.1.125. 51. Kocalevent R-D, Hinz A, Brähler E. Standardization of the depression screener 31. Meijer RR, Baneke JJ. Analyzing psychopathology items: a case for patient health questionnaire (PHQ-9) in the general population. Gen Hosp nonparametric item response theory modeling. Psychol Methods. 2004;9: Psychiatry. 2013;35:551–5. https://doi.org/10.1016/j.genhosppsych.2013.04.006. 354–68. https://doi.org/10.1037/1082-989X.9.3.354. 52. Castillo R, Waitzkin H, Ramirez Y, Escobar JI. Somatization in primary care, 32. Eurostat. Migrants in Europe. A statistical portrait of the first and second with a focus on immigrants and refugees. Arch Fam Med. 1995;4:637–46. generation. Luxembourg: European Union; 2011. 53. Kirmayer LJ, Sartorius N. Cultural models and somatic syndromes. 33. Aichberger MC, Schouler-Ocak M, Mundt A, Busch MA, Nickels E, Psychosom Med. 2007;69:832–40. https://doi.org/10.1097/PSY. Heimann HM, et al. Depression in middle-aged and older first 0b013e31815b002c. generation migrants in Europe: results from the survey of health, 54. Mewes R, Rief W. Are somatoform complaints and causal attributions in ageing and retirement in Europe (SHARE). Eur Psychiatry. 2010;25:468– Turkish migrants associated with their cultural background or the migration 75. https://doi.org/10.1016/j.eurpsy.2009.11.009. itself? Zeitschrift für Medizinische Psychol. 2009;18:135–9. https://content. 34. Lindert J, Von EOS, Priebe S, Mielck A, Brähler E. Depression and anxiety in iospress.com/articles/zeitschrift-fur-medizinische-psychologie/zmp18-3-4-07. labor migrants and refugees - a systematic review and meta-analysis. Soc Accessed 23 Feb 2012 Sci Med. 2009;69:246–57. https://doi.org/10.1016/j.socscimed.2009.04.032. 55. Behrens K, Machleidt W, Haltenhof H, Ziegenbein M, Calliess IT. 35. Weidacher A. Schlußfolgerungen und partizipationspolitischer Ausblick Somatization and vulnerability to offence in immigrants with mental [Conclusions and political forecast]. In: Weidacher A, editor. In Deutschland disorders - evidence or eminence? Nervenheilkunde. 2008;27:639–43. zu Hause: politische Orientierungen griechischer, italienischer, türkischer 56. Cheung GW, Rensvold RB. Evaluating goodness-of- fit indexes for testing und deutscher junger Erwachsener im Vergleich (DJI-Ausländersurvey) At measurement invariance. Struct Equ Model. 2002;(2):233–55. home in Germany: a comparison of political orientations in Greek, Italian, 57. Muthén L. Muthén B. Mplus User’s guide. Turkish and Germa. Opladen: Leske + Budrich; 2000. p. 265–72. 58. Samejima F. Estimation of latent ability using a response pattern of graded 36. Mewes R, Rief W, Stenzel N, Glaesmer H, Martin A, Brähler E. What is scores. Psychometric Monograph Np. 1969:17. “normal” disability? An investigation of disability in the general population. 59. Samejima F. The graded response model. In: van der Linden WJ, Pain. 2009;142:36–41. https://doi.org/10.1016/j.pain.2008.11.007. Hambleton RK, editors. Handbook of modern item response theory. 37. Mewes R, Asbrock F, Laskawi J. Perceived discrimination and impaired New York: Springer; 1996. mental health in Turkish immigrants and their descendents in Germany. 60. Embretson S, Reise S. Item response theory for psychologists. Hove: Compr Psychiatry. 2015;62:42–50. Psychology Press; 2013. Reich et al. BMC Psychology (2018) 6:26 Page 13 of 13 61. Cai L, Thissen D, du Toit SHC. IRTPRO 2.1 for windows (item response theory for patient-reported outcomes). 2014. 62. Kwakkenbos L, Arthurs E, van den Hoogen FHJ, Hudson M, van Lankveld WGJM, Baron M, et al. Cross-language measurement equivalence of the Center for Epidemiologic Studies Depression (CES-D) scale in systemic sclerosis: a comparison of Canadian and Dutch patients. PLoS One. 2013;8: e53923. https://doi.org/10.1371/journal.pone.0053923. 63. Azocar F, Areán P, Miranda J, Muñoz RF. Differential item functioning in a Spanish translation of the Beck depression inventory. J Clin Psychol. 2001;57: 355–65. https://doi.org/10.1002/jclp.1017 64. Zhong Q, Gelaye B, Fann JR, Sanchez SE, M a W. Cross-cultural validity of the Spanish version of PHQ-9 among pregnant Peruvian women: a Rasch item response theory analysis. J Affect Disord. 2014;158:148–53. https://doi. org/10.1016/j.jad.2014.02.012. 65. Cameron IM, Crawford JR, Lawton K, Reid IC. Differential item functioning of the HADS and PHQ-9: an investigation of age, gender and educational background in a clinical UK primary care sample. J Affect Disord. 2013;147: 262–8. https://doi.org/10.1016/j.jad.2012.11.015. 66. Thibodeau MA, Asmundson GJG. The PHQ-9 assesses depression similarly in men and women from the general population. Pers Individ Dif. 2014;56:149–53. 67. Drasgow F. Study of the measurement bias of two standardized psychological tests. J Appl Psychol. 1987;72:19–29. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Psychology Springer Journals

Cross-cultural validation of the German and Turkish versions of the PHQ-9: an IRT approach

Free
13 pages
Loading next page...
 
/lp/springer_journal/cross-cultural-validation-of-the-german-and-turkish-versions-of-the-jffESu0ukP
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s).
Subject
Psychology; Psychology Research; Clinical Psychology; Cognitive Psychology
eISSN
2050-7283
D.O.I.
10.1186/s40359-018-0238-z
Publisher site
See Article on Publisher Site

Abstract

Background: The Patient Health Questionnaire’s depression module (PHQ-9) is a widely used screening tool to assess depressive disorders. However, cross-linguistic and cross-cultural validation of the PHQ-9 is mostly lacking. This study investigates whether scores on the German and Turkish versions of the PHQ-9 are comparable. Methods: Data from Germans without a migration background (German version, n = 1670) and Turkish immigrants in Germany (either German or Turkish version, n = 307) were used. Differential Item Functioning (DIF) was assessed using Item Response Theory (IRT) models. Results: Several items of the PHQ-9 were found to exhibit DIF related to language or ethnicity, e.g. ‘sleep problems’, ‘appetite changes’ and ‘anhedonia’. However, PHQ-9 sum scores were found to be unbiased, i.e., DIF had no notable impact on scale levels. Conclusions: PHQ-9 sum scores can be compared between Turkish immigrants and Germans without a migration background without any adjustments, regardless of whether they complete the German or the Turkish version. Keywords: Depression, Patient health Questionnaire-9 (PHQ-9), Item response theory (IRT), Differential item functioning (DIF), Cross-cultural / ethnic comparison Background frequently used and best validated questionnaires world- Depression is a highly prevalent disorder leading to suf- wide [14–16]. It is recommended as a general measure of fering and disability [1, 2]. It is predicted to be the major depression severity by the DSM-5 (Diagnostic and Statis- cause of burden of disease by 2020 [3]. Differences exist tical Manual of Mental Disorders, 5th Edition) [17]and across countries and ethnic groups in epidemiology [4–7] has been translated into over 70 languages and dialects and symptom presentation [8–10]of depressivedisorders. [18]. In the present study, we investigate whether PHQ-9 Many cross-cultural studies applied self-report question- scores are comparable between the German majority naires to assess and describe the phenomenology of de- population without a migration background and the lar- pressive disorders. However, cross-linguistic and gest minority group in Germany, Turkish immigrants [19]. cross-cultural validation of self-report questionnaires is To our knowledge, only three studies have investigated mostly lacking. Such validation analyses are urgently the comparability of different language versions of the needed for a valid comparison of prevalence rates and PHQ-9: Huang and colleagues [20] found differences in symptom profiles of depressive disorders across linguistic item functioning between the English and Chinese ver- and ethnic groups [11]. Among self-report questionnaires sion of the items assessing sleep, appetite, and psycho- for assessing depression, the Patient Health motor changes in a large sample of primary care Questionnaire-9 (PHQ-9) [12, 13] is one of the most patients. Comparing the English and Spanish version, they also found differences in sleep and appetite items, * Correspondence: ricarda.nater-mewes@univie.ac.at plus anhedonia and self-esteem items. Arthurs and col- Department of Psychology, University of Marburg, Marburg, Germany leagues [21] found differences between the English and Outpatient Unit for Research, Teaching and Practice, Faculty of Psychology, French version for anhedonia, sleep, and self-esteem University of Vienna, Renngasse 6-8, 1010 Vienna, Austria Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Reich et al. BMC Psychology (2018) 6:26 Page 2 of 13 items in patients with systemic sclerosis. Comparing the First, we examine whether the German and Turkish German and Russian version in primary care patients language versions of the PHQ-9 are comparable. Then, [22], a difference in item functioning was found in the we examine whether the German PHQ-9 is comparable sleep problems item. across ethnic groups. This two-step approach is neces- Regarding the comparability across ethnic and racial sary because Turkish language utilization and German groups, two studies have confirmed the comparability language proficiency vary considerably among Turkish of the English version between African-American and immigrants [35]. Based on previous studies on DIF in non-Hispanic White primary care patients [20, 23]. PHQ-9 items, one might expect DIF in the sleep, psy- Moreover, one study in a general population sample chomotor changes, anhedonia, appetite changes, and confirmed the comparability of the German version low self-esteem items. However, this is the first study to between Germans without a migration background investigate cross-linguistic and cross-cultural validity of and a heterogeneous sample of immigrants living in the Turkish version of the PHQ-9, and one of the few to Germany [24]. However, Crane and colleagues found study this topic at all. Consequently, all items of the differences in items about sleep, low energy, and psy- PHQ-9 were tested on DIF without statistical chomotor changes between HIV-infected pre-assumptions. Based on the results, recommenda- African-Americans and Whites in the English version tions for applying the PHQ-9 in Turkish immigrants are [25], and Baas and colleagues confirmed a cultural provided. bias in theDutch versionof the PHQ-9in the item psychomotor changes between Surinam Dutch and Methods Native Dutch male primary care patients [11]. Al- Data sources though the reasons for differences in item functioning This article provides secondary analyses of original data are mostly unclear, most studies confirmed that such obtained in four independent, cross-sectional studies. differences had minimal impact on the scale level and that sum scores were mainly comparable across the Study 1 investigated samples. A representative sample of the German general popula- To establish cross-linguistic and cross-cultural tion (n = 2510) was screened for disability, somatic com- measurement equivalence, equality in item functioning plaints, mental health, and healthcare utilization. The needs to be inspected. The probability of endorsing a assessment was conducted by a demographic consulting specific item should be the same for all individuals company (USUMA, Berlin) in 2007. The study material with a certain underlying level of depression, and was available in German only. Details of the procedure should not be influenced by ethnic or linguistic are described elsewhere, e.g. [36]. For the present ana- group. If these prerequisites are not fulfilled, the item lyses, only data of Germans without a migration back- is considered to have Differential Item Functioning ground and of Turkish immigrants responding to the (DIF) [26, 27]. The absence of DIF justifies German language version of the PHQ-9 are used. cross-cultural comparisons based on the sum score as an indicator for the latent trait, and allows observed Study 2 differences to be related to actual differences between A convenience sample of Turkish immigrants (n = 214) groups. DIF can be appropriately assessed using Item completed questionnaires about perceived discrimin- Response Theory (IRT) analysis [28, 29]. IRT provides ation and depressive and somatoform symptoms. Data parametric and nonparametric models, which consti- were collected in 2011 and 2012 [37]. The study material tute powerful tools for separating measurement bias was provided in German or Turkish according to the from true group differences [30, 31]. participants’ choice. The study was carried out using an The objective of this study is to investigate whether online survey and paper-and-pencil versions with a PHQ-9 scores are comparable between Turkish immi- snowball system. grants in Germany and Germans without a migration background. This is especially important since Turkish immigrants represent the largest minority group in Study 3 Germany [19], and are among the three largest immi- Two matched inpatient samples (Turkish immigrants vs. grant populations in other European countries such as Germans without a migration background, n = 50 each) the Netherlands, Denmark, and Austria [32]. Moreover, were recruited in five psychiatric clinics in 2011 and as prevalence rates of affective disorders in labor mi- 2012 [38]. Participants were asked about subjective con- grants in Europe are elevated [5, 33, 34], properly work- cepts of mental illness, motivation for psychotherapy, ing assessment instruments for depression are and mental health symptoms. The study material was particularly important in this group. provided as paper-and-pencil versions in German or Reich et al. BMC Psychology (2018) 6:26 Page 3 of 13 Turkish according to the participants’ choice. A bilingual Turkish version of the PHQ-9 [46] has been validated in research assistant helped illiterate participants. only one study [47], which showed acceptable results re- garding reliability and validity for the Turkish population Study 4 in Turkey. In a pilot study, Turkish immigrant inpatients (n = 29) were recruited to participate in a randomized controlled Statistical procedure trial (RCT) on the effects of a motivation-enhancing Data preparation and definition of the subgroups program at the beginning of their inpatient treatment. Overall, data of n = 2853 participants were eligible from They provided baseline information about motivation the four studies described above. n = 10 participants had for psychotherapy, mental health symptoms, and illness more than two missing items in the PHQ-9 and were ex- perception at the beginning of inpatient treatment in cluded from the present analysis. We selected three sub- two different psychiatric clinics in 2013 and 2014. Study groups, differing in ethnicity (no migration background material was available on a computer in German or at all vs. Turkish migration background), and language Turkish according to the participants’ choice. A bilingual version of the PHQ-9 (German vs. Turkish): Germans research assistant helped participants who were illiterate with no migration background completing the German or needed assistance with the computer. This sample version of the PHQ-9 (G-G), Turkish immigrants com- was included to enclose Turkish immigrants with a low pleting the German version of the PHQ-9 (T-G), and level of literacy in the analysis. Persons with low German Turkish immigrants completing the Turkish version of language proficiency and low educational levels usually the PHQ-9 (T-T). Ethnic groups were defined by the get excluded from research in Germany, but are charac- parents’ country of birth according to Schenk et al. [48]. teristic for the population of Turkish immigrants [39]. Persons were included only if both parents were born ei- ther in Germany or in Turkey. n = 334 participants were Measures excluded based on this criterion. Non-migrants had to Participants in all studies provided information on socio- be born in Germany, i.e. have no immigration experi- demographic and migration-related variables, and symp- ence. Their mother tongue had to be German, and they toms of depression measured by the PHQ-9. The had to hold a German passport. Based on these criteria, PHQ-9 is a nine-item self-rating instrument, with each a further n = 5 participants were excluded. The age range item representing one of the DSM-IV (Diagnostic and was restricted to 18–65 years, since there were no eld- Statistical Manual of Mental Disorders, 4th Edition) cri- erly participants in the T-T sample and only very few in teria for a depressive episode (anhedonia, depressed the T-G sample. Accordingly, n = 90 participants under mood, sleep problems, feeling tired, change in appetite, 18 and n = 437 participants over 65 were excluded. Final negative self-evaluation, concentration problems, psy- sample sizes were n = 1670, n = 191, and n (G-G) (T-G) (T-T) chomotor changes, suicidality). Each item can be scored = 116. as 0 (not at all), 1 (several days), 2 (more than half the days), or 3 (nearly every day), according to the frequency Evaluation of prerequisites of experiencing difficulties in the respective area in the IRT analyses require unidimensionality, i.e. the items previous 2 weeks. Sum scores range from 0 to 27. Inter- should measure the symptoms of one underlying dis- preting the PHQ-9 with respect to depression severity, a order. The PHQ-9 has been shown to be a score of 5 to 9 represents mild depressive symptoms, 10 one-dimensional measure of depression in previous to 14 moderate depressive symptoms, and 15 to 27 se- studies [23, 25, 49–51]. Consequently, we hypothesize vere depressive symptoms [40]. that unidimensionality is present as well in the German German and Turkish versions of the PHQ-9 were re- and Turkish versions of the PHQ-9. However, as a spe- trieved from the Pfizer Patient Health Questionnaire cial relevance of somatoform complaints in migrant pop- Screeners website [18]. The German version of the ulations in general [10, 52, 53] and Turkish immigrants PHQ-9 [41] was elaborated by several steps of transla- in particular [54, 55] has been discussed, a two-factor so- tion and blind back-translation following state-of-the-art lution was also plausible. We addressed dimensionality procedures for test translation [42]. Various studies have using confirmatory factor analysis (CFA), testing a demonstrated its validity [14, 15, 43–45]. Furthermore, single-factor model and a two-factor model including results from the American and German PHQ validation the items ‘sleep problems’, ‘low energy’, ‘appetite changes’, studies are similar regarding criterion validity, construct and ‘psychomotor changes’ on a somatic factor and the validity, internal consistency, sensitivity to change and items ‘anhedonia’, ‘depressed mood’, ‘low self-esteem’, ‘con- recommended cut-off scores [12–16]. Consequently, the centration difficulties’, ‘and suicidal ideation’ on a German PHQ-9 can be considered a trustworthy and cognitive-affective factor. Dimensionality of the PHQ-9 completely reliable PHQ version. However, to date, the was inspected for all three subgroups separately and for Reich et al. BMC Psychology (2018) 6:26 Page 4 of 13 the total sample. Missing values were handled with information about DIF-free items in our samples, we full-information maximum likelihood estimation (n used an iterative process to identify anchor items to be one = 10; n =0; n = used for evaluating DIF in candidate items. We adopted missing (G-G) two missings (G-G) one missing (T-G) 4; n =1; n =2; n the “leave-one-out” approach for the selection of anchor two missings (T-G) one missing (T-T) two missings = 0). For model fit comparison, we followed a pro- items, i.e. every single item was tested for DIF, assuming (T-T) cedure which involves comparing the change in that the remaining items were DIF-free and thus serving goodness-of-fit indices, which are unaffected by sample as anchor items. If any of the X tests for an item was size [56]. Following Cheung’s recommendations, we significant at the p < .05 level, the item was considered compared the CFI between the single-factor and the to be a candidate DIF item. This process was repeated two-factor models, with a difference of Δ < 0.01 indi- with the remaining items to purify the sample of anchor CFI cating substantively similar models [56]. Mplus version 5 items until there were no more new candidate DIF items was used for CFA [57]. in the next analysis. In the second stage of analysis, the candidate DIF items were tested for DIF relative to the Item response theory (IRT) analyses set of anchor items that had been identified in step one. For IRT analyses, the parametric graded-response model Finally, Test Characteristic Curves (TCC) and Test In- (GRM) [58, 59], the polytomous extension of the formation Curves (TIC) were inspected. The TCC plots two-parameter logistic model, was applied. The GRM the most likely standard PHQ-9 score associated with estimates two types of item parameters and one person each level of depression [25]. The TIC plots the informa- parameter, based on the pattern of responses observed tion at each depression level, e.g. the measurement pre- in the data. The item parameters are: item slope a, and cision at each depression level and the standard error item location b. The item slope parameter a indicates associated which each depression level. Where the TCC how steeply the probability of endorsing an item in- is steep and test information is high, the PHQ-9 has creases with an increasing underlying level of depression. good measurement precision and a small standard error The person parameter theta (θ) estimates the underlying of measurement. All IRT analyses were computed with level of depression. The item location parameters b indi- IRTPRO 2.1 for Windows [61]. cate the positions of the thresholds from one response category to another. The b parameters represent the trait Results level necessary to respond above the threshold with .50 Sample characteristics probability [60]. In the case of the PHQ-9, there are A final sample of n = 1977 participants was analyzed. three thresholds: from ‘not at all’ to ‘several days’ (b ), The mean age of the total sample was 42.6 years, with from ‘several days’ to ‘more than half the days’ (b ), and T-G being significantly younger (32.6 vs. 43.7 years, see from ‘more than half the days’ to ‘nearly every day’ (b ). Table 1). In the total sample, 97% of participants had Item parameters can be interpreted as a z-scale (mean = completed nine or more years of education, and 61% 0, standard deviation = 1). All parameters estimated by were employed. However, only 82% of T-T had com- the GRM are reported on a logit scale. Item Characteris- pleted 9 years of education or beyond, and the employ- tic Curves (ICCs) were used for the graphical investiga- ment rate was only 47%. The proportion of inpatients tion of the operation characteristics. The form of an ICC was markedly higher in T-T (57%) than in the other sub- describes how changes in trait level relate to changes in groups (3 and 5%). Moreover, the proportion of partici- the probability of a specified response. For polytomous pants with moderate or severe depression as estimated items, the ICC regresses the probability of responses in by the PHQ-9 sum score was higher among T-T. each category on trait level [60]. Second-generation immigrants were more likely to be in For Differential Item Functioning (DIF), our analyses the T-G subgroup (62% vs. 10%). T-G were also more disentangle differences in item functioning related to likely to indicate German as their mother tongue (17% language (German vs. Turkish) and to ethnicity and mi- vs. 6%) and to have a better German language profi- gration background (Germans without a migration back- ciency, if their mother tongue was Turkish. ground vs. Turkish migration background). The first analysis investigated DIF related to language, comparing Evaluation of prerequisites T-G and T-T. The second investigated DIF related to The single-factor model showed good fit in each sub- ethnicity and migration background, comparing T-G to group and for the entire sample (G-G: X (27) = 521.6, G-G. DIF analyses were conducted in two steps: first p < .001; CFI = .938; RMSEA [90% C.I.] = .105 [.097; selecting anchor items, and then evaluating candidate .113]. T-G: X (27) = 67.4, p < .001; CFI = .955; RMSEA items for DIF. Anchor items allow responses from two [90% C.I.] = .089 [.062; .115]. T-T: X (27) = 22.0, groups to be linked so that parameters are estimated in p > .05; CFI = 1.0; RMSEA [90% C.I.] = .000 [.000; a common metric [60]. Since we had no a priori .057]. Total: X (27) = 454.6, p < .001; CFI = .964; Reich et al. BMC Psychology (2018) 6:26 Page 5 of 13 Table 1 Sample description stratified by language and ethnicity G-G (n = 1670) T-G (n = 191) T-T (n = 116) Total (n = 1977) Test statistic Sociodemographic characteristics Age in years, mean (SD) 43.7 (12.7) 32.6 (9.9) 43.7 (11.1) 42.6 (12.8) F(2) = 70.2*** Female sex, n (%) 930 (55.7) 109 (57.4) 71 (61.2) 1110 (56.2) X (2) = 1.5* a 2 Education ≥9 years, n (%) 1638 (98.2) 181 (96.3) 94 (82.4) 1913 (97.1) X (2) = 157.8*** b 2 Being employed, n (%) 1037 (62.1) 118 (62.4) 54 (46.6) 1209 (61.2) X (2) = 11.1** Clinical characteristics Being in inpatient treatment, n (%) 49 (2.9) 9 (4.7) 66 (56.9) 124 (6.3) X (2) = 538.1*** PHQ-9 total score, mean (SD) 2.6 (3.9) 7.2 (6.3) 13.6 (7.3) 3.7 (5.3) F(2) = 397.5*** Depression severity as defined by the PHQ-9 None (0–4), n (%) 1360 (81.4) 73 (38.2) 12 (10.3) 1530 (77.4) X (2) = 409.4*** Mild (5–9), n (%) 210 (12.6) 64 (33.5) 33 (28.4) 222 (11.2) X (2) = 72.9*** Moderate (10–14), n (%) 62 (3.7) 31 (16.2) 17 (14.7) 162 (8.2) X (2) = 168.4*** Severe (≥15), n (%) 38 (2.3) 23 (12.0) 54 (46.6) 63 (3.2) X (2) = 256.0*** Migration-related characteristics Years since immigration, mean (SD) – 28.0 (11.1) 26.1 (10.9) 26.9 (11.0) F(1) = 1.7* d 2 Second generation, n (%) – 117 (61.6) 12 (10.3) 129 (42.2) X (1) = 76.8*** Mother tongue = German, n (%) – 32 (16.8) 7 (6.0) 39 (12.7) X (1) = 7.5** German language proficiency, mean (SD) – 1.4 (0.7) 2.8 (1.0) 2.0 (1.1) F(1) = 165.8*** G-G Germans with no migration background completing the German version of the PHQ-9, T-G Turkish immigrants completing the German version of the PHQ-9, T-T Turkish immigrants completing the Turkish version of the PHQ-9 Includes all school graduation certificates normally received after 9 or more years of school, i.e. the German “Hauptschulabschluss”, “Realschulabschluss” or b c “Abitur”, and the Turkish “Ortaokul diploması” or “Lise bitirme sınavı”. Working part-time or full-time. Applies only for participants who were born in Turkey. d e Participants born in Germany, both parents born in Turkey. Self-reported German language proficiency, if mother tongue is Turkish (1 = very good,4 = poor/bad) *p < .05, **p < .01, ***p < .001 RMSEA [90% C.I.] = .090 [.082; .097]). The fit of the options. Additionally, the range of the item location two-factor model was similarly good in all subgroups parameters indicated that the PHQ-9 items covered and in the entire sample (G-G: X (26) = 488.5, p levels of depression from about 1 standard deviation < .001; CFI = .942; RMSEA [90% C.I.] = .103 [.095; below to 2 standard deviations above the sample .111]. T-G: X (26) = 58.0, p < .001; CFI = .964; RMSEA population mean. [90% C.I.] = .080 [.052; .108]. T-T: X (26) = 21.5, The graphical inspection of the ICCs (Fig. 1) showed p > .05; CFI = 1.0; RMSEA [90% C.I.] = .000 [.000; that all PHQ-9 items work well in our samples. Peaks of .057]. Total: X (26) = 422.4, p < .001; CFI = .967; RCCs (Response Characteristic Curves) for response op- RMSEA [90% C.I.] = .088 [.081; .095]). The differences tions 2 and 3 (and for ‘psychomotor changes’ and ‘sui- in CFI between the one-factor and the two-factor cidal ideation’ also response option 1) corresponded to model were < 0.01 for all subgroups as well as for the underlying depression levels well above the population total sample (Δ = 0.004, Δ = 0.009, Δ mean. Most RCCs had their own peak where the re- CFI G-G CFI T-G CFI =0, Δ = 0.003), which indicates substantively spective response option was the most likely to be en- T-T CFI total similar models. As the single-factor model is more dorsed. However, in various items and especially in the parsimonious, we assume that our hypothesis is con- T-T sample (Fig. 1, right column), response option 2 firmed and presuppose unidimensionality of the Ger- ‘more than half the days’ did not offer much additional man and Turkish PHQ-9 versions for the following information, since the area under its RCC which is cov- IRT analyses. ered in addition to the adjacent RCCs is small or non-existent. IRT parameter estimates and inspection of ICCs The item slope parameters a ranged from 1.45 to DIF related to language 4.16, indicating that the response categories differ- In the first step, we identified five DIF-free items (items 2, entiated among trait levels fairly well (Table 2). The 6–9, see Table 3). These items served as anchor items for ascending order of the item location parameters b , evaluating DIF in the remaining items. Statistically signifi- b ,and b confirmed the correct order of response cant DIF regarding item slope was identified in the item 2 3 Reich et al. BMC Psychology (2018) 6:26 Page 6 of 13 Table 2 Item slope a and item locations b ,b , and b , stratified by language and ethnicity 1 2 3 Item Sample a (SE) b (SE) b (SE) b (SE) 1 2 3 1. Anhedonia G-G 2.93 (0.17) −0.49 (0.04) 0.92 (0.07) 1.54 (0.10) T-G 2.59 (0.35) −0.45 (0.12) 1.15 (0.14) 1.85 (0.20) T-T 1.45 (0.32) −0.52 (0.29) 1.46 (0.26) 2.06 (0.34) 2. Depressed mood G-G 3.97 (0.26) −0.26 (0.04) 0.83 (0.06) 1.47 (0.10) T-G 3.46 (0.51) −0.13 (0.10) 0.80 (0.11) 1.51 (0.15) T-T 4.16 (0.84) −0.13 (0.13) 0.88 (0.18) 1.26 (0.22) 3. Sleep problems G-G 2.54 (0.14) −0.60 (0.04) 0.63 (0.06) 1.31 (0.09) T-G 2.37 (0.32) −0.47 (0.13) 0.55 (0.11) 1.34 (0.16) T-T 2.33 (0.48) −0.48 (0.20) 0.67 (0.18) 1.02 (0.21) 4. Low energy G-G 3.02 (0.17) −0.83 (0.04) 0.56 (0.06) 1.32 (0.09) T-G 2.94 (0.40) −0.84 (0.13) 0.43 (0.10) 1.22 (0.14) T-T 2.95 (0.61) −0.78 (0.23) 0.77 (0.17) 1.12 (0.21) 5. Appetite changes G-G 2.53 (0.16) 0.04 (0.05) 1.08 (0.08) 2.07 (0.15) T-G 2.40 (0.34) 0.00 (0.11) 0.81 (0.12) 1.55 (0.18) T-T 1.57 (0.36) 0.07 (0.20) 1.59 (0.30) 1.88 (0.34) 6. Low self-esteem G-G 3.04 (0.20) 0.05 (0.04) 0.93 (0.07) 1.54 (0.11) T-G 2.95 (0.44) 0.14 (0.10) 1.01 (0.12) 1.62 (0.17) T-T 2.97 (0.64) 0.03 (0.14) 1.13 (0.21) 1.51 (0.26) 7. Concentration difficulties G-G 2.92 (0.19) 0.08 (0.05) 1.07 (0.08) 1.89 (0.13) T-G 2.08 (0.30) 0.09 (0.11) 0.98 (0.14) 1.75 (0.21) T-T 2.33 (0.51) 0.33 (0.15) 1.27 (0.23) 1.93 (0.32) 8. Psychomotor changes G-G 2.32 (0.17) 0.63 (0.07) 1.64 (0.13) 2.39 (0.20) T-G 2.67 (0.43) 0.56 (0.11) 1.51 (0.17) 2.04 (0.23) T-T 2.76 (0.64) 0.25 (0.14) 1.25 (0.22) 1.58 (0.27) 9. Suicidal ideation G-G 2.74 (0.23) 0.79 (0.07) 1.64 (0.12) 2.29 (0.19) T-G 2.40 (0.42) 1.02 (0.13) 1.71 (0.20) 2.28 (0.29) T-T 2.06 (0.52) 0.90 (0.18) 1.86 (0.32) 2.08 (0.36) Bolded data where DIF (see Table 3) is present G-G Germans with no migration background completing the German version of the PHQ-9 (n = 1670), T-G Turkish immigrants completing the German version of the PHQ-9 (n = 191), T-T Turkish immigrants completing the Turkish version of the PHQ-9 (n = 116) ‘anhedonia’. The probability of endorsing this item with DIF related to ethnicity and migration background increasing level of depression increased more rapidly in In the first step, we identified seven DIF-free items T-G than in T-T. Significant DIF was found for the loca- (items 1–4, 6, 8–9, see Table 3), which served as anchor tion parameters of the items ‘sleep problems’, ‘low energy’, items. The items ‘appetite changes’ and ‘concentration and ‘appetite changes’. While the locations of the first difficulties’ were evaluated for DIF in the second stage of threshold (b : ‘not at all’ to ‘several days’) were similar in analysis. While the threshold b was similar for both 1 1 both subgroups, the locations of the thresholds b and b groups, the thresholds b and b were shifted upwards 2 3 2 3 differed: b was lower in T-G for all items, while b for G-G as compared to T-G. For G-G, the probability 2 3 was higher in T-G in items 3 and 4, and higher in of endorsing item 7 increased more rapidly with rising T-T in item 5 (see Table 2). Estimating group param- underlying level of depression than for T-G. Estimating eters with DIF-free items only, the group estimate of group parameters with DIF-free items only, the mean the latent depression factor was 1.03 standard devia- depression level was 1 standard deviation higher in T-G tions higher in T-T than in T-G. Using all items, it than in G-G. Based on IRT estimates of depression using was 1.04 standard deviations higher in T-T than in all items, the group estimate was identical: With respect T-G. In summary, language-related DIF is present in to the total score, i.e. on scale level, there was no directly four items, but the impact on the scale level and the observable impact of DIF related to ethnicity and migra- total score seems to be minimal. tion background. Reich et al. BMC Psychology (2018) 6:26 Page 7 of 13 Fig. 1 (See legend on next page.) Reich et al. BMC Psychology (2018) 6:26 Page 8 of 13 (See figure on previous page.) Fig. 1 Item characteristic curves (ICC) for each PHQ-9 depression item in all three subgroups. Left column: ICCs for each item for G-G; middle column: ICCs for T-G; right column: ICCs for T-T. Response options are 0 (not at all), 1 (several days), 2 (more than half the days), or 3 (nearly every day). The X-axis indicates the estimated level of depression (theta). The Y-axis indicates the probability of endorsing a response option at a given level of estimated depression Test characteristics and test information Comparability of language versions TCCs (Fig. 2, left column) showed that the expected The PHQ-9 sum score was comparable between German PHQ-9 score is about 6 to 9 points at the mean level and Turkish language versions. Although there was item of depression in our samples (theta = 0). The PHQ-9 level bias, this was not reflected in total scores. This could had curvilinear scaling properties in all three sub- be due to cancelling out of opposite item level DIF, or the groups. Consequently, differences between standard limited effect of item level DIF at low to average range of scores have different implications depending on the the scale where most subjects were located. Consequently, starting score. For example, a reduction in the under- differences between mean scores can be attributed to real lying level of depression of 1.5 standard deviations in differences between subgroups. In our analyses, the T-T G-G was represented by 13.5 points in the PHQ-9 sample included a higher proportion of inpatients and se- starting from theta = 1.5, and by 7.5 points starting verely depressed participants, which is reflected in a from theta = 0. meaningful difference between T-G and T-T in the latent Inspecting TICs (Fig. 2, right column), we learned that depression factor. These differences reflect true differ- the PHQ-9 offers good measurement precision (i.e. small ences in depression severity instead of measurement bias. standard errors) from about 1 standard deviation below In line with other studies comparing different language the population mean to about 2.5 standard deviations versions of the PHQ-9, we found DIF for the item ‘sleep above. Accordingly, Cronbach’s alpha was .90 for T-T problems’ [20–22]. However, studies on the and G-G, and .91 for T-G. cross-linguistic validity of the CES-D in English- and Dutch-speaking patients with systemic sclerosis [62]and Discussion the BDI in English- and Spanish-speaking outpatients [63] The scope of the present study was to examine found no DIF for the corresponding sleep items. In con- whether the Turkish and German versions of the clusion, the bias in the sleep item seems to be based in the PHQ-9 provide cross-linguistic and cross-cultural PHQ-9 item formulation itself rather than in the symptom validity. The German version is comparable to the of sleep problems across cultures. Language-related DIF English and is equally well validated. We applied for the items ‘appetite changes’ and ‘anhedonia’ were also IRT analyses to three samples which differed regard- found in other studies [20, 21], and was possibly related to ing language version and ethnicity. the PHQ-9 response options in our study: ‘More than half Table 3 Analyses of differential item functioning (DIF) a b DIF related to language DIF related to ethnicity and migration background c d e c d e Item Total Slope parameter Location parameters Total Slope parameter Location parameters 1. Anhedonia 10.9* 6.4* 4.5 2.2 0.4 1.8 2. Depressed mood 4.3 0.5 3.8 4.0 0.3 3.7 3. Sleep problems 8.3 0.1 8.3* 5.3 0.4 4.9 4. Low energy 11.2* 0.0 11.2* 2.8 0.1 2.7 5. Appetite changes 19.7*** 3.3 16.4*** 14.8** 0.3 14.5** 6. Low self-esteem 3.6 0.3 3.3 0.2 0.0 0.2 7. Concentration difficulties 5.1 0.2 4.8 18.7*** 6.8** 12.0** 8. Psychomotor changes 4.2 0.0 4.2 1.9 0.1 1.9 9. Suicidal ideation 5.6 0.4 4.2 3.0 0.3 2.7 2 2 2 We report X statistics. Significant X tests indicate that there is a difference in item functioning. Results for anchor items are printed in italics. X values for anchor items are reported from the last iteration of step one, where anchor items have been selected and purified. Candidate for DIF items are in bold, and X values are those estimated from the second stage of analysis, i.e. where candidate DIF items were tested against the previously identified set of DIF-free anchor items Analysis 1 comparing T-G (Turkish immigrants completing the German version of the PHQ-9, n = 191) with T-T (Turkish immigrants completing the Turkish version of the PHQ-9, n = 116). Analysis 2 comparing G-G (Germans with no migration background completing the German version of the PHQ-9, n = 1670) with c d e T-G (Turkish immigrants completing the German version of the PHQ-9, n = 191). df = 4. df = 1. df = 3 *p < .05, **p < .01, ***p < .001 Reich et al. BMC Psychology (2018) 6:26 Page 9 of 13 Fig. 2 Test characteristic curves (TCC) and test information curves (TIC) for the PHQ-9 for all three subgroups. TCCs can be found in the left column. The X-axis indicates the estimated level of depression (theta) and the Y-axis indicates the most likely expected PHQ-9 sum score associated with each level of depression. The dotted lines may serve as a guide when estimating differences between TCCs with respect to the most likely expected PHQ-9 sum score corresponding to levels of depression at the group mean (theta 0), 1.5 standard deviations below the group mean, and 1.5 standard deviations above the group mean. TICs can be found in the right column. The X-axis continues to be the estimated level of depression (theta). Here, the solid line plots the amount of measurement precision, i.e. measurement information (left Y-axis), at each depression level. The dotted line plots the standard error of measurement (right Y-axis) associated with each depression level the days’ was barely used by Turkish immigrants, espe- the response categories ‘more than half the days’ and cially when completing the Turkish version. One recent ‘nearly every day’ and working with a three-point Likert study on the Spanish version of the PHQ-9 also reported scale improved cross-cultural psychometric characteristics problems with PHQ-9 response categories [64]; collapsing of the PHQ-9 in this study. Reich et al. BMC Psychology (2018) 6:26 Page 10 of 13 Comparability across ethnic groups However, none of these studies investigated Turkish im- Our finding that PHQ-9 sum scores are comparable be- migrants. We did not adjust for sample differences in tween Germans without a migration background and age, education, and employment, since these variables Turkish immigrants in Germany without any restrictions are not independent of the groups examined here: The concurs with previous studies addressing the utilization T-G sample was substantially younger than the other of the PHQ-9 in culturally diverse populations [11, 20, groups, as more second- than first-generation Turkish 23–25]. Higher PHQ-9 sum scores in the T-G than in immigrants chose to respond to questionnaires in Ger- the G-G sample might be explained by self-selection man. DIF related to age has been reported for items 1, 2, processes resulting in more T-G with clinical signs of de- and 4 in a UK sample [65], which might have influenced pression participating in study 2 compared to the mainly the results of our analyses. Among Turkish immigrants, representative G-G sample from study 1. In contrast to the proportion of persons with only basic education or previous studies [11, 25], we found DIF for the items ‘ap- who are unemployed is greater than in the German gen- petite changes’ and ‘concentration difficulties’. The dif- eral population [19]. According to Cameron et al. [65], ferences manifested in a lower threshold for T-G to the PHQ-9 is free of DIF related to education. The pro- endorse the clinically meaningful response categories portion of seriously ill persons in the samples might ‘more than half the days’ and ‘nearly every day’. have affected analyses through sampling bias, as the pro- portion was higher in the Turkish immigrant samples. General characteristics Last but not least, the sample without a migration back- The PHQ-9 items covered a wide range of depression ground might encompass any data of repatriated Russian severities, and the PHQ-9 had a very good measurement Germans, since they are not classified as migrants in of- precision around and above the population mean of de- ficial statistics. pression. Our findings regarding these general character- Furthermore, as no gold standard measure of depres- istics of the PHQ-9 concur with previous research sion was included in the original studies, we were unable demonstrating the high quality of this depression ques- to compare sensitivity and specificity for each of our tionnaire [40, 43]. However, differences between means samples. The addition of a gold standard would have re- (as used in longitudinal studies or for documenting the sulted in a more sophisticated understanding of the im- course of therapy) should be interpreted with caution plications of our findings for the accuracy of diagnostic due to curvilinear scaling properties. A rapid initial im- recommendations of the PHQ-9. We did not test provement in PHQ-9 sum scores, especially in severely whether DIF had a consistent impact across levels of de- depressed patients, may not correspond to an equally pression severity (uniform DIF) or whether the impact strong improvement in underlying depression. of DIF varied by symptom level (nonuniform DIF). Fi- nally, the original studies rely on different settings and Strengths and limitations study designs, implying that data from different sources The strengths of our study are that we applied a might not be fully comparable. state-of-the-art statistical approach, i.e., Item Response Theory, and used relatively large samples including a Conclusions broad spectrum of depression severities. We evaluated Based on the main findings of the present study, the the psychometric characteristics of two PHQ-9 language PHQ-9 total sum score can be recommended as a versions in-depth for application in culturally diverse cross-cultural and cross-linguistic valid screening tool populations. Nonetheless, there are some limitations to for depression in Germans without a migration back- our study. Our analyses only included people with a ground and Turkish immigrants, regardless of whether Turkish migration background or no migration back- they complete the Turkish or the German version. These ground at all. Further differentiations between the influ- results might be transferable to the comparability with ences of migration background and ethnicity (i.e. the English version. When interpreting individual scores Turkish immigrants living in Germany vs. Turkish of Turkish immigrants in clinical practice or in com- people living in Turkey) are lacking. When interpreting parative studies, the response categories ‘more than half the results, it is important to consider that there is a lot the days’ and ‘nearly every day’ should both be consid- of heterogeneity in terms of participant characteristics ered as clinically meaningful responses, as suggested by and participant capabilities in the data, which might the categorical algorithm for the diagnosis of depressive affect the analyzes. The presented results might be disorder according to DSM-IV [13]. According to our biased due to sociodemographic differences between the results, both response options should be regarded as samples. Regarding gender, some studies report no or equally important. Further analysis may evaluate only a minor influence of gender on PHQ-9 scores [65, whether both response options are necessary or whether 66], while others report a significant influence [51]. they can be collapsed into one. Furthermore, Turkish Reich et al. BMC Psychology (2018) 6:26 Page 11 of 13 immigrants seemed to be more willing to endorse some Competing interests The authors declare that they have no competing interests. of the PHQ-9 items. Consequently, there might be inter- cultural differences in the perception or expression of depression [8]. External or relational bias [67] with re- Publisher’sNote spect to second variables (e.g. symptom expression) may Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. exist. Any ensuing differences in the predictive validity of the PHQ-9 [60] might be subject of further research. Author details In summary, the PHQ-9 can be highly recommended as Department of Psychology, University of Marburg, Marburg, Germany. Outpatient Unit for Research, Teaching and Practice, Faculty of Psychology, a cross-cultural and cross-linguistic valid depression University of Vienna, Renngasse 6-8, 1010 Vienna, Austria. Institute of screener for the investigated samples. Medical Psychology, Medical School, University of Leipzig, Leipzig, Germany. Clinic and Policlinic for Psychosomatic Medicine and Psychotherapy, Abbreviations 5 University Medical Center Mainz, Mainz, Germany. Institute of Medical Δ : Delta (=difference) in CFI; a: item slope parameter; b: item location CFI Psychology, Justus-Liebig-University, Gießen, Germany. parameter; b : threshold from ‘not at all’ to ‘several days’ in the PHQ-9; b : threshold from ‘several days’ to ‘more than half the days’ in the PHQ-9; Received: 16 May 2018 Accepted: 22 May 2018 b : threshold from ‘more than half the days’ to ‘nearly every day’ in the PHQ- 9; BDI: Beck Depression Inventory; C.I.: Confidence Interval; CES-D: Center for epidemiological studies-depression measure; CFA: Confirmatory factor analysis; CFI: Confirmatory fit index; DFG: German Research Foundation; References DIF: Differential item functioning; DSM-5: Diagnostic and statistical manual of 1. Paykel ES, Brugha T, Fryers T. Size and burden of depressive disorders in mental disorders, 5th edition; DSM-IV: Diagnostic and statistical manual of Europe. Eur Neuropsychopharmacol. 2005;15:411–23. https://doi.org/10. mental disorders, 4th edition; e.g.: Latin abbreviation “exempli gratia”, which 1016/j.euroneuro.2005.04.008. means “for example”; G-G: Germans with no migration background 2. Wittchen H-U, Jacobi F. Size and burden of mental disorders in Europe - a completing the German version of the PHQ-9; GRM: Parametric graded- critical review and appraisal of 27 studies. Eur Neuropsychopharmacol. 2005; response model; HIV: Human immunodeficiency virus; i.e.: Latin abbreviation 15:357–76. https://doi.org/10.1016/j.euroneuro.2005.04.012. “id est”, which means “in other words”; ICC: Item characteristic curve; 3. World Health Organization [WHO]. The world health report: Mental Health: IRT: Item response theory; n: Sample size; p: Probability; PHQ-9: Patient health New understanding and hope. 2001. questionnaire-9 (9 designates the depression module); RCC: Response 4. Ayuso-Mateos JL, Vázquez-Barquero JL, Dowrick C, Lehtinen V, Dalgard OS, characteristic curve; RCT: Randomized controlled trial; RMSEA: Root Mean Casey P, et al. Depressive disorders in Europe: prevalence figures from the square error of approximation; SD: Standard deviation; SE: Standard error; ODIN study. Br J Psychiatry. 2001;179:308–16. TCC: Test characteristic curve; T-G: Turkish immigrants completing the 5. de Wit MAS, Tuinebreijer WC, Dekker J, Beekman AJTF, Gorissen WHM, German version of the PHQ-9; TIC: Test information curve; T-T: Turkish Schrier AC, et al. Depressive and anxiety disorders in different ethnic groups: immigrants completing the Turkish version of the PHQ-9; USUMA: Name of a a population based study among native Dutch, and Turkish, Moroccan and demographic consulting company in Berlin, Germany; vs.: versus; X : Chi Surinamese migrants in Amsterdam. Soc Psychiatry Psychiatr Epidemiol. Square; z-scale: Standard scale in statistics where the standard deviation is 2008;43:905–12. https://doi.org/10.1007/s00127-008-0382-5. one and the mean is zero; θ: Theta, person parameter 6. González HM, Tarraf W, Whitfield KE, Vega WWA, González HMH. The epidemiology of major depression and ethnicity in the United States. J Acknowledgements Psychiatr Res. 2010;44:1043–51. https://doi.org/10.1016/j.jpsychires.2010.03.017. We would like to thank PD Dr. Heide Gläsmer, Dipl.-Psych. Luisa Bockel, Dipl.- 7. Hasin D, Goodwin R, Stinson F, Grant B. Epidemiology of major depressive Psych. Johanna Laskawi, and Dipl.-Psych. Daniela Zürn for their collaboration disorder. Arch Gen Psychiatry. 2005;62:1097–106. in data collection. We also express our thanks to the cooperating clinic sites: 8. Deisenhammer E a, Coban-Başaran M, Mantar A, Prunnlechner R, Kemmler Vitos Clinic for Psychiatry and Psychotherapy Marburg (Medical Director: Prof. G, Alkın T, et al. Ethnic and migrational impact on the clinical manifestation Dr. Dr. Matthias J. Müller); “Parkland-Klinik” Bad Wildungen; Clinic for of depression. Soc Psychiatry Psychiatr Epidemiol 2011. doi:https://doi.org/ Psychiatry and Psychotherapy at the Hospital of Offenbach; Clinic for 10.1007/s00127-011-0417-1. Psychiatry, Psychotherapy, and Psychosomatics at the Hospital of Frankfurt 9. Zayas LH, Gulbas LE. Are suicide attempts by young Latinas a cultural idiom Höchst; and MEDIAN “Klinik am Südpark” Bad Nauheim. of distress? Transcult Psychiatry. 2012; https://doi.org/10.1177/ Funding 10. Kirmayer LJ, Young A. Culture and somatization: clinical, epidemiological, Study 1 was supported by a grant from the German Research Foundation and ethnographic perspectives. Psychosom Med. 1998;60:420–30. (DFG) to Prof. Rief (Grant RI 574/14–1). 11. Baas KD, Cramer AOJ, Koeter MWJ, van de Lisdonk EH, van Weert HC, Schene AH. Measurement invariance with respect to ethnicity of the patient Availability of data and materials health Questionnaire-9 (PHQ-9). J Affect Disord. 2011;129:229–35. https://doi. The datasets used and/or analysed during the current study are available org/10.1016/j.jad.2010.08.026. from the corresponding author on request. 12. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16:606–13. Authors’ contributions 13. Spitzer R, Kroenke K, Validation WJ. Utility of a self-report version of PRIME- HR analyzed and interpreted the data, and was a major contributor in MD: the PHQ primary care study. JAMA. 1999;282:1737–44. writing the manuscript. WR and EB were the principal investigators of study 14. Löwe B, Kroenke K, Herzog W, Gräfe K. Measuring depression outcome with 1 and critically revised earlier versions of the manuscript. RM was the a brief self-report instrument: sensitivity to change of the patient health principal investigator of studies 2–4, and was a major contributor in writing questionnaire (PHQ-9). J Affect Disord. 2004;81:61–6. https://doi.org/10.1016/ the manuscript. All authors read and approved the final manuscript. S0165-0327(03)00198-8. 15. Löwe B, Spitzer R, Gräfe K, Kroenke K, Quenter A, Zipfel S, et al. Comparative Ethics approval and consent to participate validity of three screening questionnaires for DSM-IV depressive disorders The institutional review board of the German Psychological Association and physicians’ diagnoses. J Affect Disord. 2004;78:131–40. https://doi.org/ (study 1) and the institutional review board of the Department of 10.1016/S0165-0327(02)00237-9. Psychology, Marburg University, Germany (studies 2 to 4) reviewed and 16. Löwe B, Unützer J, Callahan C, Perkins A, Kroenke K. Monitoring depression approved the study protocols. Participants of all studies provided written treatment outcomes with the patient health questionnaire-9. Med Care. informed consent. 2004;42:1194–201. Reich et al. BMC Psychology (2018) 6:26 Page 12 of 13 17. American Psychiatric Association [APA]. Diagnostic and statistical manual of 38. Reich H, Bockel L, Mewes R. Motivation for psychotherapy and illness beliefs mental disorders, DSM-5. 2013. in Turkish immigrant inpatients in Germany: results of a cultural comparison 18. Pfizer Inc. Patient health questionnaires (PHQ) screeners, official website. study. J Racial Ethn Heal Disparities. 2015;2:112–23. https://doi.org/10.1007/ 2013. http://www.phqscreeners.com/overview.aspx?Screener=02_PHQ-9. s40615-014-0054-y. 39. Woellert F, Kröhnert S, Sippel L, Klingholz R. Ungenutzte Potenziale: Zur 19. Statistisches Bundesamt. Bevölkerung und Erwerbstätigkeit: Bevölkerung mit Lage der Integration in Deutschland. Berlin-Institut für Bevölkerung und Migrationshintergrund, Ergebnisse des Mikrozensus 2013. Wiesbaden: Entwicklung; 2009. Statistisches Bundesamt; 2014. 40. Kroenke K, Spitzer R, Williams J, Löwe B. The patient health questionnaire 20. Huang FY, Chung H, Kroenke K, Delucchi KL, Spitzer RL. Using the patient somatic, anxiety, and depressive symptom scales: a systematic review. health Questionnaire-9 to measure depression among racially and ethnically Psychiatry Prim Care. 2010;32:345–59. diverse primary care patients. J Gen Intern Med. 2006;21:547–52. https://doi. 41. Löwe B, Spitzer RL, Zipfel S, Herzog W. PRIME MD Patient Health org/10.1111/j.1525-1497.2006.00409.x. Questionnaire (PHQ) — German version Manual and materials. 2nd ed. 21. Arthurs E, Steele RJ, Hudson M, Baron M, Thombs BD. Are scores on English Karlsruhe: Pfizer; 2002. and French versions of the PHQ-9 comparable? An assessment of 42. Bracken BA, Barona A. State of the art procedures for translating, validating differential item functioning. PLoS One. 2012;7:e52028. https://doi.org/10. and using psychoeducational tests in cross-cultural assessment. Sch Psychol 1371/journal.pone.0052028. Int. 1991;12:119–32. https://doi.org/10.1177/0143034391121010. 22. Hirsch O, Donner-Banzhoff N, Bachmann V. Measurement equivalence of 43. Martin A, Rief W, Klaiberg A, Braehler E. Validity of the brief patient health four psychological questionnaires in native-born Germans, Russian-speaking questionnaire mood scale (PHQ-9) in the general population. Gen Hosp immigrants, and native-born Russians. J Transcult Nurs. 2013; https://doi.org/ Psychiatry. 2006;28:71–7. https://doi.org/10.1016/j.genhosppsych.2005.07.003. 10.1177/1043659613482003. 44. Löwe B, Gräfe K, Zipfel S, Witte S, Loerch B, Herzog W. Diagnosing ICD-10 23. Hepner KA, Morales LS, Hays RD, Edelen MO, Miranda J. Evaluating depressive episodes: superior criterion validity of the patient health differential item functioning of the PRIME-MD mood module among questionnaire. Psychother Psychosom. 2004;73:386–90. https://doi.org/10. impoverished black and white women in primary care. Women’s Heal 1159/000080393. Issues. 2008;18:53–61. https://doi.org/10.1016/j.whi.2007.10.001. 45. Henkel V, Mergl R, Kohnen R, Allgaier A-K, Möller H-J, Hegerl U. Use of brief 24. Mewes R, Christ O, Rief W, Brähler E, Martin A, Glaesmer H. Are depression depression screening tools in primary care: consideration of heterogeneity and somatisation equivalent for migrants and native Germans? An in performance in different patient groups. Gen Hosp Psychiatry. 2004;26: investigation of measurement invariance for the PHQ-9 and PHQ-15. 190–8. https://doi.org/10.1016/j.genhosppsych.2004.02.003. Diagnostica. 2010;56:230–9. https://doi.org/10.1026/0012-1924/a000026. 46. Çorapçıoğlu A, Özer GU. [Patient Health Questionnaire-9]. Patient Health 25. Crane PK, Gibbons LE, Willig JH, Mugavero MJ, Lawrence ST, Schumacher JE, Questionnaire (PHQ) Screeners. http://www.phqscreeners.com/sites/g/files/ et al. Measuring depression levels in HIV-infected patients as part of routine g10016261/f/201412/PHQ9_Turkish%20for%20Turkey.pdf. Accessed 12 Dec clinical care using the nine-item patient health questionnaire (PHQ-9). AIDS Care. 2010;22:874–85. https://doi.org/10.1080/09540120903483034. 47. Yazici Güleç M, Güleç H, Simşek G, Turhan M, Aydin Sünbül E. Psychometric 26. Holland P, Wainer H. Differential item functioning. In: Hillsdale (NJ): properties of the Turkish version of the patient health questionnaire- Lawrence Erlbaum associates; 1993. somatic, anxiety, and depressive symptoms. Compr Psychiatry. 2012;53:623– 27. Camilli G, Shepard L. Methods for identifying biased test items. Thousand 9. https://doi.org/10.1016/j.comppsych.2011.08.002. Oaks: Sage Publications; 1994. 48. Schenk L, Bau AM, Borde T, Butler J, Lampert T, Neuhauser H, et al. 28. Adler M, Hetta J, Isacsson G, Brodin U. An item response theory evaluation Mindestindikatorensatz zur Erfassung des Migrationsstatus [minimum set of of three depression assessment instruments in a clinical sample. BMC Med indicators for measuring the migration status]. Bundesgesundheitsblatt - Res Methodol. 2012;12:84. https://doi.org/10.1186/1471-2288-12-84. Gesundheitsforsch - Gesundheitsschutz. 2006;49:853–60. 29. Reise SP, Waller NG. Item response theory and clinical measurement. Annu 49. Cameron IM, Crawford JR, Lawton K, Reid IC. Psychometric comparison of Rev Clin Psychol. 2009;5:27–48. https://doi.org/10.1146/annurev.clinpsy. PHQ-9 and HADS for measuring depression severity in primary care. Br J 032408.153553. Gen Pract. 2008;58:32–6. https://doi.org/10.3399/bjgp08X263794. 30. Waller NG, Thompson JS, Wenk E. Using IRT to separate measurement bias 50. Dum M, Pickren J, Sobell LC, Sobell MB. Comparing the BDI-II and the PHQ- from true group differences on homogeneous and heterogeneous scales: 9 with outpatient substance abusers. Addict Behav. 2008;33:381–7. https:// an illustration with the MMPI. Psychol Methods. 2000;5:125–46. https://doi. doi.org/10.1016/j.addbeh.2007.09.017. org/10.1037//1082-989X.5.1.125. 51. Kocalevent R-D, Hinz A, Brähler E. Standardization of the depression screener 31. Meijer RR, Baneke JJ. Analyzing psychopathology items: a case for patient health questionnaire (PHQ-9) in the general population. Gen Hosp nonparametric item response theory modeling. Psychol Methods. 2004;9: Psychiatry. 2013;35:551–5. https://doi.org/10.1016/j.genhosppsych.2013.04.006. 354–68. https://doi.org/10.1037/1082-989X.9.3.354. 52. Castillo R, Waitzkin H, Ramirez Y, Escobar JI. Somatization in primary care, 32. Eurostat. Migrants in Europe. A statistical portrait of the first and second with a focus on immigrants and refugees. Arch Fam Med. 1995;4:637–46. generation. Luxembourg: European Union; 2011. 53. Kirmayer LJ, Sartorius N. Cultural models and somatic syndromes. 33. Aichberger MC, Schouler-Ocak M, Mundt A, Busch MA, Nickels E, Psychosom Med. 2007;69:832–40. https://doi.org/10.1097/PSY. Heimann HM, et al. Depression in middle-aged and older first 0b013e31815b002c. generation migrants in Europe: results from the survey of health, 54. Mewes R, Rief W. Are somatoform complaints and causal attributions in ageing and retirement in Europe (SHARE). Eur Psychiatry. 2010;25:468– Turkish migrants associated with their cultural background or the migration 75. https://doi.org/10.1016/j.eurpsy.2009.11.009. itself? Zeitschrift für Medizinische Psychol. 2009;18:135–9. https://content. 34. Lindert J, Von EOS, Priebe S, Mielck A, Brähler E. Depression and anxiety in iospress.com/articles/zeitschrift-fur-medizinische-psychologie/zmp18-3-4-07. labor migrants and refugees - a systematic review and meta-analysis. Soc Accessed 23 Feb 2012 Sci Med. 2009;69:246–57. https://doi.org/10.1016/j.socscimed.2009.04.032. 55. Behrens K, Machleidt W, Haltenhof H, Ziegenbein M, Calliess IT. 35. Weidacher A. Schlußfolgerungen und partizipationspolitischer Ausblick Somatization and vulnerability to offence in immigrants with mental [Conclusions and political forecast]. In: Weidacher A, editor. In Deutschland disorders - evidence or eminence? Nervenheilkunde. 2008;27:639–43. zu Hause: politische Orientierungen griechischer, italienischer, türkischer 56. Cheung GW, Rensvold RB. Evaluating goodness-of- fit indexes for testing und deutscher junger Erwachsener im Vergleich (DJI-Ausländersurvey) At measurement invariance. Struct Equ Model. 2002;(2):233–55. home in Germany: a comparison of political orientations in Greek, Italian, 57. Muthén L. Muthén B. Mplus User’s guide. Turkish and Germa. Opladen: Leske + Budrich; 2000. p. 265–72. 58. Samejima F. Estimation of latent ability using a response pattern of graded 36. Mewes R, Rief W, Stenzel N, Glaesmer H, Martin A, Brähler E. What is scores. Psychometric Monograph Np. 1969:17. “normal” disability? An investigation of disability in the general population. 59. Samejima F. The graded response model. In: van der Linden WJ, Pain. 2009;142:36–41. https://doi.org/10.1016/j.pain.2008.11.007. Hambleton RK, editors. Handbook of modern item response theory. 37. Mewes R, Asbrock F, Laskawi J. Perceived discrimination and impaired New York: Springer; 1996. mental health in Turkish immigrants and their descendents in Germany. 60. Embretson S, Reise S. Item response theory for psychologists. Hove: Compr Psychiatry. 2015;62:42–50. Psychology Press; 2013. Reich et al. BMC Psychology (2018) 6:26 Page 13 of 13 61. Cai L, Thissen D, du Toit SHC. IRTPRO 2.1 for windows (item response theory for patient-reported outcomes). 2014. 62. Kwakkenbos L, Arthurs E, van den Hoogen FHJ, Hudson M, van Lankveld WGJM, Baron M, et al. Cross-language measurement equivalence of the Center for Epidemiologic Studies Depression (CES-D) scale in systemic sclerosis: a comparison of Canadian and Dutch patients. PLoS One. 2013;8: e53923. https://doi.org/10.1371/journal.pone.0053923. 63. Azocar F, Areán P, Miranda J, Muñoz RF. Differential item functioning in a Spanish translation of the Beck depression inventory. J Clin Psychol. 2001;57: 355–65. https://doi.org/10.1002/jclp.1017 64. Zhong Q, Gelaye B, Fann JR, Sanchez SE, M a W. Cross-cultural validity of the Spanish version of PHQ-9 among pregnant Peruvian women: a Rasch item response theory analysis. J Affect Disord. 2014;158:148–53. https://doi. org/10.1016/j.jad.2014.02.012. 65. Cameron IM, Crawford JR, Lawton K, Reid IC. Differential item functioning of the HADS and PHQ-9: an investigation of age, gender and educational background in a clinical UK primary care sample. J Affect Disord. 2013;147: 262–8. https://doi.org/10.1016/j.jad.2012.11.015. 66. Thibodeau MA, Asmundson GJG. The PHQ-9 assesses depression similarly in men and women from the general population. Pers Individ Dif. 2014;56:149–53. 67. Drasgow F. Study of the measurement bias of two standardized psychological tests. J Appl Psychol. 1987;72:19–29.

Journal

BMC PsychologySpringer Journals

Published: Jun 5, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off