Artist Color Wheel Chart Template 05

16 pages
0 views

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Description:
Tags
Transcript
  METHODOLOGICAL PAPER  Guidelines for choosing between multi-item and single-itemscales for construct measurement: a predictivevalidity perspective Adamantios Diamantopoulos  &  Marko Sarstedt  & Christoph Fuchs  &  Petra Wilczynski  &  Sebastian Kaiser Received: 1 June 2011 /Accepted: 27 December 2011 /Published online: 14 February 2012 # The Author(s) 2012. This article is published with open access at Springerlink.com Abstract  Establishing predictive validity of measures is a major concern in marketing research. This paper investi-gates the conditions favoring the use of single items versusmulti-item scales in terms of predictive validity. A series of complementary studies reveals that the predictive validity of single items varies considerably across different (concrete)constructs and stimuli objects. In an attempt to explain theobserved instability, a comprehensive simulation study isconducted aimed at identifying the influence of different factors on the predictive validity of single versus multi-item measures. These include the average inter-item corre-lations in the predictor and criterion constructs, the number of items measuring these constructs, as well as thecorrelation patterns of multiple and single items betweenthe predictor and criterion constructs. The simulationresultsshowthat,undermostconditionstypicallyencounteredin practical applications, multi-item scales clearly outperformsingle items in terms of predictive validity. Only under veryspecific conditions do single items perform equally well asmulti-item scales. Therefore, the use of single-item measuresin empirical research should be approached with caution, andthe use of such measures should be limited to specialcircumstances. Keywords  Singleitems.Multi-itemscales.Predictivevalidity.Measurementtheory The authors thank Edward E. Rigdon (Georgia State University), UdoWagner (University of Vienna) and the anonymous reviewers for their helpful comments on previous versions of this paper.A. DiamantopoulosDepartment of Business Studies, University of Vienna,Bruenner Strasse 72,1210 Vienna, Austria e-mail: adamantios.diamantopoulos@univie.ac.at M. Sarstedt ( * )Institute for Market-based Management,Ludwig-Maximilians-University Munich,Kaulbachstrasse 45,80539 Munich, Germanye-mail: sarstedt@bwl.lmu.deM. Sarstedt Faculty of Business and Law, University of Newcastle, Newcastle, Australia C. FuchsRotterdam School of Management,Erasmus University,Burgemeester Oudlaan 50,3062 PA Rotterdam, The Netherlandse-mail: cfuchs@rsm.nlP. WilczynskiInstitute for Market-based Management,Ludwig-Maximilians-University Munich,Kaulbachstrasse 45,80539 Munich, Germanye-mail: wilczynski@bwl.lmu.deS. Kaiser RSU Rating,Karlstrasse 35,80333 Munich, Germanye-mail: sebastian.kaiser@rsu-rating.deJ. of the Acad. Mark. Sci. (2012) 40:434  –  449DOI 10.1007/s11747-011-0300-3  Introduction More 30 years ago, in a widely cited  Journal of Marketing  article, Jacoby (1978, p. 93) alerted researchers to the  “ Follyof Single Indicants, ”  arguing that   “ given the complexity of our subject matter, what makes us think that we can useresponses to single items [ … ] as measures of these concepts,then relate these scores to a host of other variables, arrive at conclusions based on such an investigation, and get awaycalling what we have done  ‘ quality research ’ ? ”  Marketingacademia was quick to respond to Jacoby ’ s (1978) criticism as evidenced in a series of highly influential papers seekingto provide guidance to researchers in their measure devel-opment efforts (e.g., Churchill 1979; Churchill and Peter 1984; Peter  1979). The adoption of structural equation mod- eling techniques further encouraged the systematic psycho-metric assessment of multi-item (MI) scales in terms of dimensionality, reliability, and validity (e.g., Anderson andGerbing 1982; Steenkamp and van Trijp 1991). Nowadays, the use of MI scales is standard practice in academic market-ing research as reflected both in relevant scale development monographs (e.g., Netemeyer et al. 2003; Viswanathan 2005) and in numerous handbooks containing compilations of mar-ketingmeasures(e.g.,Beardenetal.2011;Bruneretal.2005). Recently, however, Bergkvist and Rossiter (2007, p. 183)challenged this conventional wisdom on both theoreticaland empirical grounds and concluded that   “ theoretical testsand empirical findings would be unchanged if good single-item measures were substituted in place of commonly usedmultiple-item measures. ”  Their theoretical challenge was based on the C-OAR-SE procedure (Rossiter  2002, 2011) according to which, if the object of the construct (e.g., a  brand or an ad) can be conceptualized as concrete andsingular and if the attribute of the construct (e.g., an attitudeor a perception) can be designated as concrete, there is noneed to use an MI scale to operationalize the construct.Furthermore, Bergkvist and Rossiter (2007, 2009) reported empirical findings indicating that single-item (SI) measuresdemonstrated equally high predictive validity as MI scales.The authors ’  final conclusion was that   “ carefully craftedsingle-item measures  —  of doubly concrete constructs  —  areat least as valid as multi-item measures of the same con-structs, and that the use of multiple items to measure them isunnecessary ”  (Bergkvist and Rossiter  2009, p. 618).In light of Bergkvist and Rossiter  ’ s (2007, 2009) find- ings, researchers may be tempted to adopt SI measures not least because the latter have numerous practical advantagessuch as parsimony and ease of administration (e.g., Drolet and Morrison 2001; Fuchs and Diamantopoulos 2009; Wanous et al. 1997). Given recent concerns regarding  “ over-surveying, ”  decreasing response rates, and high costs of sur-veying additional items (Rogelberg and Stanton 2007), theadoptionofSImeasuresis clearlytempting.However,cautionneeds to be exercised before established MI scales are aban-doned in favor of single items, for at least three reasons.First, research in other fields shows that SI measures donot   always  perform as well as MI scales of the same con-struct. For example, in a study by Kwon and Trail (2005),sometimes the MI scale outperformed the SI measure, some-times there was no difference between them, and sometimesthe SI measure was a better predictor than the MI scale.Overall, the results varied both across constructs anddepending upon the specific criterion variable under consid-eration (see also Gardner et al. 1989; Loo 2002). Second, the response pattern of an item measuring a spe-cific construct frequently carries over to the subsequent itemmeasuring(thesameor) another construct due torespondents ’ state dependence (De Jong et al. 2010). If the subsequent itemis the  only  item measuring another construct (i.e., an SImeasure), such carry-over effects might considerably affect the measure ’ s (predictive) validity. The use of multiple items,in contrast, may compensate such effects.Third, prior studies (Bergkvist and Rossiter  2007, 2009) have used Fisher  ’ s z-transformation test to compare correla-tion coefficients and R  2 -values when contrasting the predic-tive validity of SI versus MI measures. However, this test isonly appropriate when correlations from two  independent   (asopposed to paired) samples are to be compared (e.g., Steiger 1980); for related correlation coefficients, Ferguson ’ s (1971) or Meng et al. ’ s (1992) procedures should be employed. Given the practical advantages of SI measures, evidencelegitimating their use is clearly welcome. At the same time,evidence to the contrary cannot be ignored either. Against this background,the present studyseeks to investigate under whichconditions SI measures are likely to have comparable predic-tive ability as MI scales. We first replicate Bergkvist andRossiter  ’ s (2007, 2009) analyses by comparing the predictive validity of SI and MI measures of attitude toward the ad (A Ad )and attitude toward the brand (A Brand ). We then undertake a similar analysis using different (concrete) constructs to ascer-tain the robustness of our findings in different settings. We findevidenceindicatingthatSImeasures can havepredictiveabilitysimilar to MI scales; however, we also observe that the latter significantly outperform the former in most of our empiricalsettings.Thus,whereasaparticularSImayyieldgoodresultsinone setting (e.g., in one product category), the same item ’ s predictive validity may be disappointing in another.To shed light on the observed instability, we subsequentlyconduct a simulation study to identify the influence of different design characteristics (e.g., the average inter-item correlationamong the items of the predictor and criterion constructs, thenumber of items used to measure these constructs) on the predictivevalidityofSIversusMImeasures.Bysystematicallyvaryingdifferentcombinationsofthesecharacteristics,weoffer insightsintotherelativeperformanceofSIandMIscalesunder different conditions. Based on our findings, we then provide J. of the Acad. Mark. Sci. (2012) 40:434  –  449 435  marketing researchers with an empirically-based guideline for the use of SI and MI scales in practical applications. Theoretical background According to conventional measurement theory, the (reflec-tive) items comprising an MI measure of a focal construct represent a random selection from the hypothetical domain of all possible indicators of the construct (Nunnally andBernstein 1994). Using multiple items helps to average out errors and specificities that are inherent in single items, thusleadingtoincreasedreliabilityandconstructvalidity(DeVellis2003). Inthiscontext, “ in valid measures, items should have a common core (which increases reliability) but should alsocontribute some unique variance which is not tapped by other items ”  (Churchill and Peter  1984, p. 367). In practice, how-ever,scaledevelopersoftenplaceundueemphasisonattaininghigh reliability, resulting in semantically redundant items that adversely affect the measure ’ s validity (Drolet and Morrison2001). It is against this background that proponents of SImeasures argue that   “ when an attribute is judged to be con-crete, there is no need to use more than a single item [ … ] tomeasure it in the scale ”  (Rossiter  2002, p. 313).Although the above recommendation is undoubtedly ap- pealing from a pragmatic point of view, it is not without  problems from a conceptual perspective. Formally, given a single measure x 1  and an underlying latent variable  η  (rep-resenting the focal construct), the relevant measurement model is described by the following equation, where  λ 1  isthe loading of x 1  on  η  and  ε 1  is measurement error, withCOV(  η ,  ε 1 ) 0 0 and E( ε 1 ) 0 0.x 1  ¼  l 1 η   þ   1 ;  ð 1 Þ There are two possible ways of interpreting x 1  in Eq. 1,namely that (1) x 1  is somehow  unique  (i.e., no other item could possibly measure  η ) or (2) that x 1  is  representative  (i.e., it isinterchangeable with other measures of   η ). The first interpreta-tion is highly problematic because  “ an observable measurenever fully exhausts everything that is meant by a construct  ” (Peter  1981, p. 134). Indeed, if x 1  where to be seen as  the measure of   η ,  “ a concept becomes its measure and has nomeaning beyond that measure ”  (Bagozzi 1982, p. 15). Thesecond interpretation (x 1  as a representative measure of   η )is more consistent with the domain sampling model but raises the question of how the item should be chosen.As Diamantopoulos (2005, p. 4) observes,  “ if   …  a single ‘ good ’  item is to be chosen from a set of potential candidates(which implies that   other   items could, in principle, have beenused instead), the question becomes  how  to chose the  ‘  best  ’ (or at least, a   ‘ good ’ ) item. ” One option is to choose  a priori  one item from a set of indicators based on face validity considerations (e.g.,Bergkvist and Rossiter  2007). However, given that   all   itemsin an MI scale should conform to the domain samplingmodel (DeVellis 2003; Nunnally and Bernstein 1994), there is no reason why any one item should be  conceptually superior to the others; assuming unidimensionality, scaleitems are essentially interchangeable from a content validity point of view (Bollen and Lennox 1991).Another option is to ask a panel of experts to select theitem that   “  best captures ”  or   “ most closely represents ”  thefocal construct (e.g., Rossiter  2002). This approach has theadvantage that it is based on empirical agreement amongexpert judges rather than solely on the preferences of theresearchers. However, the conceptual issue as to  why  thechosen item is better than the rest of the items is still not addressed. Also, there is evidence showing that experts arenot infallible (Chi et al. 1988).A third option is to base item choice on statistical criteria such as an indicator  ’ s communality (e.g., Loo 2002) or thereliability of the indicator (e.g., Wanous et al. 1997). Whilethis approach explicitly considers the psychometric proper-ties of the various scale items, it is also subject to samplingvariability; for example, the item displaying the highest communality in one sample may not do so in another sample. Thus, identifying a suitable SI  prior   to statisticalanalysis is hardly feasible.A fourth option is to choose an item at random. Randomchoice would appear to be most conceptually consistent with the domain sampling model. However, according tocongeneric measurement (Jöreskog 1971), items may differ from one another both in terms of how strongly they relateto the underlying construct and in terms of their susceptibil-ity to measurement error (Darden et al. 1984); thus randomchoice may or may not pick the  “  best  ”  item (i.e., the itemwith the strongest loading or the smallest error variance).A final option is to look outside the MI scale and generatea tailor-made SI measure (e.g., Bergkvist and Rossiter 2009). However, given the plethora of MI scales availablefor most marketing constructs, it is unclear what additional benefit would be gained by generating extra (i.e.,  “ standalone ” ) SI measures. Moreover, there are no established procedures for the construction of SI measures in marketing.In what follows, we contrast the predictive ability of MIscales against that of   each  individual item comprising thescales. Evidence in favor of using an SI would be provided if (1)  at least one  item displays comparable predictive validity asthe entire scale, (2) the item(s) concerned does so across differ-ent samples, and (3) the item(s) concerned does so acrossdifferent stimuli (e.g., brands or ads). The stability implied by(2) and (3) is essential because if SI performance is veryvariable in different settings, it becomes extremely difficult to ex ante  select an SI as a measure of the focal construct in a  planned study. Clearly, from a practical perspective, unless onecan select a   “ good ”  item  before  the study is executed, the 436 J. of the Acad. Mark. Sci. (2012) 40:434  –  449   benefits of SI measures (e.g., parsimony, flexibility, less mo-notony, ease of administration) will not be reaped. Study 1: replication of Bergkvist and Rossiter (2007,2009) Study 1 uses the same design, focal constructs and measuresas Bergkvist and Rossiter (2007, 2009). Specifically, we compare the predictive validity of SI versus MI measuresof attitude toward the ad (A Ad ), brand attitude (A Brand ), and purchase intention (PI Brand ) measured on 7-point semanticdifferential scales. We drew our data from a survey of 520university students (age:  M  0 22 years, 68% female) whowere randomly exposed to two of four real advertisementsof products in four different product categories: insurance, jeans, pain relievers, and coffee (Bergkvist and Rossiter 2007, 2009). The ads were taken from foreign countries to ensure that respondents knew neither the brands nor the ads.We first confirmed the unidimensionality of the three MIscales using factor analysis and computed their internalconsistencies, which were satisfactory (minimum  α   valuesof .87, .88, and .88 for A Ad , A Brand , and PI Brand , respective-ly). We then computed the correlation (r) between the MImeasures of A Ad  (predictor) and A Brand  (criterion) as well as between A Brand  (predictor) and PI Brand  (criterion). Next, wecomputed correlations between each individual item captur-ing A Ad  and the full A Brand  scale and compared the resultingcorrelation coefficient with that obtained in the previousstep using Meng et al. ’ s (1992) test. We did the same for  the items capturing A Brand  and the full PI Brand  scale. In linewith measurement theorists (Bergkvist and Rossiter  2007,2009; Carmines and Zeller  1979), we assume that the higher  correlations, the closer they are to the true correlations (inthe population). We also undertook a bootstrap analysis(Cooil et al. 1987; Efron 1979, 1981) to evaluate the relative  performance of SI and MI measures in slightly changed data constellations. Table 1 summarizes the results.The results relating to A Ad  (Table 1A-D) show that in threeout of the four product categories,  all   individual items havesignificantly lower predictive validity than the full scale. Onlyfor pain relievers there is a single instance (like/dislike) for which comparable performance is obtained with an SI. Thesefindings are further supported by the bootstrapping resultswhich show that, in the vast majority of sample runs, the MIscale outperforms the individual items.A similar picture emerges for the relationship betweenA Brand  and PI Brand  (Table 1E-H). For example, good/baddisplays a comparable predictive validity as the MI scalefor pain relievers and coffee, but not for insurance and jeans.Similarly, pleasant/unpleasant performs equally well as theMI scale for pain relievers but not for any other product category; the other single items are consistently outper-formed by their MI counterparts.Our replication of Bergkvist and Rossiter (2007, 2009) thus reveals considerable variability in the performance of single items. Whereas, depending on the product category,some items have similar predictive validity as the MI scale,others consistently lag behind, suggesting that the relative performance of SI measures is context and construct-specific. 1 We further examine this issue using different constructs, different stimuli (brands), and non-students asrespondents in Studies 2 and 3 below. Study 2 Oursecondempiricalstudyis basedonaconsumersample anduses the hedonic (HED) and utilitarian (UT) dimensions of consumer attitudes towards products (Batra and Ahtola  1991)asfocalconstructs.Conceptually,the hedonic dimensionmeas-ures the experiential enjoyment of a product, while the utilitar-ian dimension captures its practical functionality (Batra andAhtola  1991; Okada  2005; Voss et al. 2003). We select- ed these constructs because, under Rossiter  ’ s (2002) ter- minology, each dimension can be considered as a doubly-concrete construct in that the object and the attribute of theconstruct   “ are easily and uniformly imagined ”  (Bergkvist andRossiter  2007, p. 176); consumers are likely to easily under-stand the meaning of the items measuring the two constructs(e.g., enjoyable, useful) as a set of expert raters also con-firmed. Previous applications of the HED and UTscales have produced alphas above .80 (Voss et al. 2003), and have evensubstituted the dimensions with single items (Okada  2005).We used Voss et al. ’ s (2003) scales to capture the two dimensions (see Table 2) and a four-item measure of brandliking (good/bad, like/dislike, positive/negative, unfavorable/ favorable) drawn from Holbrook and Batra (1987) as thecriterion (7-point scales were applied throughout). One hun-dred consumers (age:  M  0 31 years; 52% female) were ex- posed to print ads of a car brand and asked to complete theHED and UTscales, as well as the brand liking scale. Factor analysis confirmed the unidimensionality of the threeMI scales, and their internal consistencies were highlysatisfactory ( α  HED 0 .93,  α  UT 0 .89, and  α  BLiking 0 .94). Wefollowed the same procedure as in Study 1 to comparethe predictive validity of SI and MI measures of HEDand UT, using brand liking as the criterion construct.The statistical power of our analysis was close to 1(Cohen 1988), thus supporting the adequacy of thesample size. Table 2 summarizes the results. 1 We also replicated Study 1 on a separate sample of 108 students froma major US university and found consistent results. The detailed resultsof this study are available from the authors upon request.J. of the Acad. Mark. Sci. (2012) 40:434  –  449 437
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks