Wissenschaftliche Publikationen

Reliability of online-administered questionnaires: More than a catchword?

Thomas Rodenhausen and Andreas Ohde
aus planung & analyse MR/2000

Online administration of standard questionnaires is an ideal basis to conduct market research on an international level, especially if experience and knowledge has been collected in national studies. As is well known however, merely translating the items does not necessarily yield comparable results. A key quality of a questionnaire is the reliability of the (sub-) scales and the contribution of single items to this reliability. Analysis of reliability offers the opportunity to optimize the quality of questionnaires and to check for cross-cultural comparability. In the present paper the procedure of statistical reliability analysis is described and consequences for international adaptation of instruments are outlined.

Reliability and validity are central criteria for the qualilty of standardized questionnaires. Both criteria have to be fulfilled regardless of the specific mode of use - be it as traditional paper-pencil questionnaire or as web based tool. However, in many cases the fulfillment of these criteria is claimed without supporting it by an adequate statistical analysis.

Analysis and documentation of the relevant statistical parameters become even more pressing in the context of internationalization of market research. Especially standard instruments, which have proven to yield results of high quality in the past are ideal candidates for international studies. Often, a lot of knowledge has been accumulated, for instance in the form of benchmarks which can significantly enhance the interpretation of results. Yet the mandatory prerequisite for the adaptation of a questionnaire to other languages and cultures is a proven reliability of the original instrument. This paper describes the evaluation and optimization of a web based tool for comparison of corporate websites with respect to its statistical reliability. The results of this type of study can be used to assess the quality of the cross-cultural adaptation of an instrument on the level of single items, subscales and the total scale.

Evaluation of corporate websites
The corporate website is an essential link between customer and company. A few years ago internet presence could lead to a competitive advantage even though the benefits were not apparent to the visitor. Times have changed and consumer and competitor demands are increasing. High expectations comprise comprehensive and detailed content as well as technical reliability and functionality integrated by an appealing graphical design. In that connection, it is important to emphasize that it is difficult to develop stable standards for the aesthetic appearance of websites in a context which is evolving rapidly. Therefore, for the quality conscious website provider it is imperative to expose his website to the user's judgement in order to identify strengths and weaknesses.

The study of user acceptance of websites offers an ideal field for panel based online research, because two distinct advantages of this approach can be brought to bear. First of all, high external validity can be attained by studying the attitudes of users towards websites in their natural context. Users can answer an online distributed questionnaire while interacting with the respective website. Second, by using an online panel, it is easy to collect a sample of persons with high interest in the subject. Third, it is possible to control and stratify the sample for relevant characteristics. For these reasons it is not surprising that various online research companies offer that kind of analysis. MediaTransfer AG Netresearch & Consulting has conducted standardized studies for the evaluation of websites since 1996 and has pioneered work in this field. Since 1999 MediaTransfer AG Netresearch & Consulting and the German magazine Horizont cooperate and publish a monthly study of general interest named Website Trend. Due to MediaTransfer AG Netresearch & Consulting's experience, benchmarking on the basis of more than 350 studies can be offered to customers. At the same time it is possible to conduct as statistical analysis to determine the reliability based on classical test theory. The results of this analysis allow us to look at the characteristics not only of the whole instrument, but of the subscales and of the single items.

These resuIts und their future use are the subject of the present paper.

Our Website Trend compares six competing websites by means of a questionnaire made up of 24 ratings. Euch rating represents an important aspect of websites. These items are grouped into six subscales which cover essential facets of websites:

  • Concept
  • Visual Realization
  • Usabilitiy
  • Content
  • Interactivity
  • Technical Quality

The 24 items are presented to 120 respondents, recruited from our Interactive Dynamic Online Panel. Prior to the evaluation of the six websites the importance of each of the 24 aspects is assessed, because it can be expected that this importance varies according to the specific line of business to which a web site belongs. Furthermore this procedure is necessary to calculate a weighted total score, which allows the comparison of different websites, even though they do not belong to the same sector. In the second step, each of the six websites is evaluated by means of the same 24 items. The respondents receive the online questionnaire including a link to the particular website and are instructed to use that website for at least three minutes, which is controlled by a timer. The respondents receive no information about the assignment of the items to the subscales. The only difference between the importance check (performed once for each respondent) and the six website-specific ratings is the labelling of the scale points. Regarding importance 1 means "is very important for me", 6 means "is absolutely not important for me". For the website-specific ratings 1 means "applies absolutely", 6 means "applies not at all".

After the respondents have answered the 24 items they are given the opportunity to specify likes and dislikes in open format. The survey closes with the final question about the intent to revisit the particular website. In total then, the 24 item questionnaire has to be answered seven times. Once, in the beginning for the importance check and six times for the website evaluation. The items are given in random order to avoid sequence effects. On average, the survey takes 45 minutes. At least 18 minutes are dedicated for the interaction with the the website. Obviously the respondents have to have some stamina in answering the questionnaire. Oh the other hand the item list cannot be reduced any further because three of the six subscales include only four items, two of them only three. A further reduction would jeopardize the reliability of the subscale results and therefore question the validity of the comparison. After all, relatively low drop out rates of about 15 percent and the comprehensiveness of the open answers given at the end of the evaluation suggest a high quality of the data. More thorough insights will be provided by a statistical resliability analysis.

Reliability study
The primary focus of the following reliability study are the six unweighted subscale scores (e.g. Concept, Visual Realization…) and the total score which are obtained for each of the six websites. The weights will not be discussed here further because they have no impact on the reliability of the scores.

First of all, five studies have been drawn at random from the pool of completed WebsiteTrend studies. Every study comprises six websites. Therefore, 30 reliability values (Cronbach´s alpha) can be calculated for the subscale scores and the total scores. After averaging (via Fisher's Z-Transformation) the values in diagram 1 are obtained.

According to Lord and Novick (1968), reliabilities of at least rtt =.7 are necessary for comparisons on group level. Whereas the subscale Interactivity reaches this threshold tightly, the remaining values are far beyond that number. Overall, it can be concluded that the subscales are reliable indicators for the concepts to be measured. This is even more true because some of the scales consist of only three items. As a matter of fact the statistical reliability of a score increases with the number of homogenous items of which it is comprised. Thus it is much easier to obtain a high reliability with a long scale than with a short one. Three items represent the minimum number that cannot be reduced any further. On the other hand, the respondents' willingness to cooperate is limited and therefore it is the objective to develop the optimal compromise between the length and the reliabilty of the scales. In this sense, the comparison between the scales Interactivity, and Technical Quality (see diagramm 1) is revealing. Although both scales consist of only three items, Technical Quality reaches a much higher reliability. This finding is supported by the analysis of the unaggregated reliablity parameters for the single observations (i.e. one out of six websites in one of the five studies): Whereas the reliabilities for Interactivity are rtt =.6 or below in four cases (see diagram 2), for Technical Quality such low values cannot be observed at all. For Content, a four item scale, low values can still be found in two cases.

Thus it can be concluded that Content and Interactivity should have room for improvement.

Optimization of the subscales
The optimizing has the objective of constructing homogeneous subscales, i. e. each item of a subscale has to be closely connected to the content of the subscale to which it is assigned. The adaquate statistical parameter is the corrected item-subscale correlation. Corrected means that before computing the item-subscale correlation the influence of this item is partialed out of the subscale score. The evaluation of the item-subscale correlations of the scale Content (see diagram 3) reveals that Item 18 has a particularly low value. The same is true for item 19 and subscale Interactivity (see diagram 4).

Inspection of the content of item 18 supports statistical diagnosis. The wording of item 18 is: "Website X is entertaining". In contrast, the other three items refer to the quality and coverage of the content of the website. However, it should be pointed out that it is an important fact for many customers to know whether a website is entertaining or not. But in terms of content homogeneousness there is a partial incommensurateness of this specific item content and subscale reliability. Therefore Item 18 has been rephrased to "Website X informs, without being boring". Thus the entertainment aspect to which the original item referred exclusively is now bounded closer to the quality of content.

Item 19 represents a similar case (see diagram 4): "Website X offers good opportnunity to establish contact to the offerer of the website". It is plausible, that this aspect doesn't play a major role in all websites. Therefore item 19 was modified to: "Website X offers good opportunity to establish contact to the offerer of the website, if required". Furthermore, one item was added to the subscale. The new item 22 is worded "Website X gives me the impression, that I can find a contact person to answer my questions". Also subscale Technical Quality, which previously was comprised of only three items was strengthened by one additional item. Item 25 is worded "Website X looks as though it were maintained by professionals." To keep the total scale short, one item of subscale Concept was removed. Based on the data of the previous studies, removal of this item reduces reliablity of subscale Content from .868 to .864 - a decrease which in practice is irrelevant. However, it should be pointed out that this is a prognosis. Its validity has to be crossvalidated in the same manner as the quality of the modified subscales with a fresh body of data.

Crossvaldidation of the optimized scale
A standard welbsite study was conducted with the new questionnaire. Results are based on the online-sample of N=109 panelists quoted according to the previous samples. As can be seen in diagram 5, reliabilities achieved with the unmodified subscales Visual Realization and Usability are virtually identical to the previous studies.

The same applies to subscale Concept, which has been shortened by one item. For the subscales Content, Interactivity and Technical Quality reliability significantly increases by the order of .1 units. Subscale interactivity, meets all demands which can be applied to a scale of such brevity. Finally, the reliability of the total scale score has been stepped up to.97, thereby enabling the discrimination of finest differences between websites.

Conclusion
Reliability analysis grounded on classical test theory provides insight into the contribution of every single item to the reliability of the whole questionnaire. Besides corrected item-subscale correlation, other statistical parameters (mean, standard deviation et cetera) have to be taken into account. Major discrepancies in these parameters between translated and original items point to cultural differences in the interpretation of the translated items. Thus, if international comparability is intended, a modification of cross-cultural unstable items is possible to enable consistent results

Dr. Thomas Rodenhausen, graduated psychologist, majored in psychology at Technische Universität Berlin focussing on methods of psychology and education psychology. At Freie Universität Berlin he earned his PhD with a dissertation on computer based assessment. In April 2000 he joined the team of MediaTransfer AG Netresearch & Consulting in Hamburg as head of data services and tool development.

Andreas Ohde, business school graduated, studied Marketing and Management in Berlin. After his studies he worked as Project Manager at the Institut für Kommunikationsforschung Dr. von Keitz. Since 1999 he operates as Project Manager for quantitative research at MediaTransfer AG Netresearch & Consulting in Hamburg. Apart from classical offline topics he is also engaged in aspects of commercial online strategy.

References
Lord. F.; Novick, M.: Statistical theories of mental test scores, Reading, MA. 1968

Für weitere Informationen wenden Sie sich bitte an:
press@mediatransfer.de

 

UNTERNEHMENLEISTUNGSSPEKTRUMVORGEHENSWEISEPRESSE

©2009 Harris Interactive AG