|
Wissenschaftliche Publikationen
Reliability of online-administered questionnaires: More
than a catchword?
Thomas Rodenhausen and Andreas Ohde
aus planung & analyse MR/2000
Online administration of standard questionnaires is an ideal basis to conduct
market research on an international level, especially if experience and
knowledge has been collected in national studies. As is well known however,
merely translating the items does not necessarily yield comparable results. A
key quality of a questionnaire is the reliability of the (sub-) scales and the
contribution of single items to this reliability. Analysis of reliability offers
the opportunity to optimize the quality of questionnaires and to check for
cross-cultural comparability. In the present paper the procedure of statistical
reliability analysis is described and consequences for international adaptation
of instruments are outlined.
Reliability and validity are central criteria for the qualilty of standardized
questionnaires. Both criteria have to be fulfilled regardless of the specific
mode of use - be it as traditional paper-pencil questionnaire or as web based
tool. However, in many cases the fulfillment of these criteria is claimed
without supporting it by an adequate statistical analysis.
Analysis and documentation of the relevant statistical parameters become even
more pressing in the context of internationalization of market research.
Especially standard instruments, which have proven to yield results of high
quality in the past are ideal candidates for international studies. Often, a lot
of knowledge has been accumulated, for instance in the form of benchmarks which
can significantly enhance the interpretation of results. Yet the mandatory
prerequisite for the adaptation of a questionnaire to other languages and
cultures is a proven reliability of the original instrument. This paper
describes the evaluation and optimization of a web based tool for comparison of
corporate websites with respect to its statistical reliability. The results of
this type of study can be used to assess the quality of the cross-cultural
adaptation of an instrument on the level of single items, subscales and the
total scale.
Evaluation of corporate websites
The corporate website is an essential link between customer and company. A few
years ago internet presence could lead to a competitive advantage even though
the benefits were not apparent to the visitor. Times have changed and consumer
and competitor demands are increasing. High expectations comprise comprehensive
and detailed content as well as technical reliability and functionality
integrated by an appealing graphical design. In that connection, it is important
to emphasize that it is difficult to develop stable standards for the aesthetic
appearance of websites in a context which is evolving rapidly. Therefore, for
the quality conscious website provider it is imperative to expose his website to
the user's judgement in order to identify strengths and weaknesses.
The study of user acceptance of websites offers an ideal field for panel based
online research, because two distinct advantages of this approach can be brought
to bear. First of all, high external validity can be attained by studying the
attitudes of users towards websites in their natural context. Users can answer
an online distributed questionnaire while interacting with the respective
website. Second, by using an online panel, it is easy to collect a sample of
persons with high interest in the subject. Third, it is possible to control and
stratify the sample for relevant characteristics. For these reasons it is not
surprising that various online research companies offer that kind of analysis.
MediaTransfer AG Netresearch & Consulting has conducted standardized studies
for the evaluation of websites since 1996 and has pioneered work in this field.
Since 1999 MediaTransfer AG Netresearch & Consulting and the German magazine
Horizont cooperate and publish a monthly study of general interest named Website
Trend. Due to MediaTransfer AG Netresearch & Consulting's experience,
benchmarking on the basis of more than 350 studies can be offered to customers.
At the same time it is possible to conduct as statistical analysis to determine
the reliability based on classical test theory. The results of this analysis
allow us to look at the characteristics not only of the whole instrument, but of
the subscales and of the single items.
These resuIts und their future use are the subject of the present paper.
Our Website Trend compares six competing websites by means of a questionnaire
made up of 24 ratings. Euch rating represents an important aspect of websites.
These items are grouped into six subscales which cover essential facets of
websites:
- Concept
- Visual Realization
- Usabilitiy
- Content
- Interactivity
- Technical Quality
The 24 items are presented to 120 respondents, recruited from our Interactive
Dynamic Online Panel. Prior to the evaluation of the six websites the importance
of each of the 24 aspects is assessed, because it can be expected that this
importance varies according to the specific line of business to which a web site
belongs. Furthermore this procedure is necessary to calculate a weighted total
score, which allows the comparison of different websites, even though they do
not belong to the same sector. In the second step, each of the six websites is
evaluated by means of the same 24 items. The respondents receive the online
questionnaire including a link to the particular website and are instructed to
use that website for at least three minutes, which is controlled by a timer. The
respondents receive no information about the assignment of the items to the
subscales. The only difference between the importance check (performed once for
each respondent) and the six website-specific ratings is the labelling of the
scale points. Regarding importance 1 means "is very important for me",
6 means "is absolutely not important for me". For the website-specific
ratings 1 means "applies absolutely", 6 means "applies not at
all".
After the respondents have answered the 24 items they are given the opportunity
to specify likes and dislikes in open format. The survey closes with the final
question about the intent to revisit the particular website. In total then, the
24 item questionnaire has to be answered seven times. Once, in the beginning for
the importance check and six times for the website evaluation. The items are
given in random order to avoid sequence effects. On average, the survey takes 45
minutes. At least 18 minutes are dedicated for the interaction with the the
website. Obviously the respondents have to have some stamina in answering the
questionnaire. Oh the other hand the item list cannot be reduced any further
because three of the six subscales include only four items, two of them only
three. A further reduction would jeopardize the reliability of the subscale
results and therefore question the validity of the comparison. After all,
relatively low drop out rates of about 15 percent and the comprehensiveness of
the open answers given at the end of the evaluation suggest a high quality of
the data. More thorough insights will be provided by a statistical resliability
analysis.
Reliability study
The primary focus of the following reliability study are the six unweighted
subscale scores (e.g. Concept, Visual Realization…) and the total score which
are obtained for each of the six websites. The weights will not be discussed
here further because they have no impact on the reliability of the scores.
First of all, five studies have been drawn at random from the pool of completed
WebsiteTrend studies. Every study comprises six websites. Therefore, 30
reliability values (Cronbach´s alpha) can be calculated for the subscale scores
and the total scores. After averaging (via Fisher's Z-Transformation) the values
in diagram 1 are obtained.
According to Lord and Novick (1968), reliabilities of at least rtt =.7 are
necessary for comparisons on group level. Whereas the subscale Interactivity
reaches this threshold tightly, the remaining values are far beyond that number.
Overall, it can be concluded that the subscales are reliable indicators for the
concepts to be measured. This is even more true because some of the scales
consist of only three items. As a matter of fact the statistical reliability of
a score increases with the number of homogenous items of which it is comprised.
Thus it is much easier to obtain a high reliability with a long scale than with
a short one. Three items represent the minimum number that cannot be reduced any
further. On the other hand, the respondents' willingness to cooperate is limited
and therefore it is the objective to develop the optimal compromise between the
length and the reliabilty of the scales. In this sense, the comparison between
the scales Interactivity, and Technical Quality (see diagramm 1) is revealing.
Although both scales consist of only three items, Technical Quality reaches a
much higher reliability. This finding is supported by the analysis of the
unaggregated reliablity parameters for the single observations (i.e. one out of
six websites in one of the five studies): Whereas the reliabilities for
Interactivity are rtt =.6 or below in four cases (see diagram 2), for Technical
Quality such low values cannot be observed at all. For Content, a four item
scale, low values can still be found in two cases.
Thus it can be concluded that Content and Interactivity should have room
for improvement.
Optimization of the subscales
The optimizing has the objective of constructing homogeneous subscales, i. e.
each item of a subscale has to be closely connected to the content of the
subscale to which it is assigned. The adaquate statistical parameter is the
corrected item-subscale correlation. Corrected means that before computing the
item-subscale correlation the influence of this item is partialed out of the
subscale score. The evaluation of the item-subscale correlations of the scale
Content (see diagram 3) reveals that Item 18 has a particularly low value. The
same is true for item 19 and subscale Interactivity (see diagram 4).
Inspection of the content of item 18 supports statistical diagnosis. The wording
of item 18 is: "Website X is entertaining". In contrast, the other
three items refer to the quality and coverage of the content of the website.
However, it should be pointed out that it is an important fact for many
customers to know whether a website is entertaining or not. But in terms of
content homogeneousness there is a partial incommensurateness of this specific
item content and subscale reliability. Therefore Item 18 has been rephrased to
"Website X informs, without being boring". Thus the entertainment
aspect to which the original item referred exclusively is now bounded closer to
the quality of content.
Item 19 represents a similar case (see diagram 4): "Website X offers good
opportnunity to establish contact to the offerer of the website". It is
plausible, that this aspect doesn't play a major role in all websites. Therefore
item 19 was modified to: "Website X offers good opportunity to establish
contact to the offerer of the website, if required". Furthermore, one item
was added to the subscale. The new item 22 is worded "Website X gives me
the impression, that I can find a contact person to answer my questions".
Also subscale Technical Quality, which previously was comprised of only three
items was strengthened by one additional item. Item 25 is worded "Website X
looks as though it were maintained by professionals." To keep the total
scale short, one item of subscale Concept was removed. Based on the data of the
previous studies, removal of this item reduces reliablity of subscale Content
from .868 to .864 - a decrease which in practice is irrelevant. However, it
should be pointed out that this is a prognosis. Its validity has to be
crossvalidated in the same manner as the quality of the modified subscales with
a fresh body of data.
Crossvaldidation of the optimized scale
A standard welbsite study was conducted with the new questionnaire. Results are
based on the online-sample of N=109 panelists quoted according to the previous
samples. As can be seen in diagram 5, reliabilities achieved with the unmodified
subscales Visual Realization and Usability are virtually identical to the
previous studies.
The same applies to subscale Concept, which has been shortened by one item. For
the subscales Content, Interactivity and Technical Quality reliability
significantly increases by the order of .1 units. Subscale interactivity, meets
all demands which can be applied to a scale of such brevity. Finally, the
reliability of the total scale score has been stepped up to.97, thereby enabling
the discrimination of finest differences between websites.
Conclusion
Reliability analysis grounded on classical test theory provides insight into the
contribution of every single item to the reliability of the whole questionnaire.
Besides corrected item-subscale correlation, other statistical parameters (mean,
standard deviation et cetera) have to be taken into account. Major discrepancies
in these parameters between translated and original items point to cultural
differences in the interpretation of the translated items. Thus, if
international comparability is intended, a modification of cross-cultural
unstable items is possible to enable consistent results
Dr. Thomas Rodenhausen, graduated psychologist, majored in psychology at
Technische Universität Berlin focussing on methods of psychology and education
psychology. At Freie Universität Berlin he earned his PhD with a dissertation
on computer based assessment. In April 2000 he joined the team of MediaTransfer
AG Netresearch & Consulting in Hamburg as head of data services and tool
development.
Andreas Ohde, business school graduated, studied Marketing and Management in
Berlin. After his studies he worked as Project Manager at the Institut für
Kommunikationsforschung Dr. von Keitz. Since 1999 he operates as Project Manager
for quantitative research at MediaTransfer AG Netresearch & Consulting in
Hamburg. Apart from classical offline topics he is also engaged in aspects of
commercial online strategy.
References
Lord. F.; Novick, M.: Statistical theories of mental test scores, Reading, MA.
1968
Für weitere Informationen wenden Sie sich bitte an:
press@mediatransfer.de
|