This post represents some thoughts on the use of psychophysiology to evaluate the player experience during a computer game. As such, it’s tangential to the main business of this blog, but it’s a topic that I think is worth some discussion and debate, as it raises a whole bunch of pertinent issues for the design of physiological computer games.
Psychophysiological methods are combined with computer games in two types of context: applied psychology research and game evaluation in a commercial context. With respect to the former, a researcher may use a computer game as a platform to study a psychological concept, such as effects of game play on aggression or how playing against a friend or a stranger influences the experience of the player (see this recent issue of Entertainment Computing for examples). In both cases, we’re dealing with the application of an experimental psychology methodology to an issue where the game is used as a task or virtual world within which to study behaviour. The computer game merely represents an environment or context in which to study human behaviour. This approach is characterised by several features: (1) comparisons are made between carefully controlled conditions, (2) statistical power is important (if you want to see your work published) so large numbers of participants are run through the design, (3) selection of participants is carefully controlled (equal number of males and females, comparative age ranges if groups are compared) and (4) counterbalanced designs, i.e. if participants play 2 different games, half of them play game 1 then game 2 whilst the other half play game 2 and then game 1; this is important because the order in which games are presented often influences the response of the participants.
Using Games as Psychological Research Tools
Let me give an example from my own work in collaboration with a social psychologist colleague, Dr. Andreas Kastenmueller. Andreas and I are interested in a psychological construct called self-activation, i.e. to what extent can the representation of self be influenced by the appearance of avatars during game play. We did a study last year (currently unpublished) where we had 4 groups of players play Wii Sports: two groups played an aggressive sport (boxing) and the other two groups played a less aggressive sport (bowling). Within each group of boxer and bowlers, half of the players played with an aggressive avatar (Wii avatar with furrowed brow – “angry eyes”) and half with a neutral avatar. The only difference between the gaming experience was the facial expression of the avatar, everything else was held constant. We ran around 80 people through the design (20 per group, 10 males/10 females; all 4 groups had approximately equivalent mean ages) and matched gender of the avatar to the gender of the person. We recorded heart rate, respiration rate and blood pressure as well as acceleration in three axes. Briefly, we found all physiological indicators were significantly higher during the boxing compared to the bowling but no effect for the expression of the avatar. However, when we controlled our analysis of physiological data for movement (using acceleration data), the effect of the game type disappeared and the effect of avatar reached statistical significance, i.e. blood pressure increased when people played with the angry avatar. In other words, identification with an aggressive avatar increased autonomic activation during the game – why were our participants more physiologically activated when playing with angry avatar? In truth, we don’t know – it could be that the aggressive avatar augmented natural competitiveness or that it was more emotionally arousing to play with an expressive avatar or that they produced more testosterone when the avatar was angry. This kind of research is exploratory, and as ever, we need another experiment…
The reason I present this summary is to make a general point; physiological variables are very sensitive measures. They respond to subtle psychological variables (appearance of avatar), individual differences (between participants), the game context (boxing vs. bowling) and physical activity (movement). This sensitivity of physiological variables is large part of why they are useful for psychological research, but this sensitivity is a double-edged sword. When your measures can be sensitive to so many things, you need careful experimental control if you want to interpret your data in a way that is robust and unambiguous, which are the qualities that will make your results meaningful to others.
Evaluation of Player Experience
The second context for psychophysiology and gaming is the evaluation of player experience as part of a design cycle. The goal here is to inform the process of game design in order to produce better games, and to be more specific, to confirm via play testing that the experience of the gamer conforms with the intentions of the team who were responsible for designing the game. Game designers may construct a gaming experience with the objective of inducing different cognitive/motivational/emotional experiences and psychophysiology represents one means of confirming that these experiences have been achieved. This type of testing can take place at the macro level (was my game scary?) or the micro level (was the part where the nice old lady turned into a zombie brandishing a chainsaw scary?).
This type of testing generally takes place in a commercial context (though there are research projects dedicated to this topic such as FUGA). Software companies generally indulge in play testing using observational methods in conjunction with post-game interviews, but there are exceptions to this rule such as the TRUE method developed by Microsoft Games Studios. A number of people working in this field have put forward strong arguments for using psychophysiological methods to evaluate gamer experience – see the talks from Pejman Mirza-Babaei and Lennart Nacke from our recent CHI workshop here as examples. Lennart was also the co-author of a recent methodological paper on how physiology could be combined with video and game events to triangulate player experience (i.e. to converge parallel data streams from physiology, observation and game events to understand player experience). Vertical Slice employ a similar approach known as Biometric Storyboarding (see the CHI link for Pejman’s presentation on this topic).
Using physiology in the context of commercial usability testing is very different to using game software in order to explore psychological hypotheses as we did with the Wii study. First of all, the purpose is to gain insight into player experience and then to convey that insight to game designers in a way which informs their practice. There are several challenges to be overcome in that last sentence. Also, we ought to recognise that this research may be confirmatory rather than exploratory – in other words, it should lead to a clear conclusion about the game experience, not just another experiment. In the past I worked at a human factors research institute that included both a research branch and a commercial/consultancy arm, I worked on projects in both sectors and experienced first-hand how the evaluation of technology diverged across academic and commercial sectors. A large part of this difference was the need for clarity (how will your test help me sell more products – exactly) and expediency (I need your results now) which combined with financial restrictions to severely limit the number of participants that could be tested or the amount of time that may be devoted to analysis.
However, none of these limiting factors change anything I said earlier about the use of physiology (or biometrics as the game industry seem to call it) to evaluate user experience. These data remain volatile, variable and difficult to interpret without a high level of experimental control.
Recently the Escapist published this article based on work by Vertical Slice on which is the scariest game for the Xbox 360. For me, there was a number of problems with how physiology was used in the context of this study (at least going from the description in the aforementioned article) that encapsulate the problems of player evaluation with commercial products. First of all,
“the study, performed across four games (Alan Wake, Dead Space 2, Condemned: Criminal Origins and Resident Evil 5) on six participants between the ages of 20 and 42, attempted to discover exactly which moments of the games were frightening.”
So we have four games to be compared, which are similar in theme but also different in terms of mechanics – this is one big problem for evaluation using bespoke commercial products, the lack of systematic control across different software titles. The alternative is to compare different versions of the game world that may be constructed from scratch or to use a SDK to create a systematic variation in the game world. Also the number of participants in this study is very low, especially as the group of six players varied considerably with respect to age and were further divided into casual vs. core gamers. What this does is simply increase the level of unsystematic “noise” in the data, although achieving statistical significance is perhaps not a pressing issue for this kind of commercial work, it does make it very hard to find consistency in the data at all, except by considering data from one individual. A description of the data collection is presented below from the original article:
“To measure fear, Vertical Slice had each of the participants play about 30 minutes of each game in a counter-balanced order to reduce bias. During play, the participants were asked to think out loud; at the same time, their heart rate, skin surface temperature, Galvanic Skin Response (which measures excitement or frustration) and, in some cases, respiration were measured. After playing, the volunteers were asked to analyze their experience.”
It should be noted that the researchers counterbalanced the presentation of the games to control for order effects. I was very surprised to hear that speak-aloud protocol had been used. Speaking exerts a profound effect on breathing rate and heart rate – since both breathing and heart rate are physiologically linked. In my view, these data would be effectively useless as some people will talk a lot and some will speak very little. In that case, I assume that GSR was mainly used for the high/low scare scale that appears in the figures accompanying the article. Now here I have a major problem due to what I perceive to be a gross simplification of the psycho-physiological inference. As noted in the article, GSR is associated with excitement or frustration; according to this recent review article on autonomic markers of emotion, GSR has been associated with a range of high activation emotions, including excitement, happiness, anger, frustration and fear. GSR basically measures activation of sympathetic nervous system, it is inherently ambiguous with respect to emotional labels (i.e. a one-to-many relationship, see my 2009 paper for full description of the complexity of this inference).
This is why experimental control is so important because context is everything for the interpretation of physiological data.
Why Experimental Control is Important for Everyone
I don’t wish to bag the guys at Vertical Slice, I think the storyboarding technique they have developed is really interesting and I am reading about their work from a secondary source. But I do want to raise a debate about how physiological measures should be used in a commercial game testing situation. When I worked in the commercial realm, I was sometimes told by the old hands that a high level of experimental control was unnecessary in the realm of consultancy because this was “quick and dirty” testing. I always hated that phrase because the financial stakes are very high in commercial work and researchers need to be as cautious and meticulous as they would be in a pure academic research setting (in fact, more so). On the other hand, is it realistic to expect a software company to allow me to test 80+ participants (at a consultancy rate) in order to figure out what kind of avatar they should use in their game? And then to charge them again for a follow-up experiment? The applied research on game testing methodology (such as the FUGA project mentioned earlier) charts a middle path between these extremes of academic exploration and commercial confirmation, but no researchers who is using physiological measures can hope to dodge the requirement for real experimental control if they wish to present a confident interpretation of physiological data to their clients. The use of scientific tools is not the same as employing a scientific methodology and gross simplification of what physiological measures mean will eventually damage the credibility of this approach in the domain of player evaluation – because people will figure out that interpretation is ambiguous and the lines on the chart are open to multiple interpretation.
What Kind of Evaluation Do Game Designers Want?
In the interest of transparency, and to prevent an accusation of criticising Vertical Slice from an academic ivory tower, I’d like to present some “quick and dirty” testing of our own in order to end this post on yet another dilemma and to show that we’re guilty of the same errors. Below you can see heart rate data for a single player – this person played the Sony PlayStation game “WipeOut HD” (a futuristic racing game for those unfamiliar with the title) under three conditions of difficulty (defined as increased speed of opponents); venom is the easiest level, phantom is the hardest level. We logged heart rate during each time by their position in the race, reasoning that sympathetic activation would be highest when the task was hard and the player is in the highest position in the race.
As you can see, the data conforms nicely to our expectations: heart rate is higher for the most demanding game (phantom) and rises as the person progresses from the lower to the higher race position. Although this is only one person, I’d be confident in seeing the same trend if we ran another 20 people through the design. The nice thing about these data is that we used a commercial piece of game software in a controlled way, and like Vertical Slice, we have placed our physiological data in the context of specific gaming events. Leaving aside the issue of interpretation (we assume heart rate increased due to sympathetic activation as a response to high demand or motivation or some mixture of both), to what extent are these data really useful for a game designer? At the very least, they confirm that player experience during venom/rapier/phantom level are different – but does that really inform their practice? Or is that aiming too high and all commercial customers really want from these measures to check their design assumptions? In my opinion, psychophysiological measures (or biometrics) will only really deliver insight when it is used to ask questions to which the industry doesn’t already know all the answers. And here’s the kicker, if a team of game designers gave me five different versions of the same game or game scenario, which were identical but different in one crucial respect (i.e. controlled in an experimental sense), then my psychophysiological measures might tell them something unexpected, something that would really enhance their understanding of the product.
Call me cynical but I don’t expect it happen anytime soon.