Biometrics, Game Evaluation and User XP: Approach with caution

This post represents some thoughts on the use of psychophysiology to evaluate the player experience during a computer game.  As such, it’s tangential to the main business of this blog, but it’s a topic that I think is worth some discussion and debate, as it raises a whole bunch of pertinent issues for the design of physiological computer games.

Psychophysiological methods are combined with computer games in two types of context: applied psychology research and game evaluation in a commercial context.  With respect to the former, a researcher may use a computer game as a platform to study a psychological concept, such as effects of game play on aggression or how playing against a friend or a stranger influences the experience of the player (see this recent issue of Entertainment Computing for examples).  In both cases, we’re dealing with the application of an experimental psychology methodology to an issue where the game is used as a task or virtual world within which to study behaviour.  The computer game merely represents an environment or context in which to study human behaviour.   This approach is characterised by several features: (1) comparisons are made between carefully controlled conditions, (2) statistical power is important (if you want to see your work published) so large numbers of participants are run through the design, (3) selection of participants is carefully controlled (equal number of males and females, comparative age ranges if groups are compared) and (4) counterbalanced designs, i.e. if participants play 2 different games, half of them play game 1 then game 2 whilst the other half play game 2 and then game 1; this is important because the order in which games are presented often influences the response of the participants.

Using Games as Psychological Research Tools

Let me give an example from my own work in collaboration with a social psychologist colleague, Dr. Andreas Kastenmueller.  Andreas and I are interested in a psychological construct called self-activation, i.e. to what extent can the representation of self be influenced by the appearance of avatars during game play.  We did a study last year (currently unpublished) where we had 4 groups of players play Wii Sports: two groups played an aggressive sport (boxing) and the other two groups played a less aggressive sport (bowling).  Within each group of boxer and bowlers, half of the players played with an aggressive avatar (Wii avatar with furrowed brow – “angry eyes”) and half with a neutral avatar.  The only difference between the gaming experience was the facial expression of the avatar, everything else was held constant.  We ran around 80 people through the design (20 per group, 10 males/10 females; all 4 groups had approximately equivalent mean ages) and matched gender of the avatar to the gender of the person.  We recorded heart rate, respiration rate and blood pressure as well as acceleration in three axes.  Briefly, we found all physiological indicators were significantly higher during the boxing compared to the bowling but no effect for the expression of the avatar.  However, when we controlled our analysis of physiological data for movement (using acceleration data), the effect of the game type disappeared and the effect of avatar reached statistical significance, i.e. blood pressure increased when people played with the angry avatar.  In other words, identification with an aggressive avatar increased autonomic activation during the game – why were our participants more physiologically activated when playing with angry avatar?  In truth, we don’t know – it could be that the aggressive avatar augmented natural competitiveness or that it was more emotionally arousing to play with an expressive avatar or that they produced more testosterone when the avatar was angry.  This kind of research is exploratory, and as ever, we need another experiment…

The reason I present this summary is to make a general point; physiological variables are very sensitive measures.  They respond to subtle psychological variables (appearance of avatar),  individual differences (between participants), the game context (boxing vs. bowling) and physical activity (movement).  This sensitivity of physiological variables is large part of why they are useful for psychological research, but this sensitivity is a double-edged sword.  When your measures can be sensitive to so many things, you need careful experimental control if you want to interpret your data in a way that is robust and unambiguous, which are the qualities that will make your results meaningful to others.

Evaluation of Player Experience

The second context for psychophysiology and gaming is the evaluation of player experience as part of a design cycle.  The goal here is to inform the process of game design in order to produce better games, and to be more specific, to confirm via play testing that the experience of the gamer conforms with the intentions of the team who were responsible for designing the game.  Game designers may construct a gaming experience with the objective of inducing different cognitive/motivational/emotional experiences and psychophysiology represents one means of confirming that these experiences have been achieved.  This type of testing can take place at the macro level (was my game scary?) or the micro level (was the part where the nice old lady turned into a zombie brandishing a chainsaw scary?).

This type of testing generally takes place in a commercial context (though there are research projects dedicated to this topic such as FUGA).  Software companies generally indulge in play testing using observational methods in conjunction with post-game interviews, but there are exceptions to this rule such as the TRUE method developed by Microsoft Games Studios.  A number of people working in this field have put forward strong arguments for using psychophysiological methods to evaluate gamer experience – see the talks from Pejman Mirza-Babaei and Lennart Nacke from our recent CHI workshop here as examples.  Lennart was also the co-author of a recent methodological paper on how physiology could be combined with video and game events to triangulate player experience (i.e. to converge parallel data streams from physiology, observation and game events to understand player experience).  Vertical Slice employ a similar approach known as Biometric Storyboarding (see the CHI link for Pejman’s presentation on this topic).

Using physiology in the context of commercial usability testing is very different to using game software in order to explore psychological hypotheses as we did with the Wii study.  First of all, the purpose is to gain insight into player experience and then to convey that insight to game designers in a way which informs their practice.  There are several challenges to be overcome in that last sentence.  Also, we ought to recognise that this research may be confirmatory rather than exploratory – in other words, it should lead to a clear conclusion about the game experience, not just another experiment.  In the past I worked at a human factors research institute that included both a research branch and a commercial/consultancy arm, I worked on projects in both sectors and experienced first-hand how the evaluation of technology diverged across academic and commercial sectors.  A large part of this difference was the need for clarity (how will your test help me sell more products – exactly) and expediency (I need your results now) which combined with financial restrictions to severely limit the number of participants that could be tested or the amount of time that may be devoted to analysis.

However, none of these limiting factors change anything I said earlier about the use of physiology (or biometrics as the game industry seem to call it) to evaluate user experience.  These data remain volatile, variable and difficult to interpret without a high level of experimental control.

Recently the Escapist published this article based on work by Vertical Slice on which is the scariest game for the Xbox 360.  For me, there was a number of problems with how physiology was used in the context of this study (at least going from the description in the aforementioned article) that encapsulate the problems of player evaluation with commercial products.  First of all,

“the study, performed across four games (Alan Wake, Dead Space 2, Condemned: Criminal Origins and Resident Evil 5) on six participants between the ages of 20 and 42, attempted to discover exactly which moments of the games were frightening.”

So we have four games to be compared, which are similar in theme but also different in terms of mechanics – this is one big problem for evaluation using bespoke commercial products, the lack of systematic control across different software titles.  The alternative is to compare different versions of the game world that may be constructed from scratch or to use a SDK to create a systematic variation in the game world. Also the number of participants in this study is very low, especially as the group of six players varied considerably with respect to age and were further divided into casual vs. core gamers.  What this does is simply increase the level of unsystematic “noise” in the data, although achieving statistical significance is perhaps not a pressing issue for this kind of commercial work, it does make it very hard to find consistency in the data at all, except by considering data from one individual.  A description of the data collection is presented below from the original article:

“To measure fear, Vertical Slice had each of the participants play about 30 minutes of each game in a counter-balanced order to reduce bias. During play, the participants were asked to think out loud; at the same time, their heart rate, skin surface temperature, Galvanic Skin Response (which measures excitement or frustration) and, in some cases, respiration were measured. After playing, the volunteers were asked to analyze their experience.”

It should be noted that the researchers counterbalanced the presentation of the games to control for order effects.  I was very surprised to hear that speak-aloud protocol had been used.  Speaking exerts a profound effect on breathing rate and heart rate – since both breathing and heart rate are physiologically linked.  In my view, these data would be effectively useless as some people will talk a lot and some will speak very little.  In that case, I assume that GSR was mainly used for the high/low scare scale that appears in the figures accompanying the article. Now here I have a major problem due to what I perceive to be a gross simplification of the psycho-physiological inference.  As noted in the article, GSR is associated with excitement or frustration; according to this recent review article on autonomic markers of emotion, GSR has been associated with a range of high activation emotions, including excitement, happiness, anger, frustration and fear.  GSR basically measures activation of sympathetic nervous system, it is inherently ambiguous with respect to emotional labels (i.e. a one-to-many relationship, see my 2009 paper for full description of the complexity of this inference).

This is why experimental control is so important because context is everything for the interpretation of physiological data.

Why Experimental Control is Important for Everyone

I don’t wish to bag the guys at Vertical Slice, I think the storyboarding technique they have developed is really interesting and I am reading about their work from a secondary source.  But I do want to raise a debate about how physiological measures should be used in a commercial game testing situation.  When I worked in the commercial realm, I was sometimes told by the old hands that a high level of experimental control was unnecessary in the realm of consultancy because this was “quick and dirty” testing.  I always hated that phrase because the financial stakes are very high in commercial work and researchers need to be as cautious and meticulous as they would be in a pure academic research setting (in fact, more so).  On the other hand, is it realistic to expect a software company to allow me to test 80+ participants (at a consultancy rate) in order to figure out what kind of avatar they should use in their game?   And then to charge them again for a follow-up experiment?  The applied research on game testing methodology (such as the FUGA project mentioned earlier) charts a middle path between these extremes of academic exploration and commercial confirmation, but no researchers who is using physiological measures can hope to dodge the requirement for real experimental control if they wish to present a confident interpretation of physiological data to their clients.  The use of scientific tools is not the same as employing a scientific methodology and gross simplification of what physiological measures mean will eventually damage the credibility of this approach in the domain of player evaluation – because people will figure out that interpretation is ambiguous and the lines on the chart are open to multiple interpretation.

What Kind of Evaluation Do Game Designers Want?

In the interest of transparency, and to prevent an accusation of criticising Vertical Slice from an academic ivory tower, I’d like to present some “quick and dirty” testing of our own in order to end this post on yet another dilemma and to show that we’re guilty of the same errors.  Below you can see heart rate data for a single player – this person played the Sony PlayStation game “WipeOut HD” (a futuristic racing game for those unfamiliar with the title) under three conditions of difficulty (defined as increased speed of opponents); venom is the easiest level, phantom is the hardest level.  We logged heart rate during each time by their position in the race, reasoning that sympathetic activation would be highest when the task was hard and the player is in the highest position in the race.

As you can see, the data conforms nicely to our expectations: heart rate is higher for the most demanding game (phantom) and rises as the person progresses from the lower to the higher race position.  Although this is only one person, I’d be confident in seeing the same trend if we ran another 20 people through the design.  The nice thing about these data is that we used a commercial piece of game software in a controlled way, and like Vertical Slice, we have placed our physiological data in the context of specific gaming events.  Leaving aside the issue of interpretation (we assume heart rate increased due to sympathetic activation as a response to high demand or motivation or some mixture of both), to what extent are these data really useful for a game designer?  At the very least, they confirm that player experience during venom/rapier/phantom level are different – but does that really inform their practice?  Or is that aiming too high and all commercial customers really want from these measures to check their design assumptions?  In my opinion, psychophysiological measures (or biometrics) will only really deliver insight when it is used to ask questions to which the industry doesn’t already know all the answers.  And here’s the kicker, if a team of game designers gave me five different versions of the same game or game scenario, which were identical but different in one crucial respect (i.e. controlled in an experimental sense), then my psychophysiological measures might tell them something unexpected, something that would really enhance their understanding of the product.

Call me cynical but I don’t expect it happen anytime soon.

19 thoughts on “Biometrics, Game Evaluation and User XP: Approach with caution

  1. Lennart Nacke

    Steve, nicely written and controversial post. You touched nicely on the dilemma of making this type of research really meaningful for the game industry. I think we both agree that it is better to support companies that push these measures in the industry than rounding them up, and I think your call for caution in interpretation and experimental control is important.

    However, I also wonder when an experiment is highly controlled, how much impact for game design is really left. Take, for example, your study with the bowlers and boxers. I wonder about the impact of a comical facial expression as is found on Miis. Your study would indicate that even minor aggressive cues (I would call them minor since the information on a Mii face is definitely not as rich as in a fully rendered CGI avatar) lead to a blood pressure increase. Taking into account this knowledge is where the important step for game design would need to happen. And is there enough information to be taken away from this manipulation to design better games?

    I see this as the point where this research ties nicely back to physiological computing. As a game designer I want to create an illusion of possibilities for the player and therefore I do not want to make a decision as to whether it is better to include the angry or the neutral avatar (ideally the player can chose either one), but allow for the game to adapt to how the player reacts physiologically at some point X during the game. This is where a real challenge lies for physiological interaction design. At what game moments do we want the player to become more excited and at what more calm? In general, I think that studies on the micro level are more likely to inform these decisions, but as you pointed out, industry has certain constraints such as wanting clarity and expedience, which is why visual analytics (not statistical significance) is a real seller there. This is where I want to close that I see some potential in the biometric storyboarding approach as it seems to inform decisions on how to find and interpret micro level events (something which we also do when we triangulate our physiological data with as many other sources as possible) and provides a visual overview, which is good for both fast and (relatively) clear results.

  2. Pejman Mirza-Babaei

    Hi Steve, really good article and thanks for liking the Biometric Storyboards idea 🙂

    Although I was not involve in conducting the “scariest game studies” as you are referring to in your article, but I thought I would comment on this and clarify some points.

    The study you mentioned was not academic neither commercial. That was a very small study, conducted in order to be part of a mass market magazine article aiming to make young readers excited about new tools available for game design/evaluation. It shouldn’t be treated as an academic neither commercial study.

    Also the biometrics measures were not analysed to be interpreted to a particular emotion (in this case being scared). They were just used to structure post-session interview for each player. So basically scary events were selected based on participants post-session comments and not the change in their GSR or etc.

  3. Lennart Nacke

    Pejman, I get your point about the experiment, but I would still say that it is important to be careful how to market psychophysiological measures to popular press because it is always too easy to infer causality from the fancy charts that can be produced based on these measures. Steve is just advocating research carefulness and scientific validity with his post, I think.

  4. Steve Fairclough Post author

    Hi Lennart

    I think you understand what I’m driving at very well. The post is about a double-dilemma revolving around what industry wants to do with these measures (with respect to UX evaluation) and how we (as researchers) present these measures to potential collaborators and customers in industry.

    I included the description of the Wii study as a contrast – not in terms of experimental control, statistical rigour etc. – but as psychophysiological research using a computer game that is essentially not directed towards informing game design, but using games as a platform to investigate human behaviour. In this case, we looked at whether physiological changes due to the aggressive Mii caused increased aggression after the game was over (it didn’t in this case).

    Obviously using psychophysiology to understand gamer experience leads directly into adaptive gaming software, where real-time changes in physiology are used to adaptively alter game events at the micro-level. But interpretation of physiological changes may be very important if the pathway through the game can fork in five different directions at your given point X; we want physiological adaptation to make games that are smart and unpredictable but not completely random.

    Your points in defence of visual analytics is well-taken and I completely see the overlap between your triangulation approach and biometric storyboards. However, two points about that – in my opinion, the push towards clear visual representations is being driven by the need of the customer for a non-specialist summary or ‘take home’ message. Secondly, without proper control, the lines on the charts are open to multiple interpretation, which (in my opinion) fundamentally weakens the explanatory power of this kind of data. Finally, it should be at least possible to do a small-scale study with quick turnaround that is properly controlled so we can have confidence in the visual analytics, don’t you agree?

    Best,
    Steve

  5. Steve Fairclough Post author

    Hi Pejman

    I’m sure my post didn’t go down well with anyone at Vertical Slice and I apologise for that.

    Thanks for your clarification. I didn’t think this was an academic study but thanks for describing the context of the data collection. From what you say, GSR was used to identify points in the game (when sympathetic activity peaked) and these events were selected for post-session comments related to perceived fear. What completely threw me were the figures presented beneath the article that show a continuous line over elapsed time to denote level of fear. I don’t understand how a continuous line could denote post-interview comments and simply assumed it represented continuous physiological data.

    This work seemed to me to be quite different from your presentation at CHI, where game events were used to provide a context for the physiological data.

    Whilst I understand the purpose of the study was to provoke interest in these evaluation techniques, and that Vertical Slice cannot completely control what The Escapist or anyone else writes about their work, there was no clarification about where the data come from or very little in the way of caveats (except for a comment about low participant numbers).

    What I’m advocating in my post is the use of experimental control to aid data interpretation combined with a more cautious approach to inferring psychological states based on physiological data. These qualities are typically associated with traditional laboratory work but it doesn’t have to be that way and I believe there are real benefits for commercial game testing.

    Regards
    Steve

  6. Ben Lewis Evans

    Nice article Steve.

    I personally think that psychophysiological measures have promise for various areas (games, but also other areas such as driving), but that more work needs to be done before it is really useful in most commercial situations.

    Just to be a bit self promoting, I tried to make some of the same points about the issues around the use of psychophysiology in game research in a feature I wrote for Gamasutra back in April. Perhaps it would be interesting for you – http://www.gamasutra.com/view/feature/6341/game_testing_and_research_the_.php

    1. Steve Fairclough Post author

      Hi Ben

      Thanks for your comment. I wasn’t aware of your game sutra article, but thought you provided a good overview. by the way, I learned how to be a psychophysiologist working with driver behaviour (with Dick and Karel at your university). I just think that importing some techniques from that area to game evaluation would greatly improve the use of Psychophysiology.

      Regards
      Steve

  7. Ben Lewis Evans

    Hi Steve,

    I was wondering if you were that particular Steve Fairclough. You have been working with several PhD colleagues of mine in REFLECT I believe (Arjan and Chris – along with Dick of course).

    I agree that some techniques from driver behaviour work could go towards game evaluation. Also perhaps games have something to offer work in driving – at the very least it would be nice if some of the technology that exists in driving games nowadays would trickle down into simulators 😉

  8. Pingback: Physiological Computing : Physiological Game Interaction and Psychophysiological Evaluation in Research and Industry

  9. Steve Fairclough Post author

    Hi Ben

    Yeah, I spent some days eating pasta in Maranello with Chris and Arjan earlier in the year. I know what you mean about driving simulators; we have an old STI simulator and the graphics are getting to be quite embarrassing.

    Steve

  10. Ben Lewis Evans

    This is getting a bit off topic, but in terms of driving simulators graphics are nice of course, however I feel what would be most beneficial is getting more accurate physics simulation into them.

    This is were the racing games nowadays are really focusing on getting that “road feel” of a physical object interacting in a real 3D environment. This can be done without moving bases, etc, and in my opinion really increases the impression of tires on a road, acceleration, lateral forces, and so on.

  11. Pingback: Caution with Biometrics, Game Evaluation and UX « The Acagamic

  12. Steve Fairclough Post author

    Hi Ben

    I can see the argument you’re making, but I haven’t been involved in driving simulation in terms of fidelity issues for years. Gaming software has always been years ahead of driving simulation, the small size of the market means that researchers must either build their own software or purchase from very specialised companies (who do not make games)

    Steve

  13. Tim Carter

    If you need to translate feelings into numbers, then you are incapable of understand feelings.

    It’s like a person who can only understand sex by looking at pictures of it.

  14. Kiel Gilleade

    Hi Tim,

    Ah that old chestnut.

    How can we digitise feeling without losing the subjective experience of its sensation which by its very nature is a personal experience?

    Well we don’t technically do that. What we do is digitise the physiological markers associated with that experience in order to infer the sensation an individual feels. Psychophysiology is a research field which studies these relationships e.g. frontal theta activity in the brain can be used to infer mental effort (more activity more effort expended).

    Obviously this loses some of the details of that experience, however does an adaptive system really need all that information in order to recommend an appropriate response e.g. player is expending a low level of mental effort on this level, recommendation: increase game difficulty. Does a human for that matter? Human social interaction works through inference of the overt markers associated with a subjective experience and we seem to muddle along reasonable well using just them. Physiological signals merely provide covert markers for us to infer a psychological state.

    The challenge in this field is in finding suitable relationships between feelings and physiological markers as psychophysiological responses are rather messy which often leads to erroneous interpretations about what a physiological response is actually telling you e.g. heartbeat rate is a useful measure of autonomic activation and can be used as a component of a subjective experience but you need to control for context otherwise any labels you attach to changes in the signal are pretty much meaningless.

    I hope this has explained our field a little better, though the last time I tried answering this one my office mate thought I was practising sorcery.

    – Kiel

  15. Pingback: Video games biometrics and media | Video Games Usability with Biometrics

  16. Pingback: Lies, truth wizards and user research - Matthew Ovington

Comments are closed.