{"id":1698,"date":"2011-06-23T13:05:04","date_gmt":"2011-06-23T13:05:04","guid":{"rendered":"http:\/\/www.physiologicalcomputing.net\/?p=1698"},"modified":"2021-12-22T20:21:14","modified_gmt":"2021-12-22T20:21:14","slug":"biometrics-game-evaluation-and-user-xp-approach-with-caution","status":"publish","type":"post","link":"http:\/\/www.physiologicalcomputing.net\/?p=1698","title":{"rendered":"Biometrics, Game Evaluation and User XP:  Approach with caution"},"content":{"rendered":"<p>This post represents some thoughts on the use of psychophysiology to evaluate the player experience during a computer game. \u00a0As such, it&#8217;s tangential to the main business of this blog, but it&#8217;s a topic that I think is worth some discussion and debate, as it raises a whole bunch of pertinent issues for the design of physiological computer games.<\/p>\n<p>Psychophysiological methods are combined with computer games in two types of context: applied psychology research and game evaluation in a commercial context. \u00a0With respect to the former, a researcher may use a computer game as a platform to study a psychological concept, such as effects of game play on aggression or how playing against a friend or a stranger influences the experience of the player (see <a href=\"http:\/\/www.sciencedirect.com\/science\/journal\/18759521\">this<\/a> recent issue of Entertainment Computing for examples). \u00a0In both cases, we&#8217;re dealing with the application of an experimental psychology methodology to an issue where the game is used as a task or virtual world within which to study behaviour. \u00a0The computer game merely represents an environment or context in which to study human behaviour. \u00a0 This approach is characterised by several features: (1) comparisons are made between carefully controlled conditions, (2) statistical power is important (if you want to see your work published) so large numbers of participants are run through the design, (3) selection of participants is carefully controlled (equal number of males and females, comparative age ranges if groups are compared) and (4) counterbalanced designs, i.e. if participants play 2 different games, half of them play game 1 then game 2 whilst the other half play game 2 and then game 1; this is important because the order in which games are presented often influences the response of the participants.<br \/>\n<!--more--><br \/>\n<strong>Using Games as Psychological Research Tools<\/strong><\/p>\n<p>Let me give an example from my own work in collaboration with a social psychologist colleague, Dr. Andreas Kastenmueller. \u00a0Andreas and I are interested in a psychological construct called self-activation, i.e. to what extent can the representation of self be influenced by the appearance of avatars during game play. \u00a0We did a study last year (currently unpublished) where we had 4 groups of players play Wii Sports: two groups played an aggressive sport (boxing) and the other two groups played a less aggressive sport (bowling). \u00a0Within each group of boxer and bowlers, half of the players played with an aggressive avatar (Wii avatar with furrowed brow &#8211; &#8220;angry eyes&#8221;) and half with a neutral avatar. \u00a0The only difference between the gaming experience was the facial expression of the avatar, everything else was held constant. \u00a0We ran around 80 people through the design (20 per group, 10 males\/10 females; all 4 groups had approximately equivalent mean ages) and matched gender of the avatar to the gender of the person. \u00a0We recorded heart rate, respiration rate and blood pressure as well as acceleration in three axes. \u00a0Briefly, we found all physiological indicators were significantly higher during the boxing compared to the bowling but no effect for the expression of the avatar. \u00a0However, when we controlled our analysis of physiological data for movement (using acceleration data), the effect of the game type disappeared and the effect of avatar reached statistical significance, i.e. blood pressure increased when people played with the angry avatar. \u00a0In other words, identification with an aggressive avatar increased autonomic activation during the game &#8211; why were our participants more physiologically activated when playing with angry avatar? \u00a0In truth, we don&#8217;t know &#8211; it could be that the aggressive avatar augmented natural competitiveness or that it was more emotionally arousing to play with an expressive avatar or that they produced more testosterone when the avatar was angry. \u00a0This kind of research is exploratory, and as ever, we need another experiment&#8230;<\/p>\n<p>The reason I present this summary is to make a general point; physiological variables are very sensitive measures. \u00a0They respond to subtle psychological variables (appearance of avatar), \u00a0individual differences (between participants), the game context (boxing vs. bowling) and physical activity (movement). \u00a0This sensitivity of physiological variables is large part of why they are useful for psychological research, but this sensitivity is a double-edged sword. \u00a0When your measures can be sensitive to so many things, you need careful experimental control if you want to interpret your data in a way that is robust and unambiguous, which are the qualities that will make your results meaningful to others.<\/p>\n<p><strong>Evaluation of Player Experience<\/strong><\/p>\n<p>The second context for psychophysiology and gaming is the evaluation of player experience as part of a design cycle. \u00a0The goal here is to inform the process of game design in order to produce better games, and to be more specific, to confirm via play testing that the experience of the gamer conforms with the intentions of the team who were responsible for designing the game. \u00a0Game designers may construct a gaming experience with the objective of inducing different cognitive\/motivational\/emotional experiences and psychophysiology represents one means of confirming that these experiences have been achieved. \u00a0This type of testing can take place at the macro level (was my game scary?) or the micro level (was the part where the nice old lady turned into a zombie brandishing a chainsaw scary?).<\/p>\n<p>This type of testing generally takes place in a commercial context (though there are research projects dedicated to this topic such as <a href=\"http:\/\/fuga.aalto.fi\/\">FUGA<\/a>). \u00a0Software companies generally indulge in play testing using observational methods in conjunction with post-game interviews, but there are exceptions to this rule such as the <a href=\"http:\/\/www-module.cs.york.ac.uk\/ait\/classes\/class07\/materials\/p443-kim.pdf\">TRUE<\/a> method developed by Microsoft Games Studios. \u00a0A number of people working in this field have put forward strong arguments for using psychophysiological methods to evaluate gamer experience &#8211; see the talks from Pejman Mirza-Babaei\u00a0and Lennart Nacke from our recent CHI workshop <a href=\"http:\/\/www.physiologicalcomputing.net\/?p=1624\">here<\/a> as examples. \u00a0Lennart was also the co-author of a recent methodological <a href=\"http:\/\/hci.usask.ca\/publications\/view.php?id=209\">paper<\/a> on how physiology could be combined with video and game events to triangulate player experience (i.e. to converge parallel data streams from physiology, observation and game events to understand player experience). \u00a0<a href=\"http:\/\/www.verticalslice.co.uk\/\">Vertical Slice<\/a> employ a similar approach known as Biometric Storyboarding (see the CHI link for Pejman&#8217;s presentation on this topic).<\/p>\n<p>Using physiology in the context of commercial usability testing is very different to using game software in order to explore psychological hypotheses as we did with the Wii study. \u00a0First of all, the purpose is to gain insight into player experience and then to convey that insight to game designers in a way which informs their practice. \u00a0There are several challenges to be overcome in that last sentence. \u00a0Also, we ought to recognise that this research may be confirmatory rather than exploratory &#8211; in other words, it should lead to a clear conclusion about the game experience, not just another experiment. \u00a0In the past\u00a0I worked at a human factors research institute that included both a research branch and a commercial\/consultancy arm, I worked on projects in both sectors and experienced first-hand how the evaluation of technology diverged across academic and commercial sectors. \u00a0A large part of this difference was the need for clarity (how will your test help me sell more products &#8211; exactly) and expediency (I need your results now) which combined with financial restrictions to severely limit the number of participants that could be tested or the amount of time that may be devoted to analysis.<\/p>\n<p>However, none of these limiting factors change anything I said earlier about the use of physiology (or biometrics as the game industry seem to call it) to evaluate user experience. \u00a0These data remain volatile, variable and difficult to interpret without a high level of experimental control.<\/p>\n<p>Recently the Escapist published this <a href=\"http:\/\/www.escapistmagazine.com\/news\/view\/110384-Study-Declares-Dead-Space-2-Scariest-360-Game\">article<\/a> based on work by Vertical Slice on which is the scariest game for the Xbox 360. \u00a0For me, there was a number of problems with how physiology was used in the context of this study (at least going from the description in the aforementioned article) that encapsulate the problems of player evaluation with commercial products. \u00a0First of all,<\/p>\n<p>&#8220;the study, performed across four games (<em>Alan Wake, Dead Space 2, Condemned: Criminal Origins<\/em> and\u00a0<em>Resident Evil 5<\/em>) on six participants between the ages of 20 and 42, attempted to discover exactly which moments of the games were frightening.&#8221;<\/p>\n<p>So we have four games to be compared, which are similar in theme but also different in terms of mechanics &#8211; this is one big problem for evaluation using bespoke commercial products, the lack of systematic control across different software titles. \u00a0The alternative is to compare different versions of the game world that may be constructed from scratch or to use a SDK to create a systematic variation in the game world. Also the number of participants in this study is very low, especially as the group of six players varied considerably with respect to age and were further divided into casual vs. core gamers. \u00a0What this does is simply increase the level of unsystematic &#8220;noise&#8221; in the data, although achieving statistical significance is perhaps not a pressing issue for this kind of commercial work, it does make it very hard to find consistency in the data at all, except by considering data from one individual. \u00a0A description of the data collection is presented below from the original article:<\/p>\n<p>&#8220;To measure fear, Vertical Slice had each of the participants play about 30 minutes of each game in a counter-balanced order to reduce bias. During play, the participants were asked to think out loud; at the same time, their heart rate, skin surface temperature, Galvanic Skin Response (which measures excitement or frustration) and, in some cases, respiration were measured. After playing, the volunteers were asked to analyze their experience.&#8221;<\/p>\n<p>It should be noted that the researchers counterbalanced the presentation of the games to control for order effects. \u00a0I was very surprised to hear that speak-aloud protocol had been used. \u00a0Speaking exerts a profound effect on breathing rate and heart rate &#8211; since both breathing and heart rate are physiologically linked. \u00a0In my view, these data would be effectively useless as some people will talk a lot and some will speak very little. \u00a0In that case, I assume that GSR was mainly used for the high\/low scare scale that appears in the figures accompanying the article. Now here I have a major problem due to what I perceive to be a gross simplification of the psycho-physiological inference. \u00a0As noted in the article, GSR is associated with excitement or frustration; according to <a href=\"http:\/\/www.sciencedirect.com\/science\/article\/pii\/S0301051110000827\">this<\/a> recent review article on autonomic markers of emotion, GSR has been associated with a range of high activation emotions, including excitement, happiness, anger, frustration and fear. \u00a0GSR basically measures activation of sympathetic nervous system, it is inherently ambiguous with respect to emotional labels (i.e. a one-to-many relationship, see my 2009 <a href=\"http:\/\/web.mac.com\/shfairclough\/Stephen_Fairclough_Research\/Publications_physiological_computing_mental_effort_stephen_fairclough_files\/shf_IWC_final.pdf\">paper<\/a> for full description of the complexity of this inference).<\/p>\n<p>This is why experimental control is so important because context is everything for the interpretation of physiological data.<\/p>\n<p><strong>Why Experimental Control is Important for Everyone<\/strong><\/p>\n<p>I don&#8217;t wish to bag the guys at Vertical Slice, I think the storyboarding technique they have developed is really interesting and I am reading about their work from a secondary source. \u00a0But I do want to raise a debate about how physiological measures should be used in a commercial game testing situation. \u00a0When I worked in the commercial realm, I was sometimes told by the old hands that a high level of experimental control was unnecessary in the realm of consultancy because this was &#8220;quick and dirty&#8221; testing. \u00a0I always hated that phrase because the financial stakes are very high in commercial work and researchers need to be as cautious and meticulous as they would be in a pure academic research setting (in fact, more so). \u00a0On the other hand, is it realistic to expect a software company to allow me to test 80+ participants (at a consultancy rate) in order to figure out what kind of avatar they should use in their game? \u00a0 And then to charge them again for a follow-up experiment? \u00a0The applied research on game testing methodology (such as the FUGA project mentioned earlier) charts a middle path between these extremes of academic exploration and commercial confirmation, but no researchers who is using physiological measures can hope to dodge the requirement for real experimental control if they wish to present a confident interpretation of physiological data to their clients. \u00a0The use of scientific tools is not the same as employing a scientific methodology and gross simplification of what physiological measures mean will eventually damage the credibility of this approach in the domain of player evaluation &#8211; because people will figure out that interpretation is ambiguous and the lines on the chart are open to multiple interpretation.<\/p>\n<p><strong>What Kind of Evaluation Do Game Designers Want?<\/strong><\/p>\n<p>In the interest of transparency, and to prevent an accusation of criticising Vertical Slice from an academic ivory tower, I&#8217;d like to present some &#8220;quick and dirty&#8221; testing of our own in order to end this post on yet another dilemma and to show that we&#8217;re guilty of the same errors. \u00a0Below you can see heart rate data for a single player &#8211; this person played the Sony PlayStation game &#8220;WipeOut HD&#8221; (a futuristic racing game for those unfamiliar with the title) under three conditions of difficulty (defined as increased speed of opponents); venom is the easiest level, phantom is the hardest level. \u00a0We logged heart rate during each time by their position in the race, reasoning that sympathetic activation would be highest when the task was hard and the player is in the highest position in the race.<\/p>\n<p><a href=\"http:\/\/www.physiologicalcomputing.net\/wordpress\/wp-content\/uploads\/2011\/06\/Player-1-Position-HR-Bin1.jpg\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-1721\" title=\"Player 1 - Position - HR Bin - Wipeout\" src=\"http:\/\/www.physiologicalcomputing.net\/wordpress\/wp-content\/uploads\/2011\/06\/Player-1-Position-HR-Bin1.jpg\" alt=\"\" width=\"650\" height=\"429\" srcset=\"http:\/\/www.physiologicalcomputing.net\/wordpress\/wp-content\/uploads\/2011\/06\/Player-1-Position-HR-Bin1.jpg 650w, http:\/\/www.physiologicalcomputing.net\/wordpress\/wp-content\/uploads\/2011\/06\/Player-1-Position-HR-Bin1-300x198.jpg 300w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/a><\/p>\n<p>As you can see, the data conforms nicely to our expectations: heart rate is higher for the most demanding game (phantom) and rises as the person progresses from the lower to the higher race position. \u00a0Although this is only one person, I&#8217;d be confident in seeing the same trend if we ran another 20 people through the design. \u00a0The nice thing about these data is that we used a commercial piece of game software in a controlled way, and like Vertical Slice, we have placed our physiological data in the context of specific gaming events. \u00a0Leaving aside the issue of interpretation (we assume heart rate increased due to sympathetic activation as a response to high demand or motivation or some mixture of both), to what extent are these data really useful for a game designer? \u00a0At the very least, they confirm that player experience during venom\/rapier\/phantom level are different &#8211; but does that really inform their practice? \u00a0Or is that aiming too high and all commercial customers really want from these measures to check their design assumptions? \u00a0In my opinion, psychophysiological measures (or biometrics) will only really deliver insight when it is used to ask questions to which the industry doesn&#8217;t already know all the answers. \u00a0And here&#8217;s the kicker, if a team of game designers gave me five different versions of the same game or game scenario, which were identical but different in one crucial respect (i.e. controlled in an experimental sense), then my psychophysiological measures might tell them something unexpected, something that would really enhance their understanding of the product.<\/p>\n<p>Call me cynical but I don&#8217;t expect it happen anytime soon.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post represents some thoughts on the use of psychophysiology to evaluate the player experience during a computer game. \u00a0As such, it&#8217;s tangential to the main business of this blog, but it&#8217;s a topic that I think is worth some discussion and debate, as it raises a whole bunch of pertinent issues for the design [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[5,7],"tags":[106,105,96],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pY315-ro","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/posts\/1698"}],"collection":[{"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1698"}],"version-history":[{"count":1,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/posts\/1698\/revisions"}],"predecessor-version":[{"id":4691,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=\/wp\/v2\/posts\/1698\/revisions\/4691"}],"wp:attachment":[{"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1698"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1698"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.physiologicalcomputing.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1698"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}