Quantifying Effects of Exposure to the Third and First-Person Perspectives in Virtual-Reality-Based Training
JULY-SEPTEMBER 2010 (Vol. 3, No. 3) pp. 272-276
1939-1382/10/$31.00 © 2010 IEEE

Published by the IEEE Computer Society
Quantifying Effects of Exposure to the Third and First-Person Perspectives in Virtual-Reality-Based Training
Patrick Salamin

Tej Tadi

Olaf Blanke

Frédéric Vexo

Daniel Thalmann
  Article Contents  
  Previous Work  
Download Citation
Download Content
PDFs Require Adobe Acrobat

Abstract—In the recent years, usage of the third-person perspective (3PP) in virtual training methods has become increasingly viable and despite the growing interest in virtual reality and graphics underlying third-person perspective usage, not many studies have systematically looked at the dynamics and differences between the third and first-person perspectives (1PPs). The current study was designed to quantify the differences between the effects induced by training participants to the third-person and first-person perspectives in a ball catching task. Our results show that for a certain trajectory of the stimulus, the performance of the participants post3PP training is similar to their performance postnormal perspective training. Performance post1PP training varies significantly from both 3PP and the normal perspective.

In video games and virtual environments (VEs), two visual perspectives are generally available to the users: A first-person perspective (1PP) in which the camera is in the position of the avatar's eye and a third-person perspective (3PP) in which the camera follows the avatar with an adjustable distance and angle of view. The usage of 3PP has become a topic of interest because of its technological impact in the society, e.g., more therapists and medical professionals are using virtual reality for rehabilitation. This also extends to artists like Marc Owens who seems to have been inspired by our first 3PP prototype. 1
A study on the effects of the dimension of egocentric-exocentric perspective on collaborative navigation performance was made by Yang and Olson in [ 1 ]. But in 1999, Rouse stated in [ 2 ] that the first-person perspective was dead within the games VE and a few years earlier, Bauman [ 3 ] wrote "The question is not 'Is your game 3D?' It is more on the lines of, 'What type of 3D will it be?'" Based on the many different aspects involved and the choice of visual perspectives available to the common VR user; as a first step, we exposed participants to the first and third-person perspectives as in video games in order to verify if naive users preferred the third-person perspective for moving actions and the first-person perspective for the fine manipulations as has been observed for gamers [ 4 ]. In 2008, Hemmert et al. proposed the change of on-screen views through a single-sided eye closure [ 5 ]. Though this solution provided more interesting information to the viewer, the view stream could not be correlated. On the other hand, following research, Yang and Olson [ 1 ], proposed the integration of different perspectives in one interface and hence designed an empirical study to test the effectiveness of different perspective displays on collaborative navigation performance.
We follow a similar design with our improved third-person perspective setup [ 6 ] that combines first and third-person perspectives. As we noticed that occlusion of the viewer's body could influence the performance in some tasks, e.g., fine manipulation with the hands, we represented the viewer as a transparent ghost by adding the first-person perspective video stream to the third-person perspective at the head location. We then presented both perspectives emulated in reality for our experiment [ 6 ]. In parallel, during simulations in VEs, the view presented to the participant results from a virtual camera in the VE. The position of this camera depends on two factors: the displacements of the avatar (representation of the participant in the VE) and the perspective we intended to provide to the participant. In the first option, the camera follows the avatar's position for every action (e.g., running, jumping, and staying) while the second adds an offset to this position (e.g., in front or behind the avatar).
During these simulations, we expected higher levels of "presence" (the "sense of being there" [ 7 ]) for 3PP training in comparison to 1PP exposure. Participants could see their own body within this perspective and we hypothesize that this fidelity between the representation of the body as seen in the Head-Mounted Display (HMD) with their own body movements contributes to an increase in the feeling of presence in the 3PP training condition as compared to the 1PP training condition.
Observations from our task indicate that the performance of the participants post3PP training was closer to their performance postnormal perspective training (subjects looked through the HMD with their own eyes with no video input from the camera) and their performance post1PP differed significantly from both 3PP and normal perspectives. We hope that the neural patterns recorded through the 16-electrode electroencephalogram (EEG) recordings will reveal further insights into this observation.
2. Previous Work
As humans navigate in virtual environments, they very often underevaluate the distances [ 8 ]. Initial studies indicated that the biases from display devices such as the Head-Mounted Displays were responsible for this problem. Distortions introduced by the HMD Field-of-View (FoV), the monovision in certain cases, or the HMD itself and the view designed by the experimenter were among the different parameters considered.
It was proved in 2004 that this underevaluation of the distance between the person and an object was not due to the HMD FoV. Knapp and Loomis [ 9 ] showed that the significant under perception of distance observed in several studies on distance perception in VE is not caused by the limited FoV of the HMD.
Other studies suggested that the distance underevaluation could also be a consequence of the lack of stereo-vision. In 1991, Drascic [ 10 ] showed that stereo helps to evaluate the distances in a faster way and Arsenault and Ware confirmed that stereo is very important for eye-hand coordination during fine manipulations like fishing [ 11 ]. But in 2008, Willemsen et al. [ 12 ] proved the contrary by manipulating stereo viewing conditions in a HMD. Their results indicate that the amount of compression of distance judgments is unaffected by these manipulations. This study confirmed the intentions of Creem-Regehr et al. who claimed four years ago that FoV and binocular restrictions do not largely contribute to distance underevaluation [ 13 ].
Willemsen et al. showed in [ 14 ] that, for the same task performed in the real world, distance judgments to objects on the ground are compressed in VE with an HMD, at least when indicated through visually directed walking tasks. Results of this experiment indicate that the mechanical aspects of HMDs cannot explain the full magnitude of distance underestimation seen in HMD-based VE, though they may account for a portion of the effect.
Except the viewer and the VR equipment used during the simulations, it has been proved in 1995 by Lampton et al. that viewers always underevaluate the distances in VE [ 15 ]. This was confirmed in 2003 by Loomis and Knapp [ 16 ] and Messing and Durgin [ 17 ]. In these experiments in which a real environment was observed through an HMD, via live video streaming, distances measured by visually directed walking were underestimated even when the perceived environment was known to be real and present. However, the underestimation was revealed to be linear, which could mean that higher order spatial perception effects might be preserved in VR. Thompson et al. also proved a year later that it was not the quality of graphics that leads to distance underevaluation [ 18 ]. It is important to note that accurate perception of the distance between an object and a nearby surface can increase a viewer's sense of presence in an immersive environment, particularly when a participant is performing actions that affect or are affected by this distance [ 19 ]. This underevaluation of the distances can introduce biases and lead to a breaks in presence [ 7 ].
3. Paradigm
The goal of the current experiment was to see if exposure to the 3PP and 1PP as opposed to the normal perspective for as little as 15 minutes can alter the performance of naive participants in a "ball catching" task thus giving us more insight on the advantages and effects of using these perspectives for training in virtual reality. In order to simulate both perspectives (1PP and 3PP), we used a camera that could be placed at two locations. These locations can be considered as static with reference to the participant. We simulate a perspective close to the 1PP ( Fig. 1 a) with a camera attached on the center of the HMD. It is three centimeters in front and five centimeters on the top of the center between the participant eyes. This perspective corresponds approximately to the normal view in daily life scenarios. With the help of a rigid backpack, we put a camera at 80 cm behind and 60 cm on the top of the participant's eye and slightly looking down (with an orientation of 7.3 degrees with the horizontal). We empirically obtained these measures so that the participant could see the head and the shoulders of the avatar (him/her) but also the environment around his/her body. The 3PP ( Fig. 1 b) has been used in video games for over a few years now. It seems to be preferred in action games while the avatar is moving in galleries [ 2 ] and provides a more global (wider) view of the environment despite the occlusions. This perspective seems to be unsettling at the beginning but as proved in [ 4 ], the participants adapt to it after a few minutes and this perspective even seems to facilitate their performance for certain tasks.

Fig. 1. View from the camera: (a) the simulated first-person perspective and (b) the simulated third-person perspective.

4. Methods
4.1 Participants
Twelve naive participants (aged from 20 to 30 years old) performed the experiment. All participants were right-handed, had normal or corrected-to-normal vision, and declared of having no history of neurological or psychiatric disorders. All participants gave written informed consent prior to inclusion in the study.
4.2 Role of the Backpack
In order to improve the participant's comfort, we built a rigid backpack with a strapped suitcase which held all the equipment. The backpack had a 1-meter-long arm and a camera was fixed on top of the arm which provided the 3PP video stream (see Fig. 2 ). We used a rigid backpack to minimize the oscillations for the setup as when the participants moved, the camera movements could induce dizziness for the participants. It is important to note that (as shown in [ 4 ] and [ 5 ]) during the 3PP exposure, some participants reported discomfort and slight dizziness as the camera fixed at the end of the arm follows (and enhances) the movements of the participant trunk and this can be perturbing while the participant walks in the environment or moves to catch a ball.

Fig. 2. VR equipment: (a) spy camera coupled with a HMD, (b) a mask, and (c) the backpack.

The camera was fixed 80 cm behind and 60 cm above the participant's eye position with an orientation of 7.3 degrees in direction to the bottom from the horizontal. The field of view was 60 degrees and this enabled the participant to see his/her shoulders, head, and objects in front of him/her at a distance larger or equal to 1.5 m corresponding to the footsteps. For the first-person perspective, we plugged the camera on the HMD in front of the eyes in the center.
4.3 Video Equipment
Because of the need for mobility during the exposure period, we chose wireless and battery powered equipment. We set up a radio color spy camera ( Fig. 2 a) with a wide FoV providing a video flow in PAL format (628 per 482 with a 62 pinhole). It weights a few grams and is powered by a 9 V battery which lasts for two hours, and can thus be fixed on the HMD during the different exposure periods.
The video was then sent to the HMD SONY Glasstron PLM-S700E ( Fig. 2 b) with a resolution of 800 per 600 at a refresh rate of 60 Hz via a receiver in the backpack powered with a 3.6 V battery. In order to occlude other external visual distracters, we asked the participants to wear a mask. Thus, the participants just saw the video feed from the camera ( Fig. 4 ).
5. Procedure
5.1 Stimuli
The stimuli presented to the participant were a series of video clips based on a 3D virtual environment in which a ball originates from a fixed origin at an approximate distance of 20 m. They were displayed to the participant in the HMD. The ball traveled toward the participants from a fixed origin at three different final distances on each side (left and right). The distances were 20, 60, and 150 cm (as shown in Fig. 3 ). For the first distance (20 cm), we hypothesized that participants would always be able to catch the ball with the arms extended and without moving the trunk. For the second distance (60 cm), we hypothesized that there would be some ambiguity introduced in judgment between the different training conditions. For the third distance (150 cm), we hypothesized that the participant would not be able to catch the ball with arms extended and without moving the trunk. Each stimulus video was presented for duration of 200 ms, followed by a random interstimulus interval. Responses from the participant were recorded through a serial response box and the participant pressed either the left or right button on the box to confirm if he/she could catch the ball or not.

Fig. 3. Stimuli (at the three distances on the participant's left side) presented to the participant during the task. The figure presents the trajectories of the balls arriving at a lateral distance of 150, 60, and 20 cm from the participant.

Fig. 4. Experiment performed with the participants: walking in a corridor and a slalom between pillars, evaluating distance of a wall in front of them, and playing football and basketball at the 1PP, 3PP, and normal perspective.

We also counterbalanced the hands for the response, across different participants. They used the index and the middle fingers to respond. The presentation of the stimuli (20, 60, and 150 cm) were randomized within one continuous block of testing and the order of perspective training (3PP, 1PP), was also randomized across participants. Subjective reports were recorded and eight of the 12 participants confirmed that distances 20 and 150 cm were easier to judge compared to the stimulus trajectory at a distance of 60 cm.
5.2 Timing and Walking Paths
The experiment lasted for about 150 minutes for each participant. They were tested using a 16-channel EEG system postexposure to different perspectives. The results from the EEG recordings are not discussed in this paper.
Baseline stimulus (5'). The participant does the "ball catching" task before being primed by any perspective.
Exposure (1PP/3PP) (20'). The participants perform different tasks while being exposed to one of our perspectives with the HMD. The order of this exposure is counterbalanced. The participant performs the "ball catching" task post the perspective training.
Exposure (3PP/1PP) (20'). The participants perform different tasks while being exposed to one of our perspectives with the HMD. The order of this exposure is counterbalanced. The participant performs the "ball catching" task again post the perspective training.
Control exposure (see-through) (20'). The participant performs one last time the different tasks with the HMD in see-through mode (without using camera video stream) and then does the "ball catching" task again.
5.3 Exposure Parameters
The exposure was composed of six steps: adaptation walk, slalom between pillars, distance evaluation with a door to open, football pass, basketball pass, and walking back. These tasks were undertaken across all the training conditions and the control exposure.
The exposure phase can be separated into three ordered steps. In the first step "Adaptation," the subjects walked in a corridor and performed a slalom. Second step, "Static" where they evaluated the distance to a wall (while walking) and stopped before touching it. Finally, the "Dynamic" step that consisted of online interaction with the experimenter using the football.
We chose to perform these steps in this predefinite order as this corresponded to the difficulty level of each task. The participants went through all the steps for the different training conditions.
The first stage consisted of walking through a 50-meter long gallery composed of two 90-degree curves with some obstacles of several sizes on the ground and then a slalom between 10 pillars. The participants were given no prior instructions on the location of the obstacles. During the walking phase, we recorded subjective reports from the participants on their preferred perspective (1PP or 3PP). We made sure that participants did not run into obstacles and the walls. We also recorded the walking speed and the total time taken across different stages.
In the next stage where the participant interacted with a static environment, the participant walked toward a door and tried to open it. As mentioned in [ 9 ], the distances were misjudged because of the bias induced by the perspective and this meant a collision with the door or that the participant missing the handle because he/she is not yet close to the door.
In the last stage of the exposure, the participant interacted with the experimenter. Here, we sent a ball to the participant in two different ways: with the foot (rolling ball) and with the hand (flying ball) and verified if he/she could extrapolate the position of the approaching objects in the 3PP as compared to the 1PP.
5.4 Behavioral Data
We recorded response times (RTs) and error rates. $3 \times 3$ repeated measure for the Analysis Of Variance (ANOVAs) were run on RTs and error rates for the recorded trials with visual perspective (3PP, 1PP, and Baseline) and stimuli (20, 60, and 150 cm) as within participant factors. Since we observed similar results in terms of performance and response times, we paired corresponding stimuli from the left and the right; hence, the 20, 50, and 150 cm L/R were paired together.
6. Results
6.1 Performance
Fig. 5 shows the mean ${\rm \%NO}$ responses (percentage times the participant answered he/she cannot catch the ball) for the correct responses of all participants across the different conditions (baseline, post1PP, and post3PP). There was a main effect between different stimuli (20, 60, and 150 cm) $(p < .001)$ with increasing ${\rm \%NO}$ responses with increasing distance of the stimulus video. There was a significant Perspective $\times$ Video interaction $(p < .001)$ . There was no significant effect found between the three perspectives. Posthoc tests revealed a significant difference between 3PP-1PP and baseline-1PP perspectives but there was no effect found between 3PP and baseline suggesting that participant performed similarly for these two perspectives. The values are presented in Table 1 .

Fig. 5. Participant performance: The graph above shows the percentage of times participant thought they could not catch the ball at the three previously mentioned distances: 20, 60, and 150 cm. Left and right trajectories are combined. We see that the performance postbaseline and 3PP exposure in the "ball catching" task varies significantly from 1PP at 60 cm.

Table 1. Performance: The Table Shows the Mean Values (and the Standard Errors) After Each Simulation for Stimuli with the Trajectories at Different Distances

6.2 Response Times
Fig. 6 shows the mean response times (ms) for the correct responses of all participants across the different conditions (baseline, post1PP, and post3PP). There was no main effect of perspective $(p = .09)$ . No significant Perspective $\times$ Video interaction was found $(p = .43)$ . We did not really expect different patterns in the response times as the participant had to wait till the stimulus video was complete before they could respond. The values are listed in Table 2 .

Fig. 6. Participant response times: The graph shows the average response times of the participants to the stimuli at the three distances: 20, 60, and 150 cm to the stimuli at the three previously mentioned distances. There were no significant differences found across the different perspectives.

Table 2. Response Times: The Table Shows the Mean Values (and the Standard Errors) After Each Simulation for Stimuli with the Trajectories at Different Distances

7. Discussion
The main goal of the current study was to systematically quantify and explore the differences if any across the two visual perspectives (3PP and 1PP) that are now commonly used in games and virtual reality. We also further looked at the preference for the 3PP that was reported by different participants for the navigation task spread across different stages in our experiment. We measured behavioral patterns (performances and RTs) post3PP and post1PP training for as little as 15 minutes in the same "ball catching" task which revealed that post3PP training, the performance of the participants was similar to their performance in the normal daily life perspectives and were successfully able to adapt to a different visual perspective in as little as 15 minutes. We are confident that the recorded neural activity will throw more light in the brain states underlying these processes.
Though, 3PP has now become a more commonly used perspective in VE and 3D games, there have been no systematic empirical studies that have looked at the dynamics in the usage of this perspective. Researchers in the field of presence have speculated on the levels of presence experienced between different perspectives [ 20 ]. Hence, we think that it is important as a first step to look at the differences between 3PP and 1PP exposures on participants performing the same task. On the basis of our previous studies [ 4 ], [ 6 ], we found that even though 3PP was more uncommon and could introduce biases (e.g., occlusions due to the participant body), it was often preferred by participant and even provided better results as compared to 1PP during the simulations. Nine of the twelve participants reported that spatial navigation tasks were easier to perform because they could see their body. In the current study, we primed participants to different perspectives (3PP and 1PP) for the same duration (15 minutes) and then looked at the performances in a ball catching task. Since the conditions were similar across both training phases, we predicted that differences in the performance of the participants can be attributed to the effects induced by the perspective they were primed to just prior to the task.
In the error rates (performances), we did not find a significant difference across the three perspective training conditions (3PP, 1PP, and Baseline) for the stimulus trajectories at 20 and 150 cm. For the stimulus trajectory at 60 cm of final distance from the participant's head, we found no significant difference between 3PP and baseline training conditions but there was a significant difference between 3PP and 1PP training and between the baseline and 1PP training conditions. Thus, for the ambiguous stimulus trajectory, the participants seemed to perform similarly between the 3PP and their normal everyday perspectives. This could explain to an extent the preference for the 3PP perspective usage in gaming and navigation in virtual environments. Hence, usage of the 3PP in training and learning methods might prove to be a more effective process as we find that training with the 3PP facilitates performances and leads to quicker adaptation of distance evaluation in the extra personal space. In the current study, we also tested a goalkeeper and training to the third-person perspective facilitated his performance for wider trajectories as compared to the naive participants. In order to pursue this hypothesis, future experiments will involve professional sportsmen (tennis, football, and basketball) and comparing their performance and brain activity with the general naive population. In the current experiment, the duration of training was limited to around 15 minutes and this lead to shorter readaptation times for the participants (around five minutes) and hence they had to perform the task immediately after exposure to the different training conditions. In future experiments, we would also like to vary the training time between 15 minutes and a couple of hours to see the modulation in adaptation mechanisms. We also think that some improvements need to be made to the equipment— especially the backpack which is quite cumbersome and could introduce a bias by limiting the natural movement of the participants while they avoid different obstacles.
The successful mapping of interactions such as geometrical mappings of the body with the environment and external objects, both within the virtual environments and the real world and relative to each other and requires the study of fundamental components of these interactions, such as the origin of the spatial perspective (1PP and 3PP) and how these contribute to the user's performance in the virtual environments. Hence, through this study, we have taken the first steps to quantify the performance of users in a VE postexposure to the different perspectives. The participants addressed the same task postexposure to different perspectives and learned to use the perspectives to solve the task. They performed differently for a particular ambiguous trajectory (60 cm) based on the perspective they were just exposed to. Hence, differences in performances should reflect the effects induced by the perspective the participants learned to use during the exposure. The results from the task give us an insight into the effectiveness of using the right perspective, which enables optimal performance across different sessions and does not depend on the order of exposure (general learning effect). We think that this has an important influence on training and learning procedures as the study gives us an idea as to which mappings are permissible and which elements degrade performance for this specific scenario (ball catching task) for exposures (as little as 15 minutes) to the commonly used perspectives (3PP and 1PP) in virtual reality. In future studies, it will be important to look at the malleability of these effects across different 3D techniques and interactions in virtual environments.
The question of using the optimal visual perspective is a fundamental parameter to study especially for different learning and training methods in gaming and virtual reality as the interactions of humans in these worlds needs to be convincing for these methods to be effective. We are confident based on our results that the usage of the third-person perspective as compared to the first-person perspective for training methods in virtual reality can be a viable and efficient solution.


This work has been partially supported by the European Coordination Action FOCUS K3D. Patrick Salamin and Tej Tadi have contributed to this paper equally.

    P. Salamin, F. Vexo, and D. Thalmann are with VRLab, EPFL, EPFL-IC-ISIM-VRLab, Station 14, CH-1015 Lausanne.

    E-mail: {patrick.salamin, frederic.vexo, daniel.thalmann}@epfl.ch.

    T. Tadi and O. Blanke are with LNCO-BMI, EPFL, EPFL-SV2805-BMI-LNCO, Station 19, CH-1015 Lausanne.

    E-mail: {tej.tadi, olaf.blanke}@epfl.ch.

Manuscript received 24 Nov. 2009; revised 9 Mar. 2010; accepted 16 May 2010; published online 14 July 2010.

For information on obtaining reprints of this article, please send e-mail to: lt@computer.org, and reference IEEECS Log Number TLT-2009-11-0154.

Digital Object Identifier no. 10.1109/TLT.2010.13.

1. http://www.marcowens.co.uk/products.html.