Abstract
Keywords
1. Introduction
In recent years, the topics of human motion modeling, prediction, and interaction with social and service robots have been rapidly growing, driven by industrial interests and a quest for safer algorithms in human–robot interaction settings. Various advanced automated systems, such as mobile robots (including autonomous vehicles), manipulators, and sensor networks, benefit from human motion models for safe and efficient operation in the presence of humans. Human motion data is central to human-aware path planning, collision avoidance, tracking, interaction, understanding human activities, and collaborating on shared tasks.
Modern approaches for modeling human motion require plentiful data recorded in diverse environments and settings to train on, as well as for the evaluation (Rudenko et al., 2020b). Among the growing numbers of human trajectory datasets, most focus on capturing interactions between the moving agents in indoor (Brščić et al., 2013), outdoor (Robicquet et al., 2016), and automated driving (Bock et al., 2020) settings. These datasets are designed to study how people interact and avoid collisions in social settings by describing their motion through position and velocity information. Further datasets attempt to capture full-body motion in various activities and human–object interactions in household settings (Ehsanpour et al., 2022; Kratzer et al., 2020; Liu et al., 2019).
Human motion is influenced by many exogenous factors, which cumulatively amount to the
Furthermore, beyond the environment context, there are various aspects of the specific person—
Existing datasets in human motion analysis often lack the comprehensive inclusion of the exogenous factors and the target agent cues necessary for holistic studies of human motion dynamics. This research gap hinders the development of robust models that capture the relationship between contextual cues and human behavior in different scenarios. To address this gap, we present a novel dataset incorporating a broader set of contextual features and multiple variations to support factor isolation. By integrating diverse modalities such as walking trajectories, eye-tracking data, and environmental sensory inputs captured by a mobile robot (see Figure 1), our dataset fosters the exploration and analysis of human motion in various scenarios with increased fidelity and granularity.In this paper, we propose a novel dataset of accurate human and robot navigation and interaction in diverse indoor contexts, building on the previous THÖR dataset (Rudenko et al., 2020a). The THÖR dataset established a foundation for collecting open-source data on human social navigation toward randomized targets in a controlled setting using motion capture technology with minimal scripting. Building on the helpful work of previous studies (Mavrogiannis et al., 2019), which utilized reflective markers on helmets in small, spatially confined settings with a limited number of participants, the THÖR datasets extend this methodology and offer a broader scope. In particular, the THÖR-MAGNI dataset represents a significant advancement, enhancing data quality and features to provide rich insights into human motion and interactions within a larger room. The publicly available THÖR datasets, especially THÖR-MAGNI, facilitate more comprehensive human–robot interaction and human social navigation research. The THÖR-MAGNI data collection is designed around systematic variation of environmental factors to allow building cue-conditioned models of human motion and verifying hypotheses on factor impact. To that end, we propose several scenarios in which the participants, in addition to primary navigation, need to move objects, interact with each other and the robot, and respond to remote instructions. The dataset includes differential and omnidirectional robot navigation, semantic zones, direction signs in the environment, and many other aspects. We provide position and head orientation for each moving agent, as well as 3D lidar scans and gaze tracking. Finally, we provide tools to visualize the dataset’s multiple modalities and preprocess the trajectory data. In total, THÖR-MAGNI captures 3.5 hours of motion of 40 participants over 5 days of recording, which is available for download.
1
Furthermore, we note the continuity between the THÖR and THÖR-MAGNI recordings due to their shared environment (in diverse configurations), motion capture system, and complimentary scenario composition. THÖR-MAGNI data modalities. (1) Walking trajectories of participants in a workplace setting shared with other humans and robots; (2) lidar sweep recorded with a mobile robot; (3) snapshot from an eye tracker’s gaze overlay video; (4) fish-eye camera image from the mobile robot, showing object stashes and two goal points from our scenarios.
In this paper, we motivate and detail the THÖR-MAGNI data collection and sensor setup, describe the interfaces to the dataset, and compare it to the prior datasets. The paper is structured as follows: in Section 2, we review the prior state-of-the-art datasets, and in Section 3, we outline the target application domains. Section 4 provides all necessary information about the data collection, and Section 5 describes the data formats and tools used to visualize and preprocess the data. Finally, Section 6 presents a quantitative evaluation of the collected data followed by a conclusion in Section 7.
2. Related work
Multi-modal human motion datasets drive various research applications, including gait patterns, gaze vectors, human–robot interactions, and robot sensor data. These include human motion prediction (Kothari et al., 2022; Rudenko et al., 2020b), human motion representation for mobile robots (Kucner et al., 2023), human–robot interaction (Dahiya et al., 2023), human awareness in robot motion planning (Faroni et al., 2022; Heuer et al., 2023), and gaze-based prediction of human pose and locomotion mode (Li et al., 2022; Zheng et al., 2022).
2.1 Human trajectory datasets
Early datasets such as UCY (Lerner et al., 2007) and ETH (Pellegrini et al., 2009) have contributed significantly to our understanding of human movement in outdoor environments. Although these datasets encompass a range of human motion attributes such as trajectories, group identification, and goal points, social interactions play a minor role in shaping human trajectories (Makansi et al., 2022). The indoor ATC dataset introduced by Brščić et al. (2013) represents a data collection with high coverage and tracking accuracy due to the use of 49 range sensors for raw data acquisition. The tracking method involved the independent estimation of positions and body orientations from each sensor, which were subsequently fused. This fusion process increased the robustness of the primary estimates and ensured a high degree of accuracy in the resulting dataset. In contrast to the UCY and ETH datasets, our dataset contains many social interactions, as we always had multiple participants moving in the same space between goal points, deliberately allowing for frequent interactions between participants (see Section 3.1). Furthermore, unlike the ATC dataset, we have included a mobile robot in the scene, which allows the study of human–robot interaction scenarios (see Section 3.4).
Munaro and Menegatti (2014), Dondrup et al. (2015a), and Ehsanpour et al. (2022) have presented human motion datasets acquired through mobile robotic systems. While the datasets presented by Munaro and Menegatti (2014) and Dondrup et al. (2015a) consist of short acquisitions and have limited contextual information such as maps or environmental goals, Ehsanpour et al. (2022) have contributed a more comprehensive dataset. Their dataset includes detailed annotations of micro-actions and social group dynamics, offering a richer and more contextualized understanding of human motion patterns in diverse environments. However, in these datasets, human locations are based on detections in the sensor’s field of view onboard the mobile robot, which limits the scope of tracking due to occlusions. In contrast to these works, we used a motion capture system to track the moving agents (described in Section 4.5), which provides longer continuous tracking of each observed agent.
Kratzer et al. (2020) presented the MoGaze dataset, a notable advancement, by incorporating a motion capture system for full-body pose tracking and eye-tracking data for humans engaged in various activities. Similarly, Chen et al. (2022) proposed a human-tracking dataset for recording human–robot cooperation tasks in retail environments. However, neither dataset captures social interactions as they track only one person. In addition, MoGaze does not include a mobile robot in the scene. The absence of these elements hinders the study of downstream applications, for instance, robot motion planning methods in the “invisible robot” settings (Heuer et al., 2023), in which the humans do not react to the robot’s motion and location, but rather the full extent of collision avoidance falls on the robot. Similar to THÖR-MAGNI, the THÖR dataset introduced by Rudenko et al. (2020a) presents accurate human motion trajectories in the presence of a robot. While the THÖR dataset provides tracking accuracy in a socially dynamic environment, its limited recording duration (1 hour) poses challenges for in-depth studies, particularly concerning data-intensive deep learning-based methods for trajectory prediction.
2.2 Human–robot interaction
Understanding human motion is crucial in spatial human–robot interaction (sHRI). It allows robots to anticipate and adapt to human movements in shared environments, enhancing their safety, efficiency, and naturalness. This section situates the THÖR-MAGNI dataset within the context of existing HRI and robotics datasets.
Datasets like UF-Retail HRI (Chen et al., 2022), SiT (Bae et al., 2024), Mavrogiannis (Mavrogiannis et al., 2019), SoGRIN (Webb et al., 2023), and THÖR (Rudenko et al., 2020a) offer varied insights into HRI across different settings. UF-Retail HRI emphasizes social human navigation in retail environments using sensors like MoCap, eye tracking, and RGB cameras. SiT provides indoor and outdoor data to analyze pedestrian detection and trajectory prediction. SoGRIN investigates nonverbal social signals in group interactions, utilizing MoCap and RGB cameras to capture detailed motion and interaction cues. THÖR explores motion trajectories in shared spaces with a robot, using a MoCap system to enhance data accuracy. Several studies further emphasize the importance of understanding human social behaviors in HRI. For instance, Althaus et al. (2004) highlight the need for robots to exhibit behaviors that align with human social norms to enhance natural interaction in shared environments. Kretzschmar et al. (2016) present a novel approach to model cooperative behavior, highlighting the importance of understanding and imitating human interaction patterns for effective HRI. These studies emphasize the significance of data acquisitions like the ones from Bae et al. (2024), Dondrup et al. (2015b), and Yan et al. (2017), which provide critical data for analyzing pedestrian behaviors and interactions in shared spaces.
The THÖR-MAGNI dataset offers extensive indoor human–robot interaction data using MoCap, 3D lidar, and RGB-D cameras to record motion and social interactions in various contexts. It enriches the field by including scenario-based interactions, making it ideal for analyzing human social navigation and collaboration. In comparison to the predecessor THÖR dataset, THÖR-MAGNI represents a significant improvement, incorporating more exogenous factors such as lane markings and one-way passages, and introducing specific HRI scenarios. These scenarios involve participants navigating shared environments with a semi-autonomous mobile robot, supervised by an experimenter. The dataset explores robotic assistance in industrial settings, focusing on task efficiency and user experience in collaborative workflows. This makes THÖR-MAGNI uniquely valuable for advancing our understanding of human–robot interaction.
2.3 Comparison between THÖR and THÖR-MAGNI
Comparison of human trajectory datasets.
3. Context of the THÖR-MAGNI dataset
The THÖR-MAGNI dataset provides diverse navigation styles of a mobile robot, and humans engaged in various activities in a shared environment with robotic agents. It incorporates multi-modal data for a complete representation. Following a comparative analysis of our dataset with state-of-the-art datasets contributing to the evolving landscape of human motion research (see Section 2), this section supports users of our dataset by providing a detailed exploration of its features in the context of human motion and robot navigation and interactions. We explain their significance in addressing identified gaps before describing the dataset in Section 4.
3.1 Goal-directed human motion trajectories
Goal-directed human agents are crucial in human motion prediction (Chiara et al., 2022; Dendorfer et al., 2021; Zhao and Wildes, 2021). Traditional approaches often depict human agents as rational entities, acting logically and moving toward specific goals or destinations (Ziebart et al., 2009). Real-world recordings commonly show this directional traffic flow, characterized by distinct goal points, often resulting in a consistent and linear motion with limited diversity. In our dataset, we include scenes with seven different goal points distributed over a larger spatial volume and scenes where they are arranged in a more compact space (see Section 4). Goal points and static obstacles are positioned strategically to ensure that recorded trajectories are sufficiently long and topologically diverse, that is, covering a range of spatial arrangements and configurations. This approach allows for the inclusion of frequent interactions between the moving agents, contributing to a more comprehensive understanding of human motion dynamics.
3.2 Navigation of heterogeneous agents
Heterogeneous agents are dynamic entities that navigate with distinct motion patterns. This heterogeneity stems from various factors that affect the motion, such as tasks and ongoing activities performed by the agent (Almeida et al., 2023). For instance, several works have studied how humans move individually or as part of a social group (Moussaïd et al., 2010; Rudenko et al., 2018; Wang et al., 2022). It has been shown that humans can coordinate their movements as a group by following simple rules based on the visual perception of local motion (Boos et al., 2014). Previous research on the anatomy of leadership in collective behavior (Garland et al., 2018) describes human collective behavior as optimal coordination and leadership dynamics in various group scenarios. In particular, crowd dynamics are determined by physical constraints and significantly influenced by communicative and social interactions among individuals (Moussaïd et al., 2010). Autonomous driving datasets often highlight the motion of heterogeneous agents in mixed traffic (Chandra et al., 2019; Salzmann et al., 2020). In our dataset, we introduce roles for participants tailored for industrial tasks, such as navigating alone or in groups of different sizes, transporting various objects, and interacting with a robot. This heterogeneous social setting provides a novel way to study how specific industrial roles influence human motion, aligning with the work conducted by Almeida et al. (2023).
3.3 Navigation of a robotic agent
Human-aware robot motion planning is crucial for safe navigation in shared spaces, especially in narrow and crowded indoor environments (Cancelli et al., 2023). Understanding human interaction with robots of different driving styles promotes the design of socially acceptable motion planners (Möller et al., 2021). Analyzing participant behavior with robots of varied movement patterns reveals insights into how robot motion style affects human expectations (Karnan et al., 2022; Mavrogiannis et al., 2019), guiding the development of robots that interact safely and are well-received by people (Shah et al., 2023). Our dataset features scenarios with a mobile robot in teleoperated and semi-autonomous modes and two driving styles: differential drive (forward, backward, and turning) and omnidirectional mode (allowing the robot to drive in any direction while keeping its heading). This variety of motion modes (detailed in Section 4.2.2) extends the state-of-the-art datasets of teleoperated navigation which feature a single driving style (Karnan et al., 2022; Rudenko et al., 2020a). Lastly, while some parts of our dataset (Section 4.3.3) might be interesting for the field of social robot navigation (see Mavrogiannis et al., 2023, who recently surveyed this field), the main focus of the dataset is on human social navigation and spatial human–robot interaction in shared workplaces.
3.4 Spatial human–robot interaction in shared workplace settings
Industry 5.0 aims to prioritize human well-being in manufacturing systems (Leng et al., 2022). This requires enhancing the quality of human–machine and human–robot interactions in these environments. Designing robots that clearly express their intentions to human collaborators is a crucial step toward fostering mutual understanding and enhancing the well-being of workers who regularly interact with robots (Pascher et al., 2023). Furthermore, intuitive human–robot interaction (HRI) improves well-being and enhances safety and efficiency in collaborative settings (Haddadin et al., 2011).
Spatial HRI (sHRI) and navigation in shared environments are research areas that have an adherent need for accurate datasets of human motion tracking and prediction (Chen et al., 2022; Rudenko et al., 2020a) and for robots that understand the underlying physical interactions between nearby agents and objects (Castri et al., 2022). Our dataset contains recordings of explicit interactions between a mobile robot and individuals in shared workplace settings. THÖR-MAGNI is a valuable resource for studying human responses to robotic approaches and assistance initiatives, enabling researchers to analyze goal-oriented interactions between humans and robots.
3.5 Eye tracking and head orientation in navigation tasks
Eye tracking is a powerful method to study various aspects of human behavior, including attention, emotion, cognition, and decision-making, with applications spanning education, marketing, gaming, and healthcare (Duchowski, 2017). Eye tracking provides objective data about eye movements and positions and enables researchers to quantify visual information processing through various metrics (Duchowski, 2017; Mahanama et al., 2022). In HRI applications, human eye-gaze is an important nonverbal signal (Admoni and Scassellati, 2017). Our dataset aligns human gaze data with human motion trajectories, allowing us to study human gaze during visual exploration across dynamic tasks, activities, and scenarios.
Head orientation provides another essential modality of human behavior that is complementary to gaze direction and attentional focus. Head orientation plays a vital role in joint attention, that is, attention coordination between individuals focusing on the same point of interest (Tomasello, 2014). Furthermore, it is valuable for detecting interpersonal dynamics in multi-party interactions (Stiefelhagen and Zhu, 2002). Beyond its social implications, head orientation becomes a predictive indicator of walking motion goals (Holman et al., 2021) and can enhance human motion prediction through vision-based features (Salzmann et al., 2023). Using a state-of-the-art motion capture system and eye-tracking devices, our dataset provides highly accurate head poses and orientations aligned with the eye-tracking data.
3.6 Semantic environment cues
Crucial environmental information, conveyed by semantic cues such as doors, stairs, floor markings, and signs, is essential in guiding humans and robots within a given space. These cues, combined with obstacle configurations, influence human interactions with the environment, leading to actions like detouring, bypassing, overtaking, and avoiding specific areas. In our dataset, we include semantic cues like markings on the floor indicating areas to be cautious of the environment or one-way passages that limit the flow of motion in one direction. In this way, we enable the exploration of navigation and interactions in semantically-rich environments. For instance, leveraging Maps of Dynamics (Kucner et al., 2023) allows the quantification of motion patterns changes around these cues. This information, in turn, can be utilized to predict long-term human motion dynamics, as demonstrated by Zhu et al. (2023).
4. Description of the THÖR-MAGNI dataset
Amount of eye-tracking and trajectory data recorded for various activities with all three devices: Tobii 2, Tobii 3, and Pupil Invisible glasses.
In this section, we detail the environment in which we recorded the data (Section 4.1), the navigation and task design for the participants and the robot (Section 4.2), interactive scenarios to emphasize the various contextual aspects of human motion (Section 4.3), participants’ background and priming (Section 4.4), and the technical implementation of the recording pipeline and collection of motion capture and eye-tracking data (Section 4.5).
4.1 Environment design
We conducted the data acquisition in a laboratory at Örebro University, the same as in the THÖR dataset (Rudenko et al., 2020a). There are two different configurations for the laboratory. One features a small but free-space environment (see Figure 2 left). The other resembles an industrial logistics setting and promotes frequent interactions between human and robotic co-workers (see Section 4.3). Both room configurations have seven goal positions to drive purposeful human navigation through the available space, generating frequent interactions in the center. Additionally, we include several environmental layouts (i.e., obstacle maps) in the THÖR-MAGNI dataset, which vary the placement of static obstacles (robotic manipulators and tables) in the room to prevent walking between goals in a straight path. Apart from static obstacles, two robots are in the room: a static robotic arm near the podium and an omnidirectional mobile robot with a robotic arm on top (see Section 4.2.2). Our dataset comprehensively explores human–robot interaction in a shared workplace environment. 
4.2 Navigation and interaction design
The interaction and navigation design in THÖR-MAGNI extends the weakly-scripted motion recording procedure introduced in the THÖR dataset (Rudenko et al., 2020a). This procedure facilitates realistic motion in controlled settings, in which, accurate ground truth motion capture and eye-tracking data are collected using specialized equipment (see Figure 2 on the right). Our key idea is to assign meaningful activities and tasks to the recording’s participants, allowing them to concentrate on their continuous activity during which they freely move inside the room shared with other people and robots. To generate a diverse range of interactions, we developed several scenes that vary in the composition of tasks, robot operation, and other contextual cues, as discussed in Section 3.
4.2.1 Tasks, activities, and roles requiring search and navigation
We aimed to simulate authentic scenes that reflect the different activities individuals perform in a workplace environment. To that end, we designed several tasks that require search, navigation, and interaction with objects, other participants, and a mobile robot. Participants engaged in those tasks according to their assigned
Our dataset has two types of roles: Participants in the role of Carrier were transporting various objects in different sizes and shapes. (1) 
4.2.2 Modes of robot navigation and HRI
Our dataset includes a mobile robot, “DARKO
2
” (see Figure 4), which acts as a static obstacle in some scenes and moves in others. This range of behaviors enables the study of participants’ movements and gaze behaviors concerning the stationary and mobile status of the robot. In certain scenes, the robot was teleoperated and moved omnidirectionally, enabling it to reach any 2D position from a stationary position. In some, it moved directionally with a predetermined orientation (front). In others, the DARKO robot navigated semi-autonomously with manually set goal points. An experimenter was supervising the navigation of DARKO for safety reasons. When acting semi-autonomously, the robot interacted with participants through a communication intermediary called the “Anthropomorphic Robot Mock Driver” (ARMoD). Robot used in and for data collection (the “DARKO” robot) with an omnidirectional mobile base (RB-Kairos) of the dimensions: 760 × 665 × 690 mm (5), equipped with two sensor towers, one hosting two Azure Kinect RGB-D cameras (2), and one hosting an Ouster OS0-128 lidar and two Basler fish-eye RGB cameras (4). Additional equipment includes two Sick MicroScan 2D safety lidars (6), mecanum wheels (7), and a NAO robot (“ARMoD”) for interaction with participants (3). Our recordings did not use the robotic arm with a maximum arm height of 855 mm (1).
The ARMoD is a small humanoid NAO robot, as shown in Figure 4. It was sitting on the DARKO robot. The ARMoD displayed two behaviors during interactions: One using only the voice (
4.3 Scenario design
We address the context of agent movement by including both humans and robots, as previously discussed, in five specifically designed scenes we call “scenarios.” Scenario 1 captures the dynamics of motion because of semantic attributes of the environment and sets up a baseline for goal-directed social human navigation. Scenario 2 adds role-specific motion for some participants navigating the environment. Subsequently, Scenario 3 explores the impact of different robot motion styles on these role-specific patterns. Figure 5 depicts a detailed overview of the room configuration and varying environmental layouts for Scenarios 1–3. Scenario 1’s conditions A and B capture regular social behavior in a static environment with and without additional floor markings and a one-way passage. Scenario 2 maintains the same layout as Scenario 1A but introduces individuals performing tasks, emulating industrial activities. Scenario 3 explores human–robot interactions by varying the driving modes of the mobile robot teleoperated by experimenters on a podium. Varying environmental layouts for the room configuration of Scenarios 1–3. 
Transitioning to a smaller room configuration, we present two scenarios to explore human motion and intended interactions between humans and robots: Scenarios 4 and 5. In Scenario 4, participants engaged in intermittent interaction with a mobile robot. This robot communicated in two interaction styles through another entity to mediate joint navigation with participants toward goal points. In Scenario 5, the robots and a human co-worker collaborated actively in transporting small storage bins. For a comprehensive overview of roles and scenarios, see Figure 6. Scenario definitions in the THÖR-MAGNI dataset, including roles, robot motion status (e.g., autonomous or teleoperated), environment layout (i.e., obstacle maps), specific scenario conditions, and duration and recording days. Each recording day has a unique set of participants. Day 1 has nine participants and days 2–4 have seven participants each. Three mobile eye-tracking devices were used daily for three participants. On day 5, two devices were used for two sets of participants. The duration of recorded trajectory and eye-tracking data is provided in Table 2.
We recorded multiple runs for each condition in Scenarios 1–5. Specifically, we recorded two runs per condition for Scenarios 1 and 3, two for Scenario 2, four per condition for Scenario 4, and four runs for Scenario 5. To counterbalance learning-based effects, we randomized the recording order of conditions for Scenarios 3 and 5. We implemented this systematic approach to ensure a broad and impartial exploration of the scenarios, capturing subtle interactions and behaviors in each setting.
4.3.1 Scenario 1: Capturing motion dynamics in the environment
Scenario 1 comprises two conditions: Maps of dynamics created from one day of data acquisitions. 
4.3.2 Scenario 2: Role-specific motion patterns in industrial environments
Scenario 2 features the same environment layout as Scenario 1A (Figure 7 left). In addition to the goal-driven navigation (
In summary, this scenario presents role-specific tasks for participants and goal-driven navigation, creating a platform to study the impact of human occupation on their motion profiles and those of the other agents in a shared environment.
4.3.3 Scenario 3: Impact of mobile robot motion on human behavior
With Scenario 3, we introduce an opportunity to study the interplay between human activities and a mobile robot. In this scenario, the stationary DARKO robot of Scenarios 1 and 2 becomes mobile, exploring changes in the humans’ motion patterns based on the mobile robot driving style. This scenario comprises two conditions, in which we modulated the way the mobile robot navigates: Two types of mobile robot motion achievable with mecanum wheels (
4.3.4 Scenario 4: Spatial HRI in a shared environment
This scenario includes participants with the roles of
Participants assigned with the role of
Participants move either individually or in pairs between designated goal points. A specific card directs the individual participants (
To ensure safe and seamless interactions, ARMoD’s behaviors are triggered by an experimenter using a controller (see Figure 9 left). The experimenter initiates actions like “Greet the closest participant” and “Talking to the participant,” guiding ARMoD’s communication with participants. Concurrently, the mobile robot continues its autonomous navigation, albeit under the oversight of the experimenter, who can pause its movements if necessary.
Accurate tracking of individuals was essential for facilitating seamless interactions between the ARMoD and the participants. To determine the ARMoD’s position relative to individuals at any given moment, we leveraged the motion capture system’s data, broadcasted into the local network using “Robot Operating System (ROS)” (Quigley et al., 2009). This integration ensured precise transformations and provided position and orientation information, enabling ARMoD to accurately point, look, and establish eye contact with its interaction partners. Figure 9 right illustrates an interaction between a participant and ARMoD in this scenario. The position and orientation data of participants, robots, and the world frame are broadcasted within the local network, providing essential information to the path planner for DARKO and the interaction scheduler for ARMoD. In this figure, examples of established coordinate frames include (1) that of the helmets of participants defined based on the orientation of the marker, (2) a static coordinate frame for the ARMoD derived from the DARKO robot’s frame through an offset, (3) DARKO’s coordinate frame, and (4) the motion capture reference’s frame called the “QTM-World Frame.”
This scenario investigates free movement in a shared environment alongside the DARKO robot, exploring semi-autonomous navigation. Participants engaged in interactions with ARMoD under varied conditions. These allow for a study of human–robot interactions, navigation tasks, and the impact of different interaction styles on participants’ activities and movements.
4.3.5 Scenario 5: Spatial human–robot interaction, proactive robotic assistance
This scenario involves the roles:
4.4 Participants background and priming
The average age of the participants was 30.18 years, with a standard deviation of 6.73, indicating a relatively homogeneous age group. The dataset contains a balanced gender distribution with 40 participants, of which 21 are male and 19 female. Geographically, 23 participants are from Sweden. From other European countries, there are 10, including the Czech Republic, Spain, Germany, and Italy, reflecting a diverse European representation. The remaining seven participants come from countries on other continents like Asia, Africa, and South America, providing a broader international scope. We recruited the participants from different areas of the campus. Their backgrounds varied considerably, including differences in their highest academic degree and primary subjects. At the beginning of each recording day, participants completed a demographic questionnaire. We used this information to create diverse group compositions, aiming for optimal allocation of eye-tracking devices across different roles (see Figure 10). For example, we ensured that groups of two or three participants contained only one participant equipped with an eye tracker and the equipment of at least one of the carriers with an eye tracker. Initial priming of participants performed at the beginning of each recording day. Participants were instructed about the experimental setting and the recording procedure, including a briefing on the tasks, establishing familiarity with the equipment, filling out consent forms, and an initial set of questionnaires.
At the beginning of each recording day, we provided standardized information to participants to ensure natural and unbiased behaviors. The instruction emphasized the experiment’s focus on testing the robot’s perception of humans, involving tasks such as navigating the laboratory and executing physical activities, with an estimated duration of 15 min.
During the data collection procedure, we guided the participants through a series of runs with specific instructions tailored to each scenario. Between successive runs, participants complete questionnaires while logistical preparations are made, such as removing floor markings, configuring a phone for voice chat using Discord (before Scenarios 2 and 3), monitoring and, if necessary, changing the batteries of eye trackers, and preparing the robots for Scenarios 3–5. After completing the questionnaire, participants are assigned new roles in Scenarios 2 and 3. We gave each group a new starting point for the next run, from which they drew their first card. Participants unfamiliar with their roles got a brief recap of their task-related responsibilities. In Scenario 3, we informed participants that an experimenter monitored the robot’s motion for safety and teleoperated the robot. In Scenarios 4 and 5, participants were first briefed about their roles in the scenario (see Section 4.3) and then introduced to the ARMoD and the DARKO robot as co-workers in the room, with the ARMoD acting as a communicator on behalf of the DARKO robot.
After each run, participants completed the raw version of the NASA Task Load Index (RTLX) (Hart,, 2006; Hart and Staveland 1988). The scale consisted of a 21-point set of subscales [1 = low; 21 = high], each of which assessed the mental demand, physical demand, temporal demand, and frustration produced by the task as reported by the participant, as well as their self-perceived performance and frustration. After each session of the last run of Scenarios 3 or 5, participants complete two additional mobile robot questionnaires. First, they complete the Godspeed Questionnaire Series (Bartneck et al., 2009), a semantic differential set of subscales [5-point] that measures participants’ perceptions of the robot in terms of anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety. Second, they complete a 5-point Likert scale [1 = strongly disagree; 5 = strongly agree] to assess trust in the robot in industrial human–robot collaborations (Charalambous et al., 2016). Participants complete all questionnaires on paper.
4.5 System setup
4.5.1 Hardware and software configuration
We used a motion capture system from Qualisys with 10 infrared cameras (Oqus 7+) positioned around the room to track moving agents. The system provided comprehensive coverage of the room volume. Reflective markers arranged in distinct patterns of six degrees of freedom (6DoF) on bicycle helmets. These were tracked at 100 Hz with a spatial resolution of 1 mm. The coordinate frame of the system originated at the ground level in the center of the room. Each participant and the robot are represented as unique rigid bodies (identifiable through the group of passive reflective markers arranged in specific patterns) in the system. This configuration enabled the precise capture of each participant’s 6DoF head position and orientation. We provided the participants with individualized helmets for the recording sessions. The specific helmet IDs used during each recording session are listed in Tables 3, 4, and 5 in the Appendix.
We captured eye-tracking data using three distinct models of eye-tracking devices: Tobii Pro Glasses 2 and 3 and Pupil Invisible. The Tobii Glasses models record raw gaze data at a frequency of 50 Hz and camera footage at 25 Hz, while the Pupil Glasses record gaze data at 100 Hz and camera footage at 30 Hz. We used the I-VT Attention filter to export Tobii Glasses data, optimized for dynamic situations, to classify gaze points into fixations and saccades based on a velocity threshold of 100°/
The DARKO robot integrates several sensors, including an Ouster OS0-128 lidar, two Azure Kinect RGB-D cameras (one of which was used in these recordings), two Basler fish-eye RGB cameras, and two Sick MicroScan 2D safety lidars. The Azure Kinect cameras have a resolution of 2048 × 1536 at 6 Hz, a horizontal field of view of 75°, and a tracking range of up to 5 m. The Basler fish-eye RGB cameras have a resolution of 1700 × 1536 at 20 Hz. The DARKO robot is augmented with a NAO robot acting as ARMoD for participant interaction. The NAO is attached to a seat on the DARKO robot, facilitating the communication of spatial motion intent. This arrangement aligns the ARMoD’s body orientation with the direction of movement in scenarios where DARKO employs a directional driving style.
Recordings from the DARKO robot and the motion capture system were synchronized using ROS timestamps. Taking advantage of the integration of the motion capture system with ROS 1 Melodic, we recorded all of the robot’s onboard sensor data and the 6DoF positions of the people using ROS bag files and in text form.
4.5.2 Sensor calibration
The precision of the data acquisition relied on sensor calibration procedures to ensure accurate measurements and reliable data interpretation throughout the experiments. This section describes our calibration methods for both the motion capture system and the eye-tracking devices. We followed separate calibration routines for each sensor. These calibration routines allowed for the robustness and reliability of our dataset, allowing for accurate analysis and interpretation of participants’ behaviors and interactions within the recorded scenarios.
For the eye-tracking devices, we followed the calibration procedures for both Tobii Glasses models (see Figure 11) as outlined in their respective user manuals to optimize eye-tracking accuracy (see Tobii AB Accessed: 2024-02-02(a) and Tobii AB Accessed: 2024-02-02(b)). This process involved positioning a calibration target, ensuring its visibility, and having participants focus on its center. To ensure accurate recordings of the Pupil Invisible Glasses, we followed the best calibration practices outlined by and validated the calibrations with the dedicated software of Pupil Labs AB Accessed: 2024-02-02(b).
To ensure the data accuracy of the motion capture system, rigorous daily calibration routines were performed before the start of each recording session. We used the standard calibration kit with a 502.2 mm carbon fiber wand to fine-tune the system. These calibrations allowed us to define precise rigid bodies that enabled 6DoF tracking. This approach ensured the accurate capture of spatial dimensions (X, Y, Z) and rotational elements (roll, pitch, yaw) of objects within the 3D environment, resulting in an average residual tracking error of 2 mm. Rigid bodies of helmets and objects, such as the large objects for the carriers or the DARKO robot, were strategically designed to enable simultaneous and highly accurate capture of all object poses and locations.
4.6 Post processing
Multi-modal data synchronization was necessary in our data collection. We used ROS and custom Python scripts to align the data streams while maintaining temporal integrity. To achieve synchronicity between the motion capture and eye-tracking data, we strategically placed custom events associated with precise timestamps in the two data streams using the respective software of the eye-tracking devices such as Tobii Pro Lab (Tobii AB Accessed: 2024-02-02[c]) and Pupil Player (Pupil Labs AB Accessed: 2024-02-02[a]) as well as the Qualisys Track Manager (QTM) (Qualisys AB Accessed: 2024-02-02) for the motion capture system. This procedure resulted in CSV files where all modalities’ timestamps are synchronized on the motion capture system’s timestamp. Within these files, eye-tracking data is available for frames where the motion capture system tracks all rigid body markers, as it is a prerequisite to determine the 3D gaze vector using a correct head orientation. The frame numbers for each respective eye tracker’s scene recording are indexed in the column named “SceneFNr” in the corresponding CSV file.
To facilitate a thorough analysis of the eye-tracking data in our study, we offer access to the raw data from the Tobii glasses, along with essential synchronization details. The scene recordings are provided in a blurred format to ensure data protection and removed audio data. Access to the raw data from the Pupil Invisible glasses can be granted upon individual request, providing careful and ethical distribution of sensitive data.
An extensive post-processing stage followed the data acquisitions, including synchronization and alignment. It aimed to refine and validate the collected data and ensure the protection of sensitive data. This stage involved several vital procedures, such as eliminating artifacts and noise caused by marker occlusion, lighting variations, and camera disruptions. We also rectified misidentified trajectories through spatial and temporal consistency evaluations, applying manual adjustments when needed.
5. Working with the THÖR-MAGNI dataset
5.1 Data formats
For dissemination, the dataset has been categorized into five recording scenarios (see Section 4 for a detailed description), aligning with the respective days of data collection. Each scenario’s data is organized into separate folders. Multiple acquisitions conducted over the 5 days of recording are stored within each folder. The folders corresponding to the first three scenarios (1–3) contain acquisitions from 4 days (in May 2022), while the folders representing the last two scenarios (4 and 5) encompass recordings from 1 day (in September 2022). We record multiple runs for each scenario and condition to enhance the diversity of motion data in the recordings and mitigate random artifacts. It is essential to note that all files are intended to be extracted into a common directory. In this way, the arrangement preserves the temporal structure of the recorded data.
Each run’s data includes a CSV file and up to two .mp4 videos representing the recordings from the scene cameras of the Tobii eye trackers and if the robot was in motion during the Scenarios 3–5 continuous 3D point clouds from the Ouster lidar as well as the RGB videos from one of the fish-eye cameras. The structure of the recorded data is shown in the Tables 3, 4, and 5 in the Appendix. In the following subsections, we will provide more specific details on the usage and processing of the individual files.
5.1.1 Comma-separated value files
Each CSV file contains a header with critical metadata, including the number of frames for the recording, rigid body and marker details, units of measurement, role labels, and eye-tracking specifics (see Table 6 in the Appendix). The rest of the CSV files contain the merged data from the motion capture system and the eye-tracking devices, organized based on the rigid bodies of the participants’ helmets. Thus, the data of each rigid body is organized into columns containing the XYZ coordinates of all markers (e.g., “Helmet_1 – 2 X” indicating the data for helmet one, marker two, and axis X), XYZ coordinates of the centroid of all markers, the 6DOF orientation of the rigid body’s local coordinate frame, and, if available, eye-tracking data including 2D gaze coordinates, 3D gaze vectors, the frame number of scene recording, eye movement types (such as saccades or fixations), and IMU data (accelerometer, gyroscope, and magnetometer).
Missing data is indicated by either “N/A” (not available) or an empty cell. The temporal indexing in these files is provided by the “Time” or “Frame” column, which indicates the timestamp or frame number of the motion capture system, respectively.
5.1.2 Robot sensor data
The sensor data from the robot includes lidar data and videos captured by the Azure Kinect camera and the Basler camera. Lidar 3D point clouds are provided in the Point Cloud Data (PCD) file format, corresponding to each timestamp. The lidar data for each run is supplied in a zip file, which is labeled with the same File ID as referenced in the Tables 3, 4, and 5 in the Appendix. Regarding video data, the RGB-D and fish-eye camera video streams are unrectified, providing raw visual data, and are only available upon request to ensure suitable data protection.
5.1.3 Additional data
In addition to the CSV files containing information about the recorded data from the eye trackers and the motion capture system, we provide the scene recordings from most of the Tobii eye-tracking devices as .mp4 videos. The videos of the scene recordings were carefully post-processed, as we blurred all the faces of the participants using dedicated video-redaction software (“Caseguard”) to ensure data protection. The raw camera video from the Pupil Invisible Glasses scene has distortions that must be corrected. For this purpose, we provide JSON files with the necessary intrinsic camera parameters to compensate. All data from the Pupil Invisible eye-tracking devices and the remaining data from the Tobii devices are available upon request.
5.2 Development tools
Most existing datasets in the field lack a dedicated toolbox for streamlined visualization and preprocessing. Addressing this gap, we contribute a set of data visualization tools, including a dashboard, and introduce a specialized Python package named
5.2.1 Data visualization
To provide researchers and users with an intuitive interface for the exploration of human movement, gaze patterns, and environmental perception of the THÖR-MAGNI dataset, we made a set of visualization tools publicly available.
4
Our visualization dashboard provides a user-friendly interface with multiple interactive components. The dashboard includes the following key features: 1. 2. 3. 4.
In addition to data visualization, our dashboard contains concise scenario descriptions. Each scenario represents a unique context in which human motion data was captured (described in Section 4.3). These descriptions include information such as the physical environment, task objectives, social interactions, and specific conditions imposed on the participants (e.g., transporting objects between two goal points). Understanding these scenarios is vital for accurately interpreting the data and ensures that researchers can contextualize their analyses effectively.
5.2.2 Data filtering and preprocessing with thor-magni-tools
To facilitate the use of the agents’ trajectories in our dataset, we employed the Filtering methods in a 4-minute recording from Scenario 1. 
For both 3D and 6D tracks (X, Y, Z, and 3D orientation), we provide an interpolation method based on a predefined maximum number of positions in the absence of tracking. This method is used to fill in the missing data points while maintaining the integrity of the motion patterns and ensuring continuity in the trajectories. An example of the interpolation of a trajectory based on Example of a 4-minute Helmet trajectory in Scenario 1. 
6. Analysis and comparison to existing human motion datasets
This section presents a comparison with popular human trajectory datasets, specifically the ETH/UCY benchmark and THÖR, with our THÖR-MAGNI dataset. Our analysis encompasses a multidimensional evaluation, covering various facets of the data recordings. These include trajectory continuity, social proxemics delineating interpersonal interactions, and motion characteristics such as velocity profiles and trajectory linearity. Through this comparison, we aim to situate THÖR-MAGNI among its predecessors, showing its potential for advancing human motion analysis and human–robot interaction research.
6.1 Metrics for trajectory data comparison
To evaluate the trajectory data of our dataset in comparison to previous data collections, we employ metrics proposed by Rudenko et al., 2020a; Amirian et al., 2021: • • • • •
6.2 Trajectory data comparison
We compare our dataset with the THÖR dataset and the ETH/UCY trajectory prediction benchmark. The THÖR dataset encompasses three distinct scenarios, each featuring participants performing different tasks such as individual and group movement, box transportation, different amounts of obstacles, and a mobile robot in the environment. In THÖR Scenario 1 (THÖR-S1), participants navigate the environment with one static obstacle. THÖR Scenario 2 (THÖR-S2) introduces a mobile robot navigating around the static obstacle while participants continue their tasks. Finally, in THÖR Scenario 3 (THÖR-S3), the mobile robot becomes a static obstacle, and an additional obstacle is added to the scene. The ETH/UCY trajectory prediction benchmark consists of five scenes: ETH, HOTEL, UNIV, ZARA1, and ZARA2. These scenes represent five outdoor public spaces that capture natural human motion patterns, resulting in a benchmark widely used by the human trajectory prediction community (Almeida and Mozos, 2023; Dendorfer et al., 2021; Salzmann et al., 2020; Yue et al., 2022).
First, we show the tracking durations in Figure 14. THÖR presents consistent average tracking durations around 15.5 to 17.6 seconds across the three scenarios. In contrast, THÖR-MAGNI shows wider variations. For instance, Scenario 4 features longer tracking durations (averaging 41.3 seconds), whereas Scenario 2 has the shortest durations (averaging 17.1 seconds). This variability can be attributed to participants’ density; Scenarios 4–5, involving fewer human agents in a smaller space, may contribute to higher quality tracking. Nevertheless, THÖR-MAGNI has comparable or higher tracking time than THÖR. Furthermore, compared to the ETH/UCY benchmark (i.e., ETH, HOTEL, UNIV, ZARA1, and ZARA2 scenes), THÖR-MAGNI offers comparable or significantly longer tracking durations. This makes our dataset more valuable than its predecessors for tasks such as long-term human motion prediction and human–robot interactions. Tracking durations (mean ± one standard deviation) across datasets in seconds. Scenarios 1–3 of THÖR-MAGNI provide comparable tracking durations to previous datasets, while Scenarios 4 and 5 provide longer tracks.
Second, we compare the minimal distance between people in Figure 15. Again, human density plays an important role: THÖR-MAGNI Scenarios 1–3 show low values comparable to those in ZARA1/ZARA2, while Scenario 4 and 5 reach values similar to THÖR, ETH, and HOTEL. The higher participant density in THÖR-MAGNI Scenarios 1–3 results in reduced spatial navigational freedom, leading to increased interactions and decreased social distances between individuals. Minimal distance between people (mean ± one standard deviation) across datasets in meters. Lower spatial navigational freedom in Scenarios 1–3 of THÖR-MAGNI potentiates reduced social distances between participants. These results are more consistent with the ZARA1 and ZARA2 scenes, while Scenario 4 and 5 (with more spatial freedom) show similar results to THÖR, ETH, and HOTEL datasets.
Third, the motion speed statistics are shown in Figure 16. Despite the higher participant density in Scenarios 1–3 of THÖR-MAGNI, these datasets feature faster human agent navigation than THÖR and akin to those in ETH, HOTEL, and ZARA1 scenes, possibly influenced by the task of object transportation, impacting their velocity profiles. Participants in Scenarios 4–5 of THÖR-MAGNI have an average velocity similar to those in THÖR, UNIV, and ZARA2. Also, generally, THÖR-MAGNI shows comparable standard deviations in motion speeds, indicating diverse and varied movement patterns among human agents. The similarity of the velocity profiles to previous datasets suggests that our dataset is also natural and diverse. Motion speed (mean ± one standard deviation) for 8-second tracklets across datasets in meters per second.
Finally, we compare path efficiency and the number of tracklets in Figure 17. Regarding trajectory linearity, Scenarios 1–3 are aligned with the THÖR and HOTEL datasets, while the other datasets from the ETH/UCY benchmark contain more linear and less complex trajectories. It is also worth noting that THÖR-MAGNI Scenario 4 and 5 display the lowest average metrics (0.78 and 0.75, respectively). The presence of a moving robot might influence these scenarios, prompting human agents to navigate cautiously and align their motion with the robot’s motion profile. Furthermore, THÖR-MAGNI presents a much higher number of non-overlapping tracklets than the other datasets.
These distinctive features make our dataset uniquely challenging, diverse, and valuable as a benchmark for evaluating human trajectory prediction methods. The heightened complexity and diverse range of trajectories in THÖR-MAGNI can provide a robust platform for assessing the effectiveness of trajectory prediction methods, thereby increasing the breadth and depth of research in this area.
7. Conclusions
In this paper, we present THÖR-MAGNI, a comprehensive human and robot navigation and interaction dataset, extending THÖR (Rudenko et al., 2020a) with 3.5 times more motion data, novel interactive scenarios, and rich contextual annotations. Both datasets are accessible online at https://thor.oru.se/. To further support researchers, THÖR-MAGNI comes with a dedicated set of user-friendly tools—a dashboard and a specialized Python package called
THÖR-MAGNI was created to fill a gap in human motion analysis datasets, limiting HRI research: a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders holistic studies of human motion dynamics. Unlike existing datasets, THÖR-MAGNI includes a broader set of contextual features and offers multiple variations to facilitate factor isolation. Our dataset integrates different modalities, such as walking trajectories, eye-tracking data, and environmental sensory inputs captured by a mobile robot.
THÖR-MAGNI comprehensively represents mobile robots’ and humans’ diverse navigation styles in shared environments using multi-modal data. Our dataset contributes to the evolving landscape of human motion research through a comparative analysis with state-of-the-art datasets. Furthermore, we discuss the features of our dataset in the context of human motion and robot interaction, highlighting their importance in addressing gaps in the existing literature. The THÖR-MAGNI dataset has already been used in research papers, demonstrating its usefulness for training role-conditioned motion prediction models (Almeida et al., 2023) and investigating visual attention during human–robot interaction and navigation in shared environments with robots (Schreiter et al., 2023, 2024).
In the future, we intend to propose a benchmark for multi-modal indoor trajectory prediction methods that leverage the rich contextual cues in THÖR-MAGNI. This work aims to advance the field by facilitating the development of more precise models of human motion. Future data acquisitions should encompass a broader range of environments, increase the size of individual scenario acquisitions, and include extensive coverage of vital modalities such as eye tracking to measure situational awareness and mutual intention of all participants. These efforts will enhance the generalizability of future generations of datasets. Additionally, transitioning data acquisitions from fixed laboratory environments to real-world settings under varying conditions will improve the collected data’s ecological validity and robustness.
