Sage Journals: Discover world-class research

Abstract

We present a new large dataset of indoor human and robot navigation and interaction, called THÖR-MAGNI, that is designed to facilitate research on social human navigation: for example, modeling and predicting human motion, analyzing goal-oriented interactions between humans and robots, and investigating visual attention in a social interaction context. THÖR-MAGNI was created to fill a gap in available datasets for human motion analysis and HRI. This gap is characterized by a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders the development of robust models capable of capturing the relationship between contextual cues and human behavior in different scenarios. Unlike existing datasets, THÖR-MAGNI includes a broader set of contextual features and offers multiple scenario variations to facilitate factor isolation. The dataset includes many social human–human and human–robot interaction scenarios, rich context annotations, and multi-modal data, such as walking trajectories, gaze-tracking data, and lidar and camera streams recorded from a mobile robot. We also provide a set of tools for visualization and processing of the recorded data. THÖR-MAGNI is, to the best of our knowledge, unique in the amount and diversity of sensor data collected in a contextualized and socially dynamic environment, capturing natural human–robot interactions.

Keywords

Dataset for human motion human trajectory prediction human–robot collaboration social HRI human-aware motion planning

1. Introduction

In recent years, the topics of human motion modeling, prediction, and interaction with social and service robots have been rapidly growing, driven by industrial interests and a quest for safer algorithms in human–robot interaction settings. Various advanced automated systems, such as mobile robots (including autonomous vehicles), manipulators, and sensor networks, benefit from human motion models for safe and efficient operation in the presence of humans. Human motion data is central to human-aware path planning, collision avoidance, tracking, interaction, understanding human activities, and collaborating on shared tasks.

Modern approaches for modeling human motion require plentiful data recorded in diverse environments and settings to train on, as well as for the evaluation (Rudenko et al., 2020b). Among the growing numbers of human trajectory datasets, most focus on capturing interactions between the moving agents in indoor (Brščić et al., 2013), outdoor (Robicquet et al., 2016), and automated driving (Bock et al., 2020) settings. These datasets are designed to study how people interact and avoid collisions in social settings by describing their motion through position and velocity information. Further datasets attempt to capture full-body motion in various activities and human–object interactions in household settings (Ehsanpour et al., 2022; Kratzer et al., 2020; Liu et al., 2019).

Human motion is influenced by many exogenous factors, which cumulatively amount to the context in which people move and interact. Among those are numerous environmental factors: motion and activities of other people and robots, locations of obstacles, semantic attributes such as points of common interest, direction signs, and special zones. Motion datasets should not only capture these factors to enable computational analysis of how people navigate but also vary them systematically to support factor isolation in various conditions. Datasets with access to rich context can help to better explain, model, and predict human motion.

Furthermore, beyond the environment context, there are various aspects of the specific person—target agent cues (Rudenko et al., 2020b)—which are helpful in better understanding their intention, ongoing activity, attention, and distraction, preferences, and abilities. These cues include head orientations, full body positions, gaze directions, social grouping, and past activity patterns. Multi-modal approaches for human motion modeling and prediction can provide more accurate results by combining these cues (Almeida et al., 2023), and their development is subject to the availability of high-quality multi-modal data.

Existing datasets in human motion analysis often lack the comprehensive inclusion of the exogenous factors and the target agent cues necessary for holistic studies of human motion dynamics. This research gap hinders the development of robust models that capture the relationship between contextual cues and human behavior in different scenarios. To address this gap, we present a novel dataset incorporating a broader set of contextual features and multiple variations to support factor isolation. By integrating diverse modalities such as walking trajectories, eye-tracking data, and environmental sensory inputs captured by a mobile robot (see Figure 1), our dataset fosters the exploration and analysis of human motion in various scenarios with increased fidelity and granularity.In this paper, we propose a novel dataset of accurate human and robot navigation and interaction in diverse indoor contexts, building on the previous THÖR dataset (Rudenko et al., 2020a). The THÖR dataset established a foundation for collecting open-source data on human social navigation toward randomized targets in a controlled setting using motion capture technology with minimal scripting. Building on the helpful work of previous studies (Mavrogiannis et al., 2019), which utilized reflective markers on helmets in small, spatially confined settings with a limited number of participants, the THÖR datasets extend this methodology and offer a broader scope. In particular, the THÖR-MAGNI dataset represents a significant advancement, enhancing data quality and features to provide rich insights into human motion and interactions within a larger room. The publicly available THÖR datasets, especially THÖR-MAGNI, facilitate more comprehensive human–robot interaction and human social navigation research. The THÖR-MAGNI data collection is designed around systematic variation of environmental factors to allow building cue-conditioned models of human motion and verifying hypotheses on factor impact. To that end, we propose several scenarios in which the participants, in addition to primary navigation, need to move objects, interact with each other and the robot, and respond to remote instructions. The dataset includes differential and omnidirectional robot navigation, semantic zones, direction signs in the environment, and many other aspects. We provide position and head orientation for each moving agent, as well as 3D lidar scans and gaze tracking. Finally, we provide tools to visualize the dataset’s multiple modalities and preprocess the trajectory data. In total, THÖR-MAGNI captures 3.5 hours of motion of 40 participants over 5 days of recording, which is available for download.¹ Furthermore, we note the continuity between the THÖR and THÖR-MAGNI recordings due to their shared environment (in diverse configurations), motion capture system, and complimentary scenario composition.

Figure 1.

THÖR-MAGNI data modalities. (1) Walking trajectories of participants in a workplace setting shared with other humans and robots; (2) lidar sweep recorded with a mobile robot; (3) snapshot from an eye tracker’s gaze overlay video; (4) fish-eye camera image from the mobile robot, showing object stashes and two goal points from our scenarios.

In this paper, we motivate and detail the THÖR-MAGNI data collection and sensor setup, describe the interfaces to the dataset, and compare it to the prior datasets. The paper is structured as follows: in Section 2, we review the prior state-of-the-art datasets, and in Section 3, we outline the target application domains. Section 4 provides all necessary information about the data collection, and Section 5 describes the data formats and tools used to visualize and preprocess the data. Finally, Section 6 presents a quantitative evaluation of the collected data followed by a conclusion in Section 7.

2. Related work

Multi-modal human motion datasets drive various research applications, including gait patterns, gaze vectors, human–robot interactions, and robot sensor data. These include human motion prediction (Kothari et al., 2022; Rudenko et al., 2020b), human motion representation for mobile robots (Kucner et al., 2023), human–robot interaction (Dahiya et al., 2023), human awareness in robot motion planning (Faroni et al., 2022; Heuer et al., 2023), and gaze-based prediction of human pose and locomotion mode (Li et al., 2022; Zheng et al., 2022).

2.1 Human trajectory datasets

Early datasets such as UCY (Lerner et al., 2007) and ETH (Pellegrini et al., 2009) have contributed significantly to our understanding of human movement in outdoor environments. Although these datasets encompass a range of human motion attributes such as trajectories, group identification, and goal points, social interactions play a minor role in shaping human trajectories (Makansi et al., 2022). The indoor ATC dataset introduced by Brščić et al. (2013) represents a data collection with high coverage and tracking accuracy due to the use of 49 range sensors for raw data acquisition. The tracking method involved the independent estimation of positions and body orientations from each sensor, which were subsequently fused. This fusion process increased the robustness of the primary estimates and ensured a high degree of accuracy in the resulting dataset. In contrast to the UCY and ETH datasets, our dataset contains many social interactions, as we always had multiple participants moving in the same space between goal points, deliberately allowing for frequent interactions between participants (see Section 3.1). Furthermore, unlike the ATC dataset, we have included a mobile robot in the scene, which allows the study of human–robot interaction scenarios (see Section 3.4).

Munaro and Menegatti (2014), Dondrup et al. (2015a), and Ehsanpour et al. (2022) have presented human motion datasets acquired through mobile robotic systems. While the datasets presented by Munaro and Menegatti (2014) and Dondrup et al. (2015a) consist of short acquisitions and have limited contextual information such as maps or environmental goals, Ehsanpour et al. (2022) have contributed a more comprehensive dataset. Their dataset includes detailed annotations of micro-actions and social group dynamics, offering a richer and more contextualized understanding of human motion patterns in diverse environments. However, in these datasets, human locations are based on detections in the sensor’s field of view onboard the mobile robot, which limits the scope of tracking due to occlusions. In contrast to these works, we used a motion capture system to track the moving agents (described in Section 4.5), which provides longer continuous tracking of each observed agent.

Kratzer et al. (2020) presented the MoGaze dataset, a notable advancement, by incorporating a motion capture system for full-body pose tracking and eye-tracking data for humans engaged in various activities. Similarly, Chen et al. (2022) proposed a human-tracking dataset for recording human–robot cooperation tasks in retail environments. However, neither dataset captures social interactions as they track only one person. In addition, MoGaze does not include a mobile robot in the scene. The absence of these elements hinders the study of downstream applications, for instance, robot motion planning methods in the “invisible robot” settings (Heuer et al., 2023), in which the humans do not react to the robot’s motion and location, but rather the full extent of collision avoidance falls on the robot. Similar to THÖR-MAGNI, the THÖR dataset introduced by Rudenko et al. (2020a) presents accurate human motion trajectories in the presence of a robot. While the THÖR dataset provides tracking accuracy in a socially dynamic environment, its limited recording duration (1 hour) poses challenges for in-depth studies, particularly concerning data-intensive deep learning-based methods for trajectory prediction.

2.2 Human–robot interaction

Understanding human motion is crucial in spatial human–robot interaction (sHRI). It allows robots to anticipate and adapt to human movements in shared environments, enhancing their safety, efficiency, and naturalness. This section situates the THÖR-MAGNI dataset within the context of existing HRI and robotics datasets.

Datasets like UF-Retail HRI (Chen et al., 2022), SiT (Bae et al., 2024), Mavrogiannis (Mavrogiannis et al., 2019), SoGRIN (Webb et al., 2023), and THÖR (Rudenko et al., 2020a) offer varied insights into HRI across different settings. UF-Retail HRI emphasizes social human navigation in retail environments using sensors like MoCap, eye tracking, and RGB cameras. SiT provides indoor and outdoor data to analyze pedestrian detection and trajectory prediction. SoGRIN investigates nonverbal social signals in group interactions, utilizing MoCap and RGB cameras to capture detailed motion and interaction cues. THÖR explores motion trajectories in shared spaces with a robot, using a MoCap system to enhance data accuracy. Several studies further emphasize the importance of understanding human social behaviors in HRI. For instance, Althaus et al. (2004) highlight the need for robots to exhibit behaviors that align with human social norms to enhance natural interaction in shared environments. Kretzschmar et al. (2016) present a novel approach to model cooperative behavior, highlighting the importance of understanding and imitating human interaction patterns for effective HRI. These studies emphasize the significance of data acquisitions like the ones from Bae et al. (2024), Dondrup et al. (2015b), and Yan et al. (2017), which provide critical data for analyzing pedestrian behaviors and interactions in shared spaces.

The THÖR-MAGNI dataset offers extensive indoor human–robot interaction data using MoCap, 3D lidar, and RGB-D cameras to record motion and social interactions in various contexts. It enriches the field by including scenario-based interactions, making it ideal for analyzing human social navigation and collaboration. In comparison to the predecessor THÖR dataset, THÖR-MAGNI represents a significant improvement, incorporating more exogenous factors such as lane markings and one-way passages, and introducing specific HRI scenarios. These scenarios involve participants navigating shared environments with a semi-autonomous mobile robot, supervised by an experimenter. The dataset explores robotic assistance in industrial settings, focusing on task efficiency and user experience in collaborative workflows. This makes THÖR-MAGNI uniquely valuable for advancing our understanding of human–robot interaction.

2.3 Comparison between THÖR and THÖR-MAGNI

In summary, the THÖR-MAGNI dataset, based on the protocol proposed by Rudenko et al. (2020a), overcomes the limitations of its predecessors. Table 1 compares well-established and recent datasets thoroughly. THÖR-MAGNI contains 3.5 times more trajectory data than THÖR, therefore providing a broader range of situations for the analysis of human motion trajectories. In addition, THÖR-MAGNI includes sensor data recorded by a mobile robot. Furthermore, our dataset provides gaze vectors aligned with the corresponding trajectories, allowing simultaneous analysis of both modalities. This alignment not only enables studies of human–robot interaction but also facilitates in-depth analyses of the complex interplay between human visual attention and motion patterns. Finally, to the best of our knowledge, THÖR-MAGNI stands out among other datasets for its extensive and diverse collection of sensor data within a contextualized and socially dynamic environment, effectively capturing natural human–robot interactions.

Table 1.

Comparison of human trajectory datasets.

Dataset	Environment	Sensors for Pose Estimation	Duration	Pose Frequency (Hz)	Pose Annotation	Social Interactions	Robot in the Scene	Intended for HRI	Goals	Map	Robot Data	Other Data
UCY (Lerner et al., 2007)	Street (outdoor)	RGB camera	20 min.	Continuous	Manual	✓
ETH (Pellegrini et al., 2009)	University and hotel	RGB camera	25 min.	2.5	Manual	✓			✓	✓
Edinburgh (Majecka, 2009)	Forum (outdoor)	RGB camera	4 months	6–10	Automated	✓
Town center (Benfold and Reid, 2011)	Street (outdoor)	RGB camera	5 min.	25	Manual	✓				Raw
VIRAT (Oh et al., 2011)	Various outdoors	RGB camera	29 h	2, 5, 10	Manual	✓				Raw		Human activities, agents types
Central station (Zhou et al., 2012)	Train station	RGB camera	34 min.	24	Automated	✓
ATC (Brščić et al., 2013)	Shopping centre	Several 3D range sensors	41 days	10-30	Automatic	✓
NBA SportVU 2013	Basketball court	RGB camera	20 days	25	Automatic							Multi-agent human activities
KTP (Munaro and Menegatti, 2014)	Empty room	RGB-D camera	4.7 min.	30	Manual	✓	✓				RGB-D camera	Motion capture
KTH (Dondrup et al., 2015a)	Lab	RGB-D camera and 2D laser scanner	2.7 h	25	Automatic	✓	✓				RGB-D camera and 2D laser scanner
UCLA aerial event dataset (Shu et al., 2015)	Outdoor spaces	RGB camera	1.5 h	60	Automatic	✓				Raw		Human roles, small and large objects location
SDD (Robicquet et al., 2016)	University campus(outdoor)	RGB camera	5 h	30	Manual	✓				Raw		Human activities
L-CAS (Yan et al., 2017)	Office	3D lidar	49 min.	10	Manual	✓	✓				3D lidar	Single-person, group labels
MoGaze (Kratzer et al., 2020)	Lab	Motion capture	3 h	120	Ground truth							Human activities
Flobot (Yan et al., 2020)	Public spaces (i.e., airport, warehouse, supermarket)	3D lidar and RGB-D camera	27.5 min.	10	Automatic	✓	✓				2D and 3D lidars, RGB-D and stereo cameras
THÖR (Rudenko et al., 2020a)	Lab with various spatial layouts	Motion capture	1 h	100	Ground truth	✓	✓		✓	✓	3D lidar	Aligned ET, human activities
JRDB-Act (Ehsanpour et al., 2022)	University campus (indoor and outdoor)	Lidar and RGB camera	1 h	7.5	Automatic	✓	✓				Velodyne, several cameras (RGB and RGB-D)	Human activities
Oxford-IHM (Finean et al., 2023)	Lab/office	Motion capture	1 h	100	Ground truth		✓		✓	✓	RGB-D camera	Static RGB-D camera
THÖR-MAGNI (2024)	Lab with various spatial layouts	Motion capture	3.5 h	100	Ground truth	✓	✓	✓	✓	✓	3D lidar, RGB and RGB-D cameras	Aligned ET, several human activities

3. Context of the THÖR-MAGNI dataset

The THÖR-MAGNI dataset provides diverse navigation styles of a mobile robot, and humans engaged in various activities in a shared environment with robotic agents. It incorporates multi-modal data for a complete representation. Following a comparative analysis of our dataset with state-of-the-art datasets contributing to the evolving landscape of human motion research (see Section 2), this section supports users of our dataset by providing a detailed exploration of its features in the context of human motion and robot navigation and interactions. We explain their significance in addressing identified gaps before describing the dataset in Section 4.

3.1 Goal-directed human motion trajectories

Goal-directed human agents are crucial in human motion prediction (Chiara et al., 2022; Dendorfer et al., 2021; Zhao and Wildes, 2021). Traditional approaches often depict human agents as rational entities, acting logically and moving toward specific goals or destinations (Ziebart et al., 2009). Real-world recordings commonly show this directional traffic flow, characterized by distinct goal points, often resulting in a consistent and linear motion with limited diversity. In our dataset, we include scenes with seven different goal points distributed over a larger spatial volume and scenes where they are arranged in a more compact space (see Section 4). Goal points and static obstacles are positioned strategically to ensure that recorded trajectories are sufficiently long and topologically diverse, that is, covering a range of spatial arrangements and configurations. This approach allows for the inclusion of frequent interactions between the moving agents, contributing to a more comprehensive understanding of human motion dynamics.

3.2 Navigation of heterogeneous agents

Heterogeneous agents are dynamic entities that navigate with distinct motion patterns. This heterogeneity stems from various factors that affect the motion, such as tasks and ongoing activities performed by the agent (Almeida et al., 2023). For instance, several works have studied how humans move individually or as part of a social group (Moussaïd et al., 2010; Rudenko et al., 2018; Wang et al., 2022). It has been shown that humans can coordinate their movements as a group by following simple rules based on the visual perception of local motion (Boos et al., 2014). Previous research on the anatomy of leadership in collective behavior (Garland et al., 2018) describes human collective behavior as optimal coordination and leadership dynamics in various group scenarios. In particular, crowd dynamics are determined by physical constraints and significantly influenced by communicative and social interactions among individuals (Moussaïd et al., 2010). Autonomous driving datasets often highlight the motion of heterogeneous agents in mixed traffic (Chandra et al., 2019; Salzmann et al., 2020). In our dataset, we introduce roles for participants tailored for industrial tasks, such as navigating alone or in groups of different sizes, transporting various objects, and interacting with a robot. This heterogeneous social setting provides a novel way to study how specific industrial roles influence human motion, aligning with the work conducted by Almeida et al. (2023).

3.3 Navigation of a robotic agent

Human-aware robot motion planning is crucial for safe navigation in shared spaces, especially in narrow and crowded indoor environments (Cancelli et al., 2023). Understanding human interaction with robots of different driving styles promotes the design of socially acceptable motion planners (Möller et al., 2021). Analyzing participant behavior with robots of varied movement patterns reveals insights into how robot motion style affects human expectations (Karnan et al., 2022; Mavrogiannis et al., 2019), guiding the development of robots that interact safely and are well-received by people (Shah et al., 2023). Our dataset features scenarios with a mobile robot in teleoperated and semi-autonomous modes and two driving styles: differential drive (forward, backward, and turning) and omnidirectional mode (allowing the robot to drive in any direction while keeping its heading). This variety of motion modes (detailed in Section 4.2.2) extends the state-of-the-art datasets of teleoperated navigation which feature a single driving style (Karnan et al., 2022; Rudenko et al., 2020a). Lastly, while some parts of our dataset (Section 4.3.3) might be interesting for the field of social robot navigation (see Mavrogiannis et al., 2023, who recently surveyed this field), the main focus of the dataset is on human social navigation and spatial human–robot interaction in shared workplaces.

3.4 Spatial human–robot interaction in shared workplace settings

Industry 5.0 aims to prioritize human well-being in manufacturing systems (Leng et al., 2022). This requires enhancing the quality of human–machine and human–robot interactions in these environments. Designing robots that clearly express their intentions to human collaborators is a crucial step toward fostering mutual understanding and enhancing the well-being of workers who regularly interact with robots (Pascher et al., 2023). Furthermore, intuitive human–robot interaction (HRI) improves well-being and enhances safety and efficiency in collaborative settings (Haddadin et al., 2011).

Spatial HRI (sHRI) and navigation in shared environments are research areas that have an adherent need for accurate datasets of human motion tracking and prediction (Chen et al., 2022; Rudenko et al., 2020a) and for robots that understand the underlying physical interactions between nearby agents and objects (Castri et al., 2022). Our dataset contains recordings of explicit interactions between a mobile robot and individuals in shared workplace settings. THÖR-MAGNI is a valuable resource for studying human responses to robotic approaches and assistance initiatives, enabling researchers to analyze goal-oriented interactions between humans and robots.

3.5 Eye tracking and head orientation in navigation tasks

Eye tracking is a powerful method to study various aspects of human behavior, including attention, emotion, cognition, and decision-making, with applications spanning education, marketing, gaming, and healthcare (Duchowski, 2017). Eye tracking provides objective data about eye movements and positions and enables researchers to quantify visual information processing through various metrics (Duchowski, 2017; Mahanama et al., 2022). In HRI applications, human eye-gaze is an important nonverbal signal (Admoni and Scassellati, 2017). Our dataset aligns human gaze data with human motion trajectories, allowing us to study human gaze during visual exploration across dynamic tasks, activities, and scenarios.

Head orientation provides another essential modality of human behavior that is complementary to gaze direction and attentional focus. Head orientation plays a vital role in joint attention, that is, attention coordination between individuals focusing on the same point of interest (Tomasello, 2014). Furthermore, it is valuable for detecting interpersonal dynamics in multi-party interactions (Stiefelhagen and Zhu, 2002). Beyond its social implications, head orientation becomes a predictive indicator of walking motion goals (Holman et al., 2021) and can enhance human motion prediction through vision-based features (Salzmann et al., 2023). Using a state-of-the-art motion capture system and eye-tracking devices, our dataset provides highly accurate head poses and orientations aligned with the eye-tracking data.

3.6 Semantic environment cues

Crucial environmental information, conveyed by semantic cues such as doors, stairs, floor markings, and signs, is essential in guiding humans and robots within a given space. These cues, combined with obstacle configurations, influence human interactions with the environment, leading to actions like detouring, bypassing, overtaking, and avoiding specific areas. In our dataset, we include semantic cues like markings on the floor indicating areas to be cautious of the environment or one-way passages that limit the flow of motion in one direction. In this way, we enable the exploration of navigation and interactions in semantically-rich environments. For instance, leveraging Maps of Dynamics (Kucner et al., 2023) allows the quantification of motion patterns changes around these cues. This information, in turn, can be utilized to predict long-term human motion dynamics, as demonstrated by Zhu et al. (2023).

4. Description of the THÖR-MAGNI dataset

The THÖR-MAGNI dataset is a large-scale indoor motion capture recording of human movement and robot interaction. It consists of 52 four-minute recordings (runs) of participants performing various activities related to navigating alone and in groups, finding and transporting small and large objects, and interacting with robots. THÖR-MAGNI contains over 3.5 hours of motion data for 40 participants, including position, velocity, and head orientation. Eye-tracking data is available for 16 of them, totaling 8.3 hours for eight activities (see Table 2). In 24 runs, THÖR-MAGNI also includes the robot sensor data of 3D point clouds from an Ouster lidar. Additionally, videos recorded by an Azure Kinect camera and a Basler fish-eye camera onboard a mobile robot are available on request.

Table 2.

Amount of eye-tracking and trajectory data recorded for various activities with all three devices: Tobii 2, Tobii 3, and Pupil Invisible glasses.

Activity	Eye tracking (min.)	Trajectory data (min.)
Visitors–Alone	108	392
Visitors–Group 2	124	344
Visitors–Group 3	52	168
Visitors–Alone HRI	64	112
Carrier–Bucket	32	96
Carrier–Box	60	96
Carrier–Large Object	92	192
Carrier–Storage Bin HRI	16	16
Total	548	1416

In this section, we detail the environment in which we recorded the data (Section 4.1), the navigation and task design for the participants and the robot (Section 4.2), interactive scenarios to emphasize the various contextual aspects of human motion (Section 4.3), participants’ background and priming (Section 4.4), and the technical implementation of the recording pipeline and collection of motion capture and eye-tracking data (Section 4.5).

4.1 Environment design

We conducted the data acquisition in a laboratory at Örebro University, the same as in the THÖR dataset (Rudenko et al., 2020a). There are two different configurations for the laboratory. One features a small but free-space environment (see Figure 2 left). The other resembles an industrial logistics setting and promotes frequent interactions between human and robotic co-workers (see Section 4.3). Both room configurations have seven goal positions to drive purposeful human navigation through the available space, generating frequent interactions in the center. Additionally, we include several environmental layouts (i.e., obstacle maps) in the THÖR-MAGNI dataset, which vary the placement of static obstacles (robotic manipulators and tables) in the room to prevent walking between goals in a straight path. Apart from static obstacles, two robots are in the room: a static robotic arm near the podium and an omnidirectional mobile robot with a robotic arm on top (see Section 4.2.2).

Figure 2.

Our dataset comprehensively explores human–robot interaction in a shared workplace environment. Left: Participants navigate independently, collaborate in social groups, and engage with a mobile robot. Navigation between goal points is coordinated via card decks at the goal points that assign a participant a new goal point upon drawing a card, as seen on the far left. Right: Equipment utilized in our data collection comprises (1) bicycle helmets equipped with motion capture tracking markers, (2) eye tracking glasses, and (3) headphones used for receiving spoken instructions.

4.2 Navigation and interaction design

The interaction and navigation design in THÖR-MAGNI extends the weakly-scripted motion recording procedure introduced in the THÖR dataset (Rudenko et al., 2020a). This procedure facilitates realistic motion in controlled settings, in which, accurate ground truth motion capture and eye-tracking data are collected using specialized equipment (see Figure 2 on the right). Our key idea is to assign meaningful activities and tasks to the recording’s participants, allowing them to concentrate on their continuous activity during which they freely move inside the room shared with other people and robots. To generate a diverse range of interactions, we developed several scenes that vary in the composition of tasks, robot operation, and other contextual cues, as discussed in Section 3.

4.2.1 Tasks, activities, and roles requiring search and navigation

We aimed to simulate authentic scenes that reflect the different activities individuals perform in a workplace environment. To that end, we designed several tasks that require search, navigation, and interaction with objects, other participants, and a mobile robot. Participants engaged in those tasks according to their assigned role.

Our dataset has two types of roles: Visitors and Carriers. Visitors navigate either individually (Visitors–Alone) or in groups of two (Visitors–Group 2) or three (Visitors–Group 3) between target points in the environment. The Visitors role includes a human–robot interaction component denoted by Visitors–Alone HRI, where participants interact with a robot in a joint navigation task (see Section 4.2.2). In addition, Carriers are involved in transporting various objects, including Carrier–Bucket, Carrier–Box, Carrier–Storage Bin HRI, and Carrier–Large Object (see Figure 3). Carriers transport objects between pre-defined target points, and objects themselves representing different levels of difficulty for navigation, categorized as small (lowest difficulty), medium (medium difficulty), and large (highest difficulty).

Figure 3.

Participants in the role of Carrier were transporting various objects in different sizes and shapes. (1) Carrier–Box carrying a medium-sized card box with two hands. (2) Carrier–Storage Bin HRI placing the bin at a goal point. (3) Stash of small objects transported by the Carrier–Bucket. (4) Large object (poster stand) moved by two Carrier–Large Object.

Visitors used a card-based system to navigate, receiving new destinations each time they reached a designated goal point. At each goal point, a deck of cards was available, featuring instructions such as “Go to Goal 1.” The instructions could specify a new destination or contain instructions on how to go to the robot. In the case of Visitors–Alone, they drew a card and placed it at the bottom of the deck. Afterward, the participant moved to the destination. In the case of groups, the members could choose who will draw the card.

Carriers were asked to transport objects of different shapes and sizes. These include small objects such as a blue plastic storage bin for the Carrier–Storage Bin HRI and plastic buckets of canned vegetables for the Carrier–Bucket, designed for easy, one-handed transportation. For the Carrier–Box, the participants had to move cardboard boxes as medium-sized objects. These boxes were filled with a few books, allowing for comfortable two-handed transportation. In addition, a collaborative effort involving two participants working as a group featured moving a large object, specifically a poster stand (Carrier–Large Object). This stand-up, equipped with four wheels, is thin and long and can be moved by two people working in tandem. The overall goal of this setup is to assess how different ongoing activities affect participants’ behavioral patterns, including factors such as gaze direction and movement.

4.2.2 Modes of robot navigation and HRI

Our dataset includes a mobile robot, “DARKO²” (see Figure 4), which acts as a static obstacle in some scenes and moves in others. This range of behaviors enables the study of participants’ movements and gaze behaviors concerning the stationary and mobile status of the robot. In certain scenes, the robot was teleoperated and moved omnidirectionally, enabling it to reach any 2D position from a stationary position. In some, it moved directionally with a predetermined orientation (front). In others, the DARKO robot navigated semi-autonomously with manually set goal points. An experimenter was supervising the navigation of DARKO for safety reasons. When acting semi-autonomously, the robot interacted with participants through a communication intermediary called the “Anthropomorphic Robot Mock Driver” (ARMoD).

Figure 4.

Robot used in and for data collection (the “DARKO” robot) with an omnidirectional mobile base (RB-Kairos) of the dimensions: 760 × 665 × 690 mm (5), equipped with two sensor towers, one hosting two Azure Kinect RGB-D cameras (2), and one hosting an Ouster OS0-128 lidar and two Basler fish-eye RGB cameras (4). Additional equipment includes two Sick MicroScan 2D safety lidars (6), mecanum wheels (7), and a NAO robot (“ARMoD”) for interaction with participants (3). Our recordings did not use the robotic arm with a maximum arm height of 855 mm (1).

The ARMoD is a small humanoid NAO robot, as shown in Figure 4. It was sitting on the DARKO robot. The ARMoD displayed two behaviors during interactions: One using only the voice (Verbal-Only HRI). The other uses multi-modal features such as eye contact, robotic gaze, and pointing gestures to support the voice (Multi-modal HRI). This style of interaction reduces fixations on the DARKO robot, increases focus on the ARMoD’s face, and triggers faster response times to instructions of participants, effectively directing attention and improving the quality communication with the robot (Schreiter et al., 2023).

4.3 Scenario design

We address the context of agent movement by including both humans and robots, as previously discussed, in five specifically designed scenes we call “scenarios.” Scenario 1 captures the dynamics of motion because of semantic attributes of the environment and sets up a baseline for goal-directed social human navigation. Scenario 2 adds role-specific motion for some participants navigating the environment. Subsequently, Scenario 3 explores the impact of different robot motion styles on these role-specific patterns. Figure 5 depicts a detailed overview of the room configuration and varying environmental layouts for Scenarios 1–3. Scenario 1’s conditions A and B capture regular social behavior in a static environment with and without additional floor markings and a one-way passage. Scenario 2 maintains the same layout as Scenario 1A but introduces individuals performing tasks, emulating industrial activities. Scenario 3 explores human–robot interactions by varying the driving modes of the mobile robot teleoperated by experimenters on a podium.

Figure 5.

Varying environmental layouts for the room configuration of Scenarios 1–3. Right: Sample scene view for the site used for data acquisition of the THÖR-MAGNI dataset showing the room configuration for Scenarios 1–3 with the environment layout for Scenario 1B. Left: Overview of the room configuration and the scenario-specific layout changes. Bottom: Legend explaining layout elements, including driving styles for the robot in Scenario 3, semantic elements specific for Scenario 1 (floor markings, passage), and position of goals and obstacles. Upon placement, some objects were subject to a slight rotation between runs, which is accounted for in the layouts with the rotation tolerance.

Transitioning to a smaller room configuration, we present two scenarios to explore human motion and intended interactions between humans and robots: Scenarios 4 and 5. In Scenario 4, participants engaged in intermittent interaction with a mobile robot. This robot communicated in two interaction styles through another entity to mediate joint navigation with participants toward goal points. In Scenario 5, the robots and a human co-worker collaborated actively in transporting small storage bins. For a comprehensive overview of roles and scenarios, see Figure 6.

Figure 6.

Scenario definitions in the THÖR-MAGNI dataset, including roles, robot motion status (e.g., autonomous or teleoperated), environment layout (i.e., obstacle maps), specific scenario conditions, and duration and recording days. Each recording day has a unique set of participants. Day 1 has nine participants and days 2–4 have seven participants each. Three mobile eye-tracking devices were used daily for three participants. On day 5, two devices were used for two sets of participants. The duration of recorded trajectory and eye-tracking data is provided in Table 2.

We recorded multiple runs for each condition in Scenarios 1–5. Specifically, we recorded two runs per condition for Scenarios 1 and 3, two for Scenario 2, four per condition for Scenario 4, and four runs for Scenario 5. To counterbalance learning-based effects, we randomized the recording order of conditions for Scenarios 3 and 5. We implemented this systematic approach to ensure a broad and impartial exploration of the scenarios, capturing subtle interactions and behaviors in each setting.

4.3.1 Scenario 1: Capturing motion dynamics in the environment

Scenario 1 comprises two conditions: condition A involves static obstacles such as tables, stationary robots, and goal points. Condition B introduces floor markings and stop signs in a one-way corridor in addition to the elements presented in condition A. The recording of condition B was before condition A to avoid biasing the participants toward the floor markings and to capture their natural reaction. Baseline condition A provides a clean environment without any floor markings or stop signs, allowing for the study of participants’ motion patterns independently of these factors. This condition provides a foundation for understanding the effects of additional variables introduced in our other scenarios. Conditions A and B together enable the exploration of the impact of environmental cues on human motion (see Figure 7).

Figure 7.

Maps of dynamics created from one day of data acquisitions. Top: Scenario 1A, as a baseline for human motion without semantic cues being present. Bottom: Scenario 1B layout, in which gray areas around two robotic agents represent the lane markings to signalize caution areas. Circular Linear Flow Field map (CLiFF-map) (Kucner et al., 2020) is used to capture probabilistic representations of human motion patterns, where colored arrows show the mean values of the components in the CLiFF-map model.

4.3.2 Scenario 2: Role-specific motion patterns in industrial environments

Scenario 2 features the same environment layout as Scenario 1A (Figure 7 left). In addition to the goal-driven navigation (Visitors role), this scenario introduces people performing different tasks as Carriers. For each run, we assign new roles to the participants. One participant carries small objects (i.e., buckets), and another carries medium objects (i.e., boxes) between two goal points. Finally, two participants move a large object (i.e., a poster stand). We use Discord³ to instruct one member of the two-person team responsible for moving the large object. The usage of Discord enabled the dynamic allocation of new goal points and facilitated the coordination of participants’ movements in this industrial context.

In summary, this scenario presents role-specific tasks for participants and goal-driven navigation, creating a platform to study the impact of human occupation on their motion profiles and those of the other agents in a shared environment.

4.3.3 Scenario 3: Impact of mobile robot motion on human behavior

With Scenario 3, we introduce an opportunity to study the interplay between human activities and a mobile robot. In this scenario, the stationary DARKO robot of Scenarios 1 and 2 becomes mobile, exploring changes in the humans’ motion patterns based on the mobile robot driving style. This scenario comprises two conditions, in which we modulated the way the mobile robot navigates: condition A, where the robot’s motion always has a designated direction using directional differential-drive kinematics (see Figure 8 bottom left) and condition B, where it can drive in any direction, that is, omnidirectional (see Figure 8 bottom right) using it’s mecanum wheels (see Figure 8 top). In both conditions, the roles of the participants remain the same as in Scenario 2, and a human operator controls the mobile robot using a remote controller to ensure the safety of the participants. Besides allowing for the study of human activities in the presence of a mobile robot, this setup also provides insights into how varying robot motion styles impact human behavior.

Figure 8.

Two types of mobile robot motion achievable with mecanum wheels (top), the impact of these types on human behavior is explored in Scenario 3. Left: Differential driving where the two wheels on each side are synchronized to generate forward, backward, and turning motions. The gray arrows indicate the directions in which the individual wheel axis propels, and the length of the arrow is proportional to the turning speed of the wheel. Right: Omnidirectional driving allows movement in any direction, including sideways and diagonally. Gray arrows indicate the forward or backward propulsion of the individual wheels. The small black arrows indicate the normal vector of the direction the wheel pushes the robot. We refer the reader to Tian et al. (2017) for more details.

4.3.4 Scenario 4: Spatial HRI in a shared environment

This scenario includes participants with the roles of Visitors–Alone HRI and Visitors–Group 2, who freely move around a shared environment alongside the DARKO robot that navigates semi-autonomously. The robot moved autonomously, with the restriction of being supervised by an experimenter who could intervene and halt its movements via a controller.

Participants assigned with the role of Visitors–Alone HRI received instructions regarding a joint navigation task and engaged in interactions with the ARMoD. These participants take place in two conditions based on the interaction styles outlined in Section 4.2.2, a verbal-only interaction style in condition A and a multi-modal one in condition B. Depending on the interaction style and the distance between goals, one interaction lasted around 30–40 s. If too many participants were at a goal point, the experimenter interrupted the mobile robot’s autonomous navigation shortly before reaching the goal. If interrupted prematurely, the mobile robot told the participants to abort the interaction and continue drawing cards. The mobile robot finished navigating autonomously to the goal point once it was less crowded.

Participants move either individually or in pairs between designated goal points. A specific card directs the individual participants (Visitors-Alone HRI) to approach the ARMoD and await further guidance. Visitors–Group 2 are instructed to disregard this card. The experimenter controls ARMoD’s behavior and sets the mobile robot’s next goal point. Upon participants’ arrival, ARMoD greets them and leads them jointly to the next goal point, where participants draw another card.

To ensure safe and seamless interactions, ARMoD’s behaviors are triggered by an experimenter using a controller (see Figure 9 left). The experimenter initiates actions like “Greet the closest participant” and “Talking to the participant,” guiding ARMoD’s communication with participants. Concurrently, the mobile robot continues its autonomous navigation, albeit under the oversight of the experimenter, who can pause its movements if necessary.

Figure 9.

Top: Input mapping used to control the ARMoD during HRI Scenarios 4 and 5. Purple items are used in both scenarios; yellow ones only in Scenario 4 and green ones only in Scenario 5. Bottom: Sample interaction with superimposed coordinate systems of (1) Participant’s Helmet, (2) ARMoD-, (3) DARKO-, and (4) QTM-World Frame.

Accurate tracking of individuals was essential for facilitating seamless interactions between the ARMoD and the participants. To determine the ARMoD’s position relative to individuals at any given moment, we leveraged the motion capture system’s data, broadcasted into the local network using “Robot Operating System (ROS)” (Quigley et al., 2009). This integration ensured precise transformations and provided position and orientation information, enabling ARMoD to accurately point, look, and establish eye contact with its interaction partners. Figure 9 right illustrates an interaction between a participant and ARMoD in this scenario. The position and orientation data of participants, robots, and the world frame are broadcasted within the local network, providing essential information to the path planner for DARKO and the interaction scheduler for ARMoD. In this figure, examples of established coordinate frames include (1) that of the helmets of participants defined based on the orientation of the marker, (2) a static coordinate frame for the ARMoD derived from the DARKO robot’s frame through an offset, (3) DARKO’s coordinate frame, and (4) the motion capture reference’s frame called the “QTM-World Frame.”

This scenario investigates free movement in a shared environment alongside the DARKO robot, exploring semi-autonomous navigation. Participants engaged in interactions with ARMoD under varied conditions. These allow for a study of human–robot interactions, navigation tasks, and the impact of different interaction styles on participants’ activities and movements.

4.3.5 Scenario 5: Spatial human–robot interaction, proactive robotic assistance

This scenario involves the roles: Visitors–Alone, Visitors–Group 2, and Carrier–Storage Bin HRI. The first two navigate between goal points by drawing cards. The Carrier–Storage Bin HRI takes on the role of a factory worker responsible for transporting storage bins and interacting with DARKO through ARMoD. The experimenter controlled ARMoD’s behavior and supervised DARKO’s motion for safety. During the interaction, ARMoD proactively offered assistance to the Carrier–Storage Bin HRI, informing them of the option to place a small storage bin on the mobile robot. If participants accepted, they could put the small storage bin on the DARKO robot for transportation between two designated points. The procedure described in Section 4.3.4 enabled reliable perception of both human and robot positions for this scenario. This scenario features proactive assistance from a mobile robot to a human worker in a simulated factory environment.

4.4 Participants background and priming

The average age of the participants was 30.18 years, with a standard deviation of 6.73, indicating a relatively homogeneous age group. The dataset contains a balanced gender distribution with 40 participants, of which 21 are male and 19 female. Geographically, 23 participants are from Sweden. From other European countries, there are 10, including the Czech Republic, Spain, Germany, and Italy, reflecting a diverse European representation. The remaining seven participants come from countries on other continents like Asia, Africa, and South America, providing a broader international scope. We recruited the participants from different areas of the campus. Their backgrounds varied considerably, including differences in their highest academic degree and primary subjects. At the beginning of each recording day, participants completed a demographic questionnaire. We used this information to create diverse group compositions, aiming for optimal allocation of eye-tracking devices across different roles (see Figure 10). For example, we ensured that groups of two or three participants contained only one participant equipped with an eye tracker and the equipment of at least one of the carriers with an eye tracker.

Figure 10.

Initial priming of participants performed at the beginning of each recording day. Participants were instructed about the experimental setting and the recording procedure, including a briefing on the tasks, establishing familiarity with the equipment, filling out consent forms, and an initial set of questionnaires.

At the beginning of each recording day, we provided standardized information to participants to ensure natural and unbiased behaviors. The instruction emphasized the experiment’s focus on testing the robot’s perception of humans, involving tasks such as navigating the laboratory and executing physical activities, with an estimated duration of 15 min.

During the data collection procedure, we guided the participants through a series of runs with specific instructions tailored to each scenario. Between successive runs, participants complete questionnaires while logistical preparations are made, such as removing floor markings, configuring a phone for voice chat using Discord (before Scenarios 2 and 3), monitoring and, if necessary, changing the batteries of eye trackers, and preparing the robots for Scenarios 3–5. After completing the questionnaire, participants are assigned new roles in Scenarios 2 and 3. We gave each group a new starting point for the next run, from which they drew their first card. Participants unfamiliar with their roles got a brief recap of their task-related responsibilities. In Scenario 3, we informed participants that an experimenter monitored the robot’s motion for safety and teleoperated the robot. In Scenarios 4 and 5, participants were first briefed about their roles in the scenario (see Section 4.3) and then introduced to the ARMoD and the DARKO robot as co-workers in the room, with the ARMoD acting as a communicator on behalf of the DARKO robot.

After each run, participants completed the raw version of the NASA Task Load Index (RTLX) (Hart,, 2006; Hart and Staveland 1988). The scale consisted of a 21-point set of subscales [1 = low; 21 = high], each of which assessed the mental demand, physical demand, temporal demand, and frustration produced by the task as reported by the participant, as well as their self-perceived performance and frustration. After each session of the last run of Scenarios 3 or 5, participants complete two additional mobile robot questionnaires. First, they complete the Godspeed Questionnaire Series (Bartneck et al., 2009), a semantic differential set of subscales [5-point] that measures participants’ perceptions of the robot in terms of anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety. Second, they complete a 5-point Likert scale [1 = strongly disagree; 5 = strongly agree] to assess trust in the robot in industrial human–robot collaborations (Charalambous et al., 2016). Participants complete all questionnaires on paper.

4.5 System setup

4.5.1 Hardware and software configuration

We used a motion capture system from Qualisys with 10 infrared cameras (Oqus 7+) positioned around the room to track moving agents. The system provided comprehensive coverage of the room volume. Reflective markers arranged in distinct patterns of six degrees of freedom (6DoF) on bicycle helmets. These were tracked at 100 Hz with a spatial resolution of 1 mm. The coordinate frame of the system originated at the ground level in the center of the room. Each participant and the robot are represented as unique rigid bodies (identifiable through the group of passive reflective markers arranged in specific patterns) in the system. This configuration enabled the precise capture of each participant’s 6DoF head position and orientation. We provided the participants with individualized helmets for the recording sessions. The specific helmet IDs used during each recording session are listed in Tables 3, 4, and 5 in the Appendix.

We captured eye-tracking data using three distinct models of eye-tracking devices: Tobii Pro Glasses 2 and 3 and Pupil Invisible. The Tobii Glasses models record raw gaze data at a frequency of 50 Hz and camera footage at 25 Hz, while the Pupil Glasses record gaze data at 100 Hz and camera footage at 30 Hz. We used the I-VT Attention filter to export Tobii Glasses data, optimized for dynamic situations, to classify gaze points into fixations and saccades based on a velocity threshold of 100°/s. All eye trackers have an IMU comprising an accelerometer and a gyroscope operating at 100 Hz. In addition, the Tobii Glasses 3 has a magnetometer that operates at 10 Hz. The infrared cameras in these devices capture the human gaze, which is then superimposed onto a 2D video by the scene cameras. The Pupil Invisible Glasses’ scene camera has a resolution of 1088 × 1080 pixels, with both horizontal and vertical field of view (FOV) angles measuring 80°. In contrast, the Tobii Glasses offer a resolution of 1920 × 1080 pixels. The Tobii 3 Glasses feature FOV angles of 95° horizontally and 63° vertically, while the FOV of the Tobii 2 Glasses 82° horizontally and 52° vertically.

The DARKO robot integrates several sensors, including an Ouster OS0-128 lidar, two Azure Kinect RGB-D cameras (one of which was used in these recordings), two Basler fish-eye RGB cameras, and two Sick MicroScan 2D safety lidars. The Azure Kinect cameras have a resolution of 2048 × 1536 at 6 Hz, a horizontal field of view of 75°, and a tracking range of up to 5 m. The Basler fish-eye RGB cameras have a resolution of 1700 × 1536 at 20 Hz. The DARKO robot is augmented with a NAO robot acting as ARMoD for participant interaction. The NAO is attached to a seat on the DARKO robot, facilitating the communication of spatial motion intent. This arrangement aligns the ARMoD’s body orientation with the direction of movement in scenarios where DARKO employs a directional driving style.

Recordings from the DARKO robot and the motion capture system were synchronized using ROS timestamps. Taking advantage of the integration of the motion capture system with ROS 1 Melodic, we recorded all of the robot’s onboard sensor data and the 6DoF positions of the people using ROS bag files and in text form.

4.5.2 Sensor calibration

The precision of the data acquisition relied on sensor calibration procedures to ensure accurate measurements and reliable data interpretation throughout the experiments. This section describes our calibration methods for both the motion capture system and the eye-tracking devices. We followed separate calibration routines for each sensor. These calibration routines allowed for the robustness and reliability of our dataset, allowing for accurate analysis and interpretation of participants’ behaviors and interactions within the recorded scenarios.

For the eye-tracking devices, we followed the calibration procedures for both Tobii Glasses models (see Figure 11) as outlined in their respective user manuals to optimize eye-tracking accuracy (see Tobii AB Accessed: 2024-02-02(a) and Tobii AB Accessed: 2024-02-02(b)). This process involved positioning a calibration target, ensuring its visibility, and having participants focus on its center. To ensure accurate recordings of the Pupil Invisible Glasses, we followed the best calibration practices outlined by and validated the calibrations with the dedicated software of Pupil Labs AB Accessed: 2024-02-02(b).

Figure 11.

Left: Calibration pattern encompassing circles of different sizes printed on the card used for calibrations. The outer circle has a diameter of 43 mm. The radii of the inner two circles 20 mm and 3 mm for the smallest center circle Right: A calibration procedure for mobile eye-tracking glasses. The participant stands and holds a card with a black dot at eye level, about an arm’s length away. The participant focuses on the dot to align the eye-tracking system with their eye movements. This step is essential to account for individual eye anatomy and behavior differences.

To ensure the data accuracy of the motion capture system, rigorous daily calibration routines were performed before the start of each recording session. We used the standard calibration kit with a 502.2 mm carbon fiber wand to fine-tune the system. These calibrations allowed us to define precise rigid bodies that enabled 6DoF tracking. This approach ensured the accurate capture of spatial dimensions (X, Y, Z) and rotational elements (roll, pitch, yaw) of objects within the 3D environment, resulting in an average residual tracking error of 2 mm. Rigid bodies of helmets and objects, such as the large objects for the carriers or the DARKO robot, were strategically designed to enable simultaneous and highly accurate capture of all object poses and locations.

4.6 Post processing

Multi-modal data synchronization was necessary in our data collection. We used ROS and custom Python scripts to align the data streams while maintaining temporal integrity. To achieve synchronicity between the motion capture and eye-tracking data, we strategically placed custom events associated with precise timestamps in the two data streams using the respective software of the eye-tracking devices such as Tobii Pro Lab (Tobii AB Accessed: 2024-02-02[c]) and Pupil Player (Pupil Labs AB Accessed: 2024-02-02[a]) as well as the Qualisys Track Manager (QTM) (Qualisys AB Accessed: 2024-02-02) for the motion capture system. This procedure resulted in CSV files where all modalities’ timestamps are synchronized on the motion capture system’s timestamp. Within these files, eye-tracking data is available for frames where the motion capture system tracks all rigid body markers, as it is a prerequisite to determine the 3D gaze vector using a correct head orientation. The frame numbers for each respective eye tracker’s scene recording are indexed in the column named “SceneFNr” in the corresponding CSV file.

To facilitate a thorough analysis of the eye-tracking data in our study, we offer access to the raw data from the Tobii glasses, along with essential synchronization details. The scene recordings are provided in a blurred format to ensure data protection and removed audio data. Access to the raw data from the Pupil Invisible glasses can be granted upon individual request, providing careful and ethical distribution of sensitive data.

An extensive post-processing stage followed the data acquisitions, including synchronization and alignment. It aimed to refine and validate the collected data and ensure the protection of sensitive data. This stage involved several vital procedures, such as eliminating artifacts and noise caused by marker occlusion, lighting variations, and camera disruptions. We also rectified misidentified trajectories through spatial and temporal consistency evaluations, applying manual adjustments when needed.

5. Working with the THÖR-MAGNI dataset

5.1 Data formats

For dissemination, the dataset has been categorized into five recording scenarios (see Section 4 for a detailed description), aligning with the respective days of data collection. Each scenario’s data is organized into separate folders. Multiple acquisitions conducted over the 5 days of recording are stored within each folder. The folders corresponding to the first three scenarios (1–3) contain acquisitions from 4 days (in May 2022), while the folders representing the last two scenarios (4 and 5) encompass recordings from 1 day (in September 2022). We record multiple runs for each scenario and condition to enhance the diversity of motion data in the recordings and mitigate random artifacts. It is essential to note that all files are intended to be extracted into a common directory. In this way, the arrangement preserves the temporal structure of the recorded data.

Each run’s data includes a CSV file and up to two .mp4 videos representing the recordings from the scene cameras of the Tobii eye trackers and if the robot was in motion during the Scenarios 3–5 continuous 3D point clouds from the Ouster lidar as well as the RGB videos from one of the fish-eye cameras. The structure of the recorded data is shown in the Tables 3, 4, and 5 in the Appendix. In the following subsections, we will provide more specific details on the usage and processing of the individual files.

5.1.1 Comma-separated value files

Each CSV file contains a header with critical metadata, including the number of frames for the recording, rigid body and marker details, units of measurement, role labels, and eye-tracking specifics (see Table 6 in the Appendix). The rest of the CSV files contain the merged data from the motion capture system and the eye-tracking devices, organized based on the rigid bodies of the participants’ helmets. Thus, the data of each rigid body is organized into columns containing the XYZ coordinates of all markers (e.g., “Helmet_1 – 2 X” indicating the data for helmet one, marker two, and axis X), XYZ coordinates of the centroid of all markers, the 6DOF orientation of the rigid body’s local coordinate frame, and, if available, eye-tracking data including 2D gaze coordinates, 3D gaze vectors, the frame number of scene recording, eye movement types (such as saccades or fixations), and IMU data (accelerometer, gyroscope, and magnetometer).

Missing data is indicated by either “N/A” (not available) or an empty cell. The temporal indexing in these files is provided by the “Time” or “Frame” column, which indicates the timestamp or frame number of the motion capture system, respectively.

5.1.2 Robot sensor data

The sensor data from the robot includes lidar data and videos captured by the Azure Kinect camera and the Basler camera. Lidar 3D point clouds are provided in the Point Cloud Data (PCD) file format, corresponding to each timestamp. The lidar data for each run is supplied in a zip file, which is labeled with the same File ID as referenced in the Tables 3, 4, and 5 in the Appendix. Regarding video data, the RGB-D and fish-eye camera video streams are unrectified, providing raw visual data, and are only available upon request to ensure suitable data protection.

5.1.3 Additional data

In addition to the CSV files containing information about the recorded data from the eye trackers and the motion capture system, we provide the scene recordings from most of the Tobii eye-tracking devices as .mp4 videos. The videos of the scene recordings were carefully post-processed, as we blurred all the faces of the participants using dedicated video-redaction software (“Caseguard”) to ensure data protection. The raw camera video from the Pupil Invisible Glasses scene has distortions that must be corrected. For this purpose, we provide JSON files with the necessary intrinsic camera parameters to compensate. All data from the Pupil Invisible eye-tracking devices and the remaining data from the Tobii devices are available upon request.

5.2 Development tools

Most existing datasets in the field lack a dedicated toolbox for streamlined visualization and preprocessing. Addressing this gap, we contribute a set of data visualization tools, including a dashboard, and introduce a specialized Python package named thor-magni-tools. This package facilitates the filtering and preprocessing of raw trajectory data, enhancing the accessibility and usability of the THÖR-MAGNI dataset. By making available these resources, we aim to provide researchers with versatile and fast means to navigate, analyze, and extract valuable insights from the dataset.

5.2.1 Data visualization

To provide researchers and users with an intuitive interface for the exploration of human movement, gaze patterns, and environmental perception of the THÖR-MAGNI dataset, we made a set of visualization tools publicly available.⁴ Our visualization dashboard provides a user-friendly interface with multiple interactive components. The dashboard includes the following key features:

1. Trajectory visualization: Users can visualize agents’ trajectories in 2D or 3D space. The trajectories are color-coded to represent different agents, allowing the user to identify patterns and variations.

2. Velocity profiles: The dashboard also displays velocity profiles corresponding to each trajectory, allowing users to analyze speed variations during different movement phases. This feature helps to understand the dynamics of human movement under different conditions.

3. Eye-tracking data alignment: Gaze data is overlaid on the 3D trajectories. This provides insight into visual attention during different phases of motion. Researchers can explore how gaze patterns align with specific trajectory segments, promoting the study of the cognitive processes underlying human actions.

4. Lidar data visualization: Lidar sensor data is presented in a 3D format to show the environmental context of human motion. This information is critical for studying lidar-based human detectors onboard mobile robots, especially in complex environments like in THÖR-MAGNI.

In addition to data visualization, our dashboard contains concise scenario descriptions. Each scenario represents a unique context in which human motion data was captured (described in Section 4.3). These descriptions include information such as the physical environment, task objectives, social interactions, and specific conditions imposed on the participants (e.g., transporting objects between two goal points). Understanding these scenarios is vital for accurately interpreting the data and ensures that researchers can contextualize their analyses effectively.

5.2.2 Data filtering and preprocessing with thor-magni-tools

To facilitate the use of the agents’ trajectories in our dataset, we employed the thor-magni-tools Python package,⁵ a tool designed specifically for filtering, preprocessing, and visualizing trajectory data. This tool focuses on mitigating tracking issues arising from the motion capture system, enhancing the data quality for downstream tasks, and studying novel trajectory prediction methods. To filter 3D trajectory data, we provide two methods: (1) using the most reliable marker, that is, the marker of each helmet with the highest number of tracking locations and (2) restoring the helmet tracking based on the average of the tracking locations of each marker. Both approaches offer a trade-off between tracking quantity and quality. The method utilizing the best marker exclusively produces smoother trajectories due to its reliance on a single marker. Conversely, the method averaging the positions of all visible markers generates longer trajectories but with increased jerkiness, as it incorporates data from multiple markers, which can vary. However, this jerkiness can be alleviated by applying a moving average filter in subsequent processing stages. Figure 12 shows an example of the two methods applied on THÖR-MAGNI trajectory data.

Figure 12.

Filtering methods in a 4-minute recording from Scenario 1. Left: Trajectories filtered using the most reliable marker. Right: Trajectories filtered using the average of the tracking locations of each marker. Although the average tracking markers method provides longer tracks, it induces jerkier trajectories, especially near the boundaries of the motion capture volume (e.g., bottom left and top right).

For both 3D and 6D tracks (X, Y, Z, and 3D orientation), we provide an interpolation method based on a predefined maximum number of positions in the absence of tracking. This method is used to fill in the missing data points while maintaining the integrity of the motion patterns and ensuring continuity in the trajectories. An example of the interpolation of a trajectory based on thor-magni-tools is depicted in Figure 13. Finally, this tool offers optional preprocessing steps, including downsampling and signal smoothing through a moving average filter, further refining the processed trajectories.

Figure 13.

Example of a 4-minute Helmet trajectory in Scenario 1. Left: Raw trajectory data depicting gaps, especially around extreme environmental locations. Right: Post-processed tracing with 100 maximum positions without tracking (1 s) interpolation, showcasing enhanced continuity and completeness in the trajectory.

6. Analysis and comparison to existing human motion datasets

This section presents a comparison with popular human trajectory datasets, specifically the ETH/UCY benchmark and THÖR, with our THÖR-MAGNI dataset. Our analysis encompasses a multidimensional evaluation, covering various facets of the data recordings. These include trajectory continuity, social proxemics delineating interpersonal interactions, and motion characteristics such as velocity profiles and trajectory linearity. Through this comparison, we aim to situate THÖR-MAGNI among its predecessors, showing its potential for advancing human motion analysis and human–robot interaction research.

6.1 Metrics for trajectory data comparison

To evaluate the trajectory data of our dataset in comparison to previous data collections, we employ metrics proposed by Rudenko et al., 2020a; Amirian et al., 2021:

• Tracking duration (s): This metric represents the average duration of continuous tracking for all human agents. A higher value indicates longer tracking, which is favorable for long-term human motion prediction methods.

• Minimal distance between people (m): This metric measures the minimum distance observed between individuals in the dataset. It provides insights into the proximity of human agents during their interactions, offering valuable data for studies related to personal space (proxemics) and social dynamics.

• Number of 8-second tracklets: This metric counts the non-overlapping tracklets of 8-second duration after downsampling to 0.4 s and applying a moving average filter. These choices align with current trajectory prediction benchmarks such as those outlined in Kothari et al., 2022. These tracklets offer discrete temporal segments for analysis, ensuring compatibility with existing evaluation standards in trajectory prediction.

• Motion speed (m/s): Motion speed represents the velocity of all human agents. A higher standard deviation in motion speed indicates a diverse range of behaviors within the dataset. This diversity is essential for capturing various movement patterns and for robustness in trajectory prediction models. This metric is computed in the 8-second tracklets.

• Path Efficiency: Path efficiency quantifies the linearity of trajectories in the dataset, ranging between 0 and 1 (Amirian et al., 2021). It is calculated by dividing the distance between the first and last points by the cumulative distance traveled. A lower coefficient suggests more complex and nonlinear trajectories, providing valuable insights into intricate human movement patterns. This metric is computed in the 8-second tracklets.

6.2 Trajectory data comparison

We compare our dataset with the THÖR dataset and the ETH/UCY trajectory prediction benchmark. The THÖR dataset encompasses three distinct scenarios, each featuring participants performing different tasks such as individual and group movement, box transportation, different amounts of obstacles, and a mobile robot in the environment. In THÖR Scenario 1 (THÖR-S1), participants navigate the environment with one static obstacle. THÖR Scenario 2 (THÖR-S2) introduces a mobile robot navigating around the static obstacle while participants continue their tasks. Finally, in THÖR Scenario 3 (THÖR-S3), the mobile robot becomes a static obstacle, and an additional obstacle is added to the scene. The ETH/UCY trajectory prediction benchmark consists of five scenes: ETH, HOTEL, UNIV, ZARA1, and ZARA2. These scenes represent five outdoor public spaces that capture natural human motion patterns, resulting in a benchmark widely used by the human trajectory prediction community (Almeida and Mozos, 2023; Dendorfer et al., 2021; Salzmann et al., 2020; Yue et al., 2022).

First, we show the tracking durations in Figure 14. THÖR presents consistent average tracking durations around 15.5 to 17.6 seconds across the three scenarios. In contrast, THÖR-MAGNI shows wider variations. For instance, Scenario 4 features longer tracking durations (averaging 41.3 seconds), whereas Scenario 2 has the shortest durations (averaging 17.1 seconds). This variability can be attributed to participants’ density; Scenarios 4–5, involving fewer human agents in a smaller space, may contribute to higher quality tracking. Nevertheless, THÖR-MAGNI has comparable or higher tracking time than THÖR. Furthermore, compared to the ETH/UCY benchmark (i.e., ETH, HOTEL, UNIV, ZARA1, and ZARA2 scenes), THÖR-MAGNI offers comparable or significantly longer tracking durations. This makes our dataset more valuable than its predecessors for tasks such as long-term human motion prediction and human–robot interactions.

Figure 14.

Tracking durations (mean ± one standard deviation) across datasets in seconds. Scenarios 1–3 of THÖR-MAGNI provide comparable tracking durations to previous datasets, while Scenarios 4 and 5 provide longer tracks.

Second, we compare the minimal distance between people in Figure 15. Again, human density plays an important role: THÖR-MAGNI Scenarios 1–3 show low values comparable to those in ZARA1/ZARA2, while Scenario 4 and 5 reach values similar to THÖR, ETH, and HOTEL. The higher participant density in THÖR-MAGNI Scenarios 1–3 results in reduced spatial navigational freedom, leading to increased interactions and decreased social distances between individuals.

Figure 15.

Minimal distance between people (mean ± one standard deviation) across datasets in meters. Lower spatial navigational freedom in Scenarios 1–3 of THÖR-MAGNI potentiates reduced social distances between participants. These results are more consistent with the ZARA1 and ZARA2 scenes, while Scenario 4 and 5 (with more spatial freedom) show similar results to THÖR, ETH, and HOTEL datasets.

Third, the motion speed statistics are shown in Figure 16. Despite the higher participant density in Scenarios 1–3 of THÖR-MAGNI, these datasets feature faster human agent navigation than THÖR and akin to those in ETH, HOTEL, and ZARA1 scenes, possibly influenced by the task of object transportation, impacting their velocity profiles. Participants in Scenarios 4–5 of THÖR-MAGNI have an average velocity similar to those in THÖR, UNIV, and ZARA2. Also, generally, THÖR-MAGNI shows comparable standard deviations in motion speeds, indicating diverse and varied movement patterns among human agents. The similarity of the velocity profiles to previous datasets suggests that our dataset is also natural and diverse.

Figure 16.

Motion speed (mean ± one standard deviation) for 8-second tracklets across datasets in meters per second.

Finally, we compare path efficiency and the number of tracklets in Figure 17. Regarding trajectory linearity, Scenarios 1–3 are aligned with the THÖR and HOTEL datasets, while the other datasets from the ETH/UCY benchmark contain more linear and less complex trajectories. It is also worth noting that THÖR-MAGNI Scenario 4 and 5 display the lowest average metrics (0.78 and 0.75, respectively). The presence of a moving robot might influence these scenarios, prompting human agents to navigate cautiously and align their motion with the robot’s motion profile. Furthermore, THÖR-MAGNI presents a much higher number of non-overlapping tracklets than the other datasets.

Figure 17.

Top: Path efficiency (mean ± one standard deviation) across datasets where lower results mean more linear trajectories. Bottom: Number of non-overlapping 8-second tracklets per dataset. THÖR-MAGNI provides the highest amount of nonlinear trajectories.

These distinctive features make our dataset uniquely challenging, diverse, and valuable as a benchmark for evaluating human trajectory prediction methods. The heightened complexity and diverse range of trajectories in THÖR-MAGNI can provide a robust platform for assessing the effectiveness of trajectory prediction methods, thereby increasing the breadth and depth of research in this area.

7. Conclusions

In this paper, we present THÖR-MAGNI, a comprehensive human and robot navigation and interaction dataset, extending THÖR (Rudenko et al., 2020a) with 3.5 times more motion data, novel interactive scenarios, and rich contextual annotations. Both datasets are accessible online at https://thor.oru.se/. To further support researchers, THÖR-MAGNI comes with a dedicated set of user-friendly tools—a dashboard and a specialized Python package called thor-magni-tools—specifically designed to streamline the visualization, filtering, and preprocessing of raw data. These resources aim to improve the accessibility and usability of the THÖR-MAGNI dataset.

THÖR-MAGNI was created to fill a gap in human motion analysis datasets, limiting HRI research: a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders holistic studies of human motion dynamics. Unlike existing datasets, THÖR-MAGNI includes a broader set of contextual features and offers multiple variations to facilitate factor isolation. Our dataset integrates different modalities, such as walking trajectories, eye-tracking data, and environmental sensory inputs captured by a mobile robot.

THÖR-MAGNI comprehensively represents mobile robots’ and humans’ diverse navigation styles in shared environments using multi-modal data. Our dataset contributes to the evolving landscape of human motion research through a comparative analysis with state-of-the-art datasets. Furthermore, we discuss the features of our dataset in the context of human motion and robot interaction, highlighting their importance in addressing gaps in the existing literature. The THÖR-MAGNI dataset has already been used in research papers, demonstrating its usefulness for training role-conditioned motion prediction models (Almeida et al., 2023) and investigating visual attention during human–robot interaction and navigation in shared environments with robots (Schreiter et al., 2023, 2024).

In the future, we intend to propose a benchmark for multi-modal indoor trajectory prediction methods that leverage the rich contextual cues in THÖR-MAGNI. This work aims to advance the field by facilitating the development of more precise models of human motion. Future data acquisitions should encompass a broader range of environments, increase the size of individual scenario acquisitions, and include extensive coverage of vital modalities such as eye tracking to measure situational awareness and mutual intention of all participants. These efforts will enhance the generalizability of future generations of datasets. Additionally, transitioning data acquisitions from fixed laboratory environments to real-world settings under varying conditions will improve the collected data’s ecological validity and robustness.

Footnotes

Acknowledgments

Authors would like to thank Johannes A. Stork for valuable feedback and suggestions. The authors also thank all the AASS colleagues who helped to prepare and test the experimental infrastructure.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the Wallenberg AI,Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation and by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101017274 (DARKO).

ORCID iDs

Tim Schreiter

Andrey Rudenko

Luigi Palmieri

Tomasz P. Kucner

Notes

References

Admoni

Scassellati

(2017) Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6(1): 25–63.

Almeida

TRde

Mozos

(2023) Likely, light, and accurate context-free clusters-based trajectory prediction. In 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), 1269–1276.

Almeida

TRde

Rudenko

Schreiter

, et al. (2023) THÖR-Magni: comparative analysis of deep learning models for role-conditioned human motion prediction In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2200–2209.

Althaus

Ishiguro

Kanda

, et al. (2004) Navigation for human-robot interaction tasks. IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, Vol. 2, 1894–1900.

Amirian

Zhang

Castro

, et al. (2021) “OpenTraj: assessing prediction complexity in human trajectories datasets”. In: Computer Vision – ACCV 2020. Ed. by Ishikawa

Liu

C-L

Pajdla

, et al. Cham: Springer International Publishing, pp. 566–582.

Bae

Kim

Yun

, et al. (2024) SiT dataset: socially interactive pedestrian trajectory dataset for social navigation robots. In: Advances in Neural Information Processing Systems, Vol. 36.

Bartneck

Kulić

Croft

, et al. (2009) Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International Journal of Social Robotics 1(1): 71–81.

Benfold

Reid

(2011) Stable multi-target tracking in real-time surveillance video. In: CVPR 2011. IEEE, 3457–3464.

Bock

Krajewski

Moers

, et al. (2020) “The ind dataset: a drone dataset of naturalistic road user trajectories at German intersections”. In: 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, pp. 1929–1934.

10.

Boos

Pritz

Lange

, et al. (2014) Leadership in moving human groups. PLoS Computational Biology 10(4): e1003541.

11.

Brščić

Kanda

Ikeda

, et al. (2013) Person tracking in large public spaces using 3-D range sensors. IEEE Transactions on Human-Machine Systems 43(6): 522–534.

12.

Cancelli

Campari

Serafini

, et al. (2023) Exploiting proximity-aware tasks for embodied social navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10957–10967.

13.

Castri

Mghames

Hanheide

, et al. (2022) Causal discovery of dynamic models for predicting human spatial interactions. In: International Conference on Social Robotics. Cham: Springer Nature, 154–164.

14.

Chandra

Bhattacharya

Bera

, et al. (2019) Traphic: trajectory prediction in dense and heterogeneous traffic using weighted interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8475–8484.

15.

Charalambous

Fletcher

Webb

(2016) The development of a scale to evaluate trust in industrial human-robot collaboration. International Journal of Social Robotics 8(2): 193–209.

16.

Chen

Yue

Yang

, et al. (2022) Human mobile robot interaction in the retail environment. Scientific Data 9(1): 673.

17.

Chiara

Coscia

Das

, et al. (2022) Goal-driven self-attentive recurrent networks for trajectory prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2518–2527.

18.

Dahiya

Aroyo

Dautenhahn

, et al. (2023) A survey of multi-agent human–robot interaction systems. In: Robotics and Autonomous Systems. Vol. 161: 104335.

19.

Dendorfer

Ošep

Leal-Taixé

(2021) “Goal-GAN: multimodal trajectory prediction based on goal position estimation”. In: Computer Vision – ACCV 2020. Ed. by Ishikawa

Liu

C-L

Pajdla

, et al. Cham: Springer International Publishing, pp. 405–420.

20.

Dondrup

Bellotto

Jovan

, et al. (2015a) Real-time multisensor people tracking for human-robot spatial interaction. In: ICRA’15 Workshop on Machine Learning for Social Robotics.

21.

Dondrup

Bellotto

Jovan

, et al. (2015b) Real-time Multisensor People Tracking for Human-Robot Spatial Interaction.

22.

Duchowski

(2017) Eye Tracking: Methodology Theory and Practice. London: Springer.

23.

Ehsanpour

Sadat Saleh

Savarese

, et al. (2022) JRDB-act: a large-scale dataset for spatio-temporal action, social group and activity detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20951–20960.

24.

Faroni

Beschi

Pedrocchi

(2022) Safety-aware time-optimal motion planning with uncertain human state estimation. IEEE Robotics and Automation Letters 7(4): 12219–12226.

25.

Finean

Petrović

Merkt

, et al. (2023) Motion planning in dynamic environments using context-aware human trajectory prediction. In: Robotics and Autonomous Systems, Vol. 166. 104450.

26.

Garland

Berdahl

Sun

, et al. (2018) Anatomy of leadership in collective behaviour. Chaos: An Interdisciplinary Journal of Nonlinear Science 28.7.

27.

Haddadin

Suppa

Fuchs

, et al. (2011) Towards the robotic co-worker. In: Robotics Research: The 14th International Symposium ISRR. Berlin Heidelberg: Springer, 261–282.

28.

Hart

(2006) NASA-task load index (NASA-TLX); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society - Annual Meeting. Los Angeles, CA: Sage publications Sage CA, Vol. 50.9, 904–908.

29.

Hart

Staveland

(1988) Development of NASA-TLX (task load index): results of empirical and theoretical research. Advances in Psychology. 52: 139–183.

30.

Heuer

Palmieri

Rudenko

, et al. (2023) Proactive model predictive control with multi-modal human motion prediction in cluttered dynamic environments. In: International Conference on Intelligent Robots and Systems (IROS). IEEE, 229–236.

31.

Holman

Anwar

Singh

, et al. (2021) Watch where you’re going! Gaze and head orientation as predictors for social robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3553–3559.

32.

Karnan

Nair

Xiao

, et al. (2022) Socially compliant navigation dataset (scand): a large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters 7(4): 11807–11814.

33.

Kothari

Kreiss

Alahi

(2022) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Transactions on Intelligent Transportation Systems 23: 7.

34.

Kratzer

Bihlmaier

Balachandra Midlagajni

, et al. (2020) MoGaze: a dataset of full-body motions that includes workspace geometry and eye-gaze. In: IEEE Robotics and Automation Letters (RAL).

35.

Kretzschmar

Spies

Sprunk

, et al. (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research 35(11): 1289–1307.

36.

Kucner

Lilienthal

Magnusson

, et al. (2020) Probabilistic Mapping of Spatial Motion Patterns for Mobile Robots. Cham: Springer Nature.

37.

Kucner

Magnusson

Mghames

, et al. (2023) Survey of maps of dynamics for mobile robots. The International Journal of Robotics Research 42(11): 977–1006.

38.

Leng

Sha

Wang

, et al. (2022) Industry 5.0: prospect and retrospect. Journal of Manufacturing Systems 65: 279–295.

39.

Lerner

Chrysanthou

Lischinski

(2007) Crowds by example. Computer Graphics Forum 26(3): 655–664.

40.

Zhong

Lobaton

, et al. (2022) Fusion of human gaze and machine vision for predicting intended locomotion mode. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30: 1103–1112.

41.

Liu

Shahroudy

Perez

, et al. (2019) Ntu RGB+D 120: a large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10): 2684–2701.

42.

Mahanama

Jayawardana

Rengarajan

, et al. (2022) Eye movement and pupil measures: a review. Frontiers of Computer Science 3: 733531.

43.

Majecka

(2009) Statistical models of pedestrian behaviour in the forum. In: Master’s Thesis, School of Informatics. University of Edinburgh.

44.

Makansi

von Kügelgen

Locatello

, et al. (2022) You mostly walk alone: analyzing feature attribution in trajectory prediction. In: 10th International Conference on Learning Representations (ICLR).

45.

Mavrogiannis

Hutchinson

Macdonald

, et al. (2019) Effects of distinct robot navigation strategies on human behavior in a crowded environment. In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 421–430.

46.

Mavrogiannis

Baldini

Wang

, et al. (2023) Core challenges of social robot navigation: a survey. ACM Transactions on Human-Robot Interaction 12(3): 1–39.

47.

Möller

Furnari

Battiato

, et al. (2021) A survey on human-aware robot navigation. In: Robotics and Autonomous Systems, Vol. 145, 103837.

48.

Moussaïd

Perozo

Garnier

, et al. (2010) The walking behaviour of pedestrian social groups and its impact on crowd dynamics. PLoS One 5(4): e10047.

49.

Munaro

Menegatti

(2014) Fast RGB-D people tracking for service robots. Autonomous Robots 37: 227–242.

50.

Hoogs

Perera

, et al. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR 2011. IEEE, 3153–3160.

51.

Pascher

Gruenefeld

Schneegass

, et al. (2023) How to communicate robot motion intent: a scoping review. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–17.

52.

Pellegrini

Ess

Schindler

, et al. (2009) You’ll never walk alone: modeling social behavior for multi-target tracking. In: 2009 IEEE 12th International Conference on Computer Vision, 261–268.

53.

Pupil Labs AB (Accessed: 2024a-02-02[a]) Pupil player documentation. https://docs.pupil-labs.com/core/software/pupil-player/.

54.

Pupil Labs AB (Accessed: 2024b-02-02[b]) Working with Pupil Core, Best Practices. https://bit.ly/42JeBVm.

55.

Qualisys AB (Accessed: 2024-02-02) Qualisys track manager user manual v2022.1. https://cdn-content.qualisys.com/2022/07/QTM-user-manual.pdf.

56.

Quigley

Conley

Gerkey

, et al. (2009) ROS: an open-source robot operating system. ICRA workshop on open source software 3(2): 5.

57.

Robicquet

Sadeghian

Alahi

, et al. (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In: European Conference on Computer Vision, 549–565.

58.

Rudenko

Palmieri

Lilienthal

, et al. (2018) Human motion prediction under social grouping constraints. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3358–3364.

59.

Rudenko

Kucner

Swaminathan

, et al. (2020a) THÖR: human-robot navigation data collection and accurate motion trajectories dataset. IEEE Robotics and Automation Letters 5.2: 676–682.

60.

Rudenko

Palmieri

Herman

, et al. (2020b) Human motion trajectory prediction: a survey. The International Journal of Robotics Research 39(8): 895–935.

61.

Salzmann

Ivanovic

Chakravarty

, et al. (2020) Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. In: Computer Vision – ECCV 2020. Springer International Publishing, 683–700.

62.

Salzmann

Chiang

H-TL

Ryll

, et al. (2023) Robots that can see: leveraging human pose for trajectory prediction. In: IEEE Robotics and Automation Letters.

63.

Schreiter

Morillo-Mendez

Chadalavada

, et al. (2023) Advantages of multimodal versus verbal-only robot-to-human communication with an anthropomorphic robotic mock driver. In: 32nd IEEE International Conference on Robot and Human Interactive Communication. RO-MAN).

64.

Schreiter

Rudenko

Magnusson

, et al. (2024) Human gaze and head rotation during navigation, exploration and object manipulation in shared environments with robots. In: 33rd IEEE International Conference on Robot and Human Interactive Communication. RO-MAN).

65.

Shah

Sridhar

Bhorkar

, et al. (2023) Gnm: a general navigation model to drive any robot. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 7226–7233.

66.

Shu

Xie

Rothrock

, et al. (2015) Joint inference of groups, events and human roles in aerial videos. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4576–4584.

67.

Stiefelhagen

Zhu

(2002) Head orientation and gaze direction in meetings. In: CHI ’02 Extended Abstracts on Human Factors in Computing Systems. Association for Computing Machinery, 858–859.

68.

Tian

Zhang

Liu

, et al. (2017) Research on a new omnidirectional mobile platform with heavy loading and flexible motion. Advances in Mechanical Engineering 9(9): 1687814017726683.

69.

Tobii AB (Accessed: 2024a-02-02(a)) Tobii 2 Glasses User Manual v2.0.2. https://go.tobii.com/Glasses2UM.

70.

Tobii AB (Accessed: 2024b-02-02[b]) Tobii 3 Glasses User Manual v1.18. https://go.tobii.com/tobii-pro-glasses-3-user-manual.

71.

Tobii AB (Accessed: 2024c-02-02[c]) Tobii Pro Lab User Manual v1.217. https://go.tobii.com/tobii\text{\_}pro\text{\_}lab\text{\_}user\text{\_}manual.

72.

Tomasello

(2014) Joint attention as social cognition. In: Joint Attention. Psychology Press, 103–130.

73.

Wang

Mavrogiannis

Steinfeld

(2022) Group-based motion prediction for navigation in crowded environments. In: Conference on Robot Learning. PMLR, 871–882.

74.

Webb

Giuliani

Lemaignan

(2023) “SoGrIn: a non-verbal dataset of social group-level interactions”. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, pp. 2632–2637.

75.

Yan

Duckett

Bellotto

(2017) Online learning for human classification in 3D LiDAR-based tracking. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 864–871.

76.

Yan

Schreiberhuber

Halmetschlager

, et al. (2020) Robot perception of static and dynamic objects with an autonomous floor scrubber. Intelligent Service Robotics 13: 403–417.

77.

Yue

Manocha

Wang

(2022) “Human trajectory prediction via neural social physics”. In: Computer Vision – ECCV 2022. Ed. by Avidan

Brostow

Cissé

, et al. Cham: Springer Nature Switzerland, pp. 376–394.

78.

Zhao

Wildes

(2021) Where are you heading? Dynamic trajectory prediction with expert goal examples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 7629–7638.

79.

Zheng

Yang

, et al. (2022) “GIMO: gaze-informed human motion prediction in context”. In: Computer Vision – ECCV 2022. Ed. by Avidan

Brostow

Cissé

, et al. Cham: Springer Nature Switzerland, pp. 676–694.

80.

Zhou

Wang

Tang

(2012) Understanding collective crowd behaviors: learning a Mixture model of Dynamic pedestrian-Agents. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2871–2878.

81.

Zhu

Rudenko

Kucner

, et al. (2023) “CLiFF-LHMP: using spatial dynamics patterns for long-term human motion prediction”. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3795–3802.

82.

Ziebart

Ratliff

Gallagher

, et al. (2009) Planning-based prediction for pedestrians. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3931–3936.