Sage Journals: Discover world-class research

Abstract

Current research supports the use of cognitive training interventions to improve the brain functioning of both adults and children. Our work focuses on exploring the potential use of robot assistants to allow for these interventions to become more accessible. Namely, we aim to develop an intelligent, socially assistive robot that can engage individuals in person-centred cognitively stimulating activities. In this paper, we present the design of a novel control architecture for the robot Brian 2.0, which enables the robot to be a social motivator by providing assistance, encouragement and celebration during an activity. A hierarchical reinforcement learning approach is used in the architecture to allow the robot to: 1) learn appropriate assistive behaviours based on the structure of the activity, and 2) personalize an interaction based on user states. Experiments show that the control architecture is effective in determining the robot's optimal assistive behaviours during a memory game interaction.

Keywords

Socially Assistive Robots Social Human-Robot Interaction Control Architecture Cognitive Interventions

1. Introduction

Important findings in brain and memory research in the last few decades have emphasized the potential health benefits of engaging in both cognitively and socially stimulating activities. For example, both social and cognitive stimuli have been found to promote the psychological well-being of older adults [1] and minimize the risk of social isolation which can negatively impact an elderly individual's health, for example, through increased risk of dementia [2] and higher likelihoods of having coronary heart conditions [3]. In addition, studies have also shown that cognitive activity throughout one's lifetime, including the early and middle life stages, can help reduce the risk of late-life cognitive decline [4] and is related to a person's semantic memory, perceptual speed and visuospatial ability [5].

Our work focuses on providing needed insight into the use of innovative robotic technologies for person-centred cognitive interventions. Namely, our objective is to develop intelligent, socially assistive robots that can provide cognitive stimulation during social human-robot interactions (HRI) [6 -9]. In the future, the robots can be used as aids in providing cognitive training and social interaction in both health-related and education fields, for example, to: 1) older adults including those suffering from cognitive impairments, 2) children and adults with attention deficit hyperactivity disorder, brain injuries or learning disabilities, and 3) individuals with major depression disorders which effect cognitive functioning.

To date, only a handful of research groups (including our own group) have focused on developing life-like socially assistive robots to engage different individuals in varying socially or cognitively stimulating activities [10 -18]. For example, the seal-like robot Paro, [10], has been designed to engage elderly persons, including those with dementia, in animal therapy scenarios by learning which of the robot's behaviours (i.e., moving its body parts and making seal sounds) are desired by the way a person pets, holds or speaks to it. KASPAR, a child-sized tele-operated humanoid robot engages an autistic child in play imitation games by displaying various facial expressions, waving its hand, and drumming on a tambourine [13].

Our own recent work in this area has focused on the development of the intelligent human-like robot, Brian 2.0, Figure 1, [16 -18]. Brian 2.0 is being designed as a therapeutic tool to engage people in personalized cognitively stimulating activities, providing them with an avenue to interact and socialize during the course of the activities. The significance of using a human-like social robot lies in the ability to directly incorporate a person's natural communication capabilities as well as his/her ability to understand these forms of communication.

Figure 1.

The socially assistive robot Brian 2.0

In this paper, we present the design of a novel learning-based HRI control architecture for Brian 2.0 which will enable the robot to effectively engage an individual in one-on-one person-centred cognitively stimulating activities. In particular, the architecture allows Brian 2.0 to be a social motivator by providing assistance, encouragement and celebration during the course of an activity. A hierarchical reinforcement learning (HRL) approach is used in the architecture to provide the robot with the ability to: (i) learn appropriate assistive behaviours based on the structure of the activity, and (ii) personalize an interaction based on a person's user state as defined by a combination of affective arousal and activity performance. The architecture uniquely focuses on bidirectional emotion-based interactions between an individual and the robot in order to promote the cognitive and social well-being of a person.

The novelty of our proposed control architecture lies in the inclusion of: 1) a user state recognition and analysis module to allow the robot to be able to identify a person's user state during the course of an activity, 2) a robot emotional state module which provides the robot with emotional states that are consistent with its contextual assistive interactions and that will aim to elicit an appropriate response from the user while also responding appropriately to a person's user state, and 3) the first use of an on-line learning MAXQ, [19], HRL technique to provide the robot with the ability to adapt to new people and learn appropriate assistive behaviours in order to engage in personalized one-on-one HRI.

2. Learning Strategies for Socially Intelligent Robots

It is envisioned that robots will need to have social intelligence in order to be effectively integrated into human society. Social intelligence allows a robot to share information with, relate to, and interact with humans. HRI research involves empowering a robot with the social functionalities needed to engage human participants in different types of interactions. A number of these characteristics will need to be formulated via the study and development of social learning capabilities for robots.

Recently, a number of socially intelligent robots have been developed that are capable of learning their behaviours for social HRI scenarios. A common approach has been to utilize reinforcement learning (RL) strategies to solve HRI control problems that are modelled as either a Markov decision process (MDP) [20 -23] or partially observable Markov decision process (POMDP) [24], where the latter deals with noise and state uncertainty. Other approaches have focused on utilizing policy gradient reinforcement learning (PGRL) when there is no obvious notion of state, i.e., [12],[25].

The main limitation of RL approaches currently used in HRI applications is their scalability. RL algorithms treat the entire state space as one large search space and hence, the search space can grow exponentially as the size of the state space increases, increasing the complexity of the algorithm. This results in RL algorithms having slow learning rates and becoming intractable in large state spaces. Therefore, these methods can only be physically applied to small-scale real-world systems.

Hierarchal reinforcement learning (HRL) methods have also been proposed for HRI scenarios. In the case of HRL, the decision making problem is decomposed into a collection of smaller sub-problems so that they can be solved more efficiently [19]. This results in faster learning as the value function requires less data to be learned. For example, in [26], a hierarchical POMDP approach was implemented in the dialogue-based guidance task of the Pearl robot in order for the robot to perform tasks such as reminding a person of an appointment, navigation and/or information assistance. The control policy was computed off-line, hence, during task execution, the controller simply looked up the appropriate robot action to be implemented. No on-line training was implemented.

Our own initial work in this area has focused on using Q-learning to determine the assistive behaviour of the human-like socially assistive robot Brian in providing monitoring, reminders and companionship to individuals in social HRI scenarios [7 -9]. A person's accessibility level towards Brian, as determined by his/her body language and the assistive tasks to be accomplished, were used by the Q-learning algorithm to determine the robot's appropriate assistive behaviour.

In our present work, we propose the use of a MAXQ HRL approach, within a multi-layer control architecture, to enable the socially assistive robot Brian 2.0 to provide assistance and encouragement to individuals as they engage in a cognitively stimulating activity. Brian 2.0 is capable of encouraging natural interactions between an individual and itself through social learning and its physically expressive capabilities. The objective of the HRI controller is to determine an individual's user states during a cognitively stimulating interaction with Brian 2.0 and, in turn, determine the appropriate behaviour of the robot to reflect the task to be completed given a particular user state. A modular design approach is applied to the overall control architecture, allowing for the addition and/or substitution of different sensor modalities as needed based on the intended activity.

By using a MAXQ approach, the robot's overall assistive task can be reduced into smaller more manageable sub-tasks that it can learn concurrently. This makes our MAXQ approach scalable, allowing us to effectively expand our architecture to include more activities and robot behaviours. Since MAXQ also reduces memory requirements, we can also have the robot interact with a larger number of users. Furthermore, as our objective is to improve/maintain positive user states during the course of a cognitively stimulating activity with the robot, we perform on-line training of the robot's behaviours in order to personalize interactions.

3. Social HRI Scenario

Our goal is to design a robotic social motivator to provide interventions that focus on maintaining and strengthening the cognitive abilities of a person, while promoting engagement in a cognitively stimulating leisure activity. We have used two criteria identified in the literature to design the cognitive intervention that Brian 2.0 can provide to individuals in order to better engage them in an activity of interest and increase positive affect: 1) we focus on matching the stimuli that is provided by the robot to a person's skill and interest level, and 2) the robot is designed to provide one-on-one social stimuli.

In this work, we have chosen the card game of memory as our cognitively stimulating activity. The game consists of 16 picture cards turned face down in a 4×4 grid formation. The objective is for the human player to flip over pairs of cards and match the pictures on the cards correctly. Once a pair has been matched, the two cards are removed from the game. The game is over when all cards have been matched. Individuals play the game as single players while the robot autonomously provides preferred amounts of social stimulation in order to keep these individuals engaged in the game. The memory functions within the brain that are trained while playing this card game include the visual object memory and the updating function of the central executive component of the working memory [27].

We aim to keep a person stimulated and engaged in the memory game activity. In order to do this, herein, we focus on reducing activity-induced stress of the person. Activity-induced stress is known to result in negative moods and lead to disturbances in motivation (e.g., loss of task interest) and cognition (e.g., worry) [28].

4. HRI Control Architecture

A generic modular learning-based HRI control architecture is proposed to allow the robot to provide encouragement and assistance to a person as he/she engages in a cognitively stimulating activity. The HRI control architecture focuses on determining the person's user state and his/her task performance during a cognitively stimulating interaction with Brian 2.0, and adjusting the behaviour of the robot to reflect the task to be completed given a particular user state. A modular design approach is utilized in the control architecture to allow for the addition and/or substitution of different sensor modalities as inputs into the user and activity state modules as needed based on the intended activity. Due to its generality, the architecture can be applied to different individuals or a combination of person-centred guidance-based activities in HRI scenarios.

In our proposed HRI control architecture, we apply a hybrid approach to resolve uncertainty at both the sensor data processing level and at the decision making level. At the sensor data processing level, we utilize sensor-specific algorithms to obtain the best possible state representation prior to the decision making process. These algorithms directly deal with uncertainties and noise acquired from raw sensor readings, resulting in a more accurate representation of the state of the interaction. At the decision making level, we have incorporated a knowledge clarification layer which uses clarification dialogue between the person and robot in order to reduce errors as a result of speech recognition. Furthermore, non-deterministic human behaviours are accounted for by the MAXQ algorithm. On-line training is also utilized to adapt to non-deterministic scenarios, as well as new users.

An overview of our proposed HRI control architecture is presented in Figure 2. For the current implementation of the control architecture to the memory card game scenario, sensory information is acquired for: (i) recognizing human verbal actions via a Logitech noise-cancelling microphone, (ii) user state recognition via an emWave ear-clip heart rate sensor, and (iii) activity state monitoring using a Logitech webcam. The heart rate sensor is utilized to determine a person's affective arousal level during activity engagement.

Figure 2.

HRI control architecture

4.1 Activity State Module

The Activity State module monitors the state of the memory game during the interaction utilizing images provided by the 2D webcam. A feature recognition and clustering technique we have previously developed based on SIFT (Scale-Invariant Feature Transform) is used to determine the number, identity and location of the cards within the activity area [16]. Pairs of picture cards utilized in the memory game have unique SIFT keypoints, allowing them to be distinguishable from each other. The clustering technique utilizes a nearest neighbour search algorithm to define regions in the 2D images containing keypoints that may potentially represent cards that have been flipped over during the game. A database of the keypoints for each picture card is utilized to determine the identity of the flipped over cards. Card recovery errors can arise during activity state recognition when cards become obstructed. This mainly occurs due to the temporary presence of human hands. Uncertainty is minimized by capturing and analysing n number of images of the same activity state. A probabilistic voting system is then utilized on the images to determine the current activity state.

4.2 Speech Recognition and Analysis Module

Human speech is recognized via the Speech Recognition and Analysis module. Recognition is performed by Julius, a two-pass large vocabulary continuous speech recognition (LVCSR) decoder [29]. Words are recognized based on their phonemes and their approximate location in an utterance. The LVCSR software has been customized to support the vocabulary, dialogue and action-based context needed during game playing. In particular, the vocabulary and grammar definitions have been configured with the syntactic constraints of a response or question posed to Brian 2.0. Herein, we have utilized the person independent VoXForge acoustic model [30], which is composed of statistical representations, created via Hidden Markov Models, for each phoneme in the English language to account for persons with different accents and speaking styles. The acoustic model has been trained using 625 unique voices.

The reliability of the spoken utterance is determined using word confidence scores provided by Julius which are based on a combination of predicator features (e.g., acoustic and language model scores). We then determine the weighted average of all the confidence scores of the recognized utterance. If the weighted average is low or if there are multiple results with similar weighted averages, this information is sent to the Knowledge Clarification layer in order to resolve the uncertainty via the robot asking clarification questions.

4.3 User State Module

The User State module is used to determine a person's task-based user state during game playing. This is determined during the proposed activity using a combination of affective arousal and activity performance. Affective arousal is the intensity with which emotional stimuli are perceived [31]. Heart rate has a long history of being used as an index of arousal [32]. Heart rate data is gathered from the user during interaction at the sampling rate of 2Hz provided by the sensor. The baseline heart rate, which is an average of 10 valid data points, is acquired before the start of the activity. Subsequent valid heart rate readings are compared to this baseline, with a threshold of 5bpm, to determine if the person is in a high or low affective arousal state. Activity performance is determined by whether or not matching card pairs were found in the previous round of the memory game by the Activity State module, Table 1. The 5bpm heart rate threshold, as well as the user states in Table 1, have been developed through the monitoring of numerous experiments. In these experiments, we were able to detect increased heart rate when a person was faced with both a stressful and exciting situation in an activity. In the context of the memory game, stress was directly related to the scenario when a matching card pair could not be found and excitement was directly related to matching a pair of cards.

Table 1.

Task-based User States

		Activity Performance
		No Match	Match
Arousal	High	State = 0 (Stressed)	State = 3 (Excited)
Arousal	Low	State = 1 (Neutral)	State = 2 (Pleased)

4.4 Robot Emotional State Module

The Robot Emotional State module uses the person's user state and the current assistive action of the robot to determine the emotional state of the robot. The objective of the emotional state module is to determine which robot emotion will elicit an appropriate response from the human in order to accomplish a given task while also responding appropriately to a person's user state. We utilize a finite-state machine approach to match the appropriate robot emotion to a given user state and the robot's assistive action within the context of the cognitively stimulating activity. For the memory game, the robot emotions are: happy, neutral and sad. For example, when the person finds a matching card pair and is in an excited state, the robot celebrates with him/her by being in a happy state. The robot is sad when it has to repeat an instruction after a long period of waiting. Sad is chosen for re-engagement based on human response to this emotion as outlined in empathy theories (wanting to help a person that is sad in order to relieve him/her of this emotion), as well as the wanting to achieve an internal goal based on self-rewarding – feeling good about ourselves by helping others. In general, in all cases when the user is stressed, regardless of the robot action to be implemented, the robot will try to improve the user state of a person by being in a happy state. For all other cases not mentioned here, the robot's emotional state is neutral.

4.5 Behaviour Deliberation Module

The Behaviour Deliberation module is the main decision making module within the HRI control architecture. This module requires inputs from all four of the aforementioned modules in order to determine the robot's effective assistive behaviour via a MAXQ hierarchical reinforcement learning approach [19]. The overall behaviour of the robot is physically implemented by the actuator control module using a combination of both verbal and non-verbal forms of communication. In the next section, we will discuss the detailed design of the Behaviour Deliberation module as it pertains to the robot engaging a person in the card game of memory.

5. The Behaviour Deliberation Module of Brian 2.0

The Behaviour Deliberation module is composed of two layers: (i) Knowledge Clarification and (ii) Intelligence.

5.1 Knowledge Clarification Layer

This particular layer is in charge of generating a clarification dialogue between a person and the robot in order to reduce errors as a result of speech recognition. Namely, if the average confidence score for the utterance by the person is low, as determined by the Speech Recognition and Analysis module, the robot will state the utterance that has the highest relative confidence score and ask the person to confirm his/her request by providing positive/negative feedback in the form of yes or no answers. In the case of multiple recognition results with similar confidence score averages, the robot will individually clarify the top three results to determine if the user is asking to recall, identify or localize a card. This allows the robot to match the utterance with its own stored activity-specific utterance templates and hence, increase the accuracy of speech recognition.

5.2 Intelligence Layer

The intelligence layer consists of the MAXQ HRL algorithm, which is capable of adapting the robot's behaviour to the current assistive interactive scenario. MAXQ is utilized to determine the overall behaviour of the robot as a function of both verbal (speech) and non-verbal (gestures, and facial expressions and intonation based on the robot's emotions) communication means.

5.2.1 The MAXQ Learning Algorithm

MAXQ provides a hierarchical decomposition of a given reinforcement learning problem (task) into a set of sub-problems (sub-tasks). With respect to the memory game, the overall assistive task aligns with the objective of the card game: to identify and check that cards flipped over result in a corresponding pair match. MAXQ is able to support temporal abstraction, state abstraction and sub-task abstraction which are important in the decision making process for the socially assistive robot in the memory game scenario. The need for temporal abstraction exists since, depending on the player's skill level and style of play, some actions may take varying amounts of time to execute. State abstraction is beneficial since all state variables are not needed for certain tasks. For example, when instructing the player to flip back unmatched cards in the game, the identity of the cards are irrelevant and should not affect the robot's behaviour. Due to state abstraction, the overall value function for this task can be represented more effectively by utilizing only a subset of the state variables, reducing memory requirements. Sub-task abstraction is also necessary because it allows sub-tasks to be learned only once; the solution can then be shared by other sub-tasks.

This paper presents the first application of the MAXQ algorithm to multi-modal interactions with socially assistive robots. We propose a new two-stage training process for our learning strategy which includes both off-line and on-line learning using real user data (discussed in Sections 6 and 7) in order to allow the robot to personalize its interactions with different individuals.

5.2.2 Task and Value Function Decomposition

At the core of MAXQ is the value function decomposition, which describes how to decompose the overall value function (i.e., Q-value) for a policy into a collection of value functions for the individual sub-tasks, recursively [19]. The proposed hierarchical task graph for the memory game scenario is presented in Figure 3. The task graph follows the MAXQ decomposition structure, where a given MDP M is decomposed into a finite set of sub-tasks {M₀, M₁,…, M_n} [19]. M₀ is the Root Task and is defined to be the aforementioned overall assistive task. The other sub-tasks are designed to determine the appropriate assistive behaviours of the robot based on the current user state and activity state. Each sub-task consists of a set of actions A, which can be performed to achieve sub-task M_i. These actions can be either primitive robot behavioural actions or other sub-tasks as discussed below. The three main 1^st level sub-tasks are: Flip over cards, Remove (matched) cards from the game and Flip back (unmatched) cards. Each 1^st level sub-task is divided into a primitive action, which includes Celebration, Encouragement and Instruction, and a 2^nd level sub-task: Help. The Help sub-task is further divided into three 3^rd level sub-tasks: 1) Localize a particular card in the game, 2) Recall if a card has been flipped over in a previous round, and 3) Identify the picture on a particular card. For example, if there is one card flipped over and the player has asked the robot to localize the matching pair of this card, which was flipped over in a previous round, the path taken should be: Root Task → Flip Over → Help→ Localize →L₁; where L₁ is the primitive robot action where the robot informs the person of the location of the matching card.

Figure 3.

Hierarchical task graph for the memory game scenario (primitive robot actions on bottom row are defined in Table 2)

Table 2.

Examples of Primitive Robot Actions

Action Type	Example
Instruction	“Let's play a round of the memory game. Please flip over a card.”
Celebration (with prompting)	“Congratulations, you have made a successful match. Please remove the cards from the game.”
Encouragement (with prompting)	“Those are interesting cards that you have flipped, but they are not the same. Please flip back the cards and try again. I know you can do this.”
Help: Identify (Player asks Robot to identify a card)	I1: Related question “This is a very good question. This card shows a picture of a dog.”
Help: Recall (Player asks Robot to recall a card)	R1: Level of difficultly = high “Yes! You have definitely seen this card before in the game.”
	R2: Level of difficultly = low “Yes, you have seen this card before here.” (Robot points at the location of the card)
	R3: Card location is not known “You have not yet seen this card before.”
Help: Localize (Player asks Robot to locate a card)	L1: Level of difficultly = high “The card is located in the top left corner.”
	L2: Level of difficultly = low “The card is located here.” (As identified by the pointing gesture of the robot)
	L3: Card location is not known “You have not yet flipped over this card.”
Help: Identify/Recall/Localize	I2/R4/L4: Unrelated question “I'm sorry. I cannot answer your question at this time in the game. Please try again later.”

5.2.3 State and Action Definitions

A set of states, S, have been determined for the aforementioned sub-tasks to be utilized within the MAXQ framework. Specifically, the state functions for the robot's sub-tasks are defined as follows, where s ∈ S: (i) Root Task: s(mc, c, m), (ii) Flip Over cards: s(c, hs, hu, re), (iii) Remove cards: s(c, hs), (iv) Flip back cards: s(c, hs), (v) Help: s(c, hs, hu, re), (vi) Localize: s(c, hs, gd, l, I), (vii) Recall: s(c, hs, gd, l, I), and (viii) Identify: s(c, hs, I). mc represents the number of matches found in the game, c represents the number of cards flipped over in a single round, m represents if a matched pair has been found, hu represents a person's user state, hs represents human speech, re represents the emotional state of the robot, l is the pair location of the flipped over card, I is the identity of the flipped over card and gd is the level of difficulty of the game, which changes based on the number of incorrect matches the person has made in the last n rounds.

Table 2 and Figure 4 show examples of primitive robot actions. The primitive actions for the sub-tasks Localize, Recall and Identify provide varying levels of encouragement and assistance to keep a person engaged in the game. The first primitive action for Identify (i.e., I1) is to inform the player of the identity of the card in question. A 2^nd primitive action for Identify (i.e., I2) is used when the person asks a question that is not related to the activity state. Similarly, for Localize and Recall, if the person asks a question that is not related to the activity state, the 4^th action (i.e., L4 or R4) is chosen. The remaining primitive actions for Localize and Recall provide two different levels of difficulty of the game (i.e., L1 and L2, or R1 and R2). The 3^rd action (i.e., L3 or R3) is used to deal with the case when a card's location is unknown since the card has yet to be flipped over.

Figure 4.

Example robot behaviours: (a) providing celebration in a happy emotional state after a correct match, (b) providing instruction in a sad state when game disengagement occurs and (c) providing help in a neutral state.

Every sub-task in the task graph has a termination condition. For example, for the Root Task, the termination condition is that eight pair matches are found in the game. For the Flip Over sub-task, the termination condition is that there are two cards flipped over. If the termination condition for this sub-task is not met after i number of iterations, the robot becomes sad since the person has become disengaged from the game. This change in robot emotion is used to re-engage the person. The termination condition for both the Flip Back and Remove sub-tasks is that there are 0 cards flipped over. The termination condition for the Help, Localize, Identify, and Recall sub-tasks is that there is no human speech input.

At the start and end of the game, the Deliberation module implements the following behavioural actions for the robot: 1) at game start: “Hi, my name is Brian. I am glad you want to play the memory game with me. Let's start.”, and 2) at game end: “Congratulations, you have completed the memory game.”

6. MAXQ Training

We have implemented a two-stage training procedure for our MAXQ approach. In the 1^st stage, we focus on determining appropriate behaviours for the robot based on the structure of the game. After the robot has learned its optimal behaviours with respect to the card game, the 2^nd training stage focuses on developing personalized interactions for each person using his/her user state during game playing. Here, we discuss the 1^st training stage. The 2^nd training stage is detailed in Section 7.

The objective of the 1^st training stage is to learn the robot's optimal behaviours based on human actions and activity states. On-line training would be unrealistic to use at this stage due to the large amount of possible states and actions that need to be explored, as well as the extensive amount of experience required to learn the optimal strategy. Therefore, we utilize an off-line training procedure that incorporates a human user simulation model, error models for both speech recognition and activity state detection, and an epsilon-decreasing exploration strategy that can provide the extensive interaction experience needed for policy learning.

6.1 Human User Simulation Model

A simple probabilistic approach for user modelling is the n-gram model, [33], which predicts human behaviour based on the last n-1 number of system actions. Herein, we use a bi-gram (n=2) human user model to represent both human verbal and physical actions during the proposed assistive HRI scenario.

Wizard-of-Oz (WOz) experiments consisting of ten participants, each playing the memory game while interacting with the robot, were performed to acquire the necessary data for the bi-gram model. In these WOz experiments, a member of our research team sat in a different location and only controlled the decisions regarding the behaviours of the robot (i.e., behaviour deliberation module), all other modules of the control architecture were autonomous. To promote natural interactions, we did not tell the participants how to behave, we merely requested that they play the memory game. In this bi-gram user model approach, a person's action is dependent on the last robot action, i.e., p=P(action_human|action_robot). For our experiments, an action is defined as any possible behaviour of the robot or person related to the game. We assume full cooperation of the user during the interaction. Namely, the user's actions are related to the memory game, abiding by the rules of the game in order to find all possible matches. The obtained bi-gram user model is presented in Table 3.

Table 3.

Bi-gram User Simulation Model

	Human Actions
Robot Actions	Flips over 1^st card	Flips over 2^nd card*	Flips back cards	Remove cards	Ask “What..?” (Identify)	Ask “Where..?” (Localize)	Ask “Have..?” (Recall)
Instruct/Flip Over (0 card)	76%	21%	0%	0%	1%	1%	1%
Instruct/Flip Over (1 card)	0%	62%	0%	0%	11%	8%	19%
Encourage/Flip back	0%	0%	97%	0%	1%	1%	1%
Celebrate/Remove	0%	0%	0%	97%	1%	1%	1%
Answers Identify	0%	86%	0%	0%	1%	12%	1%
Answers Localize	0%	97%	0%	0%	1%	1%	1%
Answers Recall	0%	67%	0%	0%	1%	31%	1%

If there are 0 cards initially flipped over, this action is described as flipping over two cards at once.

6.2 Speech Recognition Error Model

To account for variations in recognition performance caused by noise and speaker-dependent differences, we use a speech recognition error model that assumes a new speaker for every game. For each recognition task (RT), we use the following equation to compute the recognition rate (RR) [22]: $R R_{G a m e} (R T) = S a m p l e (N (0, 1)) σ_{R R (R T)} + R R_{O v e r a l l} (R T) .$ (1)

The recognition results of ten different speakers are used to compute the overall RR and standard deviation for Recall, Identify and Localize, Table 4. These errors are incorporated into the simulation model for when the robot needs to detect a person's verbal action.

Table 4.

Speech Recognition Rates for Recognition Tasks

	Recall	Identify	Localize
Number of utterances	250	150	200
RR_Overall(RT)	82.0%	97.3%	97.0%
σ_RR(RT)	0.148	0.064	0.063

6.3 Error Modelling for Activity State Detection

The activity state detection error is based on determining: 1) the identity of the cards in the game, and 2) the number of cards flipped over by the user. The card identification error is incorporated into the simulation model for when the robot must provide help to the user. The game area is split into a 4×4 grid, representing the location of the cards. Table 5 shows the detection rates for each section based on the results of ten detection trials per section.

Table 5.

Card Identity Detection Rates

Game Area		Column
Game Area		1	2	3	4
Row	1	90%	100%	90%	100%
	2	100%	100%	100%	100%
	3	100%	100%	100%	100%
	4	100%	100%	100%	100%

Errors resulting from detecting an incorrect number of cards flipped over are also incorporated in our simulation model for when the robot must provide the appropriate instructions based on the activity state. Table 6 shows the detection rates for when 0, 1 or 2 cards are flipped over.

Table 6.

Detection Rates for the Number of Cards Flipped Over

Number of cards flipped:	0	1	2
Number of occurrences	75	32	142
Detection Rate	100%	94%	100%

6.4 Rewards

The aim of our reward system is to minimize the cost of the actions taken to reach the ultimate goal of completing the game. In the memory game, a desired action is defined as an appropriate action for the current state (e.g., the robot congratulating a person when he/she has found matching cards). Every completed primitive action is given a negative reward of −1, whereas undesired actions are given an additional negative reward of −20. Desired primitive actions are not further rewarded. A positive reward of +21 is given at 1^st level sub-tasks if a person is asking a help-related question and the appropriate Help sub-task is chosen. A game reward of +400 is given at the Root Task when the player finds all 8 matches in the game. The reward values presented here were chosen in a manner that allows a clear distinction between desired and undesired actions. For example, when a player flips over the last two matching cards of the game, the correct robot action is to celebrate the match and inform the player to remove the cards. The resulting reward is r = 399 since a completed primitive action has been implemented (-1) and the game is finished (+400). Alternatively, if there is one card flipped over and the player asks the robot to localize a card that has been previously flipped over, the appropriate robot action would be to inform the user of the correct location of the card. In this case, the resulting reward is r = +20 since a primitive action has been implemented (-1) and the appropriate Help sub-task has been chosen based on the player's question (+21).

6.5 Exploration Policy

An epsilon-decreasing exploration strategy is applied during off-line training. At the beginning, ε is set to 1 for the Root Task, and 1^st and 2^nd level sub-tasks to encourage the maximum amount of exploration possible. 3^rd level sub-tasks, which only evoke primitive actions employ a greedy policy (i.e., ε = 0), where the action with the highest potential reward (Q-value) will always be chosen. Since a previously implemented action will result in a negative reward for that action, a greedy policy is used to ensure that all primitive actions are explored at least once. Once all primitive actions are explored, the exploration policy gradually reduces to 0 at the Root Task, and 1^st and 2^nd level sub-tasks, so that Q-values at these sub-tasks will converge to their optimal values.

6.6 Performance Analysis

We have performed a study to compare the rate of convergence of our MAXQ approach versus a traditional flat Q-learning approach, [34], for the proposed memory game scenario. The same learning rate (i.e., α=0.8), initial Q-values, state parameters, primitive actions and user simulation model were used for both implementations. Figure 5 presents the cumulative rewards for the overall assistive task. The results show that the MAXQ method converges at a faster rate than the flat Q-learning approach. For flat Q-learning, there were 33 state parameters and 13 primitive actions, resulting in 20,736 unique states and 269,568 Q-values. With state and sub-task abstraction, the MAXQ approach significantly reduces the amount of Q-values needed to be stored to only 1,707 Q-values, making it a considerably more efficient solution to this decision making problem.

Figure 5.

Comparison of MAXQ and flat Q-learning for the memory game

7. Social HRI Experiments

Once the 1^st training stage has determined the robot's appropriate behaviours based on the structure of the memory game, the 2^nd training stage is implemented. This on-line training stage is used to allow Brian 2.0 to learn its optimal assistive behaviours based on a person's user states during game engagement. The aim is to select the robot's behaviours in an attempt to maintain positive (i.e., pleased or excited) user states during game playing. We postulate that this will, in turn, allow a person to be more engaged in the cognitively stimulating activity.

7.1 Procedure

The on-line training procedure was tested on ten healthy adults (ranging in age from 20 to 35) as they played the memory game twice while interacting with the robot. A baseline heart rate was obtained for each participant prior to game initiation. A successful action is defined as a robot action that improves a person's user state from a stressed state to a non-stressed state.

In this experiment, a scenario involving activity-induced stress is simulated in order to demonstrate that, in such situations, the robot can be effectively used to minimize this type of stress. As our participants are healthy adults, we have imposed the following constraint on the game: each participant must try to win the game with five or less incorrect matches. This system performance experiment will allow us to verify the controller's ability to detect a user state and adapt the robot's behaviour accordingly based on this user state. Furthermore, these healthy adults can provide detailed comments on their experience and the performance of the robot via post-experiment surveys and self-studies. This experiment will provide us with valuable feedback on the functionality of the controller in order to optimize its design prior to conducting long-term cognitive training interventions with other potential end-users.

We have developed a novel on-line training procedure utilizing a person's user state to explore robot behaviours such as providing instruction or help when appropriate, and rewarding the behaviours that succeed at improving user state during the memory game. Exploration of behaviours is triggered by the robot detecting that the person is in a stressed state. At this user state, the exploration policy, ε, is non-greedy for the Flip Over and Help sub-tasks. ε is gradually reduced at every successful robot action. The robot will eventually revert back to the greedy exploration policy when ε finally decreases to 0, where the action with the highest Q-value will be chosen. This on-line training procedure is repeated for every new user that interacts with the robot.

At the end of the 1^st stage of training, the Instruction behaviour (at the Flip Over sub-task level) has a higher Q-value than the Help behaviour. To promote exploration of both these behaviours in the 2^nd stage of training, the first successful Help behaviour is given a reward of +20 so that it has the same Q-value as the Instruction behaviour. This is done here to illustrate the robustness of the on-line learning procedure, namely, if these two behaviours had the same Q-values, the on-line training would still be able to identify the optimal behaviour between the two behaviours. Subsequent successful Help and all successful Instruction behaviours are given a reward of +10. At the Help sub-task level, a reward of +1 is given to a successful Localize, Identify or Recall action.

7.2 Results and Discussions

Preliminary experiments demonstrate that the proposed on-line training procedure allows the robot to learn its optimal assistive behaviours during personalized interactions. Namely, the robot successfully detects user states at every interaction, explores different behaviours, and is rewarded when its behaviours improve user states.

On average, the participants played the two games for approximately 40 minutes. Figure 6 shows the user states of all ten participants during the two games. One interaction is defined to include a robot detecting a user's action (which updates the activity state), as well as the robot's reaction during game playing. From Figure 6, we can see that the participants had unique user state responses. Some participants such as A, C, D, E, F, I felt more stressed at the beginning and/or middle of the overall game playing session and had higher user states near the end of the session. Other participants felt stressed throughout the game sessions, such as G, H and J, while participant B felt more stressed during the latter half of the game session.

Figure 6.

Participant user states detected during the memory game.

Figure 7 provides a more detailed view of two sets of ten different interactions for each of the participants, i.e., one for each game. The robot was able to explore and determine appropriate behaviours during game playing utilizing the proposed MAXQ control architecture and on-line training procedure based on the participants' user states and activity states. For example, for Participant A, the robot was able to detect that the person was in a stressed state at interactions 8, 12 and 35, and provided assistance via the Identify and Locate help actions. Similarly, the Identify action was explored for Participant D at interactions 7 and 15, and for Participant H at interaction 12. The Recall help action was explored for Participant F at interaction 28. The Instruction actions were also explored and found to be effective at improving user states for Participant C at interactions 10 and 35; Participant E at interactions 3, 8, and 40; Participant F at interaction 22; Participant G at interactions 3, 5, 9, and 39; Participant I at interactions 14, 42 and 47, and Participant J at interactions 10 and 68. During the second game, the policy resulted in the robot performing more exploitation than exploration, due to the decrease in ε. For example, this was observed with Participants B, E and I, where the Instruction action was repeatedly chosen during game 2 at interactions 65 and 69; 36 and 40; and 42 and 47, respectively. We also determined the robot's overall success rates in improving user state during the first and second games to verify the robot's ability to learn its appropriate behaviours in order to improve user state. The success rates were determined to be 63% during the beginning stage of the interactions and 80% during the end stage of the interactions.

Figure 7.

Interaction details for all participants.

Figure 8 shows the rewards for the Flip Over sub-task for all ten participants during the experiment to illustrate how rewarding of actions are implemented. After the two games, we can observe which robot behaviours obtained the highest rewards and thus, became the current optimal behaviour for each participant. For the majority of the participants the Instruction action has the highest rewards followed by the Help action. Figure 9 shows the rewards for the Help sub-task for Participants A and D. For Participant A, at the end of the two games, the reward for Localize is higher than the rewards for other Help actions, and for Participant D, Identify has the highest reward.

Figure 8.

Rewards for the Flip Over sub-task.

Figure 9.

Rewards for the Help sub-task.

A post-experiment assessment was administered after the HRI scenario which included a self-study to analyse the performance of the robot's ability to detect change in user state throughout the activity and a questionnaire to obtain feedback on the robot's behaviour during game playing. For the self-study, each participant was asked to identify when he/she felt stressed (negative high arousal) or excited (positive high arousal) during the course of the activity via playback video. We compared the self-study results as well as activity performance to the user states detected by the robot in order to determine the average user state prediction accuracies for the participants. We found that the average state prediction accuracies were 82.8% for excited, 81.2% for pleased, 76.7% for neutral and 80.3% for stressed. From these results, we can see that high recognition accuracies were achieved when detecting the participants' change in user state.

For the questionnaire, the participants were asked to choose their responses from a list of robot behaviours. The participants were first asked to identify the robot behaviours they felt were the most effective at relieving stress during game playing. Table 7 summarizes their responses based on a ranking of the total number of responses for each behaviour. The robot providing instructions was ranked the highest by the participants, which concurs with the rewards presented in Figure 8. Participants A and D mentioned that both the robot's instructions and help were effective at relieving stress. Even though the rewards for instruction did not increase for these two participants, the rewards for help did. The four participants (A, D, F and H) that stated that the robot's help behaviour was one of the most effective behaviours at relieving stress, also had increased rewards for the help sub-task during the interactions.

8. Conclusions

In this paper, we present the design of a novel modular learning-based control architecture for our socially assistive robot Brian 2.0, enabling the robot to be a social motivator by providing assistance, encouragement and celebration during the course of a cognitively stimulating activity. Namely, the control architecture utilizes a MAXQ hierarchical reinforcement learning approach in order for the robot to learn its own appropriate assistive behaviours based on the structure of an activity and further personalize the interaction based on a person's user state, where the latter is defined as a combination of affective arousal and activity performance. Results from off-line and on-line training validate the performance of the learning algorithm with respect to the robot's ability to learn its appropriate assistive behaviours to maintain positive user state during a memory card game. Our future work consists of designing a pilot study with the robot at our collaborative long-term care facility with elderly persons with mild cognitive impairment to observe the robot's ability, using the proposed controller, to be a social motivator and engage individuals in the memory game, as well as to study long-term human-robot relationships between Brian 2.0 and a user.

References

Park

N.S.

(2009) The Relationship of Social Engagement to Psychological Well-Being of Older Adults in Assisted Living Facilities. J of Appl. Gerontology, 28(4): 461–481.

Wilson

R.S.

. (2007) Loneliness and Risk of Alzheimer Disease. Arch Gen Psychiatry, 64: 234–240.

Sorkin

Rook

and Lu

(2002) Loneliness, lack of emotional support, lack of companionship, and the likelihood of having a heart condition in an elderly sample. Ann Behav Med, 24(4): 290–298.

Landau

S.M.

. (2012) Association of Lifetime Cognitive Engagement and Low β-Amyloid Deposition. Arch Neurol. 69(5): 623–629.

Wilson

R.S.

Barnes

L.L.

and Bennett

D.A.

(2010) Assessment of Lifetime Participation in Cognitively Stimulating Activities. J. of Clinical and Experimental Neuropsychology. 25(5): 634–642.

Allison

Nejat

and Kao

(2009) The design of an expressive human-like socially assistive robot. ASME J. on Mechanisms and Robotics. 1(1): 1–8.

Zhang

and Nejat

(2009) Human affective state recognition and classification during human-robot interaction. ASME Design Eng. Tech. Conf., DETC2009–87647.

Nejat

and Ficocelli

(2008) Can I be of assistance? The intelligence behind an assistive robot. IEEE Int. Conf. on Robotics and Automation, pp. 3564–3569.

Terao

. (2008) The design of an intelligent socially assistive robot for elderly care. ASME Int. Mech. Eng. Congr. and Expo. IMECE2008–67678.

10.

Shibata

and Wada

(2010) Robot therapy – a new approach for mental healthcare of the elderly. Gerontol., DOI: 10.1159/000319015.

11.

Hamada

. (2008) Robot therapy as for recreation for elderly people with dementia – Game recreation using a pet-type robot. IEEE Int. Symp. on Robot and Human Interactive Commun. pp.174–179.

12.

Tapus

and Mataric

M.J.

(2010) Long term learning and online robot behavior adaptation for individuals with physical and cognitive impairments. Field and Service Robotics: Springer Tracts in Advanced Robotics. 1st ed. Howard

, Eds., Berlin/Heidelberg: Springer. (62): 389–398.

13.

Robins

. (2010) Human-centred design methods: Developing scenarios for robot assisted play informed by user panels and field trials. Int. J. of Human-Computer Studies, 68(12): 873–898.

14.

Kozima

and Nakagawa

(2006) Social robots for children: Practice in communication-care. IEEE Int. Work. on Advanced Motion Control, pp. 768–773.

15.

Ferrari

Robins

and Dautenhahn

(2009) Therapeutic and educational objectives in robot assisted play for children with autism. IEEE Int. Symp. on Robot and Human Interactive Commun. pp. 108–114.

16.

Chan

and Nejat

(2011) Designing intelligent socially assistive robots as effective tools in cognitive interventions. Int. J. of Humanoid Robotics, 8(1): 103–126.

17.

Chan

and Nejat

(2011) A learning-based control architecture for an assistive robot providing social engagement during cognitively stimulating activities. IEEE Int. Conf. on Robotics and Automation, pp. 3928–3933.

18.

Chan

and Nejat

(2010) Promoting engagement in cognitively stimulating activities using an intelligent socially assistive robot. IEEE/ASME Int. Conf. on Advanced Intelligent Mechatronics, pp. 533–538.

19.

Dietterich

T.G.

(2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J. of Artif. Intell. Res., 13(1): 227–303.

20.

Lockerd

and Breazeal

(2004) Tutelage and socially guided robot learning. IEEE/RSJ Int. Conf. on Intelligent Robots and Syst., pp. 3475–3480.

21.

Ravindra

. (2008) Development of a social learning mechanism for a humanoid robot. Int. Conf. on Intelligent Sensors, Sensor Networks and Inform. Process, pp. 243–248.

22.

Prommer

Holzapfel

and Waibel

(2006) Rapid simulation-driven reinforcement learning of multimodal dialog strategies in human-robot interaction. INTERSPEECH, pp. 1918–1921.

23.

Krsmanovic

. (2006) Have we met? MDP based speaker ID for robot dialogue. INTERSPEECH, pp. 461–464.

24.

Schmidt-Rohr

S.R.

Losch

and Dillmann

(2008) Human and robot behavior modeling for probabilistic cognition of an autonomous service robot. IEEE Int. Symp. on Robot and Human Interactive Commun., pp. 635–640.

25.

Tapus

and Mataric

M.J.

(2007) Hands-off therapist robot behavior adaptation to user personality for post-stroke rehabilitation therapy. IEEE Int. Conf. on Robotics and Automation, pp. 1547–1553.

26.

Pineau

. (2003) Towards robotic assistants in nursing homes: Challenges and results. Robotics and Autonomous Syst., 42(3–4): 271–281.

27.

Jeffery

(2008, July). Cognitive stimulation technique may prevent decline in healthy elderly. Medscape News [Online]. Available: http://www.medscape.com/viewarticle/577373.

28.

Matthews

. (2006) Emotional intelligence, personality, and task-induced stress. J. of Exp. Psychol.: Appl., 12(2): 96–107.

29.

Lee

and Kawahara

(2009) Recent Development of Open-Source Speech Recognition Engine Julius. Asia-Pacific Signal and Inform. Process. Assoc. Annu. Summit and Conf., pp. 131–137.

30.

VoxForge (2010, Dec. 17). VoxForge Downloads [Online]. Available: http://www.voxforge.org

31.

Heinzel

. (2010) Differential modulation of valence and arousal in high-alexithymic and low-alexithymic individuals. Neuroreport, 21(15): 998–1002.

32.

Jorna

P.G.

(1992) Spectral analysis of heart rate and psychological state: A review of its validity as a workload index. Biol. Psychol., 34(2–3): 237–257.

33.

Georgila

Henderson

and Lemon

(2006) User simulation for spoken dialogue systems: Learning and evaluation. INTERSPEECH, pp. 1065–1068.

34.

Watkins

and Dayan

(1992) Q-learning. Mach. Learning, 8(3–4): 279–292.