Sage Journals: Discover world-class research

Abstract

The natural interaction between human and robot is full of challenges but indispensable. In this article, a human–robot interactive system is designed for humanoid robot SHFR-III. The system consists of three subsystems: multi-sensor positioning subsystem, emotional interaction subsystem, and dialogue subsystem. The multi-sensor positioning subsystem is designed to improve the positioning accuracy, the emotional interaction subsystem uses bimodal emotional recognition model and fuzzy emotional decision-making model to realize the emotion recognition and expression feedback to the interactive objects, and the dialogue subsystem with personal information can complete the response consistent with the default information and avoid conflicting responses .The experimental results show that the multi-sensor positioning subsystem has good environmental adaptability and positioning accuracy, the emotional interaction subsystem can achieve human-like emotional feedback, and the dialogue subsystem can achieve more natural, logical, and consistent responses.

Keywords

Interactive system positioning system emotional interaction dialogue system humanoid robot

Introduction

Human–robot interaction (HRI) was first proposed in 1975.¹ HRI is an important interdisciplinary research field of computer, ergonomics, cognitive science, and other disciplines. It is also an important content of engineering psychology research. At present, HRI is developing toward personification, intellectualization, and naturalization.

Many researchers are now dedicating their efforts to studying interactive modalities such as facial expressions, natural language, and gestures. This phenomenon makes communication between robots and individuals become more natural.² Gunes et al.³ analyzed human participants’ nonverbal behavior and predicted their facial action units, facial expressions, and personality in real time while they interacted with a small humanoid robot. Ali et al.⁴ designed a sign language educational humanoid that possess stereo vision and stereo microphones along with stereo audios for intuitive interaction. De Jong Michiel et al.⁵ designed a humanoid robot for social interaction, they combined vision, gesture, speech, and input from an onboard tablet, a remote mobile phone, and external microphones. Kumra and Kanan⁶ presented a novel robotic grasp detection system that predicts the best grasping pose of a parallel-plate robotic gripper for novel objects using the RGB Depth Map image of the scene. This article considers the sustainability of the product⁷ and conducts SHFR-III interactive research from the perspective of positioning, emotion, and dialogue.

Robots should have lightweight, low-energy consumption, and excellent performance,⁸ so we designed and built a humanoid emotional robot SHFR-III (see Figure 1).⁹ SHFR-III has 22 degrees of freedom and can realize 8 basic expressions, including calm, happiness, and so on.

Figure 1.

Humanoid robot SHFR-III.

Target positioning is an important part of the research of humanoid robots.¹⁰ In HRI, first, robots need to recognize the interactive objects from the environment. Laurenzi et al.¹¹ introduced a set of modules based on visual object localization. Deepak Gala et al.¹² used auditory sensors for positioning. However, the positioning effect of single sensor is greatly affected by external factors,¹³ such as auditory positioning is limited by noise and visual positioning is limited by illumination. Therefore, multi-sensor system can reduce this situation.

In real life, people usually do not judge each other’s emotional state based on single modal information. Visual information and voice signal information are very important for emotional judgment. Multi-modal emotion recognition is to recognize emotional states by using multi-modal information.¹⁴ Emotional computing was first proposed by Picard.¹⁵ Emotional computing can measure and analyze the external performance of human emotions and affect them.

As argued by Vinyals and Le,¹⁶ the current conversation systems are still unable to pass the Turing test, and the lack of consistent personal information is one of the most challenging constraints. In recent years, Li et al.¹⁷ learned interactive object-specific conversational styles by embedding users into sequence to sequence model. Al-Rfou et al.¹⁸ used similar user embedding techniques to simulate user personalization. Both studies required conversational data from each user to simulate her/his personality. Qian et al.¹⁹ used bidirectional decoders to generate predefined personality, but a lot of data are needed to mark the information location.

To make HRI more natural and harmonious, this article makes the following contributions for SHFR-III:

A multi-sensor positioning subsystem is designed to reduce dependence on work environment and improve the overall positioning accuracy by data fusion of multiple sensors.

The emotional recognition model based on facial expression and speech is used to deal with situations that a single modal would fail. At the same time, fuzzy algorithm is used to simulate emotional decision-making.

Using default information to solve the problem of inconsistent personal information in dialogue, and maximum mutual information is taken as the objective function to reduce meaningless replies in the dialogue model.

Overview

Our interactive system works as follows (see Figure 2): First, we use multi-sensor positioning subsystem to find the exact location of the interactive objects. Then the robot adjusts the angle between the interactive objects, and the robot then obtains the facial and voice information of the interactive objects through cameras and microphones. The emotional interaction subsystem recognizes their emotional state and emotional decision-making, and SHFR-III will display the results of emotional decision-making with facial expression. The dialogue subsystem with personal information generates responses and presents them in the form of a voice.

Figure 2.

The work process.

Multi-sensor positioning subsystem

In this article, a multi-sensor positioning subsystem is designed, which includes an infrared positioning module, an auditory positioning module, and a binocular vision positioning module.

Design of positioning module

Infrared positioning based on infrared sensor array

Four pyroelectric infrared sensors with the same parameters are selected to form the sensor array as shown in Figure 3. The vertical equidistant distribution of the four sensors and the horizontal angle of the adjacent sensors are 30°. The output results of the four sensors from top to bottom are recorded as S ₁, S ₂, S ₃, and S ₄, respectively. The sensor array can cover the frontal range of 0–210.

Figure 3.

The infrared positioning module.

The positioning function Fr (S ₁, S ₂, S ₃, S ₄) is defined as the mapping relationship between the sensor output values S ₁, S ₂, S ₃, S ₄ and the target location information θ. The output values of S ₁, S ₂, S ₃, and S ₄ are 1 or 0, which correspond to the high and low level of sensor output. Therefore, the Fr (S ₁, S ₂, S ₃, S ₄) can be described by the truth table as presented in Table 1.

Table 1.

Positioning function truth table.

S ₄	S ₃	S ₂	S ₁	F (°)	θ(°)
0	0	0	1	0–30	15
0	0	1	1	30–60	45
0	1	1	1	60–90	75
1	1	1	1	90–120	105
1	1	1	0	120–150	135
1	1	0	0	150–180	165
1	0	0	0	180–210	195

Auditory positioning based on auditory sensor array

The three sensors are isosceles triangular in the vertical plane. Sensors 2 and 3 are arranged horizontally, and the distance is p. Sensor 1 is located on the vertical line of B and C, and the distance is q. Sound source is recorded as P(x, y, z). The distribution model is shown in Figure 4.

Figure 4.

The auditory positioning module.

The time difference of sound arrival between sensor 2 and sensor 1 is defined as t ₂₁, and the time difference of sound arrival between sensor 3 and sensor 1 is t ₃₁. Based on the sampling frequency of sensor (50 KHz)and sound velocity (340 m/s), the relationship between time difference of sound arrival T and distance difference D is as follows

d = 6.8 t

As we know,

{\begin{matrix} P B - P A = d_{21} \\ P C - P A = d_{31} \end{matrix}

where $P A = \sqrt{x^{2} + z^{2} + q^{2}}, P B = \sqrt{{(x - p)}^{2} + z^{2}}, and P C = \sqrt{{(x + p)}^{2} + z^{2}}$ ,.

We have the following equation

{\begin{matrix} \sqrt{{(x - p)}^{2} + z^{2}} - \sqrt{x^{2} + z^{2} + q^{2}} = d_{21} \\ \sqrt{{(x + p)}^{2} + z^{2}} - \sqrt{x^{2} + z^{2} + q^{2}} = d_{31} \end{matrix}

The distance from the source P to the origin is $ρ = \sqrt{x^{2} + z^{2}}$ . If the variable is $k = \sqrt{ρ^{2} + q^{2}}$ , we have the following equation

{\begin{matrix} \sqrt{{(x - p)}^{2} + z^{2}} = d_{21} + k \\ \sqrt{{(x + p)}^{2} + z^{2}} = d_{31} + k \end{matrix}

k = \frac{2 p^{2} - 2 q^{2} - d_{21}^{2} - d_{31}^{2}}{2 (d_{21} + d_{31})}

{\begin{matrix} k^{2} - q^{2} + p^{2} - 2 x p = d_{21}^{2} + k^{2} + 2 k d_{21} \\ k^{2} - q^{2} + p^{2} + 2 x p = d_{31}^{2} + k^{2} + 2 k d_{31} \end{matrix}

The polar angle of the sound source P relative to the origin is θ.

θ = {cos}^{- 1} (\frac{x}{ρ})

where $x = \frac{(d_{31} - d_{21}) (d_{31} + d_{21} + 2 k)}{4 p}$ .

The polar coordinates of the sound source (p, θ) are as follows

{\begin{matrix} ρ = \sqrt{k^{2} - q^{2}} \\ θ = {cos}^{- 1} (\frac{x}{ρ}) \end{matrix}

Visual positioning based on binocular stereovision

The model of binocular vision positioning is shown in Figure 5.

Figure 5.

The vision positioning module.

The focal length of the left and right eyes is (f_x ₁, f_y ₁) and (f_x ₂, f_y ₂), respectively. The coordinates of the center of light are (c_u ₁, c_v ₁) and (c_u ₂, c_v ₂), respectively. Binocular spacing is 2B. Left and right eye cameras have the same model and focal length, which can be approximately considered equal.

f_{y 1} = f_{x 1} = f_{y 2} = f_{x 2} = f

The transformation relationship between coordinates in the plane (u, v) and coordinates in the world coordinate system $(x_{w}, y_{w}, z_{w})$ is as follows

s [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} \begin{matrix} f_{x} & 0 & \begin{matrix} u_{0} & 0 \end{matrix} \end{matrix} \\ \begin{matrix} 0 & f_{y} & \begin{matrix} v_{0} & 0 \end{matrix} \end{matrix} \\ \begin{matrix} 0 & 0 & \begin{matrix} 1 & 0 \end{matrix} \end{matrix} \end{matrix}] [\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{w} \\ \begin{matrix} y_{w} \\ \begin{matrix} z_{w} \\ 1 \end{matrix} \end{matrix} \end{matrix}] = M [\begin{matrix} x_{w} \\ \begin{matrix} y_{w} \\ \begin{matrix} z_{w} \\ 1 \end{matrix} \end{matrix} \end{matrix}]

The parameters of the left and right eyes are introduced into the upper equation. We have the following equation

{\begin{matrix} Z_{W} = \frac{2 B f}{c_{u 1} - c_{u 2} - u_{1} + u_{2}} \\ x_{w} = Z_{W} \frac{u_{1} - c_{u 1}}{f} + B \\ y_{w} = Z_{W} \frac{v_{1} - c_{v 1}}{f} \end{matrix}

Fusion strategy analysis of multi-sensor positioning subsystem

The working environment of the system is a closed room of 5 × 6 m². The coverage area of the positioning subsystem is shown in Figure 6, where O is the positioning center of the multi-sensor subsystem and DEGF represents the whole room. The coverage area of visual positioning module is the triangular area ABC, and polar angle positioning error (E_θ ) and distance positioning error $(E_{ρ})$ increase with the increase of target distance. The coverage of infrared positioning module is MNFG, and E_θ decreases with the increase of target distance. The coverage of the auditory positioning module is DEGF, and $E_{ρ}$ increases with the increase of the target distance, while E_θ remains unchanged.

Figure 6.

Regional division and error distribution.

According to the working characteristics of multi-sensor positioning subsystem, the multi-level positioning fusion method shown in Figure 7 is proposed. In triangular ABC region, all three positioning modules work normally, and the three-positioning data are fused. In rectangular MNGF region, only infrared positioning module and auditory positioning module can work, at this time, the two-positioning data are fused. In rectangular DENM region, only the auditory positioning module can work normally, and then the auditory positioning data can be output directly.

Figure 7.

Distribution of fusion modes of positioning subsystem.

Weighted fusion algorithms with variable weights

The final result of the multi-sensor positioning subsystem, that is, the coordinates of the interactive target relative to the robot in the horizontal plane are expressed in polar coordinates $ρ, θ .$ The positioning data of infrared positioning module is $(θ_{r})$ ), the positioning data of auditory positioning module is $(ρ_{s}, θ_{s})$ , and the positioning data of visual positioning module is $(ρ_{v}, θ_{v})$ . The weighted fusion algorithm is constructed as follows

{\begin{matrix} θ = m_{r} θ_{r} + m_{s} θ_{s} + m_{v} θ_{v} \\ ρ = n_{s} ρ_{s} + n_{v} ρ_{v} \end{matrix}

where m_r , m_s , and m_v are the angle weighting coefficients of infrared positioning module, auditory positioning module, and visual positioning module, respectively. n_s and n_v are the distance weighting coefficients of auditory positioning module and visual positioning module, respectively.

In this article, three kinds of positioning modules are tested separately, and the corresponding positioning accuracy is calculated. The average E_θ of infrared positioning module is 15° $(E_{θ r})$ ; the average E_θ of auditory positioning module is 3° (E_θs ), and the average $E_{ρ}$ is 340 mm $(E_{ρ s})$ ; the average E_θ of visual positioning module is 1° (E_θv ), and the average $E_{ρ}$ is 100 mm $(E_{ρ v})$ .

The weighting coefficients are inversely proportional to the positioning accuracy, and the weighting coefficients are calculated according to the sum of the coefficients being 1.

{\begin{matrix} m_{r} : m_{s} : m_{v} = \frac{1}{E_{θ r}} : \frac{1}{E_{θ s}} : \frac{1}{E_{θ v}} \\ n_{s} : n_{v} = \frac{1}{E_{ρ s}} : \frac{1}{E_{ρ v}} \\ \begin{matrix} m_{r} + m_{s} + m_{v} = 1 \\ n_{s} + n_{v} = 1 \end{matrix} \end{matrix}

The weighted fusion equation is as follows

{\begin{matrix} θ = 0.05 θ_{r} + 0.24 θ_{s} + 0.71 θ_{v} \\ ρ = 0.22 ρ_{s} + 0.78 ρ_{v} \end{matrix}

In equation (3), the weighting coefficient is constant. This equation is only applicable when all three positioning systems are working normally. However, as the location of the interactive target and the external environment change, one or some positioning systems may fail. According to the positioning accuracy of each positioning system determined by experiments, a weighted fusion algorithm with variable weights is proposed.

When the interactive object is out of the location range of visual positioning module or in a dark environment, θ_r and θ_s have data, θ_v has no data, so the equation is follows

{\begin{matrix} θ = 0.17 θ_{r} + 0.83 θ_{s} \\ ρ = ρ_{s} \end{matrix}

Only when the auditory positioning module works, it indicates that the interactive object is in a dim environment and out of the infrared positioning module. At this time, the positioning subsystem directly outputs the results of the auditory positioning module.

When in a noisy environment, the auditory positioning module stops working. At this time, when θ_r and θ_v have data, it means that the interacting object is in the triangular ABC region and the light is abundant.

{\begin{matrix} θ = 0.06 θ_{r} + 0.94 θ_{v} \\ ρ = ρ_{v} \end{matrix}

Only when the infrared positioning module works, θ_s is output directly, and then the positioning subsystem has no distance information output.

Emotional interaction subsystem

Bimodal emotion recognition

In this article, facial expression and voice emotion are fused by decision-level fusion.

Facial emotion recognition

Noduls Facial Expression Analysis System (FaceReader) is used for bimodal emotion recognition. This article is based on discrete emotional classification. FaceReader data output format is based on Paul Ekman’s six basic expressions plus calm expressions to construct a seven-dimensional probability matrix describing emotional state.

Speech emotion recognition

Speech emotion recognition is a new research hotspot involving traditional speech signal processing, pattern recognition, human psychology, artificial intelligence, and other fields. The research of speech emotion recognition is based on discrete emotion classification system.

Feature extraction: The main emotional features in speech signal are prosodic features, spectrum-based features and sound quality features.^20,21 Based on the study of emotional features of expression, this article extracts the acoustic parameters including the root mean square of energy, zero-crossing rate, fundamental frequency, vocal probability, MFCC, frequency and bandwidth of first, second and third formants, and the corresponding first-order difference as the dynamic parameters of speech signal. Statistical features of acoustic parameters and dynamic parameters are used as feature vectors for speech emotion recognition, and a total of 382-dimensional features are obtained.

Hierarchical support vector machine (SVM) classifier: First, all categories are divided into two subclasses, and then these two subclasses are further divided into subclasses until a separate category is obtained. This method can classify subclasses according to the degree of confusion between classes.

The degree of confusion between category i(G_i ) and category j(G_j ) is Mix_ij .

M i x_{i j} = \frac{(P (r = i | x ϵ G_{j}) + P (r = j | x ϵ G_{i}))}{2}

The higher the level of confusion, the more difficult it is to distinguish between category i and category j. When determining the first level of classification, if the degree of confusion is greater than 0.1, the two categories can be divided into one subclass. If all confusion is less than 0.02, this category is treated as a subclass alone.

When the degree of confusion between a certain category and other categories is greater than 0.02 and less than 0.1, the classification situation cannot be directly judged. By calculating the total degree of confusion between the category and a subclass, the classification with high degree of total confusion can be selected.

T (a, B) = \sum_{i = 1}^{i - m} M i x_{a j}

where a is an emotional category, B is a subclass, and B includes emotional categories.

Decision-level fusion

This article chooses the weighted summation method through experiments.

Fuzzy emotional decision-making model

Fuzzy emotional decision-making takes the initial emotional state and the emotional state of the current interactive object as input and combines the input to make the fuzzy emotional decision through the fuzzy reasoning rules to generate the robot’s emotions.

This article refers to the way of emotional quantification in reference²² to quantify this emotional state. Seven emotional states are quantified as interval values. Considering external stimuli, the interval equivalence ratio is expanded to [0,7]. The Mamdani algorithm is used to construct a fuzzy affective decision model.

After analysis, the orthodox distribution curve of Gauss function is more in line with the characteristics that the affective control range is gradually weakening from the central point to the surrounding area. It shows that an emotional input belongs to the membership degree of the fuzzy emotional subset, which is convenient for calculation and processing. Assuming that all the emotional centers have the same influence range and ability, the Gauss function is as follows

f (x, c, σ) = e^{- \frac{{(x - c)}^{2}}{2 σ^{2}}}

where c is the position of the central point of the function, σ is the width of the number curve, σ of each fuzzy subset is the same, and c is different.

Based on a male volunteer, the fuzzy rules of orthogonal combination are equationted and the probabilities of each rule are evaluated. As presented in Table 2, the horizontal row represents the emotional state of the robot at the front moment, the vertical row represents the external stimulus, and the numerical value of the emotional state in the table represents the proportion of the rule.

Table 2.

Fuzzy emotional decision rules.

	Calm	Hap.	Sad.	Ang.	Sur.	Fear
Calm	Calm1.0	Hap.0.9	Sad.0.9	Ang.0.8	Sur.0.7	Fear0.8
Hap.	Hap.0.7	Hap.0.95	Ang.0.6	Ang.0.95	Sur.0.8	Fear0.6
Sad.	Sad.0.6	Sad.0.7	Sad.0.9	Ang.0.6	Sur.0.9	Sad.0.75
Ang.	Sur.0.7	Sur.0.7	Sur.0.7	Ang.0.95	Sur.0.5	Fear0.6
Sur.	Sur.0.8	Sur.0.8	Sad.0.8	Ang.0.6	Sur.0.95	Fear0.65
Fear	Fear0.8	Sur.0.8	Fear0.6	Sur.0.8	Fear0.9	Fear0.9

Dialogue subsystem with personal information

In this article, a dialogue subsystem with personal information is proposed (see Figure 8). By giving the chat robot specific personal information, the robot can generate a response consistent with its given information. The system first uses the question classifier to distinguish whether the question needs personal information dialogue model to deal with. If yes, the model retrieves the most similar question in the template, return the answer to the category of the question, if not, the open domain dialogue model generates the response.

Figure 8.

Dialogue subsystem with personal information.

Questions classification

The classification model is used to determine whether the input problem needs to be processed by the personal information dialogue model, which is a two-class problem. It uses $P (z | x) (z \in \{0, 1\}), z = 1$ , to express the need for personal information dialogue model, such as: “How old are you this year?” $P (z = 1 | x) \approx 1$ and “How old is your brother this year?” $P (z = 1 | x) \approx 0$ . This article adopts the support vector machine model based on word bag.

Personal information dialogue model

This article constructs a personal information dialogue model based on the twin network idea. First, the two objects to be matched are represented by the deep learning model, and then the matching degree of the two objects can be output by calculating the similarity between the two representations. This article uses bidirectional long short-term memory (BiLSTM) to represent the semantic information of sentences.²³

The loss function used is the comparative loss function,²⁴ which is often used in twin neural networks. This loss function can effectively deal with the relationship between paired data in twin neural networks. The expression of the comparative loss function is as follows

L = \frac{1}{2 N} \sum_{n = 1}^{N} y d^{2} + (1 - s) max {(m a r g i n - d, 0)}^{2}

where $d = ‖ a_{n} - b_{n} ‖_{2}$ represents the Euclidean distance of two sample features, s is the label of whether two samples match, and margin is the set threshold.

Dialogue model based on maximum mutual information

The open domain dialogue model is based on seq2seq. However,if we only rely on the maximum likelihood estimation, even if we train with a large number of data, the seq2seq model is prone to generate security answers like “不知道” (“I don’t know”), “哈哈哈”(“Ha ha ha”), and “好的”(“well”). So we use the anti-language model proposed by Li et al.²⁵ and take maximum mutual information as the objective function of seq2seq, as shown in the following equations

p (T | S) = \prod_{k = 1}^{N_{y}} p (y_{k} | x_{1}, \dots, x_{t}, y_{1}, \dots, y_{k - 1})

\hat{θ} = \underset{θ}{argmax} (log \frac{P (S, T)}{P (S) P (T)})

\hat{θ} = arg max_{θ} (log P (T | S) - λ log P (T))

The $log P (T | S)$ has the same representation as the maximum logarithmic likelihood model. $λ log P (T)$ is regarded as a penalty for candidate words with high probability for any input, and the penalty is controlled by parameter $λ$ . Because of the existence of penalty term, the neural network no longer chooses the words with high probability, so as to avoid generating general answers. However, punishment can affect sentence structure and fluency, so a piecewise function g(k) is introduced as follows

g (k) = \{\begin{matrix} 1 if k \leq γ \\ 0 if k > γ \end{matrix}

p (T) = \prod_{k = 1}^{N_{t}} p (t_{k} | t_{1}, t_{2}, \dots, t_{k - 1})

\hat{θ} = arg max_{θ} (log P (T | S) - λ log (g (k) \cdot P (T))

Pre-generative words have a greater impact on sentence diversity than post-generative words. In order to ensure the fluency of sentences as much as possible, only the high probability candidate words generated in the early stage are punished in the process of sentence generation by decoder. The $γ$ in equation is set to 1. Only the first word of the sentence is punished to ensure the coherence of the sentence as far as possible.

The model consists of encoder and decoder. In the encoder part, two layers of BiLSTM neural network are adopted, the number of units is 512, and the dimension of word vector is set to 300. Bahdanau’s attention mechanism was used. In the training process, dropout mechanism is adopted with a retention rate of 0.5, Adam learning rate is set to 0.001, batch size is set to 32, and the number of data iterations is 128.

Experiment

Experimental results of multi-sensor positioning subsystem

Each positioning module experiments in different environments

To verify that the multi-sensor positioning system can be applied to various working environments, target positioning experiments are carried out in normal lighting environment, dark environment and noise environment respectively. The experimental results are given in Table 3.

Table 3.

Positioning results in different environment.

Env.		Infra.	Aud.	Vision	Real.
Nor.	θ (°)	75	82.1	80.2	80
Nor.	/ $ρ$ (mm)	None	None	826	800
Dark	/°θ (°)	75	63.7	Fail	65
Dark	(mm)	None	1181.6	Fail	1300
Noi.	/°θ (°)	75	Fail	74.5	70
Noi.	$ρ$ (mm)	None	Fail	1250.1	1300

From the experimental results, we can see that the some positioning module designed in this article will fail in different scenarios, but other positioning modules still work normally and have good stability.

Fusion experiment results of multi-sensor positioning subsystem

The multi-sensor positioning system is used to locate points in the environment, and the results are fused. Some statistical results are given in Table 4.

Table 4.

Fusion experiment results of multi-sensor positioning system.

Point		Infra.	Audi.	Vision	Fusion	Real.
θ (°)	1	105	93.9	95.4	95.5	95
$ρ$ (mm)	1	None	728.3	1026.7	980	1000
/θ (°)	2	75	88	76	77.8	75
$ρ$ (mm)	2	None	1467.5	1632.4	1595.4	1600

The experimental results show that the positioning accuracy after fusion is higher than that of single positioning system, and the stability of positioning has been greatly improved.

Experiments on emotional interaction

Speech emotion recognition

This article uses CASIA Chinese Emotional Corpus for model training, openSMILE for feature extraction, principal component analysis (PCA) contribution selection 95%, and LIBSVM toolkit developed by Professor Lin of Taiwan University (Kernel function is three-degree polynomial). The degree of confusion obtained by experiments is given in Table 5.

Table 5.

The confusion between emotional categories.

	Calm	Hap.	Sur.	Fear	Sad.
Hap.	0.1
Sur.	0.025	0.075
Fear	0	0.05	0.1
Sad.	0	0	0	0.2
Ang.	0	0.05	0.05	0	0.05

According to the degree of confusion, the hierarchical SVM as shown in Figure 9 is designed.

Figure 9.

The hierarchical SVM.

The comprehensive recognition rate and recognition rate of each level classifier are given in Table 6.

Table 6.

The recognition rate of each SVM and hierarchical model.

Level	1	2	3	4	Hierarchical SVM
Rate	85.83%	90%	85%	73.33%	66.67%

Bimodal emotional fusion

Some data in eNTERFACE’05 multi-modal emotion database is used to verify the validity. The five emotional expressions of happiness, surprise, fear, sadness, and anger are screened by scoring principle. Thirty pieces of data are selected for each emotion for bimodal emotion recognition.

The results of single-modal and bimodal emotion recognition are compared as given in Table 7. The experimental results show that the performance of bimodal emotion recognition is better than single-modal, with an average recognition rate of 59.34%. Table 8 gives the result of emotion recognition.

Table 7.

Comparison of single-mode and double-mode recognition results.

Method	Rate (%)
Face expression recognition	52.67
Speech emotion recognition	38.67
Bimodal emotional recognition	59.34

Table 8.

Different emotional recognition results.

	Hap.	Sur.	Fear	Sad.	Ang.	Rate (%)
Hap.	23	4		3		76.67
Sur.	1	21		8		70
Fear	1	9	15	5		50
Sad.		11	3	16		53.33
Ang.	3	9	1	3	14	46.67

When the probability of detecting disgust is the highest, speech emotion recognition is not carried out, and disgust is taken as the final recognition result.

Fuzzy emotional decision-making

After setting the initial emotional state of the robot, with the change of external stimulus, the current emotional state of the robot is simulated and calculated.

When the initial emotional state of the robot is calm, the emotional state of the interactive object changes in turn. The emotional change curve of the robot is given in Figure 10(a). When the robot changes to the state of surprise, as the external stimulus gradually turns to fear, the robot will change to fear. When it encounters the external stimulus of sadness, the robot will then turn to sadness, and keep the sad state under the stimulus of disgust. Figure 10(b) shows the simulation of experimental results with sad initial state.

Figure 10.

The emotional change curve of robot with initial emotion: (a) initial emotion as calm and (b) initial emotion as sadness.

The simulation results show that under the fuzzy affective decision-making model, the change of the robot’s emotional state under the external stimulus is slow and continuous, and the calculation results are in accordance with the reasoning rules and human’s emotional changes.

Experiments of dialogue subsystem

In this article, the accuracy, F1 Score (FI) and Area under the curve (AUC) values are selected to evaluate the performance of the question classification model. The values are 87.69%, 0.8754, and 0.8797, respectively.

In the personal information reply model, since five identities are set in this article, accuracy refers to whether the category of sentences with the greatest similarity belongs to the label category, which is 87.4%.

The open domain dialogues are evaluated by manual and bilingual evaluation understudy (BLEU) methods. In the experiment, the penalty coefficient of the maximum mutual information model is 0.5 and γ is set to 1.The BLEU results of the two models are 0.17 and 0.25, the mutual information model is better. In the manual evaluation, most people think that the results of mutual information model are similar to those of maximum likelihood model, 30% think mutual information model is better than maximum logarithmic likelihood model.

In the overall dialogue system, this article randomly selected some dialogues (the dialogues are in Table 9), and asked volunteers to evaluate the following aspects:

Table 9.

Samples of dialogues.

Chinese	English(translated)
x:你的性别是什么y:女孩子	x: What is your gender?y: Girl
x:你会唱歌吗y:不会	x: Can you sing?y: No
x:吃饭了吗y:你想请我去吃饭	x: Did you have your meal?y: Would you like to invite me to dinner?

Naturalness: whether the generated response is natural and smooth.

Logic: whether the generated response is logically related to the problem.

Information consistency: whether the response to personal information is consistent.

Variety: whether there are multiple ways to respond to a question.

From Table 10, we can see that our model is superior to the ordinary seq2seq model in each index, especially in information consistency, because this article adds personal information reply model to ensure the consistency of personal information.

Table 10.

Evaluation of responses.

Method	Naturalness (%)	Logic (%)	Consistency (%)	Variety (%)
Seq2seq	71.4	38.7	2.1	1.6
Ours	75.9	52.7	75.7	22.2

Conclusion and future work

This article designs an interactive system for humanoid robot SHFR-III. The system can use multi-sensor positioning subsystem to locate accurately in complex environment, use bimodal emotion recognition model and fuzzy emotion decision-making model to complete human–robot emotional interaction in the form of robot facial expression, and the dialog subsystem with personal information can complete the response consistent with the default information. The system has the advantages of easy implementation and good interactivity, and can be applied in the fields of elder group’s care, we can understand their emotional world through HRI, and use chatting to relieve their loneliness. Besides, it can also be used in the field of robot teaching and autism treatment and so on.

Our work is only a small step toward achieving a harmonious HRI, and there are many future directions:

Positioning system: Constructing three-dimensional auditory sensor array to improve positioning accuracy and optimize fusion strategy.

Emotional interaction: The research of emotional recognition in this article is the discrete emotional recognition. Later, we can study dimension emotional recognition, which is convenient for the joint research with the dimension emotional model, taking into account personal, psychology, morality, and other factors.

Dialogue system: Integrating emotional information into the dialogue system to generate a response with emotional information style. Introducing reinforcement learning into dialogue system, and improving the naturalness and logic of response by adding a “teacher.”

In addition, specific subsystems should be added for different application scenarios, such as robot hand grasping, motion trajectory control, smart home control, and so on.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Bin Cao

References

Zhao

. Survey of human-computer interaction research. Chin Comput Commun 2017; 23: 24–25+28.

Song

Yamada

. Expressing emotions through color, sound, and vibration with an appearance-constrained. In: ACM/IEEE international conference on human-robot interaction part F127194, Vienna, Austria, 6–9 March 2017, pp. 2–11.

Gunes

Celiktutan

Sariyanidi

. Live human-robot interactive public demonstrations with automatic emotion and personality prediction. Philos T R Soc B 2019; 374(1771): 20180026.

Ali

Minoo

Mohammad

. Design and realization of a sign language educational humanoid robot. J Intell Robot Syst 2016; 95(1): 3–17.

Michiel

Kevin

Aaron

. Towards a robust interactive and learning social robot. In: 17th international conference on autonomous agents and MultiAgent systems, Stockholm, Sweden, 10–15 July 2018, pp. 883–891.

Kumra

Kanan

. Robotic grasp detection using deep convolutional neural networks. In: IEEE international conference on intelligent robots and systems, Vancouver, Canada, 24–28 September 2017, pp. 769–776. IEEE.

Bin

Ting

Shan

. Product sustainability assessment for product life cycle. J Clean Prod 2019; 206: 238–250.

Wang

Liu

. Underactuated robotics: a review. Int J Adv Robot Syst 2019; 16(4): 1–29.

Xin

Yang

, et al.

Development of the system of a humanoid robot head with facial expressions

[in Chinese]. CAAI Trans Intell Syst 2015; 10(04): 555–561.

10.

Cid

Moreno

Bustos

, et al. Muecas: a multi-sensor robotic head for affective human robot interaction and imitation. Sensors 2014; 14(5): 7711–7737.

11.

Laurenzi

Kanoulas

Mingo Hoffman

, et al. Whole-body stabilization for visual-based box lifting with the COMAN+ robot. In: Proceedings – 3 rd IEEE international conference on robotic computing, Naples, Italy, 25–27 February 2019, pp. 445–446. IEEE.

12.

Gala

Lindsay

Sun

. Realtime active sound source localization for unmanned ground robots using a self-rotational bi-microphone array. J Intell Robot Syst 2019; 95(3-4): 935–954.

13.

Yang

Liu

Jiao

, et al. Research progress on human body localization in smart home system. Chin J Biomed Eng 2013; 7(5): 467–470.

14.

Zhu

. The research of multimodal human-robot interaction of humanoid emotional robot [in Chinese]. Sahnghai University 2018, pp. 46.

15.

Picard

. Affective computing. London: MIT Press, 1997.

16.

Vinyals

. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.

17.

Galley

Brockett

, et al. A persona-based neural conversation model. In: Proceeding 54th annual meeting of the association for computational linguistics, Berlin, Germany, 7–12 August 2016, pp. 944–1003.

18.

Al-Rfou

Pickett

Snaider

, et al. Conversational contextual cues: the case of personalization and history for response ranking. arXiv preprint arXiv:1606.00372, 2016.

19.

Qian

Huang

Zhao

, et al. Assigning personality/identity to a chatting machine for coherent conversation generation. arXiv preprint arXiv:1706.02861, 2017.

20.

Delic

Bojanic

Gnjatovic

, et al. Discrimination capability of prosodic and spectral features for emotional speech recognition. Electron Electr Eng 2012; 18(9): 51–54.

21.

Liang

. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. In: 2015 international conference on affective computing and intelligent interaction (ACII), Xi’an, China, 21–24 September 2015, pp. 477–483.

22.

. Research of key technology of affective computing [in Chinese]. Donghua University 2009, pp. 27.

23.

Guan

, et al. Attention enhanced bi-directional LSTM for sentiment analysis. J Chin Inform Proc 2019; 33(2): 105–111.

24.

Hadsell

Chopra

LeCun

. Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2, New York, NY, USA, 17–22 June 2006, pp. 1735–1742.

25.

Galley

Brockett

, et al. A diversity—promoting objective function for neural conversation models. Comput Sci 2015, arXiv preprint arXiv:1510.03055.