Sage Journals: Discover world-class research

Abstract

It is essential to integrate speeches and nonverbal behaviors for a humanoid robot in human-robot interaction. This paper presents an approach using multi-object genetic algorithm to match the speeches and behaviors automatically. Firstly, with humanoid robot's emotion status, we construct a hierarchical structure to link voice characteristics and nonverbal behaviors. Secondly, these behaviors corresponding to speeches are matched and integrated into an action sequence based on genetic algorithm, so the robot can consistently speak and perform emotional behaviors. Our approach takes advantage of relevant knowledge described by psychologists and nonverbal communication. And from experiment results, our ultimate goal, implementing an affective robot to act and speak with partners vividly and fluently, could be achieved.

Keywords

Intelligent optimization integration of speeches and emotional behaviors humanoid robot human-robot interaction (HRI)genetic algorithm

1. Introduction

Currently, with the development of robotic theories and technologies, humanoid robot with facial expression emerges from several areas such as the aged care, childhood education and so on. Many universities and organizations in the world focus on this research area, so a lot of robots appear [1].

In human-robot interaction, there is not only verbal interaction, but also nonverbal one. Nonverbal behavior is an important factor influencing believability. Vinciarelli considered them as relational attitudes exchanged between interacting individuals [2]. They included facial expression and body posture as well. Duric adopted them to enhance operator performance and provide for rational decision-making [3]. Carlos Busso's view was almost similar, and pointed out that rigid head motion was a gesture that conveyed important nonverbal information in human communication [4]. In addition, results showed participants expected a high level of realistic human-like verbal and nonverbal communicative behavior from the human-like agents. A human-like body would provide an abundance of nonverbal information and enable us to smoothly communicate with the robot [5, 6]. Furthermore, nonverbal behaviors increase interest and believability of the interactive process as stated above, so research on how to match them with speech (verbal interaction) well is helpful and necessary. In another human-robot interaction, researchers consider the importance of affection. Related theory and technology are called affective computing, which includes emotion analysis, development, and expression. As shown in figure 1, Speech contents which a robot will speak and human behaviors, regarded as internal and external stimuli separately, influence the robot's emotion status simultaneously, and then lead to its behaviors changing. This process is not the key point here, but will be stated in another paper.

Figure 1.

Contents researched in this paper

In this paper, we mainly focus on the expression, especially in integrated expression of speech and nonverbal behaviors linked by emotion. There are three contributions. The first one is a novel method for matching speech contents and nonverbal behavior is proposed based on genetic algorithm with fixed gene locus. Under some limitations, speech and nonverbal behaviors could be matched well. The second one is that effectiveness and robustness of this method are verified in simulation experiments for different conditions. The third one is that the proposed method is also used for telling stories in practice.

The overall goal of our work is to implement a method, which matches nonverbal behaviors to speeches, linked by emotion status for a humanoid robot. This paper is structured as follows. Firstly, in section 2, an overview and a flow chart of the proposed method are introduced. And as a preparation, two relations are stated. One is between emotion status and voice characteristic, the other is between emotion status and nonverbal behaviors. Based on the works above, thirdly, in section 3, with multi-object genetic algorithm, how to match nonverbal behaviors to speeches is researched. Fourthly, in section 4, aiming to test and verify the method proposed in section 3, some experiments are conducted. Fifthly, conclusions are given in section 5.

2. Overview of Emotion, Voice Characteristics and Nonverbal Behaviors

To make the proposed method clear, a flow chart is shown in figure 2. When the humanoid robot's emotion status is calculated, his voice characteristics of speech contents, such as speech rate, volume, and tone, are determined. However, his nonverbal behaviors, linked with the current emotion, are too many. How to select suitable ones to match the speech contents is the key point. With the help of multi-object genetic algorithm, this problem can be solved.

Figure 2.

Flow chart of the proposed method

Combine figure 1 with 2 together to think, the humanoid robot's emotion status is affected by internal and external stimuli. In this paper, we regard the speech contents as internal stimuli. With the stimuli, emotion engine, emotional Markov Chain Model (eMCM) and emotional Hidden Markov Model (eHMM) proposed by Teng [7], plays a role to generate internal emotion status of the robot. And he considers the emotion, speaks with suitable speed, volume, tone, and performs nonverbal behaviors vividly. So there are two problems to solve: 1) relations between emotions and voice characteristics of speech; 2) relations between emotions and nonverbal behaviors.

2.1. Emotion and Voice Characteristics

Usually speaking, for semantic emotion, voice emotion is the main part to express hidden mental states. According to Ekman's six emotional expressions [8], we design ours: happiness, sadness, anger, surprise, pity and disgust. In addition, there should be a calm state when the humanoid robot does not be stimulated. Because voice characteristics ranges of iFly TTS SDK we used is 0~10 for speech rate, volume, and tone, we determine their values as shown in table 1 [9].

Table 1.

Voice characteristics to six types of emotion

Emotion	Speech rate (S)	volume (V)	tone (T)	time/s
Happiness	7	8	7	0.171
Sadness	4	4	6	0.302
Anger	7	7	5	0.197
Surprise	5	6	7	0.149
Pity	5	5	6	0.228
Disgust	7	5	2	0.263
Calm	5	5	5	0.228

In order to demonstrate it clearly, we give some examples. If the robot is happy, he speaks the contents with parameters S8, V6, and T7. Correspondingly, if the robot is anger, voice characteristics may be set as S7, V6, and T5.

2.2. Emotion and Nonverbal Behaviors

Actually, nonverbal behavior has no fixed pattern in real life. But for primary research, in this paper, we consider facial behaviors mainly. Given limited control points and imperfect design of some humanoid robots, twenty kinds of nonverbal behaviors in face are programmed. These nonverbal cues are considered as parts of intention, reasoning, desire, emotion and personality usually. They could also help the robot convey his feeling and hidden mental state. Take head movements for example, behavior “Chin down” may indicate meanings of humiliation, shame, shy, refuse and bored. In other word, these nonverbal styles add extra channels of communication over language.

In addition, we divide twenty nonverbal behaviors into three parts: Primary Nonverbal Behaviors (PNB), Secondary Nonverbal Behaviors (SNB), and Basic Interaction Habits (BIH) listed in table 2. PNB are related to six types of emotions above, which could express emotion directly. SNB could express emotion indirectly. Moreover, they are implicit emotional behaviors, which include E⁺SNB and E⁻SNB corresponding to positive emotion and negative emotion separately. BIH were developed to form complex behaviors. Considering the functional morphology as described in [10] the BIH are grouped.

When the humanoid robot is engaged in speaking with human being, a direct emotional expression is needed. So a type of PNB must be selected firstly. And then, to make his performances vivid, SNB and BIH should also be integrated as assistant nonverbal behaviors. However, when he is calm, PNB and SNB are not suitable. Thus, only behaviors in BIH are selected and grouped.

Supposed that each behavior has three properties, they are Time Property (TP), Expressivity Property (EP), and Preference Property (PP). TP indicates a time period to perform an action. EP refers to intensity of expressive force. And PP shows how much the robot likes to perform this behavior, which could imply his habit.

With the help of a timer in our testing program, TP could be measured. After one of twenty nonverbal behaviors is carrying out, the timer works. Then, it stops timing when the behavior finishes. Thus, we could obtain the time period of this behavior, which is called Time Property. In addition, by this method, we could get all TP of twenty nonverbal behaviors.

For Expressivity Property, it is determined by testers mainly. In order to reduce its subjectivity brought in by a single test, mean values of 30 test results are calculated. We record behaviors with video firstly, and then show them to 30 subjects (13 women and 17 men at the age of 23 to 26) separately, who come from our university. Let them mark the intensity of behaviors expressive force. The mark is from 1 to 5, which indicate from a weak to a strong expressive performance. Finally, we analysis these marks to determine EP of twenty nonverbal behaviors.

Moreover, Preference Property demonstrates our human robot's habit. If this value corresponding to one behavior is bigger than others, it is shown that the robot likes to carry out this behavior. The range of PP is from 1 to 5 we supposed. And 1 signifies that there is no preference, 5 means the robot likes to perform this behavior very much.

Based on these, TP and EP of every behavior are also listed in table 2. PP will be adjusted in practice.

Table 2.

Classification and properties of nonverbal behaviors

ID	PNB (TP / EP)	ID	SNB (TP / EP)		ID	BIH (TP / EP)
1	Smile (4.54s / 4)		E⁺	E⁻	11	Mouth opening & closing (1.03s / 4)
2	Sadness (4.50s / 5)	7	Nodding (1.86s / 4)	Head shaking (1.75s / 4)	12	Eyeball up and down (2.14s / 4)
3	Frown (3.40s / 3)	8	Contemplation (4.75s / 5)	Shy (4.50s/ 5)	13	Eyeball left and right (1.40s / 4)
4	Surprise (3.95s / 5)	9	Funny eye (1.41s / 3)	Jeer (2.20s / 4)	14	Twinkle (1.42s / 5)
5	Pity (4.15s / 4)	10	Keep eyes open wide (1.27s / 4)	Doze (4.10s / 5)	15	Grin (1.06s / 3)
6	Disgust (3.16s / 3)				16	Eyebrow raising (1.73s / 2)

3. Integration of Nonverbal Behaviors and Speech

As stated above, nonverbal behaviors and speech have already been linked by emotions. However, the number of behavior combinations is large. For instance, if the robot wants to express happy, the PNB selected is smile. Supposed that we have p types of E⁺SNB and q types of BIH, so the total number of behavior combinations is as many as $C_{p + q}^{n}$ . Moreover, when our robot is in calm state, the number of behavior combinations is $C_{q}^{n}$ . After we add control points as many as possible, this value will get larger. It is obviously seen that how to determine a suitable behavior combination to map the speech contents is a little difficult. Here, a detailed method for mapping them will be discussed based on multi-object genetic algorithm with fixed gene locus.

To the problem of mapping nonverbal behaviors with speech, some properties of a behavior should be considered when this behavior will be selected. In this paper, we only pay attention to three properties, TP, EP and PP as stated above. Their values are grouped as a vector $[\begin{matrix} T P_{i} & E P_{i} & P P_{i} \end{matrix}], i \in (1, 2, …, n)$ . Thus, when a sentence needs several behaviors to map, properties of these behaviors are grouped as a matrix $S_{n \times 3}$ .

S = [\begin{matrix} T P_{1} & E P_{1} & P P_{1} \\ T P_{2} & E P_{2} & P P_{2} \\ ⋮ & ⋮ & ⋮ \\ T P_{n} & E P_{n} & P P_{n} \end{matrix}]

(1)

Where, n is the number of behaviors selected.

The computation framework of genetic algorithm consists of five major operators: initialization, evaluation, selection, crossover, and mutation. Except for initialization, which only performs once in the entire procedure, all the other four operators execute in every generation until a stop criterion is met. The following describes the various operators of genetic algorithm including the initialization method, adaptive search space method, selection scheme, and crossover and mutation schemes.

Initialization The initial binary-coded chromosomes are randomly generated within the feasible region (constraints) of behavior combinations. Supposed that we have m types of PNB, so the length of a chromosome is $m + p + q$ . In another word, every gene point of a chromosome corresponds to a type of nonverbal behaviors. The style of a binary-coded chromosome is like $b_{1} b_{2} \dots b_{m + p + q}$ , where,

b_{i} = {\begin{cases} 1 if state 1 \\ 0 if state 2 \end{cases} i = 1, 2, \dots, m + p + q

(2)

According to table 2, state 1 means that the $i^{t h}$ type of behaviors is selected, and state 2 indicates that the $i^{t h}$ type of behaviors is not selected.

In addition, the initial population size, $N_{i p o p}$ , is set to twice the size of the population in the latter generations, $N_{p o p}$ . In this paper, the population size is $N_{p o p} = 50$ , and the generation number is $g e n = 100$ . Through the tournament selection in the selection operator, only those chromosomes with higher fitness from the initial population are taken to the second generation while those with lower fitness are discarded. This initialization method gives the algorithm a nice start by providing a fine initial sampling of the parameter space.

Evaluation The evaluation of the fitness function, selection probability $p_{k}$ , and expected value $e_{k}$ of each chromosome, as well as the normalized fitness distance (NFD), is performed in this operator [11 –13]. We hope that the speech contents and behaviors could match better in the view of time, the robot could perform more vividly, and he also has some behavior habits. So the objective function J is defined as:

J = f (Δ, \bar{E P}, \bar{P P}) = ξ \cdot Δ + \frac{ψ}{\bar{E P}} + \frac{ζ}{\bar{P P}}

(3)

Where,

Δ = | l \cdot t i m e - \sum T P | = | l \cdot t i m e - \sum_{i = 1}^{n} T P_{i} |

demonstrates the matching degree of speech contents and behaviors. l is the number of words in a sentence which will be matched. $\bar{E P} = \sum_{i = 1}^{n} E P_{i} / n$ is an average of behaviors expressivity. And similarly, $\bar{P P} = \sum_{i = 1}^{n} P P_{i} / n$ is an average of behaviors preference property.ξ, ψ and ζ are weights of three objects. Supposed that three objects are equal, so $ξ = ψ = ζ = 1$ .

Fitness function $f (\cdot)$ is defined as $1 / J$ based on formula (3). The selection probability $p_{k}$ and the expected value $e_{k}$ are used for the selection of chromosomes in the selection operator. For chromosome $k \in [1, N_{p o p}]$ with fitness $f_{k}$ , the values of $p_{k}$ and $e_{k}$ are determined by:

p_{k} = f_{k} / \sum_{i = 1}^{N_{p o p}} f_{i} .

(4)

e_{k} = N_{p o p} \times p_{k}

(5)

The normalized fitness distance is a measure of the solution convergence. It is analogous to the ratio of the improvement of average fitness to the improvement of the best fitness in a population. NFD is defined by:

N F D = {\begin{cases} \frac{f_{m a x} - \bar{f}}{f_{m a x} - f_{m i n} + ε} f o r m i n i m i z a t i o n p r o b l e m \\ \frac{\bar{f} - f_{m i n}}{f_{m a x} - f_{m i n} + ε} f o r m a x i m i z a t i o n p r o b l e m \end{cases}

(6)

Where, $f_{\max}$ is the maximum fitness value of the population, $f_{\min}$ is the minimum fitness value of the population, $\bar{f}$ is the average fitness value, and ε is a small positive number to prevent the equation from zero division. For minimization problem, values of $\bar{f}$ and $f_{\max}$ will approach the value of $f_{\min}$ as generation progresses. For maximization problem, values of $\bar{f}$ and $f_{\min}$ will approach the value of $f_{\max}$ as generation progresses. Thus, the value of NFD will gradually increase as the solution approaches the optimum value.

Selection The selection is done using the remainder stochastic sampling. This mixed sampling approach contains both stochastic and deterministic features simultaneously [11, 13]. To further improve the convergence performance, the elitist selection scheme is also used to ensure that the best chromosome is always passed onto the next generation.

Crossover The crossover is the main search operator in genetic algorithm, which performs the exchange of information among chromosomes through combination and disruption of schemata. Investigation suggests that the essence of effective crossover is to increase both the combination power and the disruption power [14]. In genetic algorithm, the increase of combination power and disruption power is achieved by using an adaptive crossover rate scheme. The number of mating in a population is controlled by the adaptive crossover rate, $p_{c}$ , which changes according to NFD. If NFD of the present population is bigger than or equal to that of the previous population, $p_{c}$ is calculated with the formula. Otherwise, $p_{c}$ is set to a higher value, $p_{c, h}$ . The adaptive crossover rate scheme is described as:

p_{c} = {\begin{cases} p_{c, l} z = 1 \\ \frac{p_{c, h} - p_{c, l}}{1 + \exp [α \cdot N F D (z)]} + p_{c, l} if N F D (z) \geq N F D (z - 1) \forall z \in {2, \dots, g e n} \\ p_{c, h} if N F D (z) < N F D (z - 1) \forall z \in {2, \dots, g e n} \end{cases}

(7)

Where, the values of $p_{c, l}$ and $p_{c, h}$ are set to 0.3 and 0.9, respectively. α is a positive constant. According to (7), the number of offspring that will be generated in a population, $N_{c}$ , is given by:

N_{c} = INT (N_{p o p} \times p_{c})

(8)

The maximum number of offspring that can be generated in a population is equal to its population size $N_{p o p}$ . Usually speaking, a couple of chromosomes are needed to exchange information, so $N_{c}$ is even number, which requires that $INT {•}$ , in (8), takes the integer even value.

Mutation The mutation serves as a background operator to restore genetic materials as well as a local optimizer since it is a guided-search operator. Mutation rate is significant in the controlling of the genetic algorithm performance because it induces diversity to a population and also exploits the better solution. The mutation rate, $p_{m}$ , defined as the number of parameters chosen to mutate in a population is changed adaptively according to NFD. If NFD of the present population is bigger than or equal to that of the previous population (solution converges), $p_{m}$ is calculated with the formula to avoid premature convergence. Otherwise (solution diverges), $p_{m}$ is set to a higher value, $p_{m, h}$ . The adaptive mutation rate scheme is described as:

p_{m} = {\begin{cases} p_{m, l} z = 1 \\ \frac{p_{m, h} - p_{m, l}}{1 + \exp [β \cdot N F D (z)]} + p_{m, l} if N F D (z) \geq N F D (z - 1) \forall z \in {2, \dots, g e n} \\ p_{m, h} if N F D (z) < N F D (z - 1) \forall z \in {2, \dots, g e n} \end{cases}

(9)

Where, the values of $p_{m, l}$ and $p_{m, h}$ are set to 0.01 and 0.03. β is a positive constant. According to (9), the number of mutations performed in a population, $N_{m}$ , is given by:

N_{m} = INT [N_{p o p} \times (m + p + q) \times p_{m}]

(10)

Mutation operator is carried out in single chromosome, so $INT {•}$ here refers to taking the integer value in a general sense.

The multi-object genetic algorithm uses the five operators discussed above to perform an effective search/optimization that contains both stochastic and heuristic characteristics. The mechanism of the adaptive crossover and mutation techniques control the pressures of search space exploration and search space exploitation. And the effect of this match method will be illustrated in experiment section.

4. Experiments

4.1. Simulation of the match method

To test this match method based on multi-object genetic algorithm proposed in this paper, speech contents with different number of Chinese characters are adopted separately. With these sentence templates, humanoid robot performs six times with six different types of emotions, such as happy, sadness, anger, surprise, pity and disgust. In addition, to illustrate influences brought in by different PP, contrast tests are also carried on.

4.1.1. Robustness

A good match of speech contents and nonverbal behaviours mainly refers to they are synchronous. In other words, with some limitations, speech and behaviors could begin and stop simultaneously and coordinate with each other. In order to demonstrate the robustness of the match method, experiments are carried on based on sentence templates with different number of Chinese characters. Firstly, supposed that our robot has no habit (PP=1), for sentences with 25 and 35 Chinese characters, match results are shown in figure 3 and 4.

Take figure 3 for example. Figure 3(a) shows that the fitness increases along with generation based on genetic algorithm. In this figure, x-coordinate refers to the generation, and y-coordinate denotes the fitness. It is obviously seen that, for six types of emotions, the fitness could reach a high level and remain, which demonstrates the sentence and behaviors could match well. In other words, from the macroscopic view, our three objects stated above could be satisfied with our match method.

Figure 3.

Match result for twenty-five words (PP=1)

Figure 4.

Match result for thirty-five words (PP=1)

On the microscopic level, for instance Δ, $\bar{E P}$ and $\bar{P P}$ , we can also find out there is a good performance using our match method. First of all, list the match results as follows. For happy emotion, the selected behaviors are PNB-1 and BIH-14. In addition, sadness, anger, surprise, pity and disgust emotions match with the combination of PNB-2, E⁻SNB-9, BIH-12 and BIH-16, the combination of PNB-3, BIH-11, BIH-14 and BIH-15, the combination of PNB-4 and E⁺SNB-10, the combination of PNB-5, BIH-11, BIH-15 and BIH-16, and the combination of PNB-6, E⁻SNB-7, BIH-11, BIH-14, and BIH-16 separately. Then, with these behavior combinations, we could calculate their duration $\sum T P$ , $\bar{E P}$ and $\bar{P P}$ step by step based on table 2. Moreover, $\sum T P$ of each behavior combination is shown in figure 3(b), which is called behavior duration. And the sentence duration of every emotion are also displayed. From it, we could see that there are few differences between them. So, Δ is very small, which satisfies the time match object. Analysis to $\bar{E P}$ and $\bar{P P}$ is the same. Similarly, figure 4 also illustrate this information about matching.

Secondly, supposed that our robot likes to perform BIH 12 and 13, and prefers BIH 13, (PP12=2, PP13=3), for sentences with 25 and 35 Chinese characters, match results are shown in figure 5 and 6.

Figure 5(a) shows that the fitness increases along with generation on condition that PP12=2 and PP13=3, which means our three objects stated above could be satisfied too. Effect is good enough. So the match method we proposed is robust for various sentences length and the robot's preferring behaviors.

Figure 5.

Match result for twenty-five words (PP12=2, PP13=3)

Figure 6.

Match result for thirty-five words (PP12=2, PP13=3)

Figure 7.

Variation of the maximum Δ, the minimum Δ, and the average Δ

Finally, we research on the variation of the maximum Δ, the minimum Δ, and the average Δ to show robustness in detail. Their variation tendencies are shown in figure 7.

In figure 7, for x-coordinate, 1 corresponds to the condition demonstrated in figure 3. It means the speech contents has 25 Chinese characters and PP=1. Similarly, 2, 3, and 4 correspond to the condition demonstrated in figure 5, 4, and 6 separately.

As a whole, no matter how the number of Chinese characters and PP change, variations of the maximum Δ, the minimum Δ, and the average Δ remain within a small range. These match errors between speech contents and nonverbal behaviors are almost imperceptible in practice. Thus, the method is robust for changing condition.

See the declination between 2 and 3. Speech duration of a sentence with 25 Chinese characters is shorter than the one with 35 Chinese characters obviously. Because TP of every behavior shown in table 2 is fixed, the number of optional behaviors is much smaller for shorter speech duration than longer speech duration. In other words, it is difficult to match shorter speech contents with behaviors. Thus, its Δ will be bigger.

In sum, the method for mapping speech contents and nonverbal behaviors based on multi-object genetic algorithm with fixed gene locus discussed above has a good performance.

4.1.2. Effect of Preference Property

Preference Property illustrates the robot prefers to some certain behaviors. When PP equals to 1, he selects behaviors randomly and there is no habit. And when the value of a certain behavior increases, it means the robot gets used to adopt this behavior. Here, we suppose two conditions. One is PP values of all behaviors equal to 1 (PP=1), the other is PP values of BIH-12 and 13 equals to 2 and 3 separately (PP12=2, PP13=3). Effects of changing Preference Property are compared in figure 9 and 10 and will be analyzed.

Figure 8.

Comparison of selected behaviors for twenty-five words

When the robot is engaged in speaking with emotion, a direct emotional expression is needed. So a type of PNB must be selected. From figure 8, we could see PNB1-6 are all selected, which do not change with changed PP. It is reasonable. Although BIH12 is selected on condition 2, the total number of BIH12 and 13 selected on both conditions is nearly unchanged. This is mainly because speech duration of a sentence with 25 Chinese characters is shorter. Optional behaviors are fewer for shorter speech duration. The object of temporal match should be satisfied firstly, which leads to the effect of changed PP are not obvious.

Figure 9.

Comparison of selected behaviors for thirty-five words

However, the effect of changed PP is obvious shown in figure 9. The total number of BIH12 and 13 selected on condition 2 is higher than the one on condition 1. Changing PP results in different behavior selection for the same speech contents, which demonstrates our robot's habit.

To sum up, based on the match method we proposed, three goals stated above are all satisfied. And it is also robust on different conditions. We will apply this method for telling story in practice. Actual results will be recorded and analyzed in detail.

4.2. Tell story in practice

4.2.1. Scenario Description

To illustrate effects in practice, based on the proposed approach, we select randomly in Aesop's Fables. Its name is The Wolf and the Lamb (in Chinese). Storyboard examples in general are shown as figure 10.

Figure 10.

Storyboard examples

Given specific characteristics of the humanoid robot, we derive and differentiate the varieties of dialogues, descriptions, and action data in the conception of the storyboard, annotate story scripts and consider the robot current emotion status, which are listed in table 3.

Table 3.

Scenario of the fable

Sentence No.	Script	Emotion Status
1	Tonight, I will tell a fable for you. Its name is The Wolf and the Lamb.	-
2	Once upon a time a Wolf was lapping at a spring on a hillside, when, looking up, what should he see but a Lamb just beginning to drink a little lower down. “There's my supper,” thought he, “if only I can find some excuse to seize it.” Then he called out to the Lamb,	-
3	“How dare you muddle the water from which I am drinking?”	Anger
4	The Lamb was in surprise, and said,	-
5	“Nay, master, nay, if the water be muddy up there, I cannot be the cause of it, for it runs down from you to me.”	Surprise
6	The Wolf said angrily,	-
7	“Well, then, why did you call me bad names this time last year?”	Anger
8	The pitiful Lamb cried helplessly,	-
9	“That cannot be, I am only six months old.”	Pity
10	The Wolf rushed Lamb and snarled,	-
11	“I don't care, WARRA WARRA WARRA…”	Anger
12	The Wolf ate her all up.	-
13	This fable tells us that “Any excuse will serve a tyrant.”	-

Corresponding to every sentence in the story, there is a type of emotion status. For example, before the robot speaks Sentence 3, some preparations should be done. First of all, he looks over the emotion status in storyboard. Mark Anger means this type of emotion status needs to be with. Then, match method of speech contents and nonverbal behaviors is carried out, which make the robot speak and act simultaneously with current emotion. In this way, Sentence 3 in this fable is expressed, and our robot will prepare the next sentence immediately. In table 3, Mark - means retaining current emotion.

4.2.2. Result analysis

According to the storyboard we designed above, experiments are carried on. The behaviors, frown, surprise and pity, in PNB are selected because of the emotion status. And other behaviors in SNB or BIH are also selected to supplement behaviors group. So, with our objects satisfied, the process of telling story is vivid and fluent.

Table 4.

Experiment process in detail

Sentence No.	Male / female voice	S	V	T	Selected Behaviors	BD/s	WC	SD/s	Δ/s	$\bar{E P}$	$\bar{P P}$
1	m4	s2	t0	v5	BIH-11, 12, 13, 15	5.63	25	5.70	0.07	3.75	1
2	m4	s2	t0	v5	BIH-11, 13, 14, 16	5.58	24	5.47	0.11	3.75	1
3	m4	s4	t3	v7	PNB-3, BIH-14	4.82	24	4.73	0.09	4.00	1
4	m4	s2	t0	v5	BIH-13, 15	2.46	11	2.51	0.05	3.50	1
5	m2	s2	t0	v7	PNB-4, E⁺SNB-9, BIH-13, 14	8.18	54	8.05	0.13	4.25	1
6	m4	s2	t0	v5	BIH-13	1.40	6	1.37	0.03	4.00	1
7	m4	s4	t3	v7	PNB-3, BIH-11, 13	5.83	30	5.91	0.08	3.67	1
8	m4	s2	t0	v5	BIH-11, 14	2.45	10	2.28	0.17	4.50	1
9	m2	s2	t0	v5	PNB-5, BIH-12	6.29	26	5.93	0.36	4.00	1
10	m4	s2	t0	v5	BIH-12, 13, 15	4.60	21	4.79	0.19	3.67	1
11	m4	s4	t3	v7	PNB-3, E⁻SNB-7	5.15	26	5.12	0.03	3.50	1
12	m4	s2	t0	v5	BIH-13, 14	2.82	12	2.74	0.08	4.50	1
13	m4	s2	t0	v5	(BIH-12, 13, 14, 15, 16)*2	15.5	70	15.96	0.46	3.60	1

Moreover, we record experiment datum shown in table 4 to reveal the process. These could help us to get more information about our approach. Datum in table include Speech rate (S), Volume (V), Vone (T), selected behaviors, Behavior Duration (BD), Word Count (WC) in a sentence, Sentence Duration (SD), Δ, $\bar{E P}$ and $\bar{P P}$ . Moreover, we devise that the Wolf and the aside speaks in male voice (m4), and the Lamb speaks in female voice (m2).

In table 4, it is obviously seen that selected behaviors are recorded in detail. We could calculate BD and $\bar{E P}$ directly based on table 2. And it is easy to obtain WC of every sentence. So SD could be calculated based on table 1. With the two values, BD and SD, Δ is also worked out.

See values in the column of Δ, every one of them is smaller than 1, which means people can not feel time difference between nonverbal behaviors and speech. It is good enough for behaviors and speech match. Similarly, $\bar{E P}$ is also bigger, which demonstrates there is a good performance. Because the robot does not have preferring habits, $\bar{P P}$ equals to 1.

5. Conclusions

In this paper, a match method for speech contents and nonverbal behaviors based on multi-object genetic algorithm is devised, which will be used for human-robot interaction and cooperation. We have compared the relationship among emotion, voice characteristics, and nonverbal behaviors as well as devised this match method.

The focus of the present research is how to construct a hierarchical structure to link speeches and behaviors on the robot. From our point of view, the problem will be solved. Behaviors including PNB, SNB, BIH, and emotional speech could be matched well. It allows the robot to embody intelligent-like ability, flexibility, and styling performance. No matter in simulation experiments or in practice, there are good results. The proposed method is effective and robust.

The next steps taken in the course of this work will include two aspects. Because the system is yet limited by the variety of the nonverbal behaviors database, we will extend it to make our robot have rich expressive power firstly. And then, the challenges for using the same approach with an alphabet-based language such as English will be received. Moreover, experimental evaluations will have to be carried out.

References

Wang

Z. L.

Wang

X. J.

, and et al. Research summarization of humanoid expression robot with emotion. Computer Science, 2011, 38(1): 34–39.

Vinciarelli

Salamin

Pantic

, Social signal processing: Understanding social interactions through nonverbal behavior analysis. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops (CVPR), 2009, 42–49.

Duric

Gray

W. D.

Heishman

and et al, Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proceedings of the IEEE, 2002, 90(7): 1272–1289.

Busso

Deng

Z. G.

Grimm

and et al. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(3): 1075–1086.

McBreen

H. M.

Jack

M. A.

, Evaluating humanoid synthetic agents in e-retail applications. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2011, 31(5): 394–405.

Kanda

Ishiguro

Imai

and et al. Development and evaluation of interactive humanoid robots. Proceedings of the IEEE, 2004, 92(11): 1839–1850.

Teng

S. D.

, Research on artificial psychology model applied in personal robot. PhD dissertation (School of Information Engineering, University of Science and Technology Beijing Beijing), 2006.

Ekman

, An argument for basic emotions. Cognition and Emotion, 1992, 6(3/4):169–200.

Macas

J. A.

Granollers

Latorre

(eds.). New Trends on Human Computer Interaction. Springer-Verlag, New York, 2009.

10.

Witte

Hoffmann

Hackert

and et al. Biomimetic robotics should be based on functional morphology. J. Anat, 2004, 204: 331–342.

11.

Gen

Cheng

, Genetic Algorithms and Engineering Design. John Wiley and Sons, New York, 1996.

12.

Wang

Peng

, Chaos particle swarm optimization combined with isolation niche, Systems Engineering and Electronics, 2008, 30(6): 1151–1154.

13.

Wang

X. P.

Cao

L. M.

, Genetic Algorithms: Theory, Applications and Software. Xi'an Jiaotong University Press, Xi'an, 2002.

14.

Eiben

A. E.

Schippers

C. A.

, On evolutionary exploration and exploitation. Fun-damenta Informaticae Archive, 1998, 35(1–4): 35–50.