Abstract
Keywords
1. Introduction
Currently, with the development of robotic theories and technologies, humanoid robot with facial expression emerges from several areas such as the aged care, childhood education and so on. Many universities and organizations in the world focus on this research area, so a lot of robots appear [1].
In human-robot interaction, there is not only verbal interaction, but also nonverbal one. Nonverbal behavior is an important factor influencing believability. Vinciarelli considered them as relational attitudes exchanged between interacting individuals [2]. They included facial expression and body posture as well. Duric adopted them to enhance operator performance and provide for rational decision-making [3]. Carlos Busso's view was almost similar, and pointed out that rigid head motion was a gesture that conveyed important nonverbal information in human communication [4]. In addition, results showed participants expected a high level of realistic human-like verbal and nonverbal communicative behavior from the human-like agents. A human-like body would provide an abundance of nonverbal information and enable us to smoothly communicate with the robot [5, 6]. Furthermore, nonverbal behaviors increase interest and believability of the interactive process as stated above, so research on how to match them with speech (verbal interaction) well is helpful and necessary. In another human-robot interaction, researchers consider the importance of affection. Related theory and technology are called affective computing, which includes emotion analysis, development, and expression. As shown in figure 1, Speech contents which a robot will speak and human behaviors, regarded as internal and external stimuli separately, influence the robot's emotion status simultaneously, and then lead to its behaviors changing. This process is not the key point here, but will be stated in another paper.

Contents researched in this paper
In this paper, we mainly focus on the expression, especially in integrated expression of speech and nonverbal behaviors linked by emotion. There are three contributions. The first one is a novel method for matching speech contents and nonverbal behavior is proposed based on genetic algorithm with fixed gene locus. Under some limitations, speech and nonverbal behaviors could be matched well. The second one is that effectiveness and robustness of this method are verified in simulation experiments for different conditions. The third one is that the proposed method is also used for telling stories in practice.
The overall goal of our work is to implement a method, which matches nonverbal behaviors to speeches, linked by emotion status for a humanoid robot. This paper is structured as follows. Firstly, in section 2, an overview and a flow chart of the proposed method are introduced. And as a preparation, two relations are stated. One is between emotion status and voice characteristic, the other is between emotion status and nonverbal behaviors. Based on the works above, thirdly, in section 3, with multi-object genetic algorithm, how to match nonverbal behaviors to speeches is researched. Fourthly, in section 4, aiming to test and verify the method proposed in section 3, some experiments are conducted. Fifthly, conclusions are given in section 5.
2. Overview of Emotion, Voice Characteristics and Nonverbal Behaviors
To make the proposed method clear, a flow chart is shown in figure 2. When the humanoid robot's emotion status is calculated, his voice characteristics of speech contents, such as speech rate, volume, and tone, are determined. However, his nonverbal behaviors, linked with the current emotion, are too many. How to select suitable ones to match the speech contents is the key point. With the help of multi-object genetic algorithm, this problem can be solved.

Flow chart of the proposed method
Combine figure 1 with 2 together to think, the humanoid robot's emotion status is affected by internal and external stimuli. In this paper, we regard the speech contents as internal stimuli. With the stimuli, emotion engine, emotional Markov Chain Model (eMCM) and emotional Hidden Markov Model (eHMM) proposed by Teng [7], plays a role to generate internal emotion status of the robot. And he considers the emotion, speaks with suitable speed, volume, tone, and performs nonverbal behaviors vividly. So there are two problems to solve: 1) relations between emotions and voice characteristics of speech; 2) relations between emotions and nonverbal behaviors.
2.1. Emotion and Voice Characteristics
Usually speaking, for semantic emotion, voice emotion is the main part to express hidden mental states. According to Ekman's six emotional expressions [8], we design ours: happiness, sadness, anger, surprise, pity and disgust. In addition, there should be a calm state when the humanoid robot does not be stimulated. Because voice characteristics ranges of iFly TTS SDK we used is 0~10 for speech rate, volume, and tone, we determine their values as shown in table 1 [9].
Voice characteristics to six types of emotion
In order to demonstrate it clearly, we give some examples. If the robot is happy, he speaks the contents with parameters S8, V6, and T7. Correspondingly, if the robot is anger, voice characteristics may be set as S7, V6, and T5.
2.2. Emotion and Nonverbal Behaviors
Actually, nonverbal behavior has no fixed pattern in real life. But for primary research, in this paper, we consider facial behaviors mainly. Given limited control points and imperfect design of some humanoid robots, twenty kinds of nonverbal behaviors in face are programmed. These nonverbal cues are considered as parts of intention, reasoning, desire, emotion and personality usually. They could also help the robot convey his feeling and hidden mental state. Take head movements for example, behavior “Chin down” may indicate meanings of humiliation, shame, shy, refuse and bored. In other word, these nonverbal styles add extra channels of communication over language.
In addition, we divide twenty nonverbal behaviors into three parts: Primary Nonverbal Behaviors (PNB), Secondary Nonverbal Behaviors (SNB), and Basic Interaction Habits (BIH) listed in table 2. PNB are related to six types of emotions above, which could express emotion directly. SNB could express emotion indirectly. Moreover, they are implicit emotional behaviors, which include E+SNB and E−SNB corresponding to positive emotion and negative emotion separately. BIH were developed to form complex behaviors. Considering the functional morphology as described in [10] the BIH are grouped.
When the humanoid robot is engaged in speaking with human being, a direct emotional expression is needed. So a type of PNB must be selected firstly. And then, to make his performances vivid, SNB and BIH should also be integrated as assistant nonverbal behaviors. However, when he is calm, PNB and SNB are not suitable. Thus, only behaviors in BIH are selected and grouped.
Supposed that each behavior has three properties, they are Time Property (TP), Expressivity Property (EP), and Preference Property (PP). TP indicates a time period to perform an action. EP refers to intensity of expressive force. And PP shows how much the robot likes to perform this behavior, which could imply his habit.
With the help of a timer in our testing program, TP could be measured. After one of twenty nonverbal behaviors is carrying out, the timer works. Then, it stops timing when the behavior finishes. Thus, we could obtain the time period of this behavior, which is called Time Property. In addition, by this method, we could get all TP of twenty nonverbal behaviors.
For Expressivity Property, it is determined by testers mainly. In order to reduce its subjectivity brought in by a single test, mean values of 30 test results are calculated. We record behaviors with video firstly, and then show them to 30 subjects (13 women and 17 men at the age of 23 to 26) separately, who come from our university. Let them mark the intensity of behaviors expressive force. The mark is from 1 to 5, which indicate from a weak to a strong expressive performance. Finally, we analysis these marks to determine EP of twenty nonverbal behaviors.
Moreover, Preference Property demonstrates our human robot's habit. If this value corresponding to one behavior is bigger than others, it is shown that the robot likes to carry out this behavior. The range of PP is from 1 to 5 we supposed. And 1 signifies that there is no preference, 5 means the robot likes to perform this behavior very much.
Based on these, TP and EP of every behavior are also listed in table 2. PP will be adjusted in practice.
Classification and properties of nonverbal behaviors
3. Integration of Nonverbal Behaviors and Speech
As stated above, nonverbal behaviors and speech have already been linked by emotions. However, the number of behavior combinations is large. For instance, if the robot wants to express happy, the PNB selected is smile. Supposed that we have
To the problem of mapping nonverbal behaviors with speech, some properties of a behavior should be considered when this behavior will be selected. In this paper, we only pay attention to three properties, TP, EP and PP as stated above. Their values are grouped as a vector
Where,
The computation framework of genetic algorithm consists of five major operators:
According to table 2, state 1 means that the
In addition, the initial population size,
Where,
demonstrates the matching degree of speech contents and behaviors.
Fitness function
The normalized fitness distance is a measure of the solution convergence. It is analogous to the ratio of the improvement of average fitness to the improvement of the best fitness in a population. NFD is defined by:
Where,
Where, the values of
The maximum number of offspring that can be generated in a population is equal to its population size
Where, the values of
Mutation operator is carried out in single chromosome, so
The multi-object genetic algorithm uses the five operators discussed above to perform an effective search/optimization that contains both stochastic and heuristic characteristics. The mechanism of the adaptive crossover and mutation techniques control the pressures of search space exploration and search space exploitation. And the effect of this match method will be illustrated in experiment section.
4. Experiments
4.1. Simulation of the match method
To test this match method based on multi-object genetic algorithm proposed in this paper, speech contents with different number of Chinese characters are adopted separately. With these sentence templates, humanoid robot performs six times with six different types of emotions, such as happy, sadness, anger, surprise, pity and disgust. In addition, to illustrate influences brought in by different PP, contrast tests are also carried on.
4.1.1. Robustness
A good match of speech contents and nonverbal behaviours mainly refers to they are synchronous. In other words, with some limitations, speech and behaviors could begin and stop simultaneously and coordinate with each other. In order to demonstrate the robustness of the match method, experiments are carried on based on sentence templates with different number of Chinese characters. Firstly, supposed that our robot has no habit (PP=1), for sentences with 25 and 35 Chinese characters, match results are shown in figure 3 and 4.
Take figure 3 for example. Figure 3(a) shows that the fitness increases along with generation based on genetic algorithm. In this figure, x-coordinate refers to the generation, and y-coordinate denotes the fitness. It is obviously seen that, for six types of emotions, the fitness could reach a high level and remain, which demonstrates the sentence and behaviors could match well. In other words, from the macroscopic view, our three objects stated above could be satisfied with our match method.

Match result for twenty-five words (PP=1)

Match result for thirty-five words (PP=1)
On the microscopic level, for instance Δ,
Secondly, supposed that our robot likes to perform BIH 12 and 13, and prefers BIH 13, (PP12=2, PP13=3), for sentences with 25 and 35 Chinese characters, match results are shown in figure 5 and 6.
Figure 5(a) shows that the fitness increases along with generation on condition that PP12=2 and PP13=3, which means our three objects stated above could be satisfied too. Effect is good enough. So the match method we proposed is robust for various sentences length and the robot's preferring behaviors.

Match result for twenty-five words (PP12=2, PP13=3)

Match result for thirty-five words (PP12=2, PP13=3)

Variation of the maximum Δ, the minimum Δ, and the average Δ
Finally, we research on the variation of the maximum Δ, the minimum Δ, and the average Δ to show robustness in detail. Their variation tendencies are shown in figure 7.
In figure 7, for x-coordinate, 1 corresponds to the condition demonstrated in figure 3. It means the speech contents has 25 Chinese characters and PP=1. Similarly, 2, 3, and 4 correspond to the condition demonstrated in figure 5, 4, and 6 separately.
As a whole, no matter how the number of Chinese characters and PP change, variations of the maximum Δ, the minimum Δ, and the average Δ remain within a small range. These match errors between speech contents and nonverbal behaviors are almost imperceptible in practice. Thus, the method is robust for changing condition.
See the declination between 2 and 3. Speech duration of a sentence with 25 Chinese characters is shorter than the one with 35 Chinese characters obviously. Because TP of every behavior shown in table 2 is fixed, the number of optional behaviors is much smaller for shorter speech duration than longer speech duration. In other words, it is difficult to match shorter speech contents with behaviors. Thus, its Δ will be bigger.
In sum, the method for mapping speech contents and nonverbal behaviors based on multi-object genetic algorithm with fixed gene locus discussed above has a good performance.
4.1.2. Effect of Preference Property
Preference Property illustrates the robot prefers to some certain behaviors. When PP equals to 1, he selects behaviors randomly and there is no habit. And when the value of a certain behavior increases, it means the robot gets used to adopt this behavior. Here, we suppose two conditions. One is PP values of all behaviors equal to 1 (PP=1), the other is PP values of BIH-12 and 13 equals to 2 and 3 separately (PP12=2, PP13=3). Effects of changing Preference Property are compared in figure 9 and 10 and will be analyzed.

Comparison of selected behaviors for twenty-five words
When the robot is engaged in speaking with emotion, a direct emotional expression is needed. So a type of PNB must be selected. From figure 8, we could see PNB1-6 are all selected, which do not change with changed PP. It is reasonable. Although BIH12 is selected on condition 2, the total number of BIH12 and 13 selected on both conditions is nearly unchanged. This is mainly because speech duration of a sentence with 25 Chinese characters is shorter. Optional behaviors are fewer for shorter speech duration. The object of temporal match should be satisfied firstly, which leads to the effect of changed PP are not obvious.

Comparison of selected behaviors for thirty-five words
However, the effect of changed PP is obvious shown in figure 9. The total number of BIH12 and 13 selected on condition 2 is higher than the one on condition 1. Changing PP results in different behavior selection for the same speech contents, which demonstrates our robot's habit.
To sum up, based on the match method we proposed, three goals stated above are all satisfied. And it is also robust on different conditions. We will apply this method for telling story in practice. Actual results will be recorded and analyzed in detail.
4.2. Tell story in practice
4.2.1. Scenario Description
To illustrate effects in practice, based on the proposed approach, we select randomly in Aesop's Fables. Its name is The Wolf and the Lamb (in Chinese). Storyboard examples in general are shown as figure 10.

Storyboard examples
Given specific characteristics of the humanoid robot, we derive and differentiate the varieties of dialogues, descriptions, and action data in the conception of the storyboard, annotate story scripts and consider the robot current emotion status, which are listed in table 3.
Scenario of the fable
Corresponding to every sentence in the story, there is a type of emotion status. For example, before the robot speaks Sentence 3, some preparations should be done. First of all, he looks over the emotion status in storyboard. Mark Anger means this type of emotion status needs to be with. Then, match method of speech contents and nonverbal behaviors is carried out, which make the robot speak and act simultaneously with current emotion. In this way, Sentence 3 in this fable is expressed, and our robot will prepare the next sentence immediately. In table 3, Mark - means retaining current emotion.
4.2.2. Result analysis
According to the storyboard we designed above, experiments are carried on. The behaviors, frown, surprise and pity, in PNB are selected because of the emotion status. And other behaviors in SNB or BIH are also selected to supplement behaviors group. So, with our objects satisfied, the process of telling story is vivid and fluent.
Experiment process in detail
Moreover, we record experiment datum shown in table 4 to reveal the process. These could help us to get more information about our approach. Datum in table include Speech rate (S), Volume (V), Vone (T), selected behaviors, Behavior Duration (BD), Word Count (WC) in a sentence, Sentence Duration (SD), Δ,
In table 4, it is obviously seen that selected behaviors are recorded in detail. We could calculate BD and
See values in the column of Δ, every one of them is smaller than 1, which means people can not feel time difference between nonverbal behaviors and speech. It is good enough for behaviors and speech match. Similarly,
5. Conclusions
In this paper, a match method for speech contents and nonverbal behaviors based on multi-object genetic algorithm is devised, which will be used for human-robot interaction and cooperation. We have compared the relationship among emotion, voice characteristics, and nonverbal behaviors as well as devised this match method.
The focus of the present research is how to construct a hierarchical structure to link speeches and behaviors on the robot. From our point of view, the problem will be solved. Behaviors including PNB, SNB, BIH, and emotional speech could be matched well. It allows the robot to embody intelligent-like ability, flexibility, and styling performance. No matter in simulation experiments or in practice, there are good results. The proposed method is effective and robust.
The next steps taken in the course of this work will include two aspects. Because the system is yet limited by the variety of the nonverbal behaviors database, we will extend it to make our robot have rich expressive power firstly. And then, the challenges for using the same approach with an alphabet-based language such as English will be received. Moreover, experimental evaluations will have to be carried out.
