Abstract
Introduction
Recently, Internet of Things (IoT) simulations have become important in many aspects of human life, not just in entertainment but also in medicine, education, the military, and for training. The IoT consists of various devices that connect each other and transmit a huge amount of data. Through simulations, IoT is beneficial for finding solutions for potential problems, as well as major effects, in autonomous cars, traffic control, and public transportation. The IoT requires algorithms that can extract knowledge and learn in real-time from collected data resulting from various resources such as temperature, traffic, and health. Frequently applied algorithms for data analysis and IoT cases, including machine learning and deep learning, provide the ability to learn without programming. These algorithms can be categorized into
Motivations are the reasons for agents to act or behave in certain ways7–11 by affecting their perception, adaptability, and behavior. The topic of motivation in humans and animals has been studied in many different fields, including biology and sociology. Helbing et al. 11 discuss human behavior using models and data from crowd disasters and crimes for the purpose of saving human lives. Capraro and Perc 12 study cooperative behavior in human sociality for determining the ability and motivation to engage with others in social dilemma games. Scott 13 define the concept of human motivation as the desire to do something. A “plan” is a series of steps in a hierarchical task network (HTN). 14 An HTN takes a problem as input and supplies a plan or set of tasks, such as traditional tasks that roughly correspond to actions, compound tasks, and goal tasks. 15
In this article, we combine artificial intelligence and a motivation method for IoT simulation applications. This research can be applied to many IoT systems such as three-dimensional (3D) city simulation or the training of autonomous vehicles. For detail, this article proposes a motivation-based framework to make agents naturally, which reduces unpredictable situations. The primary objective of this article is to build agent simulations. This article uses a Q-Network-based motivation (QNM) framework to have agents select motivations by considering their 3D virtual environments. The QNM framework does not require training data, which is difficult to collect. In addition, we can generate big data on animal behaviors for other training systems using the proposed framework.
By combining the key techniques of the above methods, this article proposes a QNM framework to address the task of simulating agents to select motivations and goals, executing actions in the 3D virtual environment based on the selected goals. HTN planning is integrated into the QNM framework, acting as a planner for the selection of actions based on selected goals after selecting motivations, and extracting desires. Agents, including humans and animals, can adapt and react in the 3D virtual environment, with agents selecting actions based on situations in the 3D virtual environment.
This article is organized as follows. Section “Related work” describes related works and explains some of their disadvantages. In section “QNM system,” an architecture of the proposed framework for agent simulations is described. Section “Experiments and results” presents some experimental results using the proposed framework, and additionally implements experiments that apply a traditional Q-learning method, rather than the Q-Network. Both methods are used to verify the proposed framework, comparing the Q-Network against traditional Q-learning. Section “Discussion” provides discussion of the QNM framework after considering the results. Conclusions are given in section “Conclusion.”
Related work
Motivation approaches
Motivation approaches enable agents to achieve different goals based on their motivations. These approaches can create new goals for more efficiently exploring the environment. Song et al. 16 proposes motivation-based HTN planning that involves defining a list of motivations ordered by priority. At runtime, motivations are first chosen by the probability distribution of internal and external motivations and are then updated continuously. Goals and actions are generated by using HTN planning. Although their proposed method produces satisfactory results when simulating only one agent, choosing sudden, random events is not satisfactory for simulating agents in situations where unpredictable events occur, such as weather or interaction between agents.
De Sevin and Thalmann 17 propose a motivational model of action selections. The actions of the virtual agents are chosen by motivations. Each motivation is measured by a variable value of between 0 and 100. This method chooses the motivation with the highest value, and other motivations with lower values are ignored. Motivations are defined for each agent to give the agents individuality. However, how an appropriate action is chosen is a problem that must be considered, as it needs to satisfy the goals or situations arising from the environment. This proposed method cannot solve those problems.
Graham et al. 18 propose a method of combining motivated learning with goal creation. They present the idea of motivation as the underlying force of a cognitive agent’s operations by combining goal creation. Motivation is determined by needs. Each agent begins with one or more needs, such as the need for money or food. Agents develop their own goals and motivations. The motivation and goal creation control the agents’ motivation and behavior. The results of this method indicate that it performs efficiently in environments without unpredictable situations.
Graham et al. 18 propose a method that considers all the needs of the agents. However, this differs from Song et al. 16 and De Sevin and Thalmann 17 in that it is only able to manage planned situations, without considering environmental information such as the locations of agents, weather events, and interaction between agents. The proposed method can be applied to environments where these events or situations often occur. The problems of the previous approaches can be solved by choosing appropriate motivations based on events and situations.
Q-learning approaches
In recent years, reinforcement learning (RL)19,20 has played a key role in solving this problem by programming predetermined behaviors for agents and teaching them to make appropriate decisions. The agents need to learn from their environments by taking actions, receiving rewards, and adapting to this action-selection policy based on past, present, and future rewards. Value-based approaches attempt to learn the expected value of each state with the purpose of arriving at an optimal action-selection policy. A state is all information used to make a decision leading to a new state. A policy is denoted as
Q-learning is one type of algorithm used to calculate state-action values. Q-learning is an off-policy method that does not depend on the policy to update the value function. Off-policy methods are advantageous because they can follow one policy while learning another. They belong to the class of temporal difference (TD) algorithms, which suggests that time differences between taken actions and received rewards are involved. TD algorithms combine the computation of an optimal policy, using a given model of a Markov Decision Process (MDP), with Monte Carlo (MC) methods, which do not require complete knowledge of the environment. They only require experiences as state-action pairs with their rewards, and make updates after every taken action, making a prediction based on previous experiences, taking action based on that prediction, and updating the prediction after receiving a reward. In this article, instead of using a table that scores Q-values for every possible state-action pair, the table is iteratively updated during training as in MC methods. The policy
More recently, deep RL has been used extensively in research, allowing agents to perform well on games such as Flappy Bird, Go, and various Atari games, using just the raw pixel data to make decisions.21–23 Mnih et al. 23 train agents to play Atari games using deep RL, which the agents do with a skill level that exceeds that of human expertise in many games. Here, the deep RL acts as an approximate function to describe Q-values in Q-learning, which uses the input of raw pixels and the output of a function estimating a future reward. In addition, the authors present techniques for improving training effectiveness and stability, using an experience replay mechanism to randomly sample previous transitions, thus smoothing the training distribution on past behaviors.
Kempka et al. 24 implement the Doom video game as a simulated environment for RL. It has been used as a platform for the study of reward shaping, curriculum learning, and predictive planning. In addition, the open-world of Minecraft also provides a platform for exploring RL and artificial intelligence (AI). Because of the nature of the simulation, some studies researching curriculum learning and hierarchical planning are conducted with this platform (Tessler et al., 25 Matiisen et al., 26 and Oh et al. 27 ).
Juliani et al. 28 implement a toolkit that leverages the Unity platform for creating simulation environments. Unity enables the development of learning environments for sensory and physical complexity and also supports the multi-agent setting. Hessel et al.29,30 combine all of the deep RL improvements of previous studies to provide state-of-art performance on the Atari 2600 benchmarks and surpass human-level performance. Rather than Q-learning with the experience replay as the proposed framework, traditional Q-learning 20 with online learning is employed, where the current transition is used to update the value function immediately instead of storing into a replay buffer.
Motivation approaches are combined with RL to identify how to satisfy motivations using RL. Recent research has attempted to combine RL with motivation to simulate agents. Merrick and Maher 31 propose motivated RL agents. Motivated RL agents can explore their environment and learn pre-defined behaviors, such as moving to a chair or a house. However, these agents lack the ability to adapt to and experience their surroundings, leaving them unable to avoid dangerous situations, weather events, and so on.
In the proposed framework, the neural network plays the role of the Q-function. The neural network has many parameters known as weights. Therefore, instead of iteratively updating values in a lookup table as MC does, the proposed framework can be used for updating weights of the neural network to learn to provide better estimates of state-action values. In addition, the proposed framework uses gradient descent to train the Q neural network, as in any other neural network.
The neural network is implemented in the same way as the Atari playing algorithm by Google DeepMind.21–23 The network accepts a state and outputs separate Q-values for each possible action in its output layer. Maximum Q-values are needed in Q-learning for each possible action in the new state. The key feature that the proposed framework takes from deep RL21–23 is experience replay. Experience replay basically provides a minibatch of updating in online learning. The sampling transitions from the replay buffer are used as a labeled data set for learning. In addition, the proposed framework uses both the current transitions and the transitions from the replay buffer to update the value function at every time step.
In this article, the proposed framework presents a new framework for simulating agents in a 3D virtual environment. Agents are simulated to act and adapt to various situations based on their pre-defined desires. In addition, the agents are capable of avoiding dangerous situations, events, and experiences in their 3D virtual environment.
QNM system
This section presents an overview of a QNM framework and then describes how to utilize the Q-Network for motivation selections.
System overview
This section presents an overview of the proposed framework that addresses tasks to develop agents capable of deciding motivations based on their desires, selecting goals based on the decided motivations, and performing actions based on the selected goals. First, the proposed framework uses the Q-Network framework to select goals based on motivations. Second, it uses HTN planning to allow agents to select actions to accomplish their goals.
Terms and notations
The term “agent” refers to anything that makes decisions by selecting and executing actions in their environment. The agents in this article are elements simulated in the 3D virtual environment, such as humans and animals.
In RL 20 and in deep RL, 23 the authors define the terms “state,”“terminal state,”“reward,”“transition,” and “replay buffer.” The term “state” refers to information about environments, such as the locations of buildings, food, and vegetation, or otherwise the environment where the agent can interact with directly or indirectly. In addition, a change in the environment is known as a “state-transition,” occurring whenever the agent performs its action. By performing each action, the reinforcement function evaluates the state-transitions caused by the action and assigns a “reward.” In addition, all states start from an initial state and a “terminal state.” An RL algorithm uses a deep neural network for learning as an approximator. Tasks of the RL itself have no pre-generated training sets that they can learn from, so the agent must keep records of all the state-transitions it collects so it can learn from them later, and these records are stored in a memory-buffer. This is referred to as Experience Replay or a Replay Buffer. Figure 1 denotes the relationships between states, desires, motivations, goals, and actions in the proposed framework.

Relationships between states, desires, motivations, goals, and actions.
The term “desire” describes a need that causes an agent to act. The desires in this article are variables that invoke decision-making processes, causing motivations, and the desires are utilized for selecting motivations. The agents in this article have pre-defined desires, such as hunger, thirst, and fear. Each desire is defined by
The desires should be pre-defined, given that motivations, goals, and actions are defined in this article by considering desires. The term “motivation” refers to the reason that agents act or behave in certain ways.7–10 In this article, the motivations are used as heuristic variables in the selection of goals. Motivations such as eating food, drinking water, and avoiding objects are selected by consideration of all desires and states. Each motivation is denoted by
The QNM framework has a limitation that each goal should select only one object at one time, even though there are potentially multiple objects for each goal. If the object does not have the location, the QNM framework does not operate.
The structure of the proposed framework
The QNM framework, as shown in Figure 2, is divided into a motivation layer, an intention layer, and an execution layer, for determining motivations, goals, and actions, respectively. The QNM framework receives state

QNM framework.
Motivation layer
First, desires are calculated from the state
where
Intention layer
The goal selector of the intention layer selects a goal based on the received motivation
Execution layer
In this layer, the selected goal is performed by utilizing HTN planning. A list of actions
Q-network for motivation selections
The Q-Network operates as a function approximator to map a desire

The network structure for motivation selections.
As a result, the motivation with the highest Q-value is selected for the intention layer. After calculating all desires, the desires are transformed into desire

Representing thresholds of each desire.
Figure 4 presents the representation of the thresholds for each desire. Each desire has a distinct minimal threshold
Here, three cases are considered. First, when the desire

The current desire is below the minimal threshold.
If the desire
Second, when desire

The current desire is above the maximal threshold.
Finally, when desire
Each desire must be maintained within these thresholds
The main objective of this study is to ensure all the desires stay as close as possible within the thresholds
Desire calculations
To calculate a desire such as that in equation (1), regarding “fear,” it is necessary to check for dangerous situations involving animals, humans, and vehicles.
This section describes in more detail how the proposed framework uses desire calculations to estimate and calculate the desires of each agent; for instance, designing the desire calculation for fear of an animal. As the proposed framework defines the range of each desire as being between 0 and 100, increasing or decreasing a desire depends on the events and factors that occur around them. In the first case, increases or decreases of “fear” are associated with events such as the following: (a) when dangerous animals, humans, or nearby vehicles could endanger the animal, or there are events such as being (b) hit or (c) hurt by vehicles, humans, or other animals. In other cases, “fear” is decreased due to how far away the dangerous situations are, or (d) due to actions taken by the animal to avoid danger, such as running away or going to safe places. If the animal cannot avoid them, (e) it must fight when a situation becomes dangerous. Events (a)–(e) are transformed into equations (2)–(6), respectively, as below
where
As mentioned earlier regarding equation (1), the calculation of “fear” desire
where
Experiments and results
This section explains the goals and design of the proposed framework. The results and discussion are then described.
Experiment goals
The goals of the experiments are to verify the proposed framework by applying the proposed framework in 3D virtual environments. In addition, a traditional Q-learning 20 was also applied to be compared with the proposed framework. Two methods were compared based on their accuracy. Furthermore, the results of applying both frameworks in 3D virtual environments were also compared.
Experiment design
The QNM framework provides Python APIs to interact with a 3D virtual environment. The 3D virtual environment and the QNM framework communicate directly, while the QNM system receives feedback from the 3D virtual environment. The QNM framework is proposed for simulating multi-agents in virtual environments including humans, vehicles, and animals. For the past few years, humans and vehicles have been handled in many studies. However, there are only a few researches about animal simulation, as it is not easy to apply algorithms to the animal simulation. Therefore, we focus on animals for verifying and evaluating our QNM framework to show better performance.
The QNM algorithm was implemented in Keras
32
and PyCharm-Community-2018.1.2, and the QNM models were trained on a PC comprising an Intel i7-6700 @ 3.40 GHz CPU, an Nvidia GeForce GTX 1060 GPU, and 16 GB RAM. The proposed framework used RMSprop optimizer with a learning rate of 0.001 and
For learning the motivation-value for the motivation taken, a random sample of past experiences was used for training. All of the experiments were performed on Unity 3D, which provides 3D virtual environments to simulate agents such as animals. More than 100 animal agents were defined. The desires, motivations, goals, objects, and actions were also pre-defined for each animal agent. There were nine types of animal agents. The minimum and maximum thresholds of desires are different for each animal agent, but desires, motivations, goals, objects, and actions are the same as are shown in Tables 1 and 2, respectively. Table 3 shows a list of goals, and Table 4 shows a list of actions.
Desires and their thresholds.
List of motivations.
List of goals.
List of actions.
Figure 7(a) shows an example of choosing actions by considering the distance between an agent (“me”) and its selected goal, “lake.”Figure 7(b)–(i) are also examples of goals and actions. In Figure 7, the parameter

The result of actions taken by selecting the goal: (a) HTN planning for a drinking goal; (b) HTN planning for an eating goal; (c) HTN planning for a relaxing goal; (d) HTN planning for an avoiding goal; (e) HTN planning for a urinating goal; (f) HTN planning for a fighting goal; (g) HTN planning for the following goal; (h) HTN planning for a wandering goal; and (i) HTN planning for a playing goal.
Experiment results
In the first experiment, the proposed framework was applied using the Q-Network. Figure 8 shows the accuracy results after training the QNM and the traditional Q-learning models. The red line represents the results of the QNM. It achieved an average accuracy of 85.5%, with 1000 time-steps. The blue line shows that the average accuracy of the traditional Q-learning was 69.92%.

Accuracy of the QNM and traditional Q-learning.
Figure 9 shows the loss results after training with the QNM and traditional Q-learning models. The red line presents the average loss for the Q-Network models as approximately 8.3%. The blue line shows that the loss was 2.21%. Figure 10 shows screenshots taken after selecting and executing a list of actions using the proposed framework in 3D virtual simulations, showing the actions and motivation of a chicken. Figure 10(a) shows the chicken walking toward a road, while Figure 10(b) shows the chicken avoiding cars.

Loss of the QNM and traditional Q-learning.

Results of QNM selections: (a) The chicken walking toward a road and (b) The chicken avoiding cars.
Figure 11 shows the changing of desires by the selection of motivations. In addition, Figure 12 shows the changing of desires by selecting the motivations using the traditional Q-learning.

Result of changing desires by selecting motivations using the QNM.

Result of changing desires by selecting motivations using the traditional Q-learning.
The results of the changing desires of the two frameworks, including the QNM and the traditional Q-learning, are shown in Figures 11 and 12, respectively. Figure 11 shows the changing of some desires over time, depending on thresholds. For example, a desire would change if the desire was more than the minimal threshold or the maximal threshold such as desires of hunger, thirst, toileting, or aggression, as was the expected result of the proposed framework. However, Figure 12 only shows the changing the thirst desire, while the others were increased over time. The traditional Q-learning had a similarly poor result. In both figures, the QNM framework shows better results than the traditional framework, applying the same desires to both at the starting time.
Figure 13 shows the goals determined by selecting motivations. For examples, when a thirst motivation was selected, the goal, “Going to drink from a stream,” was selected and a list of actions was executed based on HTN planning.

Result of goals chosen by selected motivations using the QNM.
In addition, Figure 13 shows motivations changing over time. First, the agent had the following motivation, and a goal of following a vehicle was selected. After finishing that goal, a motivation for urinating was chosen. The agent needed to go to urinate at a lake. At the time
Discussion
The proposed QNM framework is designed for simulating the actions of a single agent considering a relationship with multi-agents around it. Each agent can interact with other agents in dynamic environments such as fighting or avoiding. Therefore, we can simulate multi-agents in virtual environments naturally. For simulating multi-agents like crowds, we plan to extend our framework in the near future.
Our QNM framework consists of three layers. In the motivation layer, the desire functions and desire collector include only an addition operator, hence the performance is fast. The motivation selector contains some improvements to Q-Network for suitable desire input. However, these changes do not make the original Q-Network more complex. The intention layer also consists of simple operators such as addition, subtraction, and logic; hence, the computational cost is small. In the execution layer, we apply the original HTN planner without modification. This algorithm incurs high computation if we run thousands of agents at the same time. In near future, we plan to enhance the HTN algorithm to reduce the computational complexity.
Conclusion
In this article, the QNM framework was proposed to better define the desires of virtual agents, in order to help them to act more realistically in 3D virtual IoT simulations. The desires were calculated based on states that contain the environment information. A Q-Network played a key role in selecting motivations by considering desires. Goals were then selected based on the selected motivations. HTN planning was applied to select actions after determining the goals from selected motivations. From the experiments, the results of changing desires by applying the QNM were better than the results of applying traditional Q-learning by increasing the accuracy of the QNM by 15.58% more. In future research, some adjustments and extensions of the QNM should be performed to increase the ability of agents to adapt to more challenging situations. In addition, some comparison between the QNM framework and other methods also should be implemented in future research.
