Sage Journals: Discover world-class research

Abstract

Internet of Things simulations play significant roles in the diverse kinds of activities in our daily lives and have been extensively researched. Creating and controlling virtual agents in three-dimensional Internet of Things simulations is a key technology for achieving realism in three-dimensional simulations. Given that traditional virtual agent-based approaches have limitations for realism, it is necessary to improve the realism of three-dimensional Internet of Things simulations. This article proposes a Q-Network-based motivation framework that applies a Q-Network to select motivations from desires and hierarchical task network planning to execute actions based on goals of the selected motivations. The desires are to be identified and calculated based on states. Selected motivations will be chosen to determine the goals that agents must achieve. In the experiments, the proposed framework achieved an average accuracy of up to 85.5% when the Q-Network-based motivation model was trained. To verify the Q-Network-based motivation framework, a traditional Q-learning is also applied in the three-dimensional virtual environment. Comparing the results of the two frameworks, the Q-Network-based motivation framework shows better results than those of traditional Q-learning, as the accuracy of the Q-Network-based motivation is higher by 15.58%. The proposed framework can be applied to the diverse kinds of Internet of Things systems such as a training autonomous vehicle. Moreover, the proposed framework can generate big data on animal behaviors for other training systems.

Keywords

Artificial intelligence Internet of Things big data Q-network reinforcement learning

Introduction

Recently, Internet of Things (IoT) simulations have become important in many aspects of human life, not just in entertainment but also in medicine, education, the military, and for training. The IoT consists of various devices that connect each other and transmit a huge amount of data. Through simulations, IoT is beneficial for finding solutions for potential problems, as well as major effects, in autonomous cars, traffic control, and public transportation. The IoT requires algorithms that can extract knowledge and learn in real-time from collected data resulting from various resources such as temperature, traffic, and health. Frequently applied algorithms for data analysis and IoT cases, including machine learning and deep learning, provide the ability to learn without programming. These algorithms can be categorized into supervised, unsupervised, and reinforcement learning algorithms.^1–4 Simulations have many advantages, such as the virtualization of hazardous environments, extending the safe training of agents through the facilitation of experience without risking harm to humans or animals. Virtual agents can make decisions to move goals, interact with their environment, and respond to one another. The decisions of virtual agents can be enhanced using a variety of approaches, such as planning, knowledge representation, fuzzy logic, and neural networks.^5–7

Motivations are the reasons for agents to act or behave in certain ways^7–11 by affecting their perception, adaptability, and behavior. The topic of motivation in humans and animals has been studied in many different fields, including biology and sociology. Helbing et al.¹¹ discuss human behavior using models and data from crowd disasters and crimes for the purpose of saving human lives. Capraro and Perc¹² study cooperative behavior in human sociality for determining the ability and motivation to engage with others in social dilemma games. Scott¹³ define the concept of human motivation as the desire to do something. A “plan” is a series of steps in a hierarchical task network (HTN).¹⁴ An HTN takes a problem as input and supplies a plan or set of tasks, such as traditional tasks that roughly correspond to actions, compound tasks, and goal tasks.¹⁵

In this article, we combine artificial intelligence and a motivation method for IoT simulation applications. This research can be applied to many IoT systems such as three-dimensional (3D) city simulation or the training of autonomous vehicles. For detail, this article proposes a motivation-based framework to make agents naturally, which reduces unpredictable situations. The primary objective of this article is to build agent simulations. This article uses a Q-Network-based motivation (QNM) framework to have agents select motivations by considering their 3D virtual environments. The QNM framework does not require training data, which is difficult to collect. In addition, we can generate big data on animal behaviors for other training systems using the proposed framework.

By combining the key techniques of the above methods, this article proposes a QNM framework to address the task of simulating agents to select motivations and goals, executing actions in the 3D virtual environment based on the selected goals. HTN planning is integrated into the QNM framework, acting as a planner for the selection of actions based on selected goals after selecting motivations, and extracting desires. Agents, including humans and animals, can adapt and react in the 3D virtual environment, with agents selecting actions based on situations in the 3D virtual environment.

This article is organized as follows. Section “Related work” describes related works and explains some of their disadvantages. In section “QNM system,” an architecture of the proposed framework for agent simulations is described. Section “Experiments and results” presents some experimental results using the proposed framework, and additionally implements experiments that apply a traditional Q-learning method, rather than the Q-Network. Both methods are used to verify the proposed framework, comparing the Q-Network against traditional Q-learning. Section “Discussion” provides discussion of the QNM framework after considering the results. Conclusions are given in section “Conclusion.”

Related work

Motivation approaches

Motivation approaches enable agents to achieve different goals based on their motivations. These approaches can create new goals for more efficiently exploring the environment. Song et al.¹⁶ proposes motivation-based HTN planning that involves defining a list of motivations ordered by priority. At runtime, motivations are first chosen by the probability distribution of internal and external motivations and are then updated continuously. Goals and actions are generated by using HTN planning. Although their proposed method produces satisfactory results when simulating only one agent, choosing sudden, random events is not satisfactory for simulating agents in situations where unpredictable events occur, such as weather or interaction between agents.

De Sevin and Thalmann¹⁷ propose a motivational model of action selections. The actions of the virtual agents are chosen by motivations. Each motivation is measured by a variable value of between 0 and 100. This method chooses the motivation with the highest value, and other motivations with lower values are ignored. Motivations are defined for each agent to give the agents individuality. However, how an appropriate action is chosen is a problem that must be considered, as it needs to satisfy the goals or situations arising from the environment. This proposed method cannot solve those problems.

Graham et al.¹⁸ propose a method of combining motivated learning with goal creation. They present the idea of motivation as the underlying force of a cognitive agent’s operations by combining goal creation. Motivation is determined by needs. Each agent begins with one or more needs, such as the need for money or food. Agents develop their own goals and motivations. The motivation and goal creation control the agents’ motivation and behavior. The results of this method indicate that it performs efficiently in environments without unpredictable situations.

Graham et al.¹⁸ propose a method that considers all the needs of the agents. However, this differs from Song et al.¹⁶ and De Sevin and Thalmann¹⁷ in that it is only able to manage planned situations, without considering environmental information such as the locations of agents, weather events, and interaction between agents. The proposed method can be applied to environments where these events or situations often occur. The problems of the previous approaches can be solved by choosing appropriate motivations based on events and situations.

Q-learning approaches

In recent years, reinforcement learning (RL)^19,20 has played a key role in solving this problem by programming predetermined behaviors for agents and teaching them to make appropriate decisions. The agents need to learn from their environments by taking actions, receiving rewards, and adapting to this action-selection policy based on past, present, and future rewards. Value-based approaches attempt to learn the expected value of each state with the purpose of arriving at an optimal action-selection policy. A state is all information used to make a decision leading to a new state. A policy is denoted as $π$ and is the specific strategy used to access high-value states (actions) to maximize rewards over time. Similarly, there is an action-value Q-function which accepts a state and an action, and then returns the value of taking that action given that state.

Q-learning is one type of algorithm used to calculate state-action values. Q-learning is an off-policy method that does not depend on the policy to update the value function. Off-policy methods are advantageous because they can follow one policy while learning another. They belong to the class of temporal difference (TD) algorithms, which suggests that time differences between taken actions and received rewards are involved. TD algorithms combine the computation of an optimal policy, using a given model of a Markov Decision Process (MDP), with Monte Carlo (MC) methods, which do not require complete knowledge of the environment. They only require experiences as state-action pairs with their rewards, and make updates after every taken action, making a prediction based on previous experiences, taking action based on that prediction, and updating the prediction after receiving a reward. In this article, instead of using a table that scores Q-values for every possible state-action pair, the table is iteratively updated during training as in MC methods. The policy $π$ of the proposed framework is based on choosing the action with the highest Q-value for that given state, and the simulation of the proposed framework is much too large to store in a table. Therefore, a neural network is introduced to solve this problem.

More recently, deep RL has been used extensively in research, allowing agents to perform well on games such as Flappy Bird, Go, and various Atari games, using just the raw pixel data to make decisions.^21–23 Mnih et al.²³ train agents to play Atari games using deep RL, which the agents do with a skill level that exceeds that of human expertise in many games. Here, the deep RL acts as an approximate function to describe Q-values in Q-learning, which uses the input of raw pixels and the output of a function estimating a future reward. In addition, the authors present techniques for improving training effectiveness and stability, using an experience replay mechanism to randomly sample previous transitions, thus smoothing the training distribution on past behaviors.

Kempka et al.²⁴ implement the Doom video game as a simulated environment for RL. It has been used as a platform for the study of reward shaping, curriculum learning, and predictive planning. In addition, the open-world of Minecraft also provides a platform for exploring RL and artificial intelligence (AI). Because of the nature of the simulation, some studies researching curriculum learning and hierarchical planning are conducted with this platform (Tessler et al.,²⁵ Matiisen et al.,²⁶ and Oh et al.²⁷).

Juliani et al.²⁸ implement a toolkit that leverages the Unity platform for creating simulation environments. Unity enables the development of learning environments for sensory and physical complexity and also supports the multi-agent setting. Hessel et al.^29,30 combine all of the deep RL improvements of previous studies to provide state-of-art performance on the Atari 2600 benchmarks and surpass human-level performance. Rather than Q-learning with the experience replay as the proposed framework, traditional Q-learning²⁰ with online learning is employed, where the current transition is used to update the value function immediately instead of storing into a replay buffer.

Motivation approaches are combined with RL to identify how to satisfy motivations using RL. Recent research has attempted to combine RL with motivation to simulate agents. Merrick and Maher³¹ propose motivated RL agents. Motivated RL agents can explore their environment and learn pre-defined behaviors, such as moving to a chair or a house. However, these agents lack the ability to adapt to and experience their surroundings, leaving them unable to avoid dangerous situations, weather events, and so on.

In the proposed framework, the neural network plays the role of the Q-function. The neural network has many parameters known as weights. Therefore, instead of iteratively updating values in a lookup table as MC does, the proposed framework can be used for updating weights of the neural network to learn to provide better estimates of state-action values. In addition, the proposed framework uses gradient descent to train the Q neural network, as in any other neural network.

The neural network is implemented in the same way as the Atari playing algorithm by Google DeepMind.^21–23 The network accepts a state and outputs separate Q-values for each possible action in its output layer. Maximum Q-values are needed in Q-learning for each possible action in the new state. The key feature that the proposed framework takes from deep RL^21–23 is experience replay. Experience replay basically provides a minibatch of updating in online learning. The sampling transitions from the replay buffer are used as a labeled data set for learning. In addition, the proposed framework uses both the current transitions and the transitions from the replay buffer to update the value function at every time step.

In this article, the proposed framework presents a new framework for simulating agents in a 3D virtual environment. Agents are simulated to act and adapt to various situations based on their pre-defined desires. In addition, the agents are capable of avoiding dangerous situations, events, and experiences in their 3D virtual environment.

QNM system

This section presents an overview of a QNM framework and then describes how to utilize the Q-Network for motivation selections.

System overview

This section presents an overview of the proposed framework that addresses tasks to develop agents capable of deciding motivations based on their desires, selecting goals based on the decided motivations, and performing actions based on the selected goals. First, the proposed framework uses the Q-Network framework to select goals based on motivations. Second, it uses HTN planning to allow agents to select actions to accomplish their goals.

Terms and notations

The term “agent” refers to anything that makes decisions by selecting and executing actions in their environment. The agents in this article are elements simulated in the 3D virtual environment, such as humans and animals.

In RL²⁰ and in deep RL,²³ the authors define the terms “state,”“terminal state,”“reward,”“transition,” and “replay buffer.” The term “state” refers to information about environments, such as the locations of buildings, food, and vegetation, or otherwise the environment where the agent can interact with directly or indirectly. In addition, a change in the environment is known as a “state-transition,” occurring whenever the agent performs its action. By performing each action, the reinforcement function evaluates the state-transitions caused by the action and assigns a “reward.” In addition, all states start from an initial state and a “terminal state.” An RL algorithm uses a deep neural network for learning as an approximator. Tasks of the RL itself have no pre-generated training sets that they can learn from, so the agent must keep records of all the state-transitions it collects so it can learn from them later, and these records are stored in a memory-buffer. This is referred to as Experience Replay or a Replay Buffer. Figure 1 denotes the relationships between states, desires, motivations, goals, and actions in the proposed framework.

Figure 1.

Relationships between states, desires, motivations, goals, and actions.

The term “desire” describes a need that causes an agent to act. The desires in this article are variables that invoke decision-making processes, causing motivations, and the desires are utilized for selecting motivations. The agents in this article have pre-defined desires, such as hunger, thirst, and fear. Each desire is defined by $d_{t}^{i, j}$ from the minimum threshold $ϑ^{i, j}$ to the maximum $μ^{i, j}$ , where $i$ is an index of an agent, $j$ is an index of a desire, and $t$ is the current time. If a desire $d_{t}^{i, j}$ falls below the minimum threshold $ϑ^{i, j}$ or exceeds the maximum threshold $μ^{i, j}$ , an agent $a^{i}$ should choose a motivation, which will increase or decrease the value of desire $d_{t}^{i, j}$ .

The desires should be pre-defined, given that motivations, goals, and actions are defined in this article by considering desires. The term “motivation” refers to the reason that agents act or behave in certain ways.^7–10 In this article, the motivations are used as heuristic variables in the selection of goals. Motivations such as eating food, drinking water, and avoiding objects are selected by consideration of all desires and states. Each motivation is denoted by $m_{t}^{i}$ . The term “goal” is defined as a state achieved by invoking planning and executing actions. Goals are chosen to satisfy the selected motivations. Each goal is denoted by $g_{t}^{i}$ . The term “object” is defined as a thing that can be seen or touched. The objects being referred to in this article are any elements that exist in a 3D virtual environment, such as humans, animals, vehicles, buildings, food, vegetation, and water that can be interacted with by agents. We assume that all objects have a fixed location in the 3D virtual environment. Each object is denoted by $o_{n}$ , where n is an index of objects. The goal corresponding to each selected motivation is also selected. Multiple objects are considered for the goal, and then one of them is selected considering agents and objects. Each goal is performed with the one selected object. The term “action” is defined as what agents can do in each state. The actions according to the selected goal are executed in a 3D virtual environment. Each action is denoted by $a_{t}^{i}$ .

The QNM framework has a limitation that each goal should select only one object at one time, even though there are potentially multiple objects for each goal. If the object does not have the location, the QNM framework does not operate.

The structure of the proposed framework

The QNM framework, as shown in Figure 2, is divided into a motivation layer, an intention layer, and an execution layer, for determining motivations, goals, and actions, respectively. The QNM framework receives state $s_{t}$ at time t. In the motivation layer, after receiving state $s_{t},$ motivation $m_{t}$ is then chosen. The intention layer is responsible for receiving motivation $m_{t}^{i}$ from the motivation layer, selecting goal $g_{t}^{i}$ among the available goals, and sending the selected goal $g_{t}^{i}$ to the execution layer. The execution layer is then employed to execute a list of actions by considering the selected goal $g_{t}^{i}$ .

Figure 2.

QNM framework.

Motivation layer

First, desires are calculated from the state $s_{t}$ as shown in equation (1)

$d_{t}^{i, j} = d_{t - 1}^{i, j} + \sum_{k = 1} d_{t}^{i, j, k} x$ (1)

where k is an index of the number of equations. Next, the desire collector is responsible for combining all desires into desire $d_{t}^{i}$ . Desire $d_{t}^{i}$ is a collection of desires suitable for the input of the motivation selection module. In the motivation selection module, there are two sub-modules: a reward calculation and the Q-Network. The reward calculation is responsible for calculating a reward $r_{t}^{i}$ and sending the reward $r_{t}^{i}$ to the Q-Network. The Q-Network is then used to select one of the motivations based on desire $d_{t}^{i}$ . The output of the Q-Network is the motivation $m_{t}^{i}$ . By considering motivation $m_{t}^{i}$ , all desires are maintained in the range from the minimum threshold $ϑ^{i, j}$ to the maximum threshold $μ^{i, j}$ . The selected motivation $m_{t}^{i}$ is then sent to the intention layer. Error $e_{t}^{i}$ is the feedback from the Q-Network used for improving the Q-Network during training.

Intention layer

The goal selector of the intention layer selects a goal based on the received motivation $m_{t}^{i}$ . There are multiple possible objects according to the goal. Therefore, the object selector selects one of the objects among multiple objects by using a greedy best first search algorithm after calculating the location of the nearest item. Then, one goal $g_{t}^{i}$ and one object $o_{t}^{i}$ are sent to the execution layer.

Execution layer

In this layer, the selected goal is performed by utilizing HTN planning. A list of actions $a_{t}$ , $a_{t + 1}$ , and so on are selected by applying HTN planning. Each selected action affects a 3D virtual environment, executed sequentially.

Q-network for motivation selections

The Q-Network operates as a function approximator to map a desire $d_{t}^{i}$ to a motivation $m_{t}^{i}$ . Figure 3 illustrates the network structure for motivation selections using the proposed framework, which includes one input desire $d_{t}^{i}$ , one input layer, two hidden layers, one output layer with Q-values of motivations, and one output as Algorithm 1. The network structure of the proposed framework has an input layer of $d_{t}^{i}$ units, 2 hidden layers of 164 and 150 units, and an output layer of $m_{t}^{i, κ}$ units, where $κ$ is an index of each possible motivation with Q-values. One output $m_{t}^{i}$ results from the output layer.

Figure 3.

The network structure for motivation selections.

Algorithm 1. QNM Algorithm
1. Initialize the value function $Q$ 2. Initialize the replay buffer $M$ 3. WHILE not converged DO4. Get the initial desire $d_{t}^{i}$ 5. WHILE $d_{t}^{i}$ is not the terminal state DO6. Select a motivation $m_{t}^{i}$ according to ε-greedy policy7. derived from $Q$ 8. Store the transition $t = (d_{t}^{i}, m_{t}^{i}, r_{t}^{i}, d'_{t}^{i})$ into the9. replay buffer $M$ 10. Sample a batch of transitions $B$ from $M$ 11. Update the value function $Q$ with $B$ and $t$ 12. $d_{t}^{i} \leftarrow d_{t}^{i,'}$ 13. END14. END

Algorithm 1. QNM Algorithm

1. Initialize the value function

Q

2. Initialize the replay buffer

M

3. WHILE not converged DO4. Get the initial desire

d_{t}^{i}

5. WHILE

d_{t}^{i}

is not the terminal state DO6. Select a motivation

m_{t}^{i}

according to ε-greedy policy7. derived from

Q

8. Store the transition

t = (d_{t}^{i}, m_{t}^{i}, r_{t}^{i}, d'_{t}^{i})

into the9. replay buffer

M

10. Sample a batch of transitions

B

from

M

11. Update the value function

Q

with

B

and

t

12.

d_{t}^{i} \leftarrow d_{t}^{i,'}

13. END14. END

As a result, the motivation with the highest Q-value is selected for the intention layer. After calculating all desires, the desires are transformed into desire $d_{t}^{i}$ . In addition, a motivation $m_{t}^{i}$ at time t is used by the agent to choose the highest reward and move to the next desire $d_{t + 1}^{i}$ , as a new desire $d_{t}^{' i}$ at time t + 1. The proposed framework determines the terminal state for the Q-Network by indicating the maximum total rewards of all desires and setting these to stay as close as possible within the minimal and maximal thresholds, as shown in Figure 4.

Figure 4.

Representing thresholds of each desire.

Figure 4 presents the representation of the thresholds for each desire. Each desire has a distinct minimal threshold $ϑ$ and maximal threshold μ. Each agent must execute actions to maintain the value of desire $d_{t}^{i, j}$ within the range of $0 \leq ϑ^{i, j} \leq μ^{i, j} \leq 100$ .

Here, three cases are considered. First, when the desire $d_{t}^{i, j}$ is below the threshold $ϑ^{i, j}$ , a ratio is calculated by $(l_{t}^{i, j} / l_{ϑ}^{i, j})$ , where $l_{t}^{i, j}$ denotes a distance between the desire $d_{t}^{i, j}$ and $ϑ^{i, j}$ , $l_{ϑ}^{i, j}$ denotes a distance between 0 and $ϑ^{i, j}$ . The ratio is used to check whether the desire $d_{t}^{i, j}$ is near the threshold $ϑ^{i, j}$ or not, as shown in Figure 5.

Figure 5.

The current desire is below the minimal threshold.

If the desire $d_{t}^{i, j}$ is close to 0, the ratio becomes higher, and vice versa. The proposed framework collects all the ratios of desires $d_{t}^{i}$ and summarizes them. A summarized ratio is assigned as a reward for the Q-Network. The rewards are a means of guiding the learning process toward its goals. In the Q-Network, the reward depends on the distance between the desire and its nearest threshold of either $ϑ^{i, j}$ or $μ^{i, j} .$ If the desire $d_{t}^{i, j}$ is close to $ϑ^{i, j}$ , distance $l_{t}^{i, j}$ between them will be smaller. The ratio of distance $l_{t}^{i, j}$ to threshold $ϑ^{i, j}$ then becomes smaller. On the contrary, if the desire $d_{t}^{i, j}$ is far away from $ϑ^{i, j}$ , the distance between them will be higher. The ratio of distance $l_{t}^{i, j}$ to threshold $ϑ^{i, j}$ then becomes bigger. The goal is for the agents to maintain desire values within the thresholds as much as possible. The goals are satisfied when the desire is close to either threshold $ϑ^{i, j}$ or threshold $μ^{i, j}$ . Traditional RL assigns a reward of +1 when all desires are within the thresholds, and otherwise assigns a reward of −1. In the proposed framework, a reward is always less than or equal to 0, and the highest reward is 0, and is given when all desires are within their thresholds. Furthermore, all desires should stay as close as possible to the range between $ϑ^{i, j}$ and $μ^{i, j}$ . In the proposed framework, the ratio is assigned as the reward. However, the ratio becomes smaller when the desire $d_{t}^{i, j}$ is near either threshold $ϑ^{i, j}$ or threshold $μ^{i, j}$ . The ratio becomes bigger when the desire $d_{t}^{i, j}$ approaches 0 or 100. Therefore, each ratio is multiplied by −1 to ensure that the ratio is higher when the desire is close to its thresholds, and vice versa.

Second, when desire $d_{t}^{i, j}$ is above threshold $μ^{i, j}$ , as shown in Figure 6, the manner of calculating the ratio is similar to the first case.

Figure 6.

The current desire is above the maximal threshold.

Finally, when desire $d_{t}^{i, j}$ is within thresholds $ϑ^{i, j}$ and $μ^{i, j}$ , $ϑ^{i, j} \leq d_{t}^{i, j} \leq μ^{i, j},$ the ratio is 0. In this case, the ratio does not need to be calculated due to desire $d_{t}^{i, j}$ being within the thresholds.

Each desire must be maintained within these thresholds $[ϑ^{i, j}, μ^{i, j}]$ , increasing or decreasing the desire by choosing the best motivation at state $s_{t}$ . Once ratios are calculated, all the ratios are summarized and assigned as a reward $r_{t}^{i}$ after choosing the best motivation $m_{t}^{i}$ . The motivation $m_{t}^{i}$ must be executed to move to the next state $s_{t + 1}$ .

The main objective of this study is to ensure all the desires stay as close as possible within the thresholds $[ϑ^{i, j}, μ^{i, j}]$ while maximizing all rewards of desires to stay as close as possible to 0. After calculating all rewards for the desires, they are summarized as a reward $r_{t}^{i}$ to be used by the Q-Network.

Desire calculations

To calculate a desire such as that in equation (1), regarding “fear,” it is necessary to check for dangerous situations involving animals, humans, and vehicles.

This section describes in more detail how the proposed framework uses desire calculations to estimate and calculate the desires of each agent; for instance, designing the desire calculation for fear of an animal. As the proposed framework defines the range of each desire as being between 0 and 100, increasing or decreasing a desire depends on the events and factors that occur around them. In the first case, increases or decreases of “fear” are associated with events such as the following: (a) when dangerous animals, humans, or nearby vehicles could endanger the animal, or there are events such as being (b) hit or (c) hurt by vehicles, humans, or other animals. In other cases, “fear” is decreased due to how far away the dangerous situations are, or (d) due to actions taken by the animal to avoid danger, such as running away or going to safe places. If the animal cannot avoid them, (e) it must fight when a situation becomes dangerous. Events (a)–(e) are transformed into equations (2)–(6), respectively, as below

$d_{t}^{i, fear, 1} = {\begin{matrix} \begin{matrix} if L (p_{t}^{a, *}, p_{t}^{a, 1}) < 1, \\ L (p_{t}^{a, *}, p_{t}^{a, 2}) < 1, \dots, L (p_{t}^{a, *}, p_{t}^{a, i}) < 1 \end{matrix} & 10 \\ else & 0 \end{matrix}$ (2)

$d_{t}^{i, fear, 2} = {\begin{matrix} \begin{matrix} if L (p_{t}^{a, *}, p_{t}^{a, 1}) = 0, \\ L (p_{t}^{a, *}, p_{t}^{a, 2}) = 0, \dots, L (p_{t}^{a, *}, p_{t}^{a, i}) = 0 \end{matrix} & 20 \\ else & 0 \end{matrix}$ (3)

$d_{t}^{i, fear, 3} = {\begin{matrix} \begin{matrix} if L (p_{t}^{a, *}, p_{t}^{a, 1}) = 0, \\ L (p_{t}^{a, *}, p_{t}^{a, 2}) = 0, \dots, L (p_{t}^{a, *}, p_{t}^{a, i}) = 0, \\ Δ t = 5 \end{matrix} & 30 \\ else & 0 \end{matrix}$ (4)

$d_{t}^{i, fear, 4} = {\begin{matrix} \begin{matrix} if L (p_{t}^{a, *}, p_{t}^{a, 1}) > 1, \\ L (p_{t}^{a, *}, p_{t}^{a, 2}) > 1, \dots, L (p_{t}^{a, *}, p_{t}^{a, i}) > 1, \\ Δ t = 10 \end{matrix} & - 10 \\ else & 0 \end{matrix}$ (5)

$d_{t}^{i, fear, 5} = {\begin{matrix} \begin{matrix} if L (p_{t}^{a, *}, p_{t}^{a, 1}) > 1, \\ L (p_{t}^{a, *}, p_{t}^{a, 2}) > 1, \dots, L (p_{t}^{a, *}, p_{t}^{a, i}) > 1 \\ and L (p_{t}^{a, *}, p_{t}^{a, 1}) < 10, \\ L (p_{t}^{a, *}, p_{t}^{a, 1}) < 10, \dots, L (p_{t}^{a, *}, p_{t}^{a, 1}) < 10, Δ t = 10 \end{matrix} & - 20 \\ else & 0 \end{matrix}$ (6)

where $d_{t}^{i, fear}$ denotes “fear,” $L$ denotes a function to calculate a distance between two positions, “*” denotes an agent that is required checking the distance with other agents, $i$ denotes the number of agents in the 3D virtual environment, $Δ t$ depicts duration of time t, $a$ denotes the agent, and $p$ denotes a location.

As mentioned earlier regarding equation (1), the calculation of “fear” desire $d_{t}^{fear}$ is implemented as follows

$d_{t}^{i, f e a r} = d_{t - 1}^{i, f e a r} + \sum_{k = 1} d_{t}^{i, f e a r, k}$ (7)

where $d_{t}^{i, fear}$ denotes “fear.”

Experiments and results

This section explains the goals and design of the proposed framework. The results and discussion are then described.

Experiment goals

The goals of the experiments are to verify the proposed framework by applying the proposed framework in 3D virtual environments. In addition, a traditional Q-learning²⁰ was also applied to be compared with the proposed framework. Two methods were compared based on their accuracy. Furthermore, the results of applying both frameworks in 3D virtual environments were also compared.

Experiment design

The QNM framework provides Python APIs to interact with a 3D virtual environment. The 3D virtual environment and the QNM framework communicate directly, while the QNM system receives feedback from the 3D virtual environment. The QNM framework is proposed for simulating multi-agents in virtual environments including humans, vehicles, and animals. For the past few years, humans and vehicles have been handled in many studies. However, there are only a few researches about animal simulation, as it is not easy to apply algorithms to the animal simulation. Therefore, we focus on animals for verifying and evaluating our QNM framework to show better performance.

The QNM algorithm was implemented in Keras³² and PyCharm-Community-2018.1.2, and the QNM models were trained on a PC comprising an Intel i7-6700 @ 3.40 GHz CPU, an Nvidia GeForce GTX 1060 GPU, and 16 GB RAM. The proposed framework used RMSprop optimizer with a learning rate of 0.001 and $γ$ = 0.85. In addition, dropout is implemented per-layer with a probability of 0.2.

Implementation 1. The QNM algorithm
Configuration:1. Setting up a for-loop to the number of epochs2. Run Q-network forward3. Using a $ϵ$ -greedy action selection for choosing a motivation $m_{t}^{i}$ , so at time t with probability ϵ, a random a motivation $m_{t}^{i}$ will be chosen. With probability 1 − $ϵ$ , the motivation $m_{t}^{i}$ associated with the highest Q-value from the neural network will be chosen.Experience replay:4. In the desire $d_{t}^{i}$ , take a motivation $m_{t}^{i}$ , observe a new desire $d'_{t}^{i}$ and a reward $r_{t}^{i}$ 5. Store this as a tuple $(d_{t}^{i}, m_{t}^{i}, r_{t}^{i}, d'_{t}^{i})$ in a list.6. Continue to store each experience in this list until we have filled the list to a specific length (up to you to define)7. Once the experience replay memory is filled, randomly select a subset8. Iterate through this subset and calculate value updates for each; store these in a target array (y_train) and store the desire $d_{t}^{i}$ of each memory in X_train9. Use X_train and y_train as a minibatch for batch training. For subsequent epochs where the array is full, just overwrite old values in our experience replay memory array.10. Run the network forward using $d'_{t}^{i}$ . Store the highest Q-value (maxQ)11. The target value to train the network is reward + (gamma * maxQ) where gamma is a parameter $(0 \leq γ \leq 1)$ 12. Given that, $i$ motivation outputs, and update/train the output associated with the desires that the neural network took.13. Train the model and repeat process 2–13.

Implementation 1. The QNM algorithm

Configuration:1. Setting up a for-loop to the number of epochs2. Run Q-network forward3. Using a

ϵ

-greedy action selection for choosing a motivation

m_{t}^{i}

, so at time t with probability ϵ, a random a motivation

m_{t}^{i}

will be chosen. With probability 1 −

ϵ

, the motivation

m_{t}^{i}

associated with the highest Q-value from the neural network will be chosen.Experience replay:4. In the desire

d_{t}^{i}

, take a motivation

m_{t}^{i}

, observe a new desire

d'_{t}^{i}

and a reward

r_{t}^{i}

5. Store this as a tuple

(d_{t}^{i}, m_{t}^{i}, r_{t}^{i}, d'_{t}^{i})

in a list.6. Continue to store each experience in this list until we have filled the list to a specific length (up to you to define)7. Once the experience replay memory is filled, randomly select a subset8. Iterate through this subset and calculate value updates for each; store these in a target array (y_train) and store the desire

d_{t}^{i}

of each memory in X_train9. Use X_train and y_train as a minibatch for batch training. For subsequent epochs where the array is full, just overwrite old values in our experience replay memory array.10. Run the network forward using

d'_{t}^{i}

. Store the highest Q-value (maxQ)11. The target value to train the network is reward + (gamma * maxQ) where gamma is a parameter

(0 \leq γ \leq 1)

12. Given that,

i

motivation outputs, and update/train the output associated with the desires that the neural network took.13. Train the model and repeat process 2–13.

For learning the motivation-value for the motivation taken, a random sample of past experiences was used for training. All of the experiments were performed on Unity 3D, which provides 3D virtual environments to simulate agents such as animals. More than 100 animal agents were defined. The desires, motivations, goals, objects, and actions were also pre-defined for each animal agent. There were nine types of animal agents. The minimum and maximum thresholds of desires are different for each animal agent, but desires, motivations, goals, objects, and actions are the same as are shown in Tables 1 and 2, respectively. Table 3 shows a list of goals, and Table 4 shows a list of actions.

Table 1.

Desires and their thresholds.

Desire	Dog		Dear		Cow		Pig		Boar		Rabbit		Goat		Chicken
	Thresholds		Thresholds		Thresholds		Thresholds		Thresholds		Thresholds		Thresholds		Thresholds
	Min	Max	Min	Max	Min	Max	Min	Max	Min	Max	Min	Max	Min	Max	Min	Max
Hunger	40	70	20	60	30	60	35	65	45	70	30	50	45	70	30	70
Thirst	50	75	40	65	30	55	40	75	50	70	50	65	40	65	40	65
Toileting	35	65	45	70	45	75	55	85	45	65	25	55	25	55	35	45
Resting	40	70	30	75	60	80	30	60	45	70	60	80	50	70	40	70
Stress	40	80	50	60	45	70	40	60	30	70	30	70	30	70	50	75
Curiosity	50	70	40	60	35	50	30	60	40	75	40	65	50	60	50	60
Society	35	75	30	80	45	65	35	65	45	65	35	55	45	75	65	85
Fear	45	70	55	65	55	80	35	60	55	70	45	70	35	60	35	60
Aggression	40	65	45	60	50	65	30	55	60	80	40	75	40	60	30	55

Table 2.

List of motivations.

Motivation	Description
Eating	Eating some foods
Drinking	Drinking water
Urinating	Urinating in some places
Relaxing	Relaxing in some places
Playing	Playing with agents in some places
Following	Following agents
Wandering	Wandering in some places
Avoiding	Avoiding agents
Fighting	Fighting agents

Table 3.

List of goals.

Motivation	Goals
Motivation	Goal	Object
Eating	Going to eat:	Vegetable, meat, bread, fruit, grass
Drinking	Going to drink from:	A lake, a stream, a river
Urinating	Going to urinate at:	A bridge, a land, a grass place
Relaxing	Going to relax at:	An apartment, an office buildingA shop, a restaurant, a gym, a hospitalAn entertainment place
Playing	Going to play with:	Places	An apartment, an office building, a shop, a restaurant, a gym, a hospitalAn entertainment place
		Agents	A human, a vehicle, a dog, a chicken, a boar, a cow, a goatA pig, a deer, a rabbit
Following	Following agent:	Agents	A human, a vehicle, a dog, a chicken, a boar, a cow, a goatA pig, a deer, a rabbit
Wandering	Walking around at:	An apartment, an office building, a shopA restaurant, a gym, a hospital, an entertainment place, a highway, a pavement, a road
Avoiding	Avoiding agent:	A vehicle, a human, a dog, a chickenA boar, a cow, a goat, a pig, a deer, a rabbit
Fighting	Attacking agent:	A vehicle, a human, a dog, a chicken, a boar, a cow, a goat, a pig, a deer, a rabbit

Table 4.

List of actions.

Action	Description
Idle	Staying at a location
Stop	Stopping at a location
Sit	Sitting at a location
Walk	Walking to a location
Runaway	Running away from places
Run	Running toward places
Eat	Eating something
Drink	Drinking something
Sleep	Sleeping in places
Bit	Bite agents
Hit	Hit agents
Play	Play with agents, in some places
Scratch	Scratch agents
Look around	Look around the current position
Fight	Fighting agents
Dig	Digging something in places
Jump	Jumping at a location
Stand	Standing at a location
Swim	Swimming at a location
Satisfy	Urinating at a location
Tail wagging	Tail wagging at a location
Sniff	Sniff something
Hurt	Being hurt by other agents
Die	Being killed by other agents

Figure 7(a) shows an example of choosing actions by considering the distance between an agent (“me”) and its selected goal, “lake.”Figure 7(b)–(i) are also examples of goals and actions. In Figure 7, the parameter $p 1, p 2$ denotes 3D locations in the 3D virtual environment. In Figure 7(a), the agent walks toward a place such as a lake or a river from its current location $p 1$ to the new location $p 2$ by walking or running with various actions chosen by HTN planning. In Figure 7(b), the agent walks toward a place for eating foods such as vegetables, meats, bread, and fruits, by walking from its current location $p 1$ to the new food location $p 2$ with various actions chosen by HTN planning depending on if it can smell a food place. In Figure 7(c), the agent relaxes at a sheltered place such as an apartment, land, or a grass place by walking from its current location $p 1$ to the new location $p 2$ with various actions chosen by HTN planning. In Figure 7(d), the agent avoids another agent like a human, vehicle, dog, or chicken for foods by running from its current location $p 1$ to a safety location $p 2$ with various actions chosen by HTN planning, if the other agent is dangerous. In Figure 7(e), the agent urinates at a new place such as bridge, land, or a grass place by walking from its current location $p 1$ to the new location $p 2$ with various actions chosen by HTN planning. In Figure 7(f), the agent fights another agent like a human, vehicle, dog, or chicken for foods by walking, or running if it is urgent, from its current location $p 1$ to the other agent’s location $p 2$ with various actions chosen by HTN planning. In Figure 7(g), the agent follows another agent like a human, vehicle, dog, or chicken to sniff from its current location $p 1$ to the other agent’s location $p 2$ with various actions chosen by HTN planning. In Figure 7(h), the agent walks around a new place such as a bridge, land, or a grass place, moving from its current location $p 1$ to the new location $p 2$ with various actions chosen by HTN planning. In Figure 7(i), the agent runs or walks toward another agent like a human, vehicle, dog, or chicken to play with, moving from its current location $p 1$ to the other agent’s location $p 2$ with various actions chosen by HTN planning.

Figure 7.

The result of actions taken by selecting the goal: (a) HTN planning for a drinking goal; (b) HTN planning for an eating goal; (c) HTN planning for a relaxing goal; (d) HTN planning for an avoiding goal; (e) HTN planning for a urinating goal; (f) HTN planning for a fighting goal; (g) HTN planning for the following goal; (h) HTN planning for a wandering goal; and (i) HTN planning for a playing goal.

Experiment results

In the first experiment, the proposed framework was applied using the Q-Network. Figure 8 shows the accuracy results after training the QNM and the traditional Q-learning models. The red line represents the results of the QNM. It achieved an average accuracy of 85.5%, with 1000 time-steps. The blue line shows that the average accuracy of the traditional Q-learning was 69.92%.

Figure 8.

Accuracy of the QNM and traditional Q-learning.

Figure 9 shows the loss results after training with the QNM and traditional Q-learning models. The red line presents the average loss for the Q-Network models as approximately 8.3%. The blue line shows that the loss was 2.21%. Figure 10 shows screenshots taken after selecting and executing a list of actions using the proposed framework in 3D virtual simulations, showing the actions and motivation of a chicken. Figure 10(a) shows the chicken walking toward a road, while Figure 10(b) shows the chicken avoiding cars.

Figure 9.

Loss of the QNM and traditional Q-learning.

Figure 10.

Results of QNM selections: (a) The chicken walking toward a road and (b) The chicken avoiding cars.

Figure 11 shows the changing of desires by the selection of motivations. In addition, Figure 12 shows the changing of desires by selecting the motivations using the traditional Q-learning.

Figure 11.

Result of changing desires by selecting motivations using the QNM.

Figure 12.

Result of changing desires by selecting motivations using the traditional Q-learning.

The results of the changing desires of the two frameworks, including the QNM and the traditional Q-learning, are shown in Figures 11 and 12, respectively. Figure 11 shows the changing of some desires over time, depending on thresholds. For example, a desire would change if the desire was more than the minimal threshold or the maximal threshold such as desires of hunger, thirst, toileting, or aggression, as was the expected result of the proposed framework. However, Figure 12 only shows the changing the thirst desire, while the others were increased over time. The traditional Q-learning had a similarly poor result. In both figures, the QNM framework shows better results than the traditional framework, applying the same desires to both at the starting time.

Figure 13 shows the goals determined by selecting motivations. For examples, when a thirst motivation was selected, the goal, “Going to drink from a stream,” was selected and a list of actions was executed based on HTN planning.

Figure 13.

Result of goals chosen by selected motivations using the QNM.

In addition, Figure 13 shows motivations changing over time. First, the agent had the following motivation, and a goal of following a vehicle was selected. After finishing that goal, a motivation for urinating was chosen. The agent needed to go to urinate at a lake. At the time t, a motivation of hunger occurred, and a goal of going to eat vegetables was selected to satisfy the hunger. The other desires were selected as mentioned above.

Discussion

The proposed QNM framework is designed for simulating the actions of a single agent considering a relationship with multi-agents around it. Each agent can interact with other agents in dynamic environments such as fighting or avoiding. Therefore, we can simulate multi-agents in virtual environments naturally. For simulating multi-agents like crowds, we plan to extend our framework in the near future.

Our QNM framework consists of three layers. In the motivation layer, the desire functions and desire collector include only an addition operator, hence the performance is fast. The motivation selector contains some improvements to Q-Network for suitable desire input. However, these changes do not make the original Q-Network more complex. The intention layer also consists of simple operators such as addition, subtraction, and logic; hence, the computational cost is small. In the execution layer, we apply the original HTN planner without modification. This algorithm incurs high computation if we run thousands of agents at the same time. In near future, we plan to enhance the HTN algorithm to reduce the computational complexity.

Conclusion

In this article, the QNM framework was proposed to better define the desires of virtual agents, in order to help them to act more realistically in 3D virtual IoT simulations. The desires were calculated based on states that contain the environment information. A Q-Network played a key role in selecting motivations by considering desires. Goals were then selected based on the selected motivations. HTN planning was applied to select actions after determining the goals from selected motivations. From the experiments, the results of changing desires by applying the QNM were better than the results of applying traditional Q-learning by increasing the accuracy of the QNM by 15.58% more. In future research, some adjustments and extensions of the QNM should be performed to increase the ability of agents to adapt to more challenging situations. In addition, some comparison between the QNM framework and other methods also should be implemented in future research.

Footnotes

Handling Editor: Wei Wei

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research was supported by a grant from Defense Acquisition Program Administration and Agency for Defense Development,under contract #UE171095RD and BK21 Plus project of the National Research Foundation of Korea Grant.

ORCID iD

Kyungeun Cho

References

Barber

Bayesian reasoning and machine learning. Cambridge: Cambridge University Press, 2012.

Bishop

CM.

Pattern recognition and machine learning. Berlin: Springer, 2006.

Murphy

KP.

Machine learning: a probabilistic perspective. Cambridge, MA: MIT Press, 2012.

Bengio

IGY

Courville

. Deep learning. Cambridge, MA: MIT Press, 2016.

Rabin

AI game programming wisdom 3. Brookline, MA: Charles River Media, 2006.

Ayesh

Stokes

Edwards

. Fuzzy individual model (FIM) for realistic crowd simulation: preliminary. In: Proceedings of the IEEE international conference on fuzzy systems, London, 23–26 July 2007, pp.1–5. New York: IEEE.

Norman

Long

Goal creation in motivated agents. In: Proceedings of the international workshop on agent theories, architectures, and languages, Amsterdam, 7 March 1995, pp.277–290. Berlin: Springer.

Sloman

Motives, mechanisms, and emotions. J Cogn Emot 1987; 1(3): 217–233.

Mook

DG.

Motivation: the organization of action. New York: W. W. Norton, 1987.

10.

Reiss

Multifaceted nature of intrinsic motivation: the theory of 16 basic desires. Review General Psychol 2004; 8(3): 179–193.

11.

Helbing

Brockmann

Chadefaux

et al . Saving human lives: what complexity science and information systems can contribute. J Stat Phys 2015; 158(3): 735–781.

12.

Capraro

Perc

Grand challenges in social physics: in pursuit of moral behavior. Frontier Phys 2018; 6(107): 1–6.

13.

Scott

Essential animal behavior. Hoboken, NJ: Blackwell Publishing, 2005.

14.

Humphreys

Exploring HTN planners through example in game AI pro 2: collected wisdom of game AI professionals. Boca Raton, FL: CRC Press, 2015.

15.

Erol

Hendler

Nau

DS.

Complexity results for HTN planning. Ann Math Artif Intell 1996; 18(1): 69–93.

16.

Song

Cho

Motivation-based hierarchical behavior planning. Korea Game Soc 2008; 8(1): 91–102.

17.

De Sevin

Thalmann

. A motivational model of action selection for virtual humans. In: Proceedings of computer graphics international (CDI), New York, 22–24 June 2005, pp.213–220. New York: IEEE.

18.

Graham

Starzyk

Jachyra

Opportunistic motivated learning agents. IEEE Trans Neural Netw Learn Syst 2015; 26(8): 1735–1746.

19.

Sutton

Barto

AG.

Reinforcement learning: an introduction. Cambridge, MA: MIT Press, 1998.

20.

Kaelbling

Littman

Moore

AW.

Reinforcement learning: a survey. J Artif Intell Res 1996; 4(1): 237–285.

21.

Appiah

Vare

Playing flappy bird with deep reinforcement learning, 2018, http://cs231n.stanford.edu/reports/2016/pdfs/111_Report.pdf

22.

Clark

Storkey

AJ.

Teaching deep convolutional neural networks to play go, 2014, https://arxiv.org/abs/1412.3409

23.

Mnih

Kavukcuoglu

Silver

et al . Playing Atari with deep reinforcement learning. In: Proceedings of the NIPS deep learning workshop 2013, Lake Tahoe, NV, 9 December 2013, pp.1–9. San Diego: NIPS.

24.

Kempka

Wydmuch

Runc

et al . Vizdoom: a doom-based AI research platform for visual reinforcement learning. In: Proceedings of the IEEE conference on computational intelligence and games (CIG), Santorini, 20–23 September 2016, pp.1–8. New York: IEEE.

25.

Tessler

Givony

Zahavy

et al . A deep hierarchical approach to lifelong learning in minecraft. In: Proceedings of the thirty-first AAAI conference on artificial intelligence (AAAI-17), San Francisco, CA, 4–9 February 2017, pp.1–10. Palo Alto, CA: AIAA.

26.

Matiisen

Oliver

Cohen

et al . Teacher-student curriculum learning, 2017, https://arxiv.org/abs/1707.00183

27.

Singh

Lee

. Value prediction network. In: Proceedings conference on neural information processing systems (NIPS 2017), pp.1–16, Long Beach, CA, 4–9 December 2017. San Diego: NIPS.

28.

Juliani

Berges

Vckay

et al . Unity: a general platform for intelligent agents, 2018, https://arxiv.org/abs/1809.02627

29.

Hessel

Modayil

Hasselt

et al . Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, New Orleans, LA, 2–7 February 2018, pp.1–8. Palo Alto, CA: AIAA.

30.

Hessel

Soyer

Espeholt

et al . Multi-task deep reinforcement learning with popart, 2018, https://arxiv.org/abs/1809.04474

31.

Merrick

Maher

ML.

Motivated reinforcement learning curious characters for multiuser games. New York: Springer, 2009.

32.

Keras, https://keras.io/