Abstract
Introduction
Due to the constantly changing behavior of cyber-attacks, reactive approaches are desirable to detect and prevent malicious actors from gaining access to networks. Firewalls and intrusion detection and prevention systems (IDPS) are a line of defense in identifying and stopping suspicious internet traffic. When a suspicious event occurs, these devices generate a log file containing details of what preprogrammed rules were violated and how it was handled. 1 Such log files contain details of the event, e.g. source and destination IP addresses, port numbers, and protocols, but not the packet and data that led to the event. Of interest is cyber/digital forensics of logged events to understand their origin and magnitude.2,3 Suspicious events include both malicious and non-malicious activities, e.g. misconfigured routers; however, each event is logged and to find malicious events for further analysis, one must search through all logged suspicious events.
Although advances have been made in applying text mining and advanced analytics to cyber log data analysis, c.f. Suh-Lee et al. 4 Breier and Branišová 5 and Villa et al., 6 the characteristics of cyber logs results in much manual analysis for interpretation and response.7–10 When considering log data, cyber analysts rely on manual sorting and experiential knowledge to find possible threats in logged events to further investigate.7,11,12 Thus, cyber security is heavily experiential-based and uses the innate ability of humans to process large amounts of complex data; 13 similarly experience is critical and novice analysts might miss intrusions and events that a veteran analyst would not. 14 Additionally, cyber intrusion detection is asymmetrical in nature whereby an attacker can focus on only one threat approach while a defender (cyber analyst) must constantly protect all systems and prepare for many different types of attacks, vulnerabilities and threats. 15 Although system administrators and cyber analysts manually handle log data, this is becoming increasingly infeasible due to the big data nature of cyber traffic (unstructured, high volume and high velocity 16 ).
Normal behavior for cyber networks is generally not well defined and changes over time, resulting in high false positive detection rates. 17 Additionally, since firewall log events are the result of network abnormalities, one is thus necessarily interested in detecting the anomalies within the anomalies. Related research, c.f. Lazarevic et al., 18 Denning, 19 García-Teodoro et al., 20 Grimaila et al., 21 Moore et al., 22 Dube et al., 23 Shilland, 24 Shen et al., 25 Stewart et al. 26 has focused on anomaly detection at the device/software level, with little21,27–32 exploration into anomaly detection in the log files generated from the preexisting devices or software.
For analysis, data were used from a large scale distributed network with regional data nodes much like the Microsoft Cyber Defense Center, the Verizon Network Operations Center, or AT&T Global Network Operations Center. Currently, data is analyzed from enterprise-wide networks, which rely on a series of firewalls and IDPS to identify and stop intrusions. These devices, when triggered, generate a log file containing details of how it handled each incident, such as the source and destination IP addresses, port numbers, protocols, bytes transferred, etc. However, due to the wide variety of devices adding observations to the log, the data can be highly variable. In operation, analysts employ an experiential approach whereby large log files are manually sorted to find anomalies to further investigate; this process is conceptualized in Figure 1(a). However, due to the large size of the network and quantity of users, the data is of significant volume and emerging at high velocity; thus representative of a big data problem. Currently, analysts inspect numerous potential incidents on a daily basis, but have neither the time nor the resources available to analyze all incidents contained in the logs.

Cyber firewall log analysis methods: (a) Standard, manual intensive, cyber anomaly detection approach; (b) proposed methodology for analyst-aided multivariate firewall log anomaly detection.
This paper combines statistical and visual methods and integrates them into embedded analytic applications to assist analysts in the manual analysis of firewall logs. To this end, the authors develop a tabulated vector approach (TVA) that processes firewall log files to identify anomalies within the flagged firewall log event data. The TVA process employed by the authors is similar to that of Winding et al., 27 wherein pre-defined data attributes are considered. The developed process is automated and data attributes are transformed into representative counts, e.g. the number of times a certain IP address appears within the timespan of interest. Descriptive statistics are then computed for these counts with the result being the tabulated vector for a given period of time period.
The authors propose an analyst-aided solution to cue system administrators and analysts to anomalies for further analysis when manual log file analysis and forensics are employed. The end goal is that seen in Figure 1(b), wherein log files are selected, these are then divided into time blocks. From here, tabulated vectors are computed for the time blocks. These tabulated vectors are then processed through statistical and graphical methods. Finally, analysts are cued to various blocks of interest within a given log data file. The purpose of this approach is to efficiently analyze abnormal activities so that cyber analysts can dedicate their time to researching potential threats.
This paper is organized as follows: “Background” section reviews background details on cyber networks, cyber threats, and cyber technologies. “Developing a statistical framework for cyber anomaly detection” section 3 presents statistical pattern recognition methods that consider handling unstructured data through numerical approaches. “TVA for firewall log analysis” section discusses the development of a framework to detect firewall log anomalies. “Embedded analytics” section discusses how the proposed methodology was embedded into analytic applications for use by cyber analysts, and “Conclusions” section concludes the paper.
Background
In order to analyze cyber log data, one must discuss the basics of cyber networks, firewalls, IDPS, and characteristics of cyber log data. In this paper, we will discuss the salient characteristics of these areas.
Cyber networks
Figure 2 presents a conceptualization of a basic cyber network where user PCs are protected by an intrusion detection system (IDS) or intrusion prevention system (IPS), collectively IDPS, and a firewall. 33 Each security device plays a crucial role in protecting the user’s computer from outside and inside threats. Both IDPS and firewalls monitor network traffic and either stop or flag events that violate their rules. When an event triggers a rule, details are logged along with the action taken by the firewall or IDPS.

Generic cyber network.
Firewalls
Firewalls provide a first level of protection between an internal (e.g. local area network (LAN)) and external (e.g. internet) network. Firewalls employ rules to determine the outcome of an event 34 and prevent risks, including: (1) an internal host system’s exposure to inherently insecure Internet protocols and services, and (2) probes and attacks launched from hosts on the Internet. 35 A wide variety of firewalls exist, including both commercially developed and open source systems. 36 Presently, firewalls employed in the operational network of interest include those manufactured by Palo Alto Networks, Cisco ASA, McAfee, and Norton 360.
Firewalls are of three general types: 35 (1) packet filtering (PF), (2) proxy gateway, and (3) circuit level inspection. In brief, PF firewalls consider each incoming and outgoing packet, apply predefined rules to analyze the packet and decide to allow it to proceed or not. 35 Proxy gateways, also known as servers, act as a security filter. 35 Circuit level inspection firewalls use a proxy server that employs an access control list to determine if a request is permitted.
Intrusion detection/prevention systems (IDPS)
Intrusion detection involves monitoring and logging network traffic to detect attempts to gain unauthorized network access which are evident by security policy violations and acceptable use policy. 37 Intrusion prevention goes one step further by attempting to stop such incidents. Therefore, IDPS must identify possible incidents and when one occurs, a log of information about the event and the course of action is generated. 37 Similar to firewalls, IDPS employs a set of rules related to signatures or anomalies. 37 IDPSs can be setup in two ways, host based (HIDS) or network based (NIDS), where the former is deployed on each individual computer while the latter is positioned along the network. 37
Cyber anomaly detection in firewall logs
While one could find that a given firewall log file contains entirely malicious events, one likely has the problem of too many false positives in the log data. False positive issues in cyber anomaly detection involve too many benign events being logged and thus obscuring the rare malicious activities. 38 Since firewall logs contain anomalous events detected within network traffic and since many of these are not threats from attackers, one is thus interested in finding anomalies among anomalies. Two general paradigms exist for this task: (i) experiential, or manually searching through logs based on subjective experiences and (ii) statistical or machine learning approaches to find observations of interest in the log data.
Experiential approaches
In general, the work of cyber analysts is manual intensive and involves queries and searches of a dataset.7,11 Experiential approaches work by taking a log files, employing various sorting and analysis tools (e.g. Snort and Kibana), and incorporating contextual information to understand an event. 11 The process is conceptualized in Figure 1(a), where only two column searches are considered, which illustrates the manual nature of sorting by individual columns until one finds observations that display suspicious behavior. While such an approach could be highly accurate, it is time consuming and requires a year or more of on the job experience 11 and learning of various firewall forensics attributes. 39
Statistical data analysis
Statistical data analysis involves using mathematical approaches to find patterns in datasets.13,40 Approaches for doing so range from supervised (known classes/groups in the data) to unsupervised (unknown classes/groups in the data). Statistical data analysis thus includes classification methods where classes are known, at least in the training data, to clustering methods for which classes are unknown and one aims to find groupings in the data. 13 In cyber analysis, one can divide the field into academic and operational approaches. While academic approaches to cyber analysis frequently have the luxury of knowing if behaviors are threats or not, c.f. Grimaila et al., 21 Moore et al., 22 operational cyber analysis does not have the luxury of canonical truth. Thus, statistical analysis of cyber data is frequently unsupervised in operation.
Since a variety of methods have been proposed to analyze firewall logs via statistical or machine learning methods, of interest is thus leveraging past concepts and ideas to create a method to aid analysts in analyzing and interpreting firewall log data. A variety of approaches exist in this domain, c.f. Lazarevic et al., 18 García-Teodoro et al., 20 and include text analytics, 41 support vector machines, 18 random forests, 42 event correlation,21,30,43–46 dynamic rule creation, 29 and principal component analysis.20,47,48 Of particular interest is the work of Denning, 49 who originally proposed using anomaly detection methods in cyber security. This has been consistently extended and expanded upon, as seen in Lazarevic et al., 18 García-Teodoro et al., 20 Zhang and Zulkernine, 42 Shyu et al., 47 Wang and Battiti, 48 Lazarevic et al., 50 Ahmed et al., 51 Liao et al., 1 and Patcha and Park. 52
Cyber network and data of interest
Of interest to the authors are general firewall log files, one task is handling all firewall logs from the many networks the enterprise has worldwide. For context, the operational approach to the data collection process is conceptualized in Figure 3. For data handling, raw logs are first normalized into a structured data file by a connector, a stand-alone device or software that forwards data and sometimes converts from one format to another. These are then forwarded to regional centers (RC). RCs are organizations that provide regional services while simultaneously defending the network from cyber threats. 53 At the RCs, a regional security information and event management (SIEM) device aggregates, correlates, monitors, and generates alerts from the received data. Next, a second connector forwards the data to a global SIEM known as the integration center (IC). After the data is reprocessed at the IC, it is then uploaded into a big data platform—a centralized database for managing big data, 7 both structured and unstructured, at high volume and high velocity. 16 From here, data can be queried and analyzed.

Generic representation of the data collection hierarchy.
Developing a statistical framework for cyber anomaly detection
In order to develop a statistical framework for firewall log analysis, the authors posit that a multivariate dataset containing only numeric values is necessary. To this aim, the authors work towards feature vector creation and then statistical and graphical analysis of this feature vector.
Feature vector creation
One technique to facilitate the application of statistical methods to log files is the feature vector method proposed by Winding et al., 27 and further applied in Breier and Branisova 29 Syurahbil et al. 54 This approach aggregates log file observations into a set of feature vectors, which can then be analyzed through statistical approaches, which require the data to be numeric. In brief, a feature vector is a count of occurrences for the unique values in a set of variables. 27 Inherently, this involves dividing the data into blocks or regions of sequential observations.
A conceptualization is presented in Figure 4. In Figure 4(a), we have an example of two columns of raw data in a given block. Field A is categorical and field B is numeric. A feature vector can be created by condensing these observations into a block row, Figure 4(b). Unique categorical features in field A become columns of block 1. The count of each unique categorical feature in that block then becomes the value. Numerical entries in field B the original data are then summed with that value placed in the column for B.

Generic feature vector creation process. (a) Example raw data; (b) resultant feature vector.
When applying the feature vector approach to firewall log data, Winding et al.,
27
took log file records with the following raw data fields:
Repeated attempts of access by a single IP, Number of source IPs per destination IP, Number of destination IPs per source IP, Number of destination ports on a given source/destination IP pair, Unique IPs, Maximum activity from a single IP, Failed and successful connections from the same IP, Attempts to access invalid IPs, Inbound/Outbound bytes per unit time.
and then condensed these into feature vectors with the following variables:
Source IP address, number of destination IP addresses, Destination IP, number of failed access attempts, Source IP, destination IP, Destination perspective vector (destination IP, count of source IPs, number of successful accesses, number of failed accesses, count of destination ports, number of bytes transferred (inbound), number of bytes transferred (outbound)).
Mahalanobis based anomaly detection
To find anomalies inside a feature vector, one approach is the Mahalanobis distance, which is a multivariate outlier detection expression, which compares each observation by its distance from the data mean, independent of scale.
55
The Mahalanobis distance is computed as
Breakdown distance
However, one limitation is using
Principal components and factor analysis
Principal components analysis (PCA) is a linear transformation method which involves computing the data covariance, or correlation, matrix eigen-solution projecting the data by these eigenvectors. 56 The resultant projection is of uncorrelated components, with each component explaining successively less variation in the data, per the eigenvalues. 56 PCA is a dimensionality reduction method because one can select a small amount of components which explain a large amount of the variance in the data. PCA was applied to IDPS event analysis by Garcia-Teodoro et al.; 20 Shyu et al., 47 proposed using minor components (those whose variance explained is less than 0.20), claiming that their method can distinguish whether an outlier is an extreme value or it does not have the same correlation structure as the “normal” data.
Similar to PCA, factor analysis (FA) is another dimensionality reduction technique designed to identify underlying structure of the data. FA relates the correlations between variables through a set of factors to link together seemingly unrelated variables.56,57 An additional step seen in FA is that it can rotate the original solution seen in PCA, in order to possibly find more structure in the data. The basic FA model is
A factor loading matrix can be computed to understand how each original data variable is related to the resultant factors.
56
This can be computed as
To assess the quality of a factor analysis solution, Kaiser 60 proposed the index of factorial simplicity (IFS) that measures the tendency towards unifactoriality for both a given row and the entire matrix as a whole. Computing IFS values consistent with Kaiser, 60 we can evaluate the quality for a factor analysis solution with the heuristic labels shown in Table 1.
Since the purpose of factor analysis is for dataset reduction, we consider the three generally accepted methods of determining the dimensionality for correlation matrix inputs.56,61 The first and most commonly used is Kaiser’s Criterion 62 which advises to retain those factors whose eigenvalues are greater than 1.0. Second is Cattell’s scree test 63 which involves graphing the eigenvalues and retaining those that form the steep curve. The third method is a modified scree test called Horn’s parallel analysis (i.e. Horn’s Curve), 64 that uses a Monte Carlo simulation to find the average eigenvalues. Due to the advantages of Horn’s parallel analysis, the authors employed this method herein to determine the number of factors to explore.
Visualizations
Graphical analytic tools enable an analyst to visualize insights that may not be readily apparent without manually examining the data. 65 Appropriate visualizations are key to cyber data analysis, c.f. Schweitzer and Fulton, 66 thus the authors present a selection of methods which will be employed to help analysts tell a story in firewall log data.
Heatmaps
Heatmaps are graphical representations of a data matrix through the use of a color scale and have been used for 100+ years as an effective visualization approach for a matrix. 67 In statistics, one common use for heatmaps is for correlation matrices, illustrating the relationship between variables ranging from −1 to 1.
Histogram matrix
A histogram matrix (HMAT) is a visualization technique developed by Frei and Rennhard, 28 that combines graphical and statistical techniques to aid security administrators in efficiently identifying anomalies. HMAT was designed to scan large log files that show steady normal behavior and examine the messages displayed for each observation. HMAT extends both heatmaps and histograms, where data values are represented through a series of circles on a grid with the radius directly corresponding to the magnitude. 28
An example HMAT relative to log files is presented in Figure 5 (taken from Frei and Rennhard 28 ); here HMAT visualizes time on the x-axis, and the number of words per message on the y-axis. The size of the circle in Figure 5 is related to the number of log messages with the corresponding number of words while the color serves as indication to the relative likelihood of the time slot. The authors in Frei and Rennhard 28 determined the color of a circle by comparing the distribution of the sizes of the circles in its column with previous columns. In Figure 5, the large red circle indicates an unusually large amount of messages, greater than 5 standard deviations from the norm. HMAT also provides user interaction, where an administrator can click on one of the circles to reveal all the log messages that define that circle.

Histogram matrix of mail server message distribution, from Abbott et al. 14
Network graphs
Network graphs are graphical models that depict a relationship between two or more nodes, connected by edges.68,69 Herein, the authors employ network graphs to illustrate the interaction between source and destination IP addresses. Of particular interest are identifying The Onion Router (TOR) related IP addresses, port scans, and network fingerprinting attempts. TOR is a network of servers that provides a user with anonymity by relaying their internet traffic through multiple encrypted servers.70,71 Probable TOR nodes can be found and might be related to attempts to access multiple computers. A port scan is the act of determining which ports on a network are open and is thus related with one source IP connecting to many destination IPs over a short amount of time.72,73 Finally, fingerprinting a network is the act of revealing the presence of cyber security devices. 73 Thus, each unique IP address is a node. An edge represents the interaction between sources and destinations with its thickness denoting the frequency of interactions between the nodes. For related work, see Swanson. 74
TVA for firewall log analysis
Assembling the statistical methods from “Developing a statistical framework for cyber anomaly detection” section together involves a systems engineering approach. Here, the authors will show how the statistical methods from “Developing a statistical framework for cyber anomaly detection” section can be assembled into a cohesive firewall log analysis framework. Then the authors will illustrate the utility of their framework with an example case study.
TVA approach and process
When incorporating the statistical, graphical, and analytical methods discussed in “Developing a statistical framework for cyber anomaly detection” section, the conceptualization that appeared in Figure 1(b) yields the flow chart seen in Figure 6. Figure 6 presents a representation of the methodology used operationally to exploit log data is presented. This process in Figure 6 is broken up into four subsections: pre-processing, within block analytics, across block analytics, and graphical analytics. Pre-processing takes the raw data and transforms it into a form that can be used for statistical analysis. Statistical analysis utilizes the statistical tools described in “Developing a statistical framework for cyber anomaly detection” section to build the HMAT for anomaly detection. Moving to within block analytics, histograms are utilized to compare the data between blocks in time. Across block analytics assesses the entire dataset for similarities or differences while graphical analytics focuses on determining the link between observations and IP addresses. Developing a data analysis platform also involved systems engineering to select and incorporate available packages and tools for functionality. R 75 was used due to its growing popularity and open source nature; 76 additionally, R is further available as a software tool for big data platforms.

Multivariate and graphical approach to firewall log anomaly detection.
TVA example case study
To illustrate the utility of the proposed firewall log anomaly detection process, the authors will examine a representative log file with 39,304 observations and 400+ data features. Due to the nature of the data sources, IP address and data fields have been obfuscated to permit the presentation of real data results. Thus, IP addresses will not be in the traditional XXX.XXX.XX.XXX format, obfuscated values will appear in figures and nondescript names (e.g.,
Data pre-processing
Once data is retrieved, the data must be pre-processed and then time regional blocks are created, from which state vectors are extracted and data quality is considered (e.g. multicollinearity adjustments). These steps are necessary to incorporate multivariate and graphical methods for anomaly detection. In this step, variables of interest are either selected or created to aid in the discovery of anomalies.
The data used in this research has been collected from sensors located around the world. While over 400 data fields are collected, for illustrated purposes this research focuses on the fields shown in Table 2. These fields were selected based on (i) commonality between multiple log files and (ii) their ease on demonstrating the proposed methodology without the use of text mining techniques. Since some device vendors can have multiple products, we combine the fields
Dataset variables.
Time block creation
Following pre-processing and data cleaning, one then creates time regional blocks. Here, the observations are divided into sequential time blocks of equal length. The 39,304 observations in the example log file were chronologically separated into 136 time blocks, each containing 289 observations. Figure 7 shows the categorical variables being labeled as factors while numerical variables being labeled as numeric (integer, double, long, etc.). The number of levels associated with a categorical variable then denote the number of unique entries. For example, the

Sample of time block creation.
State vector creation
From the time blocks, numerical matrices are extracted to prepare for statistical analysis. To apply the statistical methods discussed in “Developing a statistical framework for cyber anomaly detection” section, we employ TVA, which uses the feature vector creation method of “Feature vector creation” section to take the pre-defined data attributes and transform them into representative counts using descriptive statistics. Therefore, as each time block is generated, the categorical fields are separated by their levels and a count of occurrences for each level is recorded into a vector. All numerical fields, such as bytes in and bytes out, are recorded as a summation within the time block. Due to the large number of levels associated with IP addresses, only the top 10 source and destination IP address counts are recorded. These vectors are then aggregated into a single matrix, known as the state vector matrix, as seen in Table 3. In Table 3, one sees rows for time blocks 1–5 with a count of occurrences for each device found in the data.
Example state vector matrix.
Multicollinearity adjustment
Prior to any statistical analysis, we automatically inspect the state vector for multicollinearity issues. This prevents us from inadvertently having issues such as matrix singularity, rank deficiency, and strong correlation values; this also removes any columns that pose an issue. The conclusion of this step ensures the data is ready for statistical analysis.
Before the statistical tools, mentioned in “Developing a statistical framework for cyber anomaly detection” section, are applied to the state vector matrix, the columns of the state vector matrix must meet three criteria: (1) the columns must have a variance greater than 0 + Δ1 to avoid matrix singularity, where Δ1 ≤ 0.1; (2) the columns must be linearly independent to avoid computational errors associated with rank deficiency, consistent with; 60 (3) the values of the correlation matrix cannot exceed a threshold of 1 − Δ2, where Δ2 = 0.05. The identified columns are removed and the reduced state vector matrix is ready for multivariate analysis.
Statistical analysis
Once the data has been pre-processed and made to conform to general data quality expectations, our data is ready for analysis. First we can build the HMAT to serve as the foundation to subsequent analysis. From here, the further analysis is analyst driven whereby three directions can be explored: within block analytics, across block analytics, and graphical analytics.
The HMAT in this research utilizes the squared Mahalanobis distance as an outlier detection metric to determine the color of each time block in the HMAT. The breakdown distance enhances the HMAT by adjusting the size of the circles according to its normalized value for each variable. Figure 8 shows the HMAT for the entire dataset, where each row refers to a time block and each column represents a variable. Using the Mahalanobis distance, the anomalous time blocks are distinguished by the darker shade of blue. Then, to determine the size of the circle, we normalize the breakdown distance for each column to distinguish which variables contributed most to the Mahalanobis distance value within each time block.

Histogram matrix (all time blocks).
Thus, from Figure 8, we can observe the big picture of potentially concentrated anomalies. The rows that are shaded darker imply that they are anomalies relative to their MD. Then the columns that have larger circles indicate those variables that are driving the MD for that particular time block. Looking at Figure 8, we identify the clearest anomalies shown in the lighter blue with the largest circle as blocks 14, 27, 44, and 63. We also recognize instances where sequential rows appear in the top 20 time blocks, which suggests that the block size may be too small. Moving forward with our case study approach, we select block 14 for further investigation.
Statistical analysis: Within blocks analytics
Once a time block has been selected for analysis, we first explore within block analytics through the use of histograms, as seen in Figure 9. Here, we compare the counts of observations for particular attributes within neighboring time blocks. Once an anomalous time block is detected, histograms are generated to compare the state vector values relative to neighboring time blocks. The histogram shown in Figure 9 displays the frequency of occurrences for the top five columns with the largest

Block 14: top 5 breakdown distance columns.
Based on Figure 9, we further our recommendation for a larger block size since blocks 13 and 14 both have high values for Device 2 and the D2 variable. The destination IP address labeled as D2 within block 14 is destination IP
Statistical analysis: Across blocks analytics
The next direction we explore in statistical analysis is across block analytics. Using FA, we first explore the factor loadings (correlations between the columns of the state vector matrix and the suggested factors), then we compare the factor scores against one another for anomaly detection. To begin using factor analysis, the dimensions of the reduced state vector matrix are first passed to the Horn’s curve function to find the recommended set of eigenvalues. Next, the dimensionality is determined by finding the eigenvalues of the correlation matrix of the state vector matrix and retaining only those factors whose eigenvalues are greater than or equal to those produced by Horn’s curve. The reduced state vector matrix and the number of factors to retain are passed to the factor analysis function. Then, the factor analysis function generates two sets of factor scores and factor loadings, unrotated and rotated. Using the IFS values to assess the quality of our solutions, we select the set of scores and loadings associated with the larger IFS value. The factor loadings are displayed in a correlation heatmap for interpretation of the variable relationships. The factor scores for each factor are plotted against one another for graphical anomaly detection.
After performing factor analysis, we observe the IFS levels presented in Table 2 to assess the quality of our factor analysis solutions. The rotated IFS level is higher than the original IFS level, serving as rationale for using the rotated factor loadings and scores in the subsequent analysis. According to Table 4, a value of 0.6125 is deemed as mediocre.
IFS results.
IFS: index of factorial simplicity.
The heatmap in Figure 10 shows the correlation between the columns of the reduced state vector to the rotated factor loadings. Strong negative correlations are depicted as red while strong positive correlations are shown as blue. The factor loading breakdown can provide insight into the relationships between variables based on Figure 10. For example, in factor 5 we see that the two devices, device 4 and device 13 are directly related to the geographic variables Country 7 and Country 10. While the true relationship between these variables is unknown, we may presume that these devices are set up to capture signatures from those locations. Looking at factor 1, we notice that four devices as well as the two main protocols are highly correlated with observations coming from the geographic locations of Country 16, Country 17, Country 18, and Country 29. It is highly likely that these locations are associated with known TOR exit nodes. Interestingly, it also reinforces the relationship seen in the histogram in Figure 8, where observations sourced from Country 10 and detected by the Device 13 are correlated with high occurrences.

Heatmap of rotated factor loadings.
The next step of FA involves projecting the data by the factors. Figure 11 contains four subplots for this step: the subplot in the top left plots rotated scores 1 on the x-axis and rotated scored 2 on the y-axis; the subplot in the top right plots rotated scores 3 on the x-axis and rotated scored 4 on the y-axis; the subplot in the bottom left plots rotated scores 5 on the x-axis and rotated scored 6 on the y-axis; the subplot in the bottom right plots rotated scores 3 on the x-axis and rotated scored 5 on the y-axis. Although rotated scores 1 explains the most variation in the data, followed by rotated scores 2 and so on, anomalies are not apparent until one examines rotated scores 3 and 5. Based on these plots, we can clearly see the anomalous time blocks, such as blocks 27, 63, 14, and 44.

Rotated factor score plots.
Statistical analysis: Graphical analysis
The final analysis direction we examine is graphical analytics. This encompasses both the HMAT and the IP network graphs. The purpose of the network graphs is to visualize the connections between source IP addresses and the destination IP(s) they attempted to connect to. While not directly a statistical technique, this method allows for rapid visual cues to understand the IP dynamics within the dataset or a specific time block.
In Figure 12, we display the network graph for time block 14. At first glance, there is a noticeably large cluster on the bottom left (circled in red), where multiple nodes are connected to a single node. We take a closer look at this region in Figure 13. Looking at Figure 1, we first pinpoint the source IP address

Block 14 IP network graph.

Block 14 IP network cluster investigation.
Embedded analytics
Analytic capabilities within organizations have, historically, been dominated by proprietary software technologies. Unfortunately, these technologies often lack availability, innovation, interoperability, flexibility, and transparency. 77 Likewise, to incorporate the analytic approach herein illustrated into existing proprietary software used by cyber analysts would take significant resources (i.e. time, money) of which most organizations have little to spare. In recent years, there has been an increased transition away from proprietary software and towards open source software both within the federal organizations and across industry. Open source software is a software that is voluntarily developed and extended by users specific to their organization’s needs and made freely available to all. 78 For analytic purposes, open source software allows analysts to customize analytic processes and products specific to their organization. Consequently, open source software has emerged as a major cultural and economic phenomenon 79 and illustrates the trend toward developing user innovation around analytic capabilities to increase an organization’s performance. 80 This collaborative model offered by the open source ecosystem can potentially change the analytic nature of organizations by increasing innovation and technology adoption while being constrained by resources. 81
The transition towards open source software allows us to operationalize and embed our analytic approaches into systems and business processes for more efficient analytic efforts. 82 In this research, the authors developed two forms of embedded analytics: an open source R Package (anomalyDetection83) and a Shiny Application which is employed by cyber analysts for operational analysis of log data.
anomalyDetection R Package
anomalyDetection is an R package that provides quantitative cyber analysts the ability to effectively and efficiently implement our methodology. anomalyDetection provides 13 functions to aid in the detection of potential cyber anomalies. These functions employ the methods presented in this paper and described in Gutierrez et al. 84
Shiny Application
Due to the high volume of incoming data, cyber analysts may not always have the time available to manually compute and analyze the data for anomaly detection using the anomalyDetection R package. To fully integrate the authors’ methodology into the workflow of cyber analysts operating on a big data platform, a web-based embedded analytic was developed so the analysts can execute the analytic approach efficiently over multiple time periods and data sources. R Shiny was used to develop this second form of embedded analytic. Shiny is an R package that provides an elegant and powerful framework for building interactive web applications using R. The web application provides means for the user to upload new data files, adjust block sizes and the number of IP addresses to consider. The web application will then perform the analytic methodologies discussed throughout this paper and provide results in the form of interactive graphics and tables to help the cyber analyst detect anomalies. This provides an efficient approach for cyber analysts to effectively analyze significant amounts of data while ensuring the methodological approach is valid and consistent. Example screenshots of the transitioned tool are presented in Figure 14.

Screenshots of embedded web analytics application.
Conclusions
Cyber attacks continue to be a growing concern for organizations. Unfortunately, the process of analyzing log files has, historically, been unorganized and lacked efficient approaches. This research presented an analyst-aided approach that makes the log file analysis process more efficient and facilitates the identification and analysis of potential anomalies. First, a state vector approach was developed to facilitate the identification and analysis of anomalies in log files. Second, multivariate statistics and graphical methods such as the Mahalanobis distance, factor analysis, and histogram matrices were combined in an analyst centric approach for outlier detection. Fourth, this research introduces the breakdown distance heuristic as a decomposition of the Mahalanobis distance, by indicating which variables and time blocks contributed most to its value. Finally, we illustrated how open source programming was used to operationalize our methodology.
Consequently, this research contributes to the field of network intrusion detection by demonstrating a comprehensive systems engineering approach to prepare log file data, apply multivariate and graphical methods to narrow the search window for log file analysis, and embed the analytic process to ensure anomaly detection approaches are reproducible and efficiently deployed.
