Abstract
Introduction
The topic of exploratory data analysis (EDA) as a distinctive tool in applied statistics was created by John W. Tukey in his 1977 book “Exploratory Data Analysis.” That book renewed the topic of descriptive statistics and enlightened three main strategies that have become crucial in modern data science: (1) graphical presentation, (2) flexibility in viewpoint, and (3) intensive search for simplicity. The methods of EDA do not present
A major feature of modern applied statistics work is the widespread use of graphical displays of the data. This has been grounded on the methods of EDA developed by Tukey in the late 1980s, together with the advancement of graphical capabilities in computer sciences. The discipline of graphical displays has evolved as an entire discipline of statistics (e.g., Chambers et al., 1983; Cleveland, 1993; Downey, 2014; Healey & Enns, 2002; Hoaglin et al., 1983; Myatt & Johnson, 2014). All current statistical software (SPSS, Stata, etc.) have ways to display data that were not present just a few years ago.
This article revisits EDA tools, applying the boxplot and the scatterplot to databases with many observations. We focus on methods that assess variation across groups, point to outliers, disclose clustering of cases, and highlight non-linearities in relationships between variables. The role of EDA is to explore graphically data in ways that could reveal structural secrets and to gain new, often unsuspected, insight into the data. The EDA methods discussed align with recent arguments that graphs usually create “pre-attentive” visual processing in the brain, helping to focus on the important message (Camacho et al., 2015; Healey & Enns, 2002; Hussain & Prieto, 2016; Levine, 2018). Cheng et al. (2013) and Schwabish (2014) suggest that researchers who want to disseminate their research to a wider audience of non-specialists should think carefully about how to construct effective graphics. Similar recommendations are made in the papers of Jebb et al. (2017) and Varian (2014). In reflecting on statistics for management, George et al. (2014) conclude that with large data sets it is too easy to get false correlations when using typical statistical tools. The EDA methods can produce a set of visualizations that simplify the understanding of complex data, which will be increasingly useful to researchers in management, business, and social sciences in general, who often face the challenges associated with using large databases in their research.
The classical inferential methods of statistics become less useful in very large samples, since test results are nearly always significant, and the standard errors are very small. EDA could help in this case by emphasizing data visualization and dimension reduction methods. Traditional descriptive statistical analysis faces problems with large data sets. For example, a bivariate relationship that would be clear in a basic scatterplot with relatively few observations may not be readily apparent in a data set containing a million observations (e.g., see Figure 1 where a simple scatterplot is used). A large part of the literature on big data concentrates on computer-intensive methods, such as machine learning or regression trees, that emphasize prediction (for more details about these tools, see Qiu et al., 2016; Al-Jarrah et al., 2015). In our study, rather than big data, we concentrate on the case of large database (large sample size), see De Mauro et al. (2016) for terminology on big data. In contrast, EDA methods focus on methods that serve to explain and describe data and should be among the first steps when analysts want to explore large data sets. We believe that there is gap in the literature of big data analysis on the subject of exploring and presenting statistical features of large data sets. This article focuses on methods that can fill part of this gap. Our main contribution is to show how traditional, simple methods should be used in the context of the new paradigm of big data. It should be recognized that descriptive tools do not give yes/no answer to basic research questions, unlike the classical testing used in econometrics. However, the EDA methods have the ability to show shifts, heterogeneity, outliers, and non-linearities among variables, without the requirement of model assumptions. EDA methods should be seen as complementary, not only prior to the classical econometric methods but also to give support or rejection to a priori posed econometric models. We show how EDA methods are also useful after a regression fit in residual diagnostic plots.

Example of scatter plot of log-wage by age.
To better convey the practicalities of the methods proposed, we illustrate them by analyzing the variation of salaries in the Spanish Social Security (SS) database. In particular, we address such issues as whether the observed difference in wages can be explained by the following: the rigidity of labor market (temporary vs. permanent contract), discrimination (women vs. men), and age (old vs. young). Previous work involving the Spanish labor market has focused on fitting models to test aspects of labor economic theories (e.g., Dolado et al., 2002; Gehrke & Weber, 2018). In this article, we deviate from this testing approach in favor of descriptive methods that permit direct, visual assessments of the questions explored. The structure of the article is as follows. Section “SS data” describes the database. In section “The comparative boxplots: log-wage by age groups,” we discuss group variation in wages using the boxplots. Section “Scatterplot: The heat-map” discusses the use of heat-map scatterplots. In section “The heat-map scatterplot in regression diagnostics,” we apply the heat-map approach to a classical residual diagnostic plot, and to the comparison between OLS and Tobit regression. Finally, in the last section, conclusions are presented.
SS data
To illustrate the EDA methods, we use the Spanish SS data on labor, specifically, the “Muestra Continua de Vidas Laborales” (hereafter, MCVL) in 2010. 1 The data comes from the register of the SS System for people active in the labor market. Starting in 2004, SS records have been released for a 4% non-stratified random sample of the population who in that year have had any relationship with Spanish SS (individuals who are working, receiving unemployment benefits, or receiving a pension). Given the structure and magnitude of the MCVL, this is a useful data set with which to illustrate the EDA approach we advocate. The rest of this section is devoted to describing briefly the statistics and economic labor context of this database.
The data set gives information regarding historical relationships of individuals with the SS relating to work, unemployment benefits, and pensions, for around 1 million observations each year. It also contains information regarding the type of contract, sector of activity, qualifications, earnings, date of entering or leaving the job market, part-time or full-time status, and firm size. The MCVL also provides individual characteristics such as age, gender, residence, country of birth, and level of education. Information on educational attainment has improved in recent editions of the MCVL. The main outcome variable of interest in our data set is the earnings of Spanish people. The MCVL only provides information on the “social contribution base” (censored earnings), which captures monthly labor earnings plus 1/12 of bonuses received over the year. The censored earnings variable has minimum and maximum values, which vary over time. 2 We exclude pensioners, the self-employed, individuals receiving unemployment benefits, individuals who report outlying earning values (outside the minimum and maximum values), and individuals with missing information in relevant variables (age, education, type of contract, occupation). This leaves a sample of 541,457 individuals. We calculate the daily wage as our main earnings measure, computed as the ratio between the monthly contribution base and the number of days worked in that particular month. If the individual records more than one job at the same time, we sum the earnings. Other years and variables of course could be considered, but these will be beyond the scope of this article, which is not focused on exploring the Spanish labor market.
We analyze the variation of wage with respect to variables: age, gender, type of contract (fixed [permanent] or temporary), education (primary, secondary, and tertiary), and occupation (high-, medium-, and low-skilled). These variables are traditionally considered in the literature of labor markets (Heckman et al., 2006). Those variables that are found to be significant in a regression analysis of wages are reported below. Note, however, that significance within such a large sample does not provide reliable evidence on the substantive relevance of these variables. As such, visual inspection of the variation of wage with respect to those variables is necessary. We extract information about qualified and non-qualified employees, splitting the sample by qualification (high-, middle-, and low-skill levels) according to the International Standard Classification of Education (ISCED) classification and following the classification proposed by Garcia-Perez (2008) for the MCVL data. We consider the type of contract held (fixed or temporary) and we construct five age groups defined by the following intervals: (15, 25], (25, 35], (35, 45], (45, 55], and (55, 65]. 3 Table 1 reports descriptive statistics of the wage by gender, occupation, education, type of contract, and age.
Descriptive Statistics of Daily Wage.
Older people generally experience higher wages. Workers with a fixed contract and highly skilled jobs also have increased wages, as do workers who have reached higher education levels. This table shows a wage gap between male and female workers of around 17.50%. This observation of a gender gap in the Spanish labor market is well documented (see, for example, Amuedo-Dorantes & De la Rica, 2006).
The next sections illustrate the use of EDA for assessing the variation of wages across workers’ characteristics.
The comparative boxplots: log-wage by age groups
The boxplot was introduced by Tukey (1977) as a way to present graphically the distribution of a continuous variable and also to display the variation of a continuous variable across groups. Boxplots display the first, second, and third quartile; the interquartile range; and outliers of a database. The information displayed by the boxplot, and most of its variations, is based on the data’s median. The 50% central bulk of the data is represented by a box; two whiskers, one at side of the box, represent the two 25% extremes of the data distribution. A line dividing the box represents the median. Symmetry/skewness of the distribution is visualized graphically by the position of the dividing line of the box: symmetry occurs when the line divides the box in two equal halves, and skewness occurs when one half is larger than the other. The boxplot is very useful for visualizing group dispersion in a variable, where the boxplot of each group is set in parallel (vertically, or horizontally), one besides the other. The boxplot is of limited use when the sample size is small, thus displaying variation across groups using the boxplot requires a large sample.
Note that in the case of a very large sample, the
Wages can be presented using raw values or after a log transformation. The advantage of the log transformation for group comparison is the approximate normality of the distribution. Hence, the comparative boxplots will have reduced the skewness of the wage distributions and will make it easier to compare the ends of the wage scale (see Hubert & Vandervieren, 2008). We use the log base 10 transformation; however, it should be noted, that once a researcher finds interesting variation at the log scale, they can easily display the same data in the original scale.
Figure 2 shows parallel boxplots of log-wage for different age groups. For each cohort, the box and the whiskers span the whole variation of log-wage in the group. The edges of the box are the first and third quartile; the median is the dividing line of the box. Points exceeding the solid line of each of the whiskers are cases that could be categorized as outliers.
4
The

Boxplot of log-wage by age.
A feature we have added to the standard boxplot display is to make the width of the box proportional to the size of the group (number of individuals in each group) in the whole sample. This makes it easy to assess the relevance of a specific group; for example, we see that the age group (55,65) represents a smaller proportion of workers than the age group (35,45).
The following findings can be drawn from this diagram:
There is a slight curvilinear increase in the median level of log-wage with age, with the first age group (15,25] showing the lowest median log-wage that is different from the other groups.
There is slight variation across the groups in the dispersion of log-wage, as measured by the interquartile range (the length of the box). The highest dispersion arises in the two most extreme age groups.
The width of the boxplots shows the proportion of the population in the group, and we see that the groups (25,35] and (35,45] are the widest.
While there is a significant increase in wage when going from group (15,25] to group (25,35] (the youngest groups), there is a slight decrease in the median of log-wage when moving from (45,55] to (55,65].
The log transformation normalizes (symmetrizes) the distribution for all groups, except for the youngest group that still shows a long-tail on the right. This asymmetry for the salary of young workers could raise policy discussion.
These findings are congruent with previously established economic empirical research (Cabrales et al., 2017), which states that skills and experience (age would be a surrogate) impact positively on earnings. We now consider group variation when we cross two or more categorical variables.
Variation of log-wage by age groups and a third categorical variable
Log-wage by age can vary when controlling for different additional variables, such as occupation, level of education, type of employment, and gender. Figure 3 displays variation of log-wage on age for each of those aforementioned variables. Panel A shows that there is a gender gap in wages for all age groups except for the two youngest. The gap reaches a maximum of 24 points (of log-wage) for those aged 45 to 55 years. The graph shows that the median (log) salary gap between men and women increases with age, although it is important to note that we do not control for other variables.

Boxplot of log-wage—Panel A: age and gender; Panel B: age and occupation; Panel C: age and education; and Panel D: age and type of contract.
Panel B of Figure 3 depicts the variation of log-wage on age for occupations classified as low-, middle-, and high-skilled workers. The proportion of people in the different groups is shown at the top of the graph. We observe the middle skills group is the largest and accounts for 63% of all workers. The width of the boxplots represents the relative size of the group. The group of young people (ages 15–25 years) is the smallest and has the lowest log-wage, and this trend occurs for the three skill levels. As expected, the group of highly skilled workers is the one with highest earnings.
Panel C shows the variation of log-wage on age for the three educational levels: primary, secondary, and tertiary. As expected, people with a tertiary level of education receive the highest wages (see the median levels of each group). Note that, the size of the group decreases with age for all education levels (as is illustrated by the width of the box). Furthermore, the plot shows that the highest variation of wage by age occurs in the secondary education level.
Panel D shows the variation of log-wage on age for the two types of contract: permanent and temporary. We see that the highest proportion of contracts are permanent, with the highest salaries. In both groups of temporary and fixed contracts, median salaries remain fairly homogeneous across age, though in temporary contracts, we see a slight decrease of salary as age increases.
Figure 3 describes, in a four-display graph, the variation of wage conditional on age and one additional variable (specifically, gender, occupation, education, or type of contract). Note that one could also consider a five-display graph where each display shows the boxplot relative to one or two of the covariates for each age group. The necessity of such additional graphs would arise from the inspection of the ones at hand. As the main purpose of the article is to highlight useful EDA devices, we do not comment further on these additional graphs.
This graph allows researchers to assess directly how the distribution of salaries changes with type of contract in age group, or between men and women. Note that this type of comparison, considering variation of wages in groups defined by crossing many categorical variables, will be possible only in the context of large data. Otherwise, the sub-samples would be small and the boxplot would become non-informative. Variation conditional on a third variable will be introduced in the next section.
Exploring the gender gap in log-wage
In this section, we explore the gender wage gap. Reducing the gender wage gap is an important topic on the European political agenda. The persistence of the gender wage gap is the result of direct discrimination against women and/or a structural inequality, such as segregation in sectors or occupations, access to education and training, or biased evaluation and/or pay systems. We present evidence showing the difference between male and female earnings, considering several factors that could explain not only the wage gender gap but also the selection of women into jobs with certain characteristics.
In Panel A of Figure 4, we see a gender gap in wages for all age groups. The log-wage gaps between males and females persist after controlling for age. The proportion of workers who are young is larger for males than females, while among females there is a smaller proportion of older workers (this is seen from the width of the boxplot). There is a wage gap between younger females’ wages than wages of all other female age groups. This pattern does not occur for males. The dispersion of wages is similar across age and gender. Panels A to C of Figure 4 shows the variation of log-wage by gender, controlling for additional characteristics of workers: occupational skills, level of education, and type of contract.

Boxplot of log-wage and gender—Panel A: age, gender and type of contract; Panel B: age, gender and occupation; and Panel C: age, gender and education.
Panel A shows that, while the gender wage gap is not apparent in young groups with temporary contracts, it is clear among workers with permanent contracts. We also see that few women in the older age group hold permanent contracts.
In Panel B, we can see that the boxplots of high-skill groups are thinner. This reflects official statistics which report a lower proportion of workers in the high-skill group. Among high-skill workers, the gender wage gap is minimal. The maximum gender gap arises in the middle skills group and is largest among the young. The gender wage gap is also apparent among low-skill workers but is smaller.
Finally, Panel C shows that when controlling for the level of education, a gender gap in wage is still visible at all levels. The gap is at its highest for the second level of education and greatest for the younger women at this level. There are more women than men in the group defined by lowest age and higher level of education (see the thickness of the boxplot for the tertiary education level for females aged 15–25 years).
Panel A shows that, while the gender wage gap is not apparent in young groups with temporary contracts, it is clear among workers with permanent contract. We also see that few of the women in the higher age group hold permanent contracts.
Panel B shows that the high-skill groups have thinner boxplots. This reflects official statistics which report a lower proportion of workers in the high-skill group. Among high-skill workers, the gender wage gap is minimal. The maximum gender gap arises in the middle skills group and is largest among the young. The gender wage gap persists among low-skill workers but is smaller.
Finally, Panel C shows that, when controlling for the level of education, a gender gap in wage is still visible at all levels. The gap is at its highest for the second level of education and more for the younger women at this level. It is interesting to see that there are more women than men in the group who have a higher level of education and lowest age (see the thickness of the boxplot for the tertiary education level for females aged 15–25 years).
Scatterplot: the heat-map
So far, we have assessed the variation of a continuous variable

Heat-map of density plot for daily wage by age.
Using the heat-map approach, we can assess the variation of daily wage on age when controlling for other variables, as shown in Figure 6. Looking at Panel A of Figure 6, we can see that women have lower wages than men. Yet, we observe that there are few older women working. The modal concentration 7 at the top wage level is of lower intensity for women, showing that there is a smaller proportion of women in this “ceiling effect” group.

Heat-map of density plot for daily wage and gender—Panel A: gender and age; Panel B: education and age; Panel C: type of contract and age; and Panel D occupation and age.
Panels B and D show daily wage variation on age when controlling for occupation and education levels. They show that high-skilled workers and those with tertiary level of education have the largest earnings, with many in the ceiling group. The variation of wages across skill and education levels show, as expected, that wages increase with skill and education. Individuals with primary level of education, or in the low-skill group, present larger heterogeneity, not only across wage levels but also across age. Fitting a nonparametric regression line to each of the scatterplots, we see a steady, near-linear increase in wage with age. Panel C shows the variation of salary by age when controlling for the type of contract. We see that in contrast to permanent (fixed) contracts, workers with temporary contracts show significant wage dispersion, concentrated in a small range of age variation. Furthermore, the ceiling effect (of high salary workers) appears only to impact those with permanent contracts. The modal wage for temporary contracts is lower than the modal wage for permanent contract workers (if we discount the ceiling group, then the modal wage for the permanent contracts intersects with the highest modal wage salaries of the nonpermanent contracts). The results in this section are in line with many studies on the Spanish labor market. For more details see, for example, Cabrales et al. (2017).
The heat-map scatterplot in regression diagnostics
The relationship between wage and age is now explored using a regression approach. We assume that the expected value of the dependent variables
In this section, we illustrate the use of regression residual plots in the context of our labor data, especially when assessing the impact of age, gender, education, and other worker characteristics on our dependent variable, namely, the log of the daily wage.
We fit three regression models using the log daily wage as the dependent variable and the covariates listed in the first column of Table 2. In the covariates, we have categorical variables (represented in the regression by dummies of category) with reference categories detailed in the footnote of the table. The second column of this table shows the estimates for the fitted OLS regression using the entire sample, the third column reports the Tobit regression, and the fourth column reports the OLS regression excluding all the cases where the log daily wage is at the ceiling point 4.506. As expected, given the large sample, all the regression coefficients are statistically significant. We have not reported in the table the standard errors, since the estimates are 0.00 for all the coefficients when rounded at three decimal places (the rounding we use for all the numbers of the table). The significance has been computed using the robust standard errors to adjust for unknown heteroscedasticity.
Estimates of the three regression models. The dependent variable in all the models is the log daily wage. a .
The reference categories for the dummies of education, occupation, gender, type of contract, and sector are, respectively, primary education, high-skill occupation, male and temporary contract, and agriculture. The municipalities fixed effects are also included in all the regressions.
We have excluded in the sample all the cases when the log daily wage it is at the ceiling value at the log of 106 euros (4.663).
All the coefficients of the three regressions are significant at
After a regression fit, it is useful to inspect the residuals-versus-fitted-

Heat-map of the residuals of regression analysis for daily wage—Panel A: All sample; Panel B; Without censored data; and Panel C: Tobit model (all sample).
In Panel A, we note a bimodality of the bivariate distribution, with a high density of points on the right-hand side of fitted
As mentioned in the section above, earnings in the MCVL data suffer from censoring: any individual with a daily wage above the threshold of 106 euros is recorded to have a wage equal to this threshold. If we eliminate individuals from the regression who have been censored, we observe how the positive residuals disappear in the heat-map of residuals-versus-fitted-
The alternative estimates shown in Table 2 should be complemented with the residuals shown in Figure 7. The full picture of the fitted model cannot be assessed without the estimated residuals. Note that OLS estimates do not account for the censoring shown in the residuals of Figure 7 (Panel A) (where we see that a line of censored residuals on the top right). The residuals from a Tobit regression shown in Figure 7 (Panel C) have already accounted for the censoring. The censoring is not visible in Figure 7 (Panel B) because the censored cases have been removed from the estimation.
Conclusion
The analysis of large data sets is increasingly popular in business and social science research, and there are many new opportunities to extract useful empirical evidence from data. Extensive research in economics has been devoted to the econometrics of regression (or extended regression) models. In this article, we present how EDA methods that focus on graphical displays of data can help researchers to inspect large databases and explore heterogeneity across groups of variables. In particular, we focus on the use of boxplots and heat-maps. EDA methods are useful not only as ways to present descriptive statistics but can also help with robustness check by selecting appropriate models to use in regression analysis.
Using SS data from Spain, this article illustrates how EDA can be used to highlight group variation of critical variables in the labor market and how they interact. We use boxplots and heat-map scatterplots to assess the heterogeneity of earnings on basic covariates, such as gender, contract status, experience, skills, and others. We also show how the heat-map scatterplot can be used for the basic diagnostic plots on regression analysis when the researcher is confronted with a very large data set.
With large databases, there are problems associated with the use of traditional statistical models and EDA offers instruments that can help researchers overcome some of these issues. These methods complement predictive approaches to big data, such as machine learning, random forest. Although it should be noted that EDA methods have limitations, as they do not permit statistical confirmation or refusal of a causal hypothesis. For academic researchers, the EDA methods presented should contribute with intuitive and simple tools to extract useful information from the data, regarding issues such as assessing across-group variation, latent heterogeneity of the data, and post-model-fit analysis. This information should help researchers uncover aspects that contribute to a more solid descriptive analysis and more accurate statistical modeling.
The possible extensions of the EDA methods discussed in this article for large databases are many. We have concentrated just on few features which we feel are immediately available to practitioners using the current free software. One area where we are currently working, and that we feel is also a promising avenue for research, is the visual exploration of longitudinal data in the case of large sample size. The heat-map methods discussed above can serve to detect data heterogeneity on the longitudinal evolution of crucial variables, or residuals of a post model fit.
Supplemental Material
appendix – Supplemental material for Exploratory data analysis on large data sets: The example of salary variation in Spanish Social Security Data
Supplemental material, appendix for Exploratory data analysis on large data sets: The example of salary variation in Spanish Social Security Data by Catia Nicodemo and Albert Satorra in Business Research Quarterly
Supplemental Material
rcode – Supplemental material for Exploratory data analysis on large data sets: The example of salary variation in Spanish Social Security Data
Supplemental material, rcode for Exploratory data analysis on large data sets: The example of salary variation in Spanish Social Security Data by Catia Nicodemo and Albert Satorra in Business Research Quarterly
Footnotes
Authors’ Note
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
