Abstract
Introduction
The etymology of ‘Big Data’ has been traced to the mid-1990s, first used by John Mashey, retired former Chief Scientist at Silicon Graphics, to refer to handling and analysis of massive datasets (Diebold, 2012). In 2001, Doug Laney detailed that Big Data were characterised by three traits:
Since then, others have attributed other qualities to Big Data, including:
Uprichard (2013) notes several other v-words that have also been used to describe Big Data, including: ‘versatility, volatility, virtuosity, vitality, visionary, vigour, viability, vibrancy… virility… valueless, vampire-like, venomous, vulgar, violating and very violent.’ More recently, Lupton (2015) has suggested dropping v-words to adopt p-words to describe Big Data, detailing 13: portentous, perverse, personal, productive, partial, practices, predictive, political, provocative, privacy, polyvalent, polymorphous and playful. While useful entry points into thinking critically about Big Data, these additional v-words and new p-words are often descriptive of a broad set of issues associated with Big Data, rather than characterising the ontological traits of the data themselves.
Comparing small and Big Data.
Characteristics of survey, administrative and Big Data.
Source: Florescu et al. (2014: 2–3) and Kitchin (2015)
In contrast, rather than focusing on the ontological characteristics of what constitutes the nature of Big Data, some define Big Data with respect to the computational difficulties of processing and analyzing it, or in storing it on a single machine (Strom, 2012). For example, Batty (2015) contends that Big Data challenges conventional statistical and visualization techniques, and push the limits of computational power to analyze them. He thus contends that we have always had Big Data, with the massive datasets presently being produced merely the latest form of Big Data, which require new technique to process, store and make sense of them. Murthy et al. (2014) categorises Big Data using a six-fold taxonomy that likewise focuses on its handling and processing rather than key traits: (1) data ((a) temporal latency for analysis: real-time, near real-time, batch; and (b) structure: structured, semi-structured, unstructured); (2) compute infrastructure (batch or streaming); (3) storage infrastructure (SQL, NoSQL, NewSQL); (4) analysis (supervised, semi-supervised, unsupervised or re-enforcement machine learning; data mining; statistical techniques); (5) visualisation (maps, abstract, interactive, real-time); and (6) privacy and security (data privacy, management, security).
Regardless of how Big Data have been defined it is clear that, despite widespread use, the term is still rather loose in its ontological framing and definition, and it is being used as a catch-all label for a wide selection of data. The result is that these data are characterised as holding similar traits to each other and the term ‘Big Data’ is treated like an amorphous entity that lacks conceptual clarity. However, for those who work with and analyze datasets that have been labelled as Big Data it is apparent that, although they undoubtedly share many traits, they also vary in their characteristics and nature. Not all of the data types that have been declared as constituting Big Data have volume, velocity or variety, let alone the other characteristics noted above. Nor do they all overly challenge conventional statistical techniques or computational power in making sense of them. In other words, there are multiple forms of Big Data. However, while there has been some rudimentary work to identify the ‘genus’ of Big Data, as detailed above, there has been no attempt to separate out its various ‘species’ and their defining attributes.
Ontological traits of Big Data.
kgrd comms: constant background passive monitoring.
Our aim in performing this analysis is not to determine a tightly constrained definition of Big Data – to definitively set out precisely the nature of Big Data and their essential qualities – but rather to explore the parameters, limits, and ‘species’ of Big Data. The analysis is thus an exercise in boundary work designed to test the edges of what might be considered Big Data and to internally tease apart what is presently an amorphous concept to reveal its inner diversity – its multiple forms. In other words, we consider in much more detail than previous studies the ontology of Big Data. This is an important exercise, we believe, as it enables the production of much more conceptual clarity about what constitutes Big Data, especially given the ongoing confusion over its traits and its amorphous description. In turn, acknowledging and detailing the various types of Big Data facilitates a much more nuanced understanding of its forms, its value, and how they might be analyzed and for what purposes.
The parameters of Big Data
In Table 3 we have mapped 26 sources of data, defined as Big Data within the literature, against the traits identified by Kitchin (2014) in Table 1. Through the process of evaluating each dataset against each characteristic it quickly became apparent that the categories of volume and velocity needed to be further teased apart. Similarly, while resolution and indexicality, and extensionality and scalability, are combined into two characteristics in Table 1, we consider them separately in Table 3 given that they are not synonymous traits.
In the context of Big Data, volume generally refers to the storage space required to record and store data. Big Data, it is commonly stated, typically require terabytes (240 bytes) or petabytes (250 bytes) of storage space (The Economist, 2010), far more than an average desktop computer can provide, with the data often stored in the cloud across several servers and locations. However, when we examine our 26 datasets it is clear that some of them, for example pollution and sound sensors, require very little storage space, maybe producing a gigabyte (230 bytes) of data per annum (easily storable on a datastick). Although each sensor might be producing a steady stream of readings, say once per minute, each record is very small, consisting of just a few kilobytes (210 bytes). Even summed over the course of a year, the sensor dataset would be relatively small in stored volume, in fact much smaller than many ‘small datasets’ such as a Census. As detailed in Table 3, we have thus teased apart volume into three dimensions: (1) the number of records (which is reflective of velocity and the number of generating devices), (2) the storage required per record, and (3) the total storage required (effectively the sum of the first two).
Using this threefold classification of volume it is clear that the 26 Big Data sets have differing volume characteristics. Automated forms of Big Data generation, where records are created on a continual basis every few seconds or minutes, often across multiple sites or individuals, produce very large numbers of records. Human-mediated forms, such as creating administrative records (immigration, unemployment registration), might have a steady stream of new records, usually generated from a constrained number of sites (a small number of entry points to a country, unemployment offices), that produce much lower volumes than automated systems. Likewise, while each sensor record is generally very small in file size, imagery data (such as streaming video, photographs and satellite images) are typically quite large in file size, meaning that relatively low numbers of records soon scale into huge storage requirements. In many cases, although the volume per record is low, the sheer number of devices generating data produce very large storage volumes. For example, the million customers flowing through thousands of Walmart stores every hour generate 2.5 petabytes of transaction data (Open Data Center Alliance, 2012).
Velocity is considered a key attribute of Big Data. Rather than data being occasionally sampled (either on a one-off basis or with a large temporal gap between samples), Big Data are produced much more continually. When we examined our datasets, however, it became apparent that there are two kinds of velocity with respect to Big Data: (1) frequency of generation; (2) frequency of handling, recording, and publishing; and that the 26 datasets varied with respect to these two traits. In terms of frequency of generation, data can be generated in real-time constantly, for example recording a reading every 30 seconds or verifying location every 4 minutes (as many mobile phone apps do), or in real-time sporadically, for example at the point of use, such as clickstream data being generated in real-time but only while a user is clicking through websites, or an immigration system recording only when someone is scanning their documents.
In some cases, as the data are recorded, the system is updated in real-time and the new data are also published in real-time (with only a fraction of delay between the two). For example, as a tweet is tweeted it is recorded in Twitter’s data architecture and micro-seconds later it is published into user timelines. Here, even though the data generation is sporadic at the point of generation (each user might only produce a couple of tweets a day), it is far from the case at the point of recording by the company (the millions of Twitter users collectively generate thousands of tweets per second, meaning that the company databases and servers are constantly handling a data deluge). In other cases, the data are recorded in real-time, but their transmission to central servers and/or their processing or publication is delayed. For example, the HERE LIDAR scanning project involves 200 cars driving around cities taking a LIDAR scan every second to produce high definition mapping data (Nokia, 2015). A single LIDAR scan generally produces a million plus points of data (Cahalane et al., 2012). At the end of every day the local storage device is removed from the vehicle performing the scan and its data transferred to a data centre. Similarly, unemployment data are recorded at the time a person updates their status on the system, but the overall unemployment rate is published monthly and in an aggregated form. In some cases, even once the data are generated they are open to further editing, as with crowdsourced data within Wikipedia or OpenStreetMap, with the edits also recorded in real-time and becoming part of the dataset.
Perhaps not unsurprisingly, there is a fair range of variety in the data form across our 26 datasets, including structured, semi-structured and unstructured data types. Of all the characteristics attributed to Big Data this seems to us to be the weakest attribute. Indeed, small data are also highly heterogeneous in nature, especially datasets common to humanities and social sciences where the handling and analyzing of qualitative data (text, images, etc.) is normal. Our suspicion is that this characteristic was attributed to Big Data because those scientists who first coined the term were used to handling structured data exclusively but were starting to encounter semi-structured and unstructured data as new data generation and collection systems were deployed.
As noted, small datasets consist of samples of representative data harvested from the total sum of potentially available data. Sampling is typically used because it is unfeasible in terms of time and resources to harvest a full dataset. In contrast, Big Data seeks to capture the entire population (
All our 26 datasets hold the characteristic of
As with exhaustivity, all 26 datasets hold the traits of fine-grained resolution (with the exception of employment data, which is fine-grained in the database but is published in aggregated form), indexicality and relationality. In each case, the data are accompanied by metadata that uniquely identifies the device, site and time/date of generation, along with other characteristics such as device settings. These metadata inherently produces relationality, enabling data from the same and related devices but generated at different times/locales to be linked, but also entirely different datasets that share some common fields to be tied together and relationships between datasets to be identified. However, the data themselves might not provide unambiguous relationality or be easily machine-readable. For example, a tweet is composed of text and/or an image which requires either data analytics or human interpretation to identify the content and meaning of the tweet. Similarly, a CCTV feed will be indexical to a camera and be time, date, and place stamped, but the content of the feed will either require image recognition to identify content (e.g., using facial recognition software) or operator recognition to make the image content indexical.
Extensionality and scaleability refer to the flexibility of data generation. A system that is highly adaptable in terms of what data are generated is said to possess strong extensionality (Marz and Warren, 2012). For example, web-based and mobile apps are constantly tweaking their designs and underlying algorithms, performing on-the-fly adaptive testing and rollout, as well as altering their terms and conditions and the metadata they capture. The result is the data they generate are changeable, with new fields being added and removed as required. However, this is not a trait common to all big datasets. For example, many systems, such as smart meters, credit card readers and sensor-networks, are seeking rigid continuity in what data are generated to produce robust, comparable longitudinal datasets. Scaleability refers to the extent to which a system can cope with varying data flow. Social media platforms such as Twitter need to be able to cope with ebbs and surges in data generation, scaling from managing a few thousand tweets at certain times of the day to tens of thousands during popular live events. Such rapid scaling is not required in systems that have a constant flow of data, such as a sensor network that produces data at set intervals (the timing can be altered, but the flow remains constant rather than surging). As such, some of the 26 datasets are generated and stored within rapidly scaleable systems, but not others.
The forms and boundaries of Big Data
What is clear from examining each Big Data parameter with respect to the 26 datasets is that there is no one characteristic profile that all Big Data fit. Big Data does not possess all of the seven traits detailed by Kitchin (2013, 2014). Indeed, not all data termed Big Data in the literature possess the 3Vs of volume, velocity and variety. If one looks across the rows in Table 3 then the diversity of Big Data becomes clear, with datasets possessing differing profiles, especially with regard to volume, velocity, variety, extensionality and scalability. Big Data are clearly then not an amorphous category and there are certainly different ‘species’ of Big Data.
Examining these profiles starts to suggest the boundary markers of what constitutes Big Data. Indeed, it may be the case that some of our 26 datasets might not be considered Big Data by some. Or it might be that some consider certain datasets to constitute Big Data that we would not, for example, national censuses (which have volume, exhaustivity, resolution, indexicality and relationality, but no velocity (generated once every 10 years and taking 1–2 years to process), no extensionality or scaleability, and are published in aggregated form). It seems to us, based on the datasets that we have examined, that the key boundary characteristics of Big Data, which together differentiate it from small data, are velocity (both frequency of generation, and frequency of handling, recording, and publishing) and exhaustivity. Small data are slow and sampled. Big Data are quick and
These two traits, we believe, act as key Big Data boundary markers. In our own analysis of Table 3 it was the administrative datasets of the house price register, planning permissions and unemployment, as well as the satellite and LIDAR imagery that provoked the most discussion (we quite quickly rejected Census data, which we had initially included, due to its very long temporal gap in data generation). In the case of the administrative data, they are produced in real-time as entries are made into the system (as house sales are completed, planning permissions sought, and unemployed people sign-on). However, the publishing of the data is either weekly or monthly, and in the case of unemployment released in an aggregated form. Do data that are generated in real-time, but released monthly and in an aggregated form constitute Big Data? Certainly they are at the point of collection, but what about at the point of publishing where they lack velocity? For some, such administrative data are Big Data (Economic and Social Research Council (ESRC), 2013), for others they are more marginal, and the key element in doubt is temporality. One month’s delay is still much quicker than most administrative data that are published quarterly or annually, and the dataset still holds most of the other characteristics of Big Data such as exhaustivity (the data refers to all houses sold, all planning permissions sought, and all unemployed people), but it is nonetheless far slower than data published in real-time.
Our discussion of satellite imagery and LIDAR focused in particular on coverage and repetition of gaze. In other forms of Big Data, what is being measured remains quite constant, with the gaze and the object under surveillance relatively fixed. In social media it is the contributions of every user, for credit cards it is the transactions of every card holder, for supermarkets it is the purchases of every shopper. However, the gaze of the satellite imagery moves, only returning to capture the same terrain after a set number of days. Nonetheless the surface of the entire planet is being repeatedly generated and data are processed constantly. In the case of LIDAR, that repetition is missing. The aim is to scan every road on the planet, but to do so only once. The data are generated in real-time, and are voluminous, indexical, relational, and they produce exhaustive spatial coverage (the aim is to create a 3D model of the whole road network and the architecture bordering this network) though no longitudinal data of the same places. In both cases, most would agree that satellite imagery and LIDAR scans constitute Big Data, but they are exhaustive in a particular way which distinguishes them from other types of Big Data. The same would also be the case with respect to large scientific experiments such as data generated by the Large Hadron Collider.
Interestingly, given the meme of the 3Vs of Big Data, having examined 26 types of Big Data, our conclusion is that two of those Vs – volume and variety – are not key defining characteristics of Big Data. It is certainly the case that Big Data often consists of very large numbers of records and the storage volume required to store them is significant, however, this is not a necessary condition of Big Data. Rather volume is a by-product of velocity and exhaustivity: the real-time flow of data across a whole system can produce a deluge of data, especially if each record is large in size. In some cases, however, the flow can be generated in real-time (e.g., every 30 seconds), but because the system is small (e.g., 30 sound sensors across a city) and each record is small in size, the storage volume is relatively small. The data generated by each sensor are also highly structured. Despite the lack of volume and variety, such sensor data are widely considered Big Data. Likewise, variety is not a distinguishing characteristic because small data possesses just as much variety as Big Data.
Conclusion
To date, there has been very little work that has sought to examine in detail the ontology of Big Data, other than to suggest that they are data that possess certain broad characteristics (volume, velocity, variety, exhaustivity, etc.). Indeed, most studies that discuss Big Data treat the term as a catch-all, amorphous phrase that assumes that all Big Data share a set of general traits. Through an analysis that applied Kitchin’s (2013, 2014) typology of Big Data traits to 26 datasets our study reveals that Big Data do not all share the same characteristics and that there are multiple forms of Big Data. Indeed, our analysis demonstrates that only a handful of the 26 datasets we examined held all seven traits identified by Kitchin. That said, it is the case that for Big Data to be classified as Big Data they do need to possess the majority of the traits set out in Table 1, of which velocity and exhaustivity are the most important. Volume and variety, we contend, are not necessary conditions of Big Data and without velocity and exhaustivity are not qualifying criteria. In other words, the 3Vs meme is actually false and misleading and along with the term itself is partially to blame for the confusion over the definitional boundaries of Big Data.
The observation that there are multiple forms of Big Data is perhaps no surprise given the wide variety of small data, the varying nature of the systems that generate Big Data, the differing purposes for which the data are generated, and the differing forms of the data generated. Nonetheless it is an observation that needs highlighting given that it has so far been ignored or taken for granted in the literature. Our analysis has revealed that Big Data as an analytical category needs to be unpacked, with the ‘genus’ of Big Data further delineated and its various ‘species’ identified. This is important work if we are to better understand what it is that we are talking about when we discuss and analyze Big Data, and if we want to produce more nuanced insights about and from the data. It is only through such ontological work, focused on shifting from broad generalities to specific qualities, that we will gain conceptual clarity about what constitutes Big Data and formulate how best to make sense of it and how it might be used to make sense of the world.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research for this paper was funded by a European Research Council Advanced Investigator Award, ‘The Programmable City’ (ERC-2012-AdG-323636).
