One of the current concerns among many domains (e.g. information design, technology, cognition, psychology) is how to make sense of extensive and complex sets of data in order to improve understanding and communication. While tools, systems and methods are designed to assist those purposes, the Big Data phenomenon seems to be the one to beat. ‘Big Data’ has become the term to refer to almost every current set of information. Everything seems to be related to or influenced by Big Data. If a problem does not deal with Big Data, it does not seem to generate much interest or attention, as it might be easy to solve. However, communication problems do not always depend on the size or complexity of the message, but on the way the information is transmitted. Misunderstandings are the result of lack of clarity and/or incomprehensible messages.
This situation made me start wondering what was Big Data? why is it considered the core of current problems? Have data sets become huge recently? When a set of data is considered big? Why its size seems to be so important the sensemaking process?
This post does not aim to answer all those questions, but to shed some light on the relationship among size, complexity, understanding and communication by proving a better understanding of the concept of Big Data and its relationship to understanding.
Size & Complexity. In the current information age, technology developments and the Internet have made data production highly varied and widely accessible. Consequently, information management and analytical skills have become essential tools for many people to deal with ‘large and fast-growing sources of data or information that also present a complex range of analysis and use problems’ (Villars et al., 2011). This type of large and complex sets of data is defined as Big Data. In other words, size and complexity are two factors that need to be taken into account to describe a dataset as large.
When Data is Big? Technically speaking, a dataset is considered large when its size, heterogeneity and level of complexity become a challenge to the person trying to analyse and make sense of it (Unwin et al., 2006:15). Similarly, Wikipedia defines a set of data as large and complex when its understanding cannot be achieved with traditional database management tools or data processing software. A dataset can be large in many ways: number of cases, number of variables or both (Unwin et al., 2006).
When Data is Complex? A set of data is considered as complex when it is formed from a varied range of sources from different formats. A complex set of data has a wide range of entities, variables and details (e.g. additional categories, special cases, etc), which generate diverse types of structure (temporal or spatial components, combinations of relational data tables) and the need for data cleaning as the quantity of irrelevant data augments. In complex sets of data, the high number of variables to consider may demand more time to read, retrieve, evaluate, display, manage and analyse data, and require tools to assist the understanding and analytical process.
Since when Data have been Big. Extensive datasets are not a phenomenon of the information age, as the concept of ‘large’ datasets has been around since the beginning of the XX century (or earlier). As an example, in 1959, a set of data of 1,000 cases was described as ‘large’, while in 1982 a dataset with less than 500 cases was considered ‘moderate’ and ‘large’ with more than 2,000 (Unwin et al., 2006:11). In 2006, Unwin et al. suggest one million (bytes, cases, variables, combinations or tests) as a guideline to refer to a dataset as ‘large’.
Each period of time has had their own particular problems and ways to deal with what was considered ‘large’ data. What has changed throughout time are the criteria to quantify the largeness of data (See table above). ‘What is large depends on the frame of reference’ stated Carr et al. (1987 in Unwin, 2006) back in the late 80s. Today, the following aspects can be used to determine the largeness of a dataset (Unwin et al. (2006):
- Type of data (e.g. cases, variables, observations, etc.)
- Amount of data that can be stored and identified
- Types of analyses used to explore and examine the data
- Required time to conduct analyses and gain understanding
Today. Few (2012) states that data ‘have always been big’ and already increasing at exponential rates 30 years ago. Lack of effective communication has been an on-going problem which does not seem to be related to the size of the data, but to the lack of thorough understanding. As the Big Data phenomenon implies more analyses and more results in less time, energy seems to be focused on developing automatic filtering and storing of results, rather than on developing communication strategies. To be effective these strategies need to combine the power of machines and technology with human strengths and input.
Largeness and complexity are relative and respond to sets of criteria determine in each specific moment and context. Therefore, these factors shouldn’t be seen as barriers to achieve transparent communication.
– Few, S. (2012) Big Data, Big Ruse. Visual Business Intelligence Newsletter. Perceptual Edge. July/August/September.
– Unwin, A., Theus, M abd Hofmann, H. (2006) Graphics of large datasets. Visualising a million. Series: Statistics and Computing. Springer: Singapore
– Villars, R.L., Olofson, C.W. and Eastwood, M. (2011) Big Data: What It Is and Why You Should Care (White paper) http://www.idc.com