A Brief History of Big Data
Just because cavemen didn’t have computers doesn’t mean the data wasn’t there
THE HISTORY OF “Big Data” as a term may be short, but many of the foundations it is built on were laid long ago. Many decades before computers as we know them today were commonplace, the idea that we were creating an ever-expanding body of knowledge ripe for analysis was popular in academia.
Our increasing ability to store and analyze information has been a gradual evolution—although things certainly sped up at the end of the last century, with the invention of digital storage and the Internet. Here, then, is a brief(-ish) look at the long history of thought and innovation that have led us to the dawn of the data age.
Ancient History of Data
C 18,000 BCE
The earliest examples we have of humans storing and analyzing data are the tally sticks. The Ishango Bone was discovered in 1960 in what is now Uganda and is thought to be one of the earliest pieces of evidence of prehistoric data storage. Paleolithic tribespeople would mark notches into sticks or bones to keep track of trading activity or supplies. They would compare sticks and notches to carry out rudimentary calculations, enabling them to make predictions such as how long their food supplies would last.
C 2400 BCE
The abacus—the first dedicated device constructed specifically for performing calculations—comes into use in Babylon. The first libraries also appear around this time, representing our first attempts at mass data storage.
300 BC – 48 AD
The Library of Alexandria is perhaps the largest collection of data in the ancient world, housing up to perhaps half a million scrolls and covering everything we have learned so far, about pretty much everything. Unfortunately, in 48AD it is thought to be destroyed by the invading Romans, perhaps accidentally. Contrary to common myth, not everything was lost—significant parts of the library’s collections were moved to other buildings in the city, or stolen and dispersed throughout the ancient world.
C 100 – 200 AD
The Antikythera Mechanism, the earliest discovered mechanical computer, is produced, presumably by Greek scientists. Its “CPU” consists of 30 interlocking bronze gears and it is thought to have been designed for astrological purposes and tracking the cycle of Olympic Games. Its design suggests it is probably an evolution of earlier devices—but these so far remain undiscovered.
The Emergence of Statistics
In London, John Graunt carries out the first recorded experiment in statistical data analysis. By recording information about mortality, he theorizes that he can design an early warning system for the bubonic plague ravaging Europe.
The term “business intelligence” is used by Richard Miller Devens in his Cyclopaedia of Commercial and Business Anecdotes, describing how the banker Henry Furnese achieved an advantage over competitors by collecting and analyzing information relevant to his business activities in a structured manner. This is thought to be the first study of a business putting data analysis to use for commercial purposes.
The US Census Bureau has a problem—it estimates that it will take it eight years to crunch all the data collected in the 1880 census, and it is predicted that the data generated by the 1890 census will take over 10 years, meaning it will not even be ready to look at until it is outdated by the 1900 census. In 1881 a young engineer employed by the bureau—Herman Hollerith—produces what will become known as the Hollerith Tabulating Machine. Using punch cards, he reduces 10 years’ work to three months and achieves his place in history as the father of modern automated computation. The company he founds will go on to become known as IBM.
The Early Days of Modern Data Storage
Interviewed by Colliers magazine, inventor Nikola Tesla states that when wireless technology is “perfectly applied the whole Earth will be converted into a huge brain, which in fact it is, all things being particles of a real and rhythmic whole … and the instruments through which we shall be able to do this will be amazingly simple compared to our present telephone. A man will be able to carry one in his vest pocket.”
Fritz Pfleumer, a German-Austrian engineer, invents a method of storing information magnetically on tape. The principles he develops are still in use today, with the vast majority of digital data being stored magnetically on computer hard disks.
Fremont Rider, librarian at Wesleyan University, Connecticut, publishes a paper titled The Scholar and the Future of the Research Library. In one of the earliest attempts to quantify the amount of information being produced, he observes that in order to store all the academic and popular works of value being produced, American libraries would have to double their capacity every 16 years. This leads him to speculate that by 2040 the Yale Library will contain 200 million books spread over 6,000 miles of shelves.
The Beginnings of Business Intelligence
IBM researcher Hans Peter Luhn defines Business Intelligence as “the ability to apprehend the interrelationships of presented facts in such a way as to guide action toward a desired goal.”
The first steps are taken toward speech recognition, when IBM engineer William C. Dersch presents the Shoebox Machine at the 1962 Seattle World’s Fair. It can interpret numbers and 16 words spoken in the English language into digital information.
An article in the New Statesman refers to the difficulty in managing the increasing amount of information becoming available.
The Start of Large Data Centers
The US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.
IBM mathematician Edgar F. Codd presents his framework for a “relational database.” The model provides the framework that many modern data services use today, to store information in a hierarchical format, which can be accessed by anyone who knows what they are looking for. Prior to this, accessing data from a computer’s memory banks usually required an expert.
Material Requirements Planning (MRP) systems are becoming more commonly used across the business world, representing one of the first mainstream commercial uses of computers to speed up everyday processes and make efficiencies. Until now, most people have probably only seen them in research and development or academic settings.
Possibly the first use of the term Big Data (without capitalization) in the way it is used today. International best-selling author Erik Larson pens an article for Harpers Magazine speculating on the origin of the junk mail he receives. He writes: “The keepers of big data say they are doing it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”
Additionally “business intelligence”—already a popular concept since the late ‘50s—sees a surge in popularity with newly emerging software and systems for analyzing commercial and operational performance.
The Emergence of the Internet
Computer scientist Tim Berners-Lee announces the birth of what would become the Internet as we know it today. In a post in the Usenet group alt.hypertext he sets out the specifications for a worldwide, interconnected web of data, accessible to anyone from anywhere.
According to R. J. T. Morris and B. J. Truskowski in their 2003 book The Evolution of Storage Systems, this is the point where digital storage became more cost effective than paper.
Michael Lesk publishes his paper How Much Information is There in the World?, theorizing that the existence of 12,000 petabytes is “perhaps not an unreasonable guess.” He also points out that even at this early point in its development, the web is increasing in size 10-fold each year. Much of this data, he points out, will never be seen by anyone and therefore yield no insight.
Google Search also debuts this year, and its name becomes shorthand for searching the Internet for data.
Early Ideas of Big Data
A couple of years later and the term Big Data appears in Visually Exploring Gigabyte Datasets in Real Time, published by the Association for Computing Machinery. Again the propensity for storing large amounts of data with no way of adequately analyzing it is lamented. The paper goes on to quote computing pioneer Richard W. Hamming as saying: “The purpose of computing is insight, not numbers.”
This year also brings possibly the first use of the term “Internet of Things” to describe the growing number of devices online and the potential for them to communicate with one another, often without a human “middle man.” The term is used as the title of a presentation given to Procter and Gamble by RFID pioneer Kevin Ashton.
In How Much Information? Peter Lyman and Hal Varian (now chief economist at Google) attempt to quantify the amount of digital information in the world, and its rate of growth, for the first time. They conclude: “The world’s total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on Earth.”
In his paper 3D Data Management: Controlling Data Volume, Velocity, and Variety, Doug Laney, analyst at Gartner, defines three of what will come to be the commonly-accepted characteristics of Big Data.
This year also sees the first use of the term “software as a service”—a concept fundamental to many of the cloud-based applications that are industry-standard today—in the article “Strategic Backgrounder: Software as a Service” by the Software and Information Industry Association.
Web 2.0 Increases Data Volumes
Commentators announce that we are witnessing the birth of “Web 2.0”—the user-generated web where the majority of content will be provided by users of services, rather than the service providers themselves. This is achieved through integration of traditional HTML-style web pages with vast back-end databases built on SQL. Some 5.5 million people are already using Facebook, launched a year earlier, to upload and share their own data with friends.
This year also sees the creation of Hadoop—the open source framework created specifically for storage and analysis of Big Data sets. Its flexibility makes it particularly useful for managing the unstructured data (voice, video, raw text, etc.) that we are increasingly generating and collecting.
Today’s Use of the Term “Big Data” Emerges
Wired brings the concept of Big Data to the masses with their article “The End of Theory: The Data Deluge Makes the Scientific Model Obsolete.”
The world’s servers process 9.57 zettabytes (9.57 trillion gigabytes) of information—equivalent to 12 gigabytes of information per person, per day), according to the How Much Information? 2010 report. In International Production and Dissemination of Information, it is estimated that 14.7 exabytes of new information are produced this year.
The average US company with over 1,000 employees is storing more than 200 terabytes of data according to the report Big Data: The Next Frontier for Innovation, Competition, and Productivity by McKinsey Global Institute.
Eric Schmidt, executive chairman of Google, tells a conference that as much data is now being created every two days as was created from the beginning of human civilization to the year 2003.
The McKinsey report states that by 2018 the US will face a shortfall of between 140,000 and 190,000 professional data scientists, and states that issues including privacy, security, and intellectual property will have to be resolved before the full value of Big Data will be realized.
This year marks the rise of the mobile machines, as for the first time more people are using mobile devices to access digital data than office or home computers. Eighty-eight percent of business executives surveyed by GE working with Accenture report that big data analytics is a top priority for their business.
What this teaches us is that Big Data is not a new or isolated phenomenon, but one that is part of a long evolution of capturing and using data. Like other key developments in data storage, data processing, and the Internet, Big Data is just a further step that will bring change to the way we run business and society. At the same time, it will lay the foundations on which many evolutions will be built.
Bernard Marr is one of the world’s most highly respected experts when it comes to business performance, digital transformation, and the intelligent use of data in business. View his website at http://www.bernardmarr.com/.