Importance of Metadata in a Big Data World
Last week, EMC announced the 2014 EMC Digital Universe, a study with IDC that details what the Digital Universe will look like in 2020 and the opportunities it presents. One of the interesting challenges coming out of the study was the need to “tag” data in order to understand it in the context in which it was captured or gathered. The study predicts that by 2020, a growing majority of new data being generated will be unstructured. That means that, more often than not, we will know little about the data, unless it is somehow characterized or tagged—a practice that results in metadata.
Tagging Lessons Learned From Web Analytics
Tagging is a concept with which most web analytics users are familiar. Tagging is a method of tracking visitor activity on each page of the website (see Figure 1).
The advantages of tagging include:
- The script may have access to additional information on the web client or on the user, not sent in the query, such as visitors’ screen sizes and the price of the goods they purchased.
- Tagging can report on events that do not involve a request to the web server, such as interactions within Flash movies, partial form completion, mouse events such as onClick, onMouseOver, onFocus, onBlur, etc.
- The tagging service manages the process of assigning cookies to visitors.
There’s Gold In Them Thar Hills of Metadata!!
Sometimes it’s hard to imagine what metadata is and why it’s important. Let’s look at an example of the metadata associated with a 140-character tweet. 140 characters wouldn’t seem to be much data, even with a voluminous number of tweets. However, data volumes explode when you start coupling the tweet with all the metadata necessary to understand the 140-characters in context of the conversation (see Figure 2).
Here is some of the metadata associated with a 140-character tweet:
- The screen name and user ID of the “replied to tweet” author
- Tweet’s creation date and time
- The author’s screen name
- The author’s user name
- The author’s biography
- The author’s URL
- The author’s location
- Rendering information for the author
- Account creation date
- Number of favorites this user has
- Number of users this user is following
- Time zone and offset for this user
- User’s selected language
- Where the user is protected or not
- Number of followers for this user
- Place ID
- Printable name for this place
- Type of place
- The country for this place
- The application that sent the tweet
It’s quick to see how the volume of metadata quickly dwarfs the amount of raw data, and this is what happens when organizations start tagging more of their transactions and interactions in order to gain additional insight into the nature and context of the dialogue and interaction.
Untapped Data Examples
Not all data is necessarily useful for Big Data analytics. However, some data types are particularly ripe for analysis, such as:
- Surveillance footage. Typically, generic metadata (date, time, location, etc.) is automatically attached to a video file. However, as IP cameras continue to proliferate, there is greater opportunity to embed more intelligence into the camera (on the edge) so that footage can be captured, analyzed, and tagged in real time. This type of tagging can expedite criminal investigations, enhance retail Big Data analytics for consumer traffic patterns, and improve military intelligence as videos from drones across multiple geographies are compared for pattern correlations, crowd emergence and response, or measuring the effectiveness of counterinsurgency.
- Embedded and medical devices. In the future, sensors of all types (including those that may be implanted into the body) will capture vital and non-vital biometrics, track medicine effectiveness; correlate bodily activity with health, monitor potential outbreaks of viruses, etc.—all in real time.
- Entertainment and social media. Trends based on crowds or massive groups of individuals can be a great source of Big Data to help bring to market the “next big thing,” help pick winners and losers in the stock market, and yes, even predict the outcome of elections—all based on information users freely publish through social outlets.
- Consumer images. We say a lot about ourselves when we post pictures of ourselves or our families/ friends. A picture used to be worth a thousand words, but the advent of Big Data has introduced a significant multiplier. The key will be the introduction of sophisticated tagging algorithms that can analyze images either in real time when pictures are taken or uploaded or en masse after they are aggregated from various websites.
These are in addition to the normal transactional data running through the enterprise systems in the course of normal data processing today.
The IDC study states that from 2013 to 2020, the digital universe will grow by a factor of 10x—from 4.4 trillion gigabytes to 44 trillion. However, the IDC study estimates that only 3% of the potentially useful data will be tagged.
Call this the Big Data gap: information that is untapped and waiting for enterprising digital explorers to extract the value hidden within it. The bad news is that this will take extra work and investment to tag all of these new data sources. The good news is that, as the digital universe expands, so does the amount of useful data it contains, and the invaluable insights about your customers, products, markets, and operations that can be used to optimize key business processes and uncover new monetization opportunities.