Imminent evolves around Big Data
With the evolution of the Internet, the ways businesses, economies, stock markets, and even governments function and operate have also evolved, big time. It has also changed the way people live. Lately the term ‘Big Data’ has been under the limelight, no more a new term, and almost everyone knows something about it. Before the past couple of years, most of the data was stored on paper, film, or any other analog media; only one-quarter of all the world’s stored information was digital. But now everything is digital.
Let’s move forward more about Big Data!
What is Big Data?
Every day, we create 2.5 quintillion bytes of data so much that 90% of the days in the world today have been created in the last two years alone.
This data comes from everywhere:
- Sensors used to gather climate information
- Posts to social media
- Digital transaction pictures and videos,
- Purchase transaction records
- Cell phone GPS signals to name a few
- This is Big Data
- Today, Twitter generates 12TB of data every day
- Airbus A380 generates 10TB every 30 minutes of flight
- NYSE generates a Tb of data every month.
Big Data is similar to small- data but bigger.
Definition
No single Standard definition.
Big Data is data whose scale, diversity,y, and complexity required new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden. Big Data is a high-volume, high velocity, and high variety of information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making. It spans three dimensions volume, velocity, and variety.
Big Data is widely classified into 3 main types
1. Structured data
2. Semi-structured data
3. Unstructured data
Let’s walkthrough
1. Structured data
Any data that can be stored, accessed, and processed in the form of a fixed format is termed “structured data. Throughout time, talent in computer science has achieved great success in developing techniques for working with such kinds of data and also deriving values out of it. However, nowadays, we are foreseeing issues when the size of such data grows to a huge extent, typical sizes being in the range of multiple zettabytes.
Source of structure data
- RDMS like Oracle, MySQL, DB2, QL server, etc
- Spreadsheets
- OLTP system
Examples
- Relation/tables
- MS Access
- MSExcel
2. Semi-structured data
Semi-structured data can contain both forms of data. We can see semi-structured data as structured in form but it is not defined with e.g table definition in relational DBMS.
Source of semi-structured data
- XML
- JSON
- Other markup languages
3. Unstructured data
Any data with an unknown form or structure is classified as unstructured data. In addition to the size being huge, unstructured, data is a heterogeneous data source containing a combination of simple text files, images, videos, etc. Nowadays organizations have a wealth of data available with them but unfortunately, they don’t know how to derive values out of it since this data is in its raw form or unstructured format.
Source of unstructured data
- Web pages
- Images
- Free-form text
- audio/video
- Body of email
- Chats
- Social media data
- Word document
Characteristics of Big Data
1. Volume
The name Big Data itself is related to its enormous size. The size of data plays a very crucial role in determining the value of data. Volume is the amount of data generated that must be understood to make data-based decisions. A text file is a few kilobytes, a sound is a few megabytes while a full-length movie is a few gigabytes. An extremely large volume of data is a major characteristic of big data
E,g Amazon handles 20 million customers’ clickstream user data per day for recommended products.
2. Velocity
Velocity measures how fast data is produced and modified and the speed with which it needs to be processed. An increased number of data sources both machine and human-generated drive velocity. Big Data Velocity pack with the speed at which data flows in from sources like business processes, application logs, networks, social media, sensors, Mobile devices, etc. The flow of data is massive and continuous. The extremely high velocity of data is another major characteristic of big data.
E.g. 72 hours of video are uploaded to youtube every minute at this velocity.
3. Variety
Variety defines data coming from new sources i.e both inside and outside of an enterprise. It can be structured,semi-structured, unstructured, or even in different formats such as text format, videos, images, and more. So, storing and processing unformatted data through RDBMS is not easy. Variety is one of the important characteristics of big data.
- Structured data – Traditional transaction processing systems and RDBMS.
- Semi-Structured data – HyperText Markup Language, Extensible Markup Language (XML).
- Unstructured data- Unstructured text documents, audio, video, email, photos, PDF, and social media. 3. Variability
4. Variability
Variability refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. Data available can sometimes get messy, anomaly, and maybe difficult to trust. Data flows can be highly inconsistent with periodic peaks. With a wide variety of big data, types generated quality and accuracy are difficult to control.
E.g A Twitter post has hashtags types and abbreviations.
5. Veracity
Veracity refers to biases, noise, and abnormality of data. This means the degree of reliability that the data has to offer. Big data, as large as it is, can contain wrong data too. The key question here is “Is all the data that I am analyzing trustful”.
6. Volatility
The volatility of data deals with, how long is the data valid? And how long should it be stored? Some data is required for long-term decisions and remains valid for a longer period. However, there are also pieces of data that quickly become obsolete minutes after their generation.
Conclusion
To conclude, the above-mentioned 6 characteristics of big data indicate that each character is associated with some advantages. However, they are not beyond challenges. Besides, these characteristics determine the root of failures or defects in data on a real-time basis. I hope you have enjoyed this blog, for more information just drop us a line and we will be back within 24hrs.