Big Data: Demystified

By Express Computer On Jul 25, 2012

Vivek Singh writes about how Open Source solutions can help harness the power of Big Data with a real life example

Big Data, besides being a challenge is also a huge opportunity in terms of generating insights from new and varied types of data enabling businesses to become more agile than before. A simple, graphic and contemporary definition of Big Data would be ‘all the machine generated data, which gets populated rapidly alongside all the types of data that have complexities rather than size or volume’. Examples of Big Data would be pertinent to industries such as e-Commerce, Telecom, Social Media and BFSI with the types of data being dealt with including call logs, Web logs, Internet text/documents/search indexing, sensor networks, RFID, social network data etc.

A typical example would be a Web commerce enterprise, which would need to gain near-real-time insights into its customers’ behavioral patterns and trends in order to influence their marketing campaigns, delivery model, pricing as well as its products or service offerings. Big Data can and, more often than not, it does include a variety of ‘unusual’ data types. In that sense, it can be either structured or unstructured. The former would include high volume transaction data such as call data records in telcos, retail transaction data and pharmaceutical drug test data. Unstructured data is more difficult to process and this includes semi-structured data such as XML and HTML besides unstructured data like text, image, rich media, Web logs etc. Challenges here include the capture, storage, search, sharing, analysis and visualization of data sets. Though Size (Volume) is key to the primary definition (typically more than a couple of Terabytes) of Big Data, the other critical dimensions are:

Latency (Velocity): For time-sensitive processes such as detecting fraud, Big Data must be used as it streams into your enterprise in order to maximize its value.

Variety: Big Data is any type of data; it includes structured as well as unstructured data in the form of text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Complexity (Multi-Dimensionality): Big often refers to complexity rather than volume. Big Data can be very small and not all large datasets are huge.

The term Big Data is now applied more broadly to cover platforms having faster, cheaper and distributed processing power, clustered computing, lower cost of storage with in-built fault tolerance and network using commodity (cost-effective) hardware. Hadoop is one such platform. It is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Hbase and Zookeeper that extend the power of Hadoop.

One of our clients that generates about 2 TB/day of networking data wanted to conduct a cost-analysis of this data. It also wanted to store the data for 18 months. The data was time-series and it would fit into the definition of Big Data. It was not streaming data and, therefore, we recommended GridFTP (an open source solution) as the data transport layer and suggested that the client set up a 50 node (2 cpu -quad core, 8 GB RAM and 1 TB disk space) Apache Hadoop grid for this purpose. Data was compressed using bzip compression logic. Since it was structured data, we used Pig Latin to do the data quality and transformation. The client also wanted an SQL like interface for doing the analysis. The output of Pig was fed into Hive and Hive tables were created where the client was able to run SQL like statements. With the help of the Unix crontab utility, jobs were scheduled and new data was appended to Hive Tables after the completion of each job. The client is also looking at pre-defined, canned as well as ad hoc reporting. We at GrayMatter are well placed in the Big Data space having been a firm advocate and believer in open source systems and technologies from our inception when it was little known and not as mature as it is today.

Proven recommendations for Big Data projects

Unstructured Search	Solr (MR version of Lucene)
Structured search (Key-Val pair)	Cassandra(Not Hadoop supported), Hbase
Document Search	MangoDB, NoSQL, Solr
Transformation, Data Quality	Pig , Java MR code, Hadoop Streaming
SQL like analysis	Hive, HBase
Stream input	Scribe, Flume

The author is a Big Data Architect, with GrayMatter Software Services Pvt. Ltd. and the author of a work of fiction, The Reverse Journey. He can be reached at [email protected]