Big data: Hadoop

By Express Computer On Nov 11, 2013

The open source framework that supports data-intensive distributed apps looks set to save enterprises from the oncoming data deluge

By Harshal Kallyanpur

For the past couple of years, Android and iOS fans have been at loggerheads as to which operating system—and therefore, smartphone—is better over the other. And these wars have spilled on to the comments sections of smartphone manufacturer forums for customers, gadget review websites, social networking platforms and the like.

To most people this may seem like mindless banter, but to companies manufacturing these devices, they can tap into useful customer sentiment effectively, by going through all this information. They can derive meaningful insights out of this data and convert them into results that translate into features and functionality on upcoming models. So far so good. However, it is this “gather, filter, derive, and translate” part that has most organizations wound up in knots.

The industry today is abuzz with talks of big data and how enterprises need to look at the famed ‘Vs’ of data: velocity, volume, variety and value. And choose a path that makes the business case for Business Intelligence (BI) and analytics.

Enterprises today are seeing volumes of data being generated not only from within their enterprise applications but also from customer and industry interactions on the Internet, social media, mobile devices and machines as sensor data or logs.

According to Sanchit Vir Gogia, Chief Analyst & CEO of Greyhound Research, as consumers are using more touch points such as mobile, Internet and social to obtain information, enterprises are looking at providing this information at every touch point in context to the consumer. And putting this information in context is key to big data, as content in context is proving to be the king.

However, to do so, an enterprise first needs to understand how it would draw information from various data sources and how it can be used in a context that will bring business value to the enterprise.

And it is the diversity of information, its nature as structured or unstructured, the speed at which it is getting generated and the volume that decides the approach the enterprise needs to take toward big data, something that organizations are in various stages of either understanding or implementing. Here, we look at how Hadoop as an approach to big data is starting to make sense for enterprises on a big data journey.

Why go the Hadoop way?

Most enterprises looking at a big data analytics implementation would typically have an enterprise data warehouse or some sort of a structured database already in place, which gives them access to business data in a fairly defined way and allows them to make business decisions based on this information.

However, many organizations are now seeing data from multiple sources, a lot of which is unstructured. At the consumer side, people are accessing data from different devices and using different channels such as social media, websites, Internet forums and video to express their sentiments. At the enterprise side, manufacturing equipment, network devices, smart devices, etc. are generating machine data which though fairly structured, is coming into the enterprise at a rate which is beyond the capabilities of traditional database solutions.

Traditional data warehousing solutions which work with structured data, are not quite equipped to handle unstructured data. Hadoop, with its HDFS file system, can store a variety of data that can be processed using tools specifically written for Hadoop-based file systems. And this is what is making it an interesting proposition for big data analytics.

According to Santhosh D’Souza, Director – Systems Engineering, NetApp India, the nature of data consumption creation and analysis in web scale environments has translated extremely quickly into enterprises as well. Enterprises have already started experiencing the problem of scale with data that Yahoo and Google have faced before, which led them to work on Web Scale or MapReduce.

“The increasing amount of collaboration or social media type of applications within enterprises, have also contributed to the amount of data going in and coming out of enterprise applications to present a scenario where Hadoop becomes one of the answers,” says D’Souza.

Speaking about the merits of Hadoop and how vendors like NetApp are able to leverage on it he says, “A Hadoop system can scale out to a large extent. We can scale up to 69 petabytes for a single logical namespace and can build multiple such logical namespaces. It can also deal with structured data and has a wide variety of applications which can use it as a data store.”

Amit Gupta, Senior Business Intelligence Architect, Persistent Systems, explains, “Hadoop’s MapReduce programming paradigm allows any type of data to be processed in any programming language to provide flexibility and low entry barrier for developers. The Hive interface on Hadoop provides a layer of database metadata that can be used to define table structures on Hadoop files, and allows users to browse and process data using standard SQL commands.”

“Many commercial and open source ETL tools utilize the Pig interface to push heavy data processing to Hadoop for warehousing and data integration tasks. This helps in bringing down ETL server and license costs to achieve scalability with Hadoop,” adds Gupta.

Srinivasan Govindarajalu, Senior Director and Practice head – DWBI, Virtusa Corporation giving a similar opinion says, “Most enterprises are quite happy using Hadoop for back end processes like for ETL and storage. As opposed to popular belief, Hadoop is getting adopted a lot from a storage point of view, such that many experts are describing the use of Hadoop for storage as unsupervised digital landfill.”

Anil Bajpai, Senior Vice President and Head of Research and Innovation, iGate, shares a similar opinion and says, “Hadoop can handle data better especially if it is coming from multiple structured or unstructured sources. It is also becoming relevant for most organizations as it offers a cost-effective way of processing huge amounts of data. Cost, can be an issue from an in-memory or enterprise data warehouse perspective as anything over a terabyte runs into exorbitant costs.”

In-memory and Hadoop

Hadoop may offer a cost-effective way of data storage for big data analytics, but when it comes to performance, in-memory has been gaining popularity in India for the last one year or more. A lot of organizations which have had traditional enterprise data warehouses, have looked at adopting in-memory as it offers near real-time analytical performance and works with structured data, which today continues to add a lot more business value than unstructured data.

Says Gogia of Greyhound, ”The biggest fallacy about big data is that it is all about unstructured data. The largest amount of big data analytics is happening around structured data today, and it continues to be the most critical decision making factor.”

Therefore, in-memory is finding a lot of adoption within the country which currently has most enterprise environments working largely with structured data. According to Seshadri Rangarajan, CTO, BIM, Global Service Line, Capgemini India, in-memory is finding greater adoption than Hadoop in India and particularly for SAP’s HANA, as India is a major SAP market with a lot of organizations having implemented SAP within their enterprise. For performance related challenges that the enterprises face from an ERP or a data warehouse perspective, HANA provides an immediate solution.

However, in-memory works best with structured data and has limitations on memory beyond which it becomes a cost-intensive proposition. As enterprises get more into predictive and advanced analytics, they would need large volumes of data to work with and here the acceptance of Hadoop will grow.

According to Dinesh Jain, Country Manager, Teradata India, only 5 percent of unstructured data is usable data. Hadoop can therefore be used for staging the unstructured data, gathering meaningful data out of it and then running analytics on top of it. It’s a tiered approach wherein the storage platform or staging area is Hadoop based. Then there is a data warehouse which is front ended by in-memory. The in-memory database grabs the data from data mart which in turn gets filtered data from the Hadoop storage.

He explains this with an example and says, “If you have website data which is one petabyte and you are loading that in Hadoop, you have a data warehouse which is 200 TB, only 5-10 terabytes of it will go for in-memory.” He is of the view that memory is expensive therefore rather than trying to put hundreds of terabytes of memory to load the entire data warehouse, the enterprise can simply look at loading that part of it into memory which is frequently asked for by business users.

Rangarajan of Capgemini feels that at this point, most enterprises are looking at real time analytics, and not looking at petabytes of data, and in-memory technologies have been able to support more than 100 terabytes of data. Over a period however, Hadoop will be seen as a complementary advantage to in-memory as the volume of data grows.

“In future, we see Hadoop becoming the underlying data store while the data which is needed for real-time analytics, will be available in memory. As the latency of this data increases, it would be moved to a Hadoop based storage where it would be looked at for historical and predictive analytics,” says Rangarajan.

Greg Kleiman, Director – Strategy, Red Hat Storage, shares a similar opinion and says that on the price to performance basis, users can afford to pay a premium for in-memory because usually, big data is data in motion and it has to be processed in near real-time, as it is coming in at a very high rate. Hadoop, on the other hand, is built for data at rest and for unstructured data.

“A lot of enterprises combine in-memory and Hadoop such that they take the real time data feed, and use in-memory database to analyze that data in real time. When they’re done with the analytics, they put the result in Hadoop and can then run historical trends on that data which is now at rest,” explains Kleiman.

Rangarajan is also of the view that in the current scenario, data with low latency moves into in-memory, and data with high latency moves into Hadoop. However, he believes that we will move toward an area where we will need to bring the computing power to where the data is, rather than taking the data to where the computing power is. It is slowly moving to a scenario wherein enterprises can generate the data, have real time analytics while processing large volumes of data for historical analysis.

Tarun Kaura, Director – Technology Sales, India and SAARC, Symantec believes that the decision to go with Hadoop or in-memory will depend on whether the organization is looking at structured or unstructured data, the response times that it is looking for from the analytical framework and the associated costs.

Challenges around Hadoop

It may seem like Hadoop will fit perfectly into an enterprise’s big data plans. However, the approach, though around for a while now, has only started to find adoption recently. There are several reasons why it’s taken a while for Hadoop to catch on with the Indian audience.

Firstly, the approach requires in-depth understanding of Hadoop, the HDFS file system and the associated tools that in many cases, need to be built ground up to suit the organization’s data warehousing requirements. Though popular with developers, the approach has not found any initial mainstream adoption due to the considerable amount of coding required.

Furthermore, Hadoop has largely been an open source effort, due to which the development and support has largely been community based. Given this fact, Hadoop lacked enterprise grade product offerings and support until recently, which further saw it failing to gain good traction among enterprise customers.

Finally, though data storage is cheaper with Hadoop, the investment in the infrastructure and the associated tools, the skillset and the management efforts required, push the total cost of ownership or TCO further, making it somewhat unpopular with enterprises experimenting with the technology approach.

Jain of Teradata is of the view that Hadoop skill set is not available in abundance. While people are getting trained and a lot of system integrators are creating competence in Hadoop, even if they are put to work, the usability of the platform by the user will remain a challenge for a very long time.

“In a relational database, you can directly have the user look at data and work with it. If the data is stored in a tabular format, it is very easy for everyone to understand. With the Hadoop file system, you need a high amount of technical manpower to pull meaningful information out of it. The fundamental architecture of the Hadoop file system is complex and therefore this challenge will always exist,” he says.

Govindarajalu of Virtusa shares a similar perspective and says, “The Hadoop architecture is quite complex and deployment is not as straightforward as some of the proprietary BI solutions. It does not come neatly packaged, and you need developers and solution architects to put a Hadoop-based big data infrastructure together.”

He also feels that while the Hadoop approach is good for analyzing data, the visualization layer is not that mature and therefore enterprises would need to interface with visualization tools offered by organizations such as QlikView, SAS, etc.

Kleiman of Red Hat says, “With in-memory, the analytics part is much more mature while with Hadoop it is still pretty new. An enterprise looking at an analytics solution won’t care much about the infrastructure but would care about the value of analytics and what they can get out of their data to help business increase revenue or get more competitive.”

Talking about the skill set part he says, “The skillset on the IT side on building a Hadoop infrastructure is fairly mature. The area that is catching up is the data analytics and the data scientist side of things. That part is still pretty immature globally and in India. There is very good talent in India on the building side, but not much on the analytics and data scientist side.”

He also observes that typically an enterprise won’t undertake a Hadoop implementation due to the complexities involved and hire a system integrator to do it for them. Hadoop is still a developing skill and a lot of SIs are still learning how to do it and therefore it is going to take longer for Hadoop to find traction.

Finding acceptance

Despite all the challenges that make Hadoop adoption seem difficult, the benefits that it promises in the long term have the industry working toward making its adoption easier. Many of the challenges are due to the lack of maturity of the Hadoop ecosystem from a big data perspective, something which would change over time, as IT vendors are developing solutions at all levels of a Hadoop infrastructure. Due to its largely open nature, Hadoop is easily available to vendor organizations who are working towards building connectors that can help enterprises interface the Hadoop architecture with their existing enterprise data warehouse or in-memory investments.

Explains D’Souza of NetApp, “Vendors like us are looking to arrive at a reference configuration for specific use cases and then depend on the data services and data management vendor to provide an end-to-end support for the entire solution. Therefore, our focus has been to work with Hadoop distributions such as Cloudera and Hortonworks to develop reference architectures and validated configurations.”

What this would essentially do is allow the enterprise to deploy the entire gamut of data stores, where there will be in-memory, relational, NoSQL, and Hadoop file system based data stores. The front end applications have already developed and people will continue to develop connectors to each of these data stores so that once an enterprise arrives at the disparate type of data, they will have the luxury of choosing the kind of data store they want to put the data into.

Bajpai explains iGate’s efforts in this area, by talking about a big data analytics framework which the company is working on and expects to be ready in the next couple of months. The framework would allow taking data from sources that are structured, unstructured, paper-based, electronic, machine to machine, and validate, verify, test and store this data using various Hadoop tools on a public cloud.

It would also look to present the data across different channels and devices, such that the results are produced instantaneously in a way best suited for the user, in various formats. This accelerated big data analytics platform can be utilized across various verticals. The framework will look at how enterprises take a business case, and visualize data to give solutions for that business scenario.

Capgemini’s Rangarajan says, ”The challenge with Hadoop is that SQL is a sort of a-de facto standard for reading and analyzing data. When Hadoop came, people needed to know Java and other technologies. However, the evolution has led to SQL becoming the standard for Hadoop as well. It’s a relief for both system integrators and the industry as there have already been a lot of investments that have happened for doing analytics from an SQL point of view. Almost all the leading vendors are developing SQL connectors on top of Hadoop.”

All of this is helping Hadoop find acceptance as an enterprise database to complement the existing data warehouse. He believes that as enterprises start talking about Internet of things and next-generation applications, we will start seeing Hadoop getting embedded in applications of the future. In a year or two, we will start seeing a connector to Hadoop being available for most technologies.

While most vendors report interest from various industry verticals to leverage on Hadoop, it would be interesting to see a few real-world examples that talk about how Hadoop has opened doors beyond enterprise data warehouse and in-memory approaches.

Big data: Hadoop

Digitize your HR practice with extensions to success factors