THE HUNT FOR a Hadoop tutorial online is proving tougher than I expected. At least, I’m not able to find one that will explain this software to a non-techie who barely knows a bit from a byte. What I do know (or rather, was told) is that this is a rapidly evolving platform that’s been making huge headway in India. Why don’t you profile its journey, asked my editor, and I agreed. The Asia head of a leading Hadoop vendor has agreed to talk to me, but I can’t show up with zero knowledge. Hence the scramble. There are YouTube videos aplenty that promise to explain how to instal and use Hadoop, there are For Dummies books on it, and there are the ubiquitous discussion boards. None makes sense to me, so I ask my techie friends. Shorn of the jargon they all seem to spout, I learn that Hadoop is an open-source software framework hosted and maintained by Apache, a community of developers. It allows computer systems with multiple nodes (or servers) to store and process terabytes of data fast. And, to make it seem accessible to people like me, it was named after a yellow stuffed elephant.
I refuse to be distracted by the mention of a soft toy (but I do read about how Doug Cutting, the man who co-developed Hadoop so liked the name his son had given his favourite toy, he decided to name the software after that). Instead, I go back to hunting for digestible wisdom on Hadoop.
The one thing I find easily is that Big Data will drive the demand for Hadoop. With the huge increase in raw data, both structured and unstructured, and the rising need for analytics, the global market for Hadoop services is expected to grow at 58.2% between 2013 and 2020, from an estimated $2 billion (Rs 12,634 crore) in 2013 to $50.2 billion by 2020, according to a report by Allied Market Research. The Asia-Pacific region is expected to see the fastest growth during this period, an estimated 59.2%.
That’s some growth. Meanwhile, a Stack Overflow Developer Survey found Hadoop to be an “unusually widely used technology” in India. But nobody seems to be talking about it. Techie conversations, even in that most tech-obsessed city Bengaluru, revolve around Elon Musk’s Mars colonisation plans or Uber’s driverless cars. Ironically, I learn from media reports that Bengaluru is home to the largest number of Hadoop professionals in the world outside Silicon Valley.
So, why the coyness? One possible reason is that Hadoop is not a retail product. It’s one of those ‘for techies, by techies’ things, and so hasn’t quite captured the public imagination. That’s a shame, because Hadoop’s raison d’être is what we all want from tech: big, cheap.
The question Hadoop was trying to answer was how to store and process huge amounts of data as cheaply as possible. While storage devices are getting smaller, data transfer speeds have not kept pace. In the 1990s, typical transfer speed was 4.4 MB per second. That has rocketed to 100 MB per second, but even then, it takes two and a half hours to read all the data off a 1 TB drive, Tom White says in his book Hadoop: The Definitive Guide.
Hadoop solves this with the oldest business management trick in the book—delegation. Rather than storing and reading data off a single drive, the framework splits the database into smaller tranches and stores them in different drives. Each drive now reads a lesser amount of data, but the entire work is done far more quickly as thousands of ‘slave’ computers are reading different data parts and putting their work together instead of a lone machine doing the whole task.
Hadoop’s code—the backbone of its existence—is free for anyone to access, modify, and sell. And that allows vendors to add layers of code to automate certain programming tasks, a process called abstraction. Hadoop, then, is the new-age IT manager.
Hadoop was officially born in 2005 and the next upgrade, the relatively prosaically named Yet Another Resource Negotiator (YARN), came two years later, making it easier to build more applications on top of the basic architecture. “Earlier Hadoop could only run MapReduce but with YARN in place, it has become much more like an operating system, like the one on your mobile or laptop. It is [now] like any other OS where you can download applications on top of it,” explains Sharad Agarwal, co-creator of Hadoop 2.0. In its new avatar Hadoop is a many-armed, ready-to-use entity that can easily acquire even more limbs to process and store data faster through its “divide and conquer” strategy, he adds. Agarwal had earlier worked with Cutting, and currently looks after e-commerce giant Flipkart’s data architecture.
HADOOP 2.0 HAS THROWN OPEN big business opportunities, especially in emerging markets like India where the Big Data industry is growing at some 83% and is expected to cross $1 billion by the end of 2015, according to Crisil Global Research & Analytics. Now that so much can be built on Hadoop, early adopters—mostly those in BFSI (banking, financial services, and insurance), retail, telecom, travel, health care, social media, and IT/analytics services—are gunning for enterprise-level solutions where codes are already written to power complex data processing and analysis.
Rajeev Banduni, co-founder and CEO of India operations at GrowthEnabler, a consultancy advising entrepreneurs on setting up business, thinks the need for better insight into customer behaviour will drive Hadoop adoption more intensely in the near future. “The primary propagation will be [through] competition,” he says. “As some adopt Hadoop and do better, others will start using it.”
Meanwhile, big Hadoop vendors from the U.S. have gotten busy with local market capture. For instance, San Jose-based MapR Technologies set up its Hyderabad office in 2009. Soon after, it became the official Hadoop distributor of the Unique Identification Authority of India (UIDAI) project, one of the biggest Hadoop clusters in the world. “We’ve been following Hadoop since 2011, because we’ll always be a Big Data project,” says UIDAI’s chief product manager Vivek Raghavan. “We opted for the open-source stack when we started. But the platform became mission-critical about two years ago. So we set up a benchmark and decided that MapR’s distribution is the best for what we want to do.”
The UIDAI has more than a hundred nodes (a node is a single point in a network; a single processor of many running Hadoop) in its cluster that process hundreds of terabytes of data, putting it in the same league as Facebook and Google, he adds.
“UIDAI was a very early project and it gave us an incredible foothold in India,” says Martin Darling, MapR’s vice president for the Asia-Pacific and Japan. “We’ve been on it for about two years now. [Judging by our experience] India is a seriously relevant market in APAC [for Hadoop].”
About 30 km from MapR’s offices is another Hadoop biggie, Cloudera, a Palo Alto-based firm that works with a wide range of companies, including digital businesses such as online travel agency Cleartrip. “We only focus on billion-dollar companies, and there are quite a few of them in India,” says Richard Jones, Cloudera’s regional vice president for Asia-Pacific and Japan. “In my experience, businesses in key sectors like banking and insurance are very familiar with Hadoop.” The company primarily earns from Hadoop training and support services. Last year its valuation was above $4 billion, while analysts forecast $329 million in global revenues in 2016.
The India focus does not surprise Satya Ramaswamy, who heads the digital enterprise arm of Tata Consultancy Services (TCS). “India is a Big Data market by definition,” he says, adding that “a large amount of work is now centred on Hadoop, which really started taking off in 2011”.
Interestingly, none of the global companies operating here is working solo. Both Cloudera and MapR have TCS as a systems integrator partner to help create applications on top of their distribution frameworks. TCS also brings in new clients, businesses which are just beginning to explore Hadoop. “We help analyse their requirements, and [recommend] the kind of distribution [that will work best for them],” says Ramaswamy. The IT major claims it is “close” to the big three, even though its Big Data division is currently building applications that can be integrated with Cloudera’s distribution.
What’s further attracting Indian businesses, especially the startup industry, is the cost benefit. Prior to Hadoop, the key database technology was RDBMS (relational database management system), invented at IBM in the 1970s. Later Oracle released the first commercial version of RDBMS and held nearly 50% of the market, according to Gartner. But Oracle’s database machines are way too expensive compared with any standard Hadoop implementation. Back in 2008, the company came up with a 168 TB “data warehouse in a rack” and the price tag was $2.33 million. To that bill, add $95,000 a year for an Oracle engineer to manage the cluster, and the total cost of the system comes to about $2.62 million over three years.
Contrast this with the average cost of a Hadoop cluster. According to industry estimates, each node in a Hadoop cluster (containing a processor, a few hard disks, and a network card for Internet connectivity) costs around $4,000. The price can be kept low as Hadoop runs on commodity hardware—multipurpose, affordable components equivalent to off-the-rack clothes. Going by that pricing, a 75-node Hadoop cluster storing 300 TB of data would cost $1.05 million over three years (including hardware and annual operations cost). The difference is huge—Hadoop handles almost twice the data and it’s nearly 2.5 times cheaper.
Cleartrip took advantage of Hadoop’s low-cost scalability and spent just Rs 2 crore to set up its cluster; operational costs are about Rs 25 lakh annually, excluding salaries. Overall, that’s almost six times less expensive than Oracle’s RDBMS. Cleartrip says the secret lies in Hadoop’s open-source origins.
A PROMISING START notwithstanding, Indian businesses trying out Hadoop have been slow on the uptake. Cleartrip, for example, installed its 10-node cluster (runs on Cloudera’s platform) in the middle of 2013. The company is still largely using it for raw data storage and working with a skeletal team of three to four people. “We are finding it difficult to find people who are experienced in Hadoop and Big Data,” rues Ramesh Krishnamoorthy, Cleartrip’s senior vice president of technology. The company says it is exploring Hadoop’s more complex applications. “We have two or three types of data reporting tools. But right now, we’re trying to understand the different capabilities,” Krishnamoorthy says. He adds that Hadoop operations will expand rapidly as the company collects more data on everything from customer usage to flight availability.
Cleartrip sounds much like the companies Gartner surveyed—70% of them indicated that fewer than 20 employees actually use Hadoop. Of course it is a niche technology, but a less-than-adequate talent pool can be a major obstacle to continued Hadoop adoption. Raghavan of UIDAI admits that his project has struggled to cope with the crunch; Banduni of GrowthEnabler blames a fast-widening skills gap. “There’s a lack of data scientists here. There are those who can generate data, and others who can apply the findings to business decisions. The second kind is much harder to find,” says Banduni. Others like Cutting’s early collaborator Agarwal think the talent pool will begin to rise as the market for Hadoop deepens. “In Asia, Bengaluru would be among the top cities. There are plenty of committers here, a huge ecosystem, and a lot of development.”
THAT’S JUST ONE PART of the story. As of now Big Data (and therefore, Hadoop) is nowhere to be found on Gartner’s Hype Cycle for Emerging Technologies. Just two years ago, Big Data was at the top of the cycle, before being replaced by the new vogue Internet of Things. Three years ago, Cloudera co-founder Mike Olson bluntly stated that “the biggest winners in the Big Data world aren’t the Big Data technology vendors, but rather the companies that will leverage Big Data technology to create entirely new businesses or disrupt legacy businesses... Our customers don’t care nearly as much about Hadoop as they care about engaging better with their customers or preventing fraud in their transaction flows. People choose the platform because it solves problems that they care about.”
If Hadoop can evolve fast while staying close to that vision, it may be on to greater things. It’s not a programming language, but boasts the kind of community of developers and users, and the legacy that widely used languages leave behind. (In the TIOBE Index that tracks and assesses the popularity of programming languages, many have risen and withered away during the Hadoop era. While Java fell, only to dramatically come to life as the most used programming language in 2015, C++ has revived slowly.)
But the question is whether Hadoop can deal with the onslaught of competing Big Data tools. The Apache community is already developing alternatives. Applications like Spark, built on the Hadoop architecture, are picking up. In June, IBM announced that it is committed to developing Apache Spark, calling it “the most important new open source project in a decade”.
“Hadoop is a good way to query unstructured data, but it isn’t good for decision making,” explains Banduni. “Also, it is not really tested to the nth level of data. Those case studies—like the UIDAI—are coming out only now.” Will Hadoop grow beyond its niche in India despite the global turmoil? Banduni thinks so. “What works for Hadoop is that there is no [viable] alternative right now. It’s hard to switch out of it.” He adds that digital natives “who are comfortable with technology, and corporates who are figuring it out” will drive adoption.
Meanwhile, I am downloading my first proper Hadoop tutorial. The elephant may yet teach me a lot about efficiency.