Hadoop Series Three: The Future of Technology

An important factor influencing the future of Hadoop is the update and progress of technology. From the development of Hadoop and other big data related technologies in recent years, we can see some clues.

Data storage - optimistic prospects

From the perspective of file storage technology, HDFS is stable and robust, and has become the de facto standard for massive file storage. Of course, there are also some distributed file storage technologies worthy of attention, such as GlusterFS, Tachyon and so on. But it does not pose a substantial threat to HDFS yet.

Compared with the dominant file storage company, the storage of structured data is currently showing a situation where a hundred flowers are blooming. We mentioned earlier that in the Hadoop ecosystem, the most mature implementation of structured data storage is HBase. You can think of it as a more flexible and extensible MySQL. Compared with the popularity of other NoSQL databases such as MongoDB and Cassandra, HBase is relatively low-key. However, I personally think that HBase is applicable to a wider range, and the prospects are still very optimistic. For NoSQL databases, I will not discuss it, and those who are interested can refer to the book NoSQL Essence.

Data Processing – Challenges

From a data processing perspective, MapReduce is no longer popular. The most essential reason is that the model of MapReduce is too simple. The consequence is that programming is very difficult. A simple word count program also requires writing a lot of MapReduce code. Although supported by higher-level language tools such as Pig and Cascade, MapReduce programming is always a headache. In addition, simple models make performance optimization for specific data processing difficult. Especially in applications like machine learning that need to process data repeatedly, file reading and writing become a bottleneck. At present, Spark, with its simple and efficient features, has the potential to replace MapReduce and become a general-purpose data processing engine. Of course, Hadoop itself has also launched some new data processing engines, such as MRv2 (YARN), Tez, but the future is probably still Spark.

Resource Allocation – Full of Opportunities

Another problem with the old MapReduce was that its resource provisioning mechanism had performance shortcomings. In order to fundamentally solve the performance bottleneck of the old MapReduce framework, starting from version 0.23.0, Hadoop's MapReduce framework has been completely refactored. The new Hadoop MapReduce framework is named MapReduceV2 or YARN.

Although YARN was born for MapReduce, it is actually an independent resource management framework, so theoretically any distributed application can run on YARN, and YARN only allocates resources such as CPU and memory. In fact, non-Hadoop applications such as Spark and Storm all support running in the YARN framework. This makes it possible for YARN to become Hadoop "recruitment" for other big data applications. Of course, YARN is not sitting on the mountain, Mesos is a competitor that cannot be ignored, and Mesosphere will release their data center operating system soon. Just look at its Demo and you will know how resource allocation is going in the future.

From the development trend of Hadoop's support for file storage and resource allocation, we can imagine that in the future, Hadoop should become a lower-level infrastructure like today's operating system.

Conclusion

Hadoop, as a platform and ecosystem for big data, has passed the period of skyrocketing and entered a stage of steady and rational growth. The future, like other technologies, will face challenges from its own metabolism and the new technologies around it. The only way for an open source community to prosper is to have better programs, more people use, more people contribute, and so on. It is hoped that the continued prosperity of Hadoop will enable small and medium-sized enterprises in various fields to process massive amounts of data easily and happily.

More
Hadoop Series One: The Birth of a Baby Elephant
Hadoop Series Two: Three Pillars

Related Posts