The big data road was born in line with the progress of human science and technology. It has been smooth and has penetrated into all aspects of social production and people's life in less than 20 years.
However, with the exponential growth of the amount of information, big data has also begun to face a series of problems, such as the shortage of storage resources, the shortage of computing power, and the inability of data processing efficiency to meet the demands of business growth, leading to the rise and fall of voices.
In recent years, container technology, with its advantages of lightweight, easy migration and fast expansion, combined with the distributed architecture of computing storage separation, can better play the advantages of big data platform in massive data sets, high concurrency, real-time analysis and other application scenarios.
Internet, automobile, insurance, electric power, retail and other industries use massive information to analyze user characteristics and behavior patterns, so as to develop service plans and business strategies closer to users, and carry out accurate push.
At present, most data analysis is in Hadoop ecology With its perfect ecology, Hadoop is also popular among users, becoming a mainstream open source big data platform and synonymous with big data.

However, since the release of the first Hadoop version in 2006, the development of big data has also gone through at least 13 years. The original proud "computing storage convergence" architecture and advanced data analysis concepts and practices have also begun to face challenges:
1. The coupling of computing storage resources makes it impossible to flexibly adjust the ratio of storage and computing, and can only be expanded in a fixed proportion, resulting in waste of some resources; 2. The construction cost of the data center is high, and the later operation and maintenance cost is high. The cost performance and flexibility are not as good as the public cloud scheme; 3. In the Internet era, the data is growing explosively, and the existing data center is short of resources, which can easily lead to job congestion and reduce computing efficiency;
4. Big data cannot be shared with other business resource pools, and multiple sets need to be maintained separately to further increase operation and maintenance costs. In addition, the rise of AI, machine learning, natural language processing (NLP) and other concepts has also had an impact on big data, and the wind of "big data is dead" has been heard all the time.

along with 5G+cloud+AI era Coming, data becomes more, more complex and more refined. Big data is not dead, but becomes more important to enterprises than ever before. The urgent problem we need to solve is how to use a more efficient and practical solution to deal with the explosive growth of data.
Around this topic, major companies have also launched a new round of technology exploration and upgrading.
First of all, with the rapid development of the basic network, network transmission is no longer the bottleneck. Many companies have begun to try to separate the storage and calculation of big data. What is the effect? IDC China reports that:“ Decoupled computing and storage have proved useful in big data deployment, providing higher resource utilization, higher flexibility and lower cost 。”
At the same time, with the maturity of container technology and its in-depth application in various industries, some enterprises are also starting to transform the platform into containers, hoping to combine the advantages of containers to give new power to the big data platform.
Combining the two, we seem to see the dawn of big data transformation.
The Journey of Butterfly Changing

At present, the storage and computing separation scheme is relatively mature, and the containerization scheme is still in the stage of exploration and small-scale application. Taking Spark as an example, there are generally two schemes:
One is Spark Standalone This scheme only implements container based deployment and transformation for big data systems. Thanks to the lightweight containers, more fine-grained computing power management, task isolation and other characteristics, the host can be divided into more small particle task units, so that the host resources can be used more efficiently, while taking into account the user's original usage habits.
However, this scheme needs to allocate a fixed number of containers in advance and keep the containers running continuously. It is unable to dynamically manage containers. Although the utilization rate of resources has improved, there is still waste.
The other is Spark On Kubernetes cluster solution The scheme uses Kubernetes instead of Yarn to conduct unified resource arrangement and scheduling, which is more technically close to the mainstream container solution, eliminating the two-layer scheduling, and can further improve the efficiency of resource management. Compared with the Standalone scheme, it realizes the dynamic management of container resources and optimizes resource allocation.
However, Kubernetes is not a Hadoop ecological component. Compared with the traditional Spark on YARN, Kubernetes has some disadvantages, such as lack of task queues, external shuffle services, and poor performance. Therefore, when applied to the production system, a lot of function enhancement, scheduling and performance optimization must be done to keep consistent with the traditional big data platform.
Huawei Cloud plans to launch Huawei Cloud for problems in the process of customer containerization Kunpeng Big Data Container Solution This solution, combined with BigData Pro, will provide a more complete set of containerized big data solutions.
BigData Pro is the industry's first Kunpeng big data solution The scheme adopts the storage and computing separation architecture based on public cloud, uses Kunpeng computing power that can be infinitely elastically expanded as computing resources, and uses OBS object storage services that support native multi protocol as a unified storage data lake to provide“ Separation of storage and calculation, extreme elasticity, extreme efficiency ”The new public cloud big data solution has significantly improved the resource utilization rate of big data clusters, can effectively address the bottlenecks in the current big data industry, help enterprises to meet the new challenges of the 5G+cloud+intelligent era, and achieve intelligent transformation and upgrading of enterprises.