paint-brush
Architecting a Modern Data Lake in a Post-Hadoop Worldby@minio
4,554 reads
4,554 reads

Architecting a Modern Data Lake in a Post-Hadoop World

by MinIO7mSeptember 13th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.
featured image - Architecting a Modern Data Lake in a Post-Hadoop World
MinIO HackerNoon profile picture


The Modern Datalake is one-half data warehouse and one-half data lake and uses object storage for everything. The use of object storage to build a data warehouse is made possible by Open Table Formats (OTFs) like Apache Iceberg, Apache Hudi, and Delta Lake, which are specifications that, once implemented, make it seamless for object storage to be used as the underlying storage solution for a data warehouse. These specifications also provide features that may not exist in a conventional Data Warehouse - for example, snapshots (also known as time travel), schema evolution, partitions, partition evolution, and zero-copy branching.


As organizations build Modern Datalakes, here are some of the key factors we think they should be considering:


  1. Disaggregation of compute and storage
  2. Migration from monolithic frameworks to best-of-breed frameworks
  3. Data center consolidation - replace departmental solutions with a single corporate solution
  4. Seamless performance across small and large files/objects
  5. Software-defined, cloud-native solutions that scale horizontally


This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.

Adoption of Hadoop

With the expansion of internet applications, the first major data storage and aggregation challenges for advanced tech companies started 15 years ago. Traditional RDBMS (Relational Database Management System) could not be scaled to approach large amounts of data. Then came Hadoop, a highly scalable model. In the Hadoop model, a large amount of data is divided into multiple inexpensive machines in a cluster which is then processed in parallel. The number of these machines or nodes can be increased or decreased as per the enterprise’s requirements.


Hadoop was open source and used cost-effective commodity hardware, which provided a cost-efficient model, unlike traditional relational databases, which require expensive hardware and high-end processors to deal with big data. Because it was so expensive to scale in the RDBMS model, enterprises started to remove the raw data. This led to suboptimal outcomes across a number of vectors.


In this regard, Hadoop provided a significant advantage over the RDBMS approach. It was more scalable from a cost perspective, without sacrificing performance.

The End of Hadoop

The advent of newer technologies like change data capture (CDC) and streaming data, primarily generated from social media companies like Twitter and Facebook, altered how data is ingested and stored. This triggered challenges in processing and consuming these even larger volumes of data.


A key challenge was with batch processing. Batch processes run in the background and do not interact with the user. Hadoop was efficient with batch processing when it came to very large files but suffered with smaller files—both from an efficiency perspective as well as a latency perspective—effectively rendering it obsolete as enterprises sought out processing and consumption frameworks that could ingest varied datasets large and small in batch, CDC, and real-time.


Separating compute and storage simply makes sense today. Storage needs to outpace compute by as much as ten to one. This is highly inefficient in the Hadoop world, where you need one compute node for every storage node.  Separating them means they can be tuned individually. The compute nodes are stateless and can be optimized with more CPU cores and memory. The storage nodes are stateful and can be I/O optimized with a greater number of denser drives and higher bandwidth.


By disaggregating, enterprises can achieve superior economics, better manageability, improved scalability, and enhanced total cost of ownership.


HDFS cannot make this transition. When you leave data locality, Hadoop HDFS’s strength becomes its weakness. Hadoop was designed for MapReduce computing, where data and compute had to be co-located. As a result, Hadoop needs its own job scheduler, resource manager, storage, and compute. This is fundamentally incompatible with container-based architectures, where everything is elastic, lightweight, and multi-tenant.


In contrast, MinIO was born cloud native and is designed for containers and orchestration via Kubernetes, making it the ideal technology to transition to when retiring legacy HDFS instances.


This has given rise to the Modern Datalake. It takes advantage of using the commodity hardware approach inherited from Hadoop but disaggregates storage and compute — thereby changing how data is processed, analyzed, and consumed.

Building a Modern Data Lake with MinIO

MinIO is a high-performance object storage system that was built from scratch to be scalable and cloud-native. The team that built MinIO also built one of the most successful file systems, GlusterFS, before evolving their thinking on storage. Their deep understanding of file systems and which processes were expensive or inefficient informed the architecture of MinIO, delivering performance and simplicity in the process.


Minio uses erasure coding and provides a better set of algorithms to manage storage efficiency and provide resiliency. Typically, it's 1.5 times copy, unlike 3 times in Hadoop clusters. This alone already provides storage efficiency and reduces costs compared to Hadoop.


From its inception, MinIO was designed for the cloud operating model. As a result, it runs on every cloud—public, private, on-prem, bare metal, and edge. This makes it ideal for multi-cloud and hybrid-cloud deployments. With a hybrid configuration, MinIO enables the migration of data analytics and data science workloads in accordance with approaches like the Strangler Fig Pattern popularized by Martin Fowler.


Below are several other reasons why MinIO is the basic building block for a Modern Datalake capable of supporting your IA data infrastructure as well as other analytical workloads such as business intelligence, data analytics, and data science.

Modern Data Ready

Hadoop was purpose-built for data where “unstructured data” means large (GiB to TiB-sized) log files. When used as a general-purpose storage platform where true unstructured data is in play, the prevalence of small objects (KB to MB) greatly impairs Hadoop HDFS, as the name nodes were never designed to scale in this fashion. MinIO excels at any file/object size (8KiB to 5TiB).

Open Source

The enterprises that adopted Hadoop did so out of a preference for open-source technologies. The ability to inspect, the freedom from lock-in, and the comfort that comes from tens of thousands of users, has real value. MinIO is also 100% open source, ensuring that organizations can stay true to their goals while upgrading their experience.

Simple

Simplicity is hard. It takes work, discipline, and above all, commitment. MinIO’s simplicity is legendary and is the result of a philosophical commitment to making our software easy to deploy, use, upgrade, and scale. Even Hadoop’s fans will tell you it is complex. To do more with less, you need to migrate to MinIO.

Performant

Hadoop rose to prominence because of its ability to deliver big data performance. They were, for the better part of a decade, the benchmark for enterprise-grade analytics. Not anymore. MinIO has proven in multiple benchmarks that it is materially faster than Hadoop. This means better performance for your Modern Datalake.

Lightweight

MinIO’s server binary is all of <100MB. Despite its size, it is powerful enough to run the data center, yet still small enough to live comfortably at the edge. There is no such alternative in the Hadoop world. What it means to enterprises is that your S3 applications can access data anywhere, anytime, and with the same API. By deploying MinIO to an edge location, you can capture and filter data at the edge and use MinIO’s replication capabilities to ship it to the your Modern Datalake for aggregation and further analytics.

Resilient

MinIO protects data with per-object, inline erasure coding, which is far more efficient than HDFS alternatives which came after replication and never gained adoption. In addition, MinIO’s bitrot detection ensures that it will never read corrupted data — capturing and healing corrupted objects on the fly. MinIO also supports cross-region, active-active replication. Finally, MinIO supports a complete object-locking framework offering both Legal Hold and Retention (with Governance and Compliance modes).

Software Defined

Hadoop HDFS’ successor isn’t a hardware appliance; it is software running on commodity hardware. That is what MinIO is — software. Like Hadoop HDFS, MinIO is designed to take full advantage of commodity servers. With the ability to leverage NVMe drives and 100 GbE networking, MinIO can shrink the data center — improving operational efficiency and manageability.

Secure

MinIO supports multiple, sophisticated server-side encryption schemes to protect data — wherever it may be — in flight or at rest. MinIO’s approach assures confidentiality, integrity, and authenticity with negligible performance overhead. Server-side and client-side encryption are supported using AES-256-GCM, ChaCha20-Poly1305, and AES-CBC, ensuring application compatibility. Furthermore, MinIO supports industry-leading key management systems (KMS).

Migrating from Hadoop to MinIO

The MinIO team has expertise in migrating from  HDFS to MinIO. Customers that purchase an Enterprise license can get assistance from our engineers. To learn more about using MinIO to replace HDFS check out this collection of resources.

Conclusion

Every enterprise is a data enterprise at this point. The storage of that data and the subsequent analysis need to be seamless, scalable, secure, and performant. The analytical tools spawned by the Hadoop ecosystem, like Spark, are more effective and efficient when paired with object storage-based data lakes. Technologies like Flink improve the overall performance as it provides single run-time for the streaming as well as batch processing that didn’t work well in the HDFS model. Frameworks like Apache Arrow are redefining how data is stored and processed, and Iceberg and Hudi are redefining how table formats allow for the efficient querying of data.


These technologies all require a modern, object storage-based data lake where compute and storage are disaggregated and workload-optimized. If you have any questions while architecting your own modern data lake, please feel free to reach out to us at hello@min.io or on our Slack channel.