Data Mesh vs Data Lake

Data Mesh vs Data Lake – Driving Business Insights at Scale
 

 

What is a data mesh?   

Much in the same way that software engineering teamstransitioned from monolithic applications to microservice architectures, the data mesh is, in manyways, the data platform version of microservices.   

As first defined by Zhamak Dehghani, a ThoughtWorks consultantand the original architect of the term, a data mesh is a type of data platformarchitecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Borrowing Eric Evans’ theory of domain-driven design, a paradigm that matches the structure and language of your code with its corresponding business domain, the data mesh is widely considered the next big architectural shift in data.   

Unlike traditional monolithic data infrastructures thathandle the consumption, storage, transformation, and output of data in onecentral data lake, a data mesh supports distributed, domain-specific data consumers and views “data-as-a-product,” with each domain handling their own data pipelines. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards.   

Data mesh is an architectural and organizational paradigmthat challenges the age-old assumption that we must centralize big analyticaldata to use it, have data all in one place or be managed by a centralized data team to deliver value. Data mesh claims that for big data to fuel innovation, its ownership must be federated among domain data owners who are accountable for providing their data as products (with the support of a self-serve data platform to abstract the technical complexity involved in serving data products); it must also adopt a new form of federated governance through automation to enable interoperability of domain-oriented data products. Decentralization, along with interoperability and focus on the experience of data consumers, are key to the democratization of innovation using data.   

What is a data lake? 

A data lake is a central location that holds a large amount of data in its native, rawformat. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.‍ Object storage stores data with metadata tags and a unique identifier, which makes it easier to locate and retrieve data across regions and improves performance. By leveraging inexpensive object storage and open formats, data lakes enable many applications to take advantage of the data.   

Data lakes were developed in response to the limitations ofdata warehouses. While data warehouses provide businesses with highlyperformant and scalable analytics, they are expensive, proprietary and can’t handle the modern use cases most companies are looking to address. Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema (i.e. a formal structure for how the data is organized) up front like a data warehouse does. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases and data warehouses, data lakes can process all data types — including unstructured and semi-structured data like images, video, audio and documents — which are critical for today’s machine learning and advanced analytics use cases.