What is a Data Lake?

Data Lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of Data Lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transform data which is used for various tasks including reporting, visualization, analytics and machine learning.

The Data Lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, pdf) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data.

A Data Swamp is a deteriorated data lake, that is inaccessible to its intended users and provides little value. Typically, data swamps are formed when the metadata of the data lake is not managed and there is no architecture or organization of each source feed.

Business Challenges with Data Lakes

The Data Lake frameworks are typically a collection of technologies and loosely coupled goals, they are not fully connected to the enterprise vision and architecture. Data Lakes quickly become Data Swamps if the architecture is not integrated with other federated data stores and no clear delineation of roles. A tangible data integration and data sharing strategy are required to smoothly integrate a Data Lake to the rest of the business and information stores.

The integration characteristics of a successful enterprise data lake will include:

  1. Common, well-understood methods and APIs for ingesting content
    1. Make it easy for external systems to push content into the Data Lake
    2. Provide frameworks to easily configure and test connectors to pull content into the Data Lake
  2. Corporate-wide schema management
    1. Methods for identifying and tracking metadata fields through business systems
    2. Tracking common lexicons and vocabularies correlated across multiple business systems
  3. Business user’s interface for content processing
    1. Format conversion and parsing
    2. Collecting data definitions, lineage and operational statistics
  4. The purpose of ‘mining the data lake’ is to produce business insights which lead to business actions.
    1. It is expected that these insights and actions will be communicated through data interfaces or reports.
    2. A closed loop process that discriminates data science findings into actionable data assets

Data Lake managed by A2B Data™

A2B Data™ can be leveraged to bi-directionally ingest large sets of information to the Data Lake and bi-directionally extract analytical findings from the Data Lake to other data stores. A cohesive architecture for Data Lakes includes a tightly coupled connection with other corporate data assets found in Operational Data Stores, Enterprise Data Warehouse, and Data Marts.

A2B Data™ will serve as the integration hub that connects each of the data stores. The Data Lake ingestion strategies are fully managed by A2B Data™ for latent, real-time and streaming data. Not only will A2B Data™ manage all the extraction and load process but will also collect all the metadata for both technical lineage and operational statistics.   

Typically, data stored in the Data Lake is just images of source data, A2B Data™ goes beyond giving you the flexibility of maintaining the history of changed data in the Data Lake. A2B Data™ manages the changes of data over time sensitive point-in-time snapshot.

Data Lake environments prefer raw data structures and do not require predictive data structures.  A2B Data™ supports this approach and maintains the structure of the source data with additional control information.

Advantages

  • Avoid writing point-point interfaces, since A2B Data™ automates the data extract, load and upkeep of changed data
  • Multiple change data capture methods to detect source data changes
  • Multiple target design patterns for ingestion to the new system
  • Immediate access to legacy system data on the new environment
  • No need to write transformation logic to convert data types, A2B Data™ will automate that process
  • Focus resource effort on mining the data and not moving the data
  • A2B Data™ support parallel and iterative executions