A Data Lake (“DL”) is storage that can store large amounts of data. It stores every type of data in its native format with no fixed limits on size or number of files.
A data lake can hold structured data such as rows and columns from a relational databases. It can also hold semi-structured data, for example, CSV, logs, XML and JSON. Finally it can store unstructured data, for example, emails, documents, PDFs and binary data like images, audio and video. Current DL solutions include Azure Data Lake, Amazon S3’s cloud storage services or Apache Hadoop’s distributed file system.
Benefits of Using a Lake?
- Data richness. Ability to store many sources and types. For example, text, audio, images and video.
- Data Democratization. This is due to the lake making data available to the whole organization.
- Storage in native format. A lake doesn’t need modeling when data is loaded. Instead the data is molded when being explored for analytics. Consequently, lakes offer flexibility to ask business questions and to gain insight.
- Scalability. Lakes offer scalability at a modest price when compared to a traditional data warehouse.
- Advanced Analytics A lake links large amounts of data to deep learning algorithms. As a result it helps with real-time decisions.
- Complementary to existing data warehouse. Warehouses and lakes can work together resulting in an integrated data strategy.
How do Warehouses Compare to Lakes?
Depending on the requirements, an organization may require a data warehouse or a data lake or both. They serve different needs.
Characteristics | Traditional Data Warehouse | Modern Data Lake |
Type of Data | Relational data from transactional systems, databases, and business applications. | Non-relational and relational data from many sources. For example, IoT devices, web sites, mobile apps, social media, and others. |
Schema | Designed prior to the warehouse implementation. | Written at the time of analysis. |
Price Performance | Medium speed query results using high cost storage. | Query results faster due to using low-cost storage. |
Data Quality | Highly curated data that serves as the one version of the truth. | Any data that may or may not be curated. |
Users | Business analysts. | Data scientists, Data developers, and Business analysts. |
Analytics | Batch reporting, BI and visualizations. | Machine learning, predictive analytics, data discovery and profiling. |
If you would like to know some more then read about How JTA The Data Scientists does its work or have a look at some other FAQs. You might also like to read Wikipedia’s article on Data lakes
You could also explore our case studies or whitepapers.