In recent years, we have met with different views of what Data lake is. Because data lake is a modern world today, it can be discussed by IT specialists as well as company management. Ideas of what this means, however, can be diametrically opposed. From the idea that data lake is just smarter storage, to the idea that data lake is a modern replacement for a data warehouse. Much has been written about the data lake, so let’s summarize the facts.
- In the past, the standard way of reporting and analyzing data was to build a data warehouse and data marts. One of the traditional rules of BI architecture was to store data relevant for analysis. Subsequently, the identification of the most interesting attributes and their aggregation into datamarts. The disadvantage of this solution is that only a subset of attributes is examined, so that only predetermined questions can be answered. In addition, data is aggregated in datamarts, so detail from the lowest levels is lost. In the future, many unknown questions will arise that this architecture will not be able to answer quickly.
- There are many data consumers in companies from different departments, areas and with different technical knowledge. Covering all needs in the traditional way, for example by creating a universal data model in Data Warehouse, is very time and money consuming. In addition, many data analyzes will not be repeated in the future or will be so different that it makes no sense for them to make a data model.
- In the past, companies mostly dealt with structured or semi-structured data, and unstructured data was neglected, although it is an important source of information for the company.
- Over time, the data is so large that it does not technically and / or economically fit into traditional relational databases.
It follows from the original idea of the data lake concept that the data lake philosophy is different from the data warehouse concept. We store the data in its original form and in the lowest possible detail. The database supports all forms of data – structured, semi-structured, and unstructured. Data storage is economically advantageous. The database is based on Hadoop’s philosophy and tools. But there is a new and strong challenger in the ground, Snowflake which is the fastest-growing company in the field of Data, and its functionalities are better and the process is faster than Hadoop. It can serve as a data source for a traditional data warehouse and, in addition to the data warehouse, allows you to perform new types of analytical tasks. This definition clearly defined what a data lake is and how it is to be used.
The definition of the data lake is still not precisely specified. Different consulting firms and India Snowflake Consultants have different definitions on their sites. The database is also very often associated with internal marketing and is presented as an opportunity to start where, for example, the DWH project failed. Nevertheless, concepts such as DWH will still have a place next to the data lake and will fulfill their function.