The integration of data and easy access to it is a critical competitive advantage for any insurance or financial services company. Whether its financial, claims, policy, or customer data – a data lake can enable analytics, business transformation, and informed decision-making. In Capco’s experience, helping our clients understand what a data lake is and how it can help is the first step to transformation. We have answers to some of our clients’ most frequently asked questions about data lakes.
A data lake allows for storage of any type of data in its raw format. Data stored in a lake can be structured (existing relational data bases, flat files exported from other systems, XML documents, JSON messages) or unstructured (recorded call center phone calls, videos, or every tweet mentioning your industry). Once integrated into the lake, the data can be analyzed in a variety of ways. Software products, like columnar databases, have made the creation and maintenance of data lakes easier.
A data lake allows for the integration of all a company’s data from disparate sources into one place for analysis. New data sources are easy to add because the data’s original format is maintained. Data lakes are utilized by those who are highly skilled at data manipulation and analysis.
Data lakes can also be used as part of an enterprise data warehouse, allowing new data sources to be ingested quickly. While the native format is retained, a data lake can also transform data and export it to a data warehouse in a more accessible format. For example, a data lake could store every tweet that includes the company name, but export only the statistical analysis of those tweets to the warehouse.
Some of the benefits include:
The two major differences between a data lake and a data warehouse are how data is stored and when it is transformed.
A data lake typically stores data in its original format, and if transformation is needed for analysis or reporting, it happens before being integrated into the consuming system. This concept is called “schema on read.” The analysis would be made available in a more usable format. Storing the raw data and transforming it in the lake allows for all the data in a source to be stored in the lake even if there is only need for some of the data at the time. Inevitably, when requirements change, there is no need to go back to the source to add the extra column, for example, because it is already in the lake and can just be added to the consuming schema and the data transform.
A data warehouse, on the other hand, uses a “schema on write” process. Before adding data to a warehouse, it is first mapped to a model. Different models might be used for each different format and/or feeding system. Determining and building these data models represent the bulk of the work needed for creating a data warehouse. The data modeling effort is postponed in a data lake and might not be needed at all depending on how the data is used.
A data lake’s primary strength, that it can integrate data in any format or structure, can become a liability if misused. The ability to add disparate sources of data does not negate the need for a plan of how it will be used. The ease of ingesting data into a lake, without proper governance, can lead to disorganized thoughtless data integration. The lake becomes a swamp. Lakes, as with any database, must be thoughtfully designed with an eye for which data will be integrated and the purpose for adding it. Most importantly, proper governance processes are needed so users and administrators understand what is in the lake, its location and origin, and how it is to be used.
Ideally, all data flowing through the enterprise would find its way to the lake. As such, the natural goal is to make the lake a central hub for data consumers. To ensure the data available has value, governance policies and metadata support are needed to ensure data validation and help users find the data they want.