Intro to Data Lakes
Most people have heard of data warehouses. In case you have not, a data warehouse is a large database that combines data from various source databases. The data is typically normalized or standardized such that it simplifies reporting or executing business intelligence against the data.
What is a Data Lake?
When you have data coming from vendors, partners, or other integration touch points, that data comes to you in a raw format (XML, CSV, images, EDI claims, etc.). This is also known as source data. A data lake is essentially a large storage repository that holds vast amounts of raw data in its native format. Once the data comes in, additional metadata is applied or tagged to the raw data to understand what the raw data means and potential downstream uses. The data can then be replicated to downstream systems that store a more polished version of the data in a respective format. But why “lake” and where did that term come from? Originally the term was coined by Hadoop as a marketing term for their data product but the concept has become an industry buzz word since.
So why create Data Lakes?
As company capture raw data thru normal business operations, the consumption, and uses of that data are constantly changing and evolving. Certainly, as your data strategy and data requirements evolve. By storing the data in its raw format, it can be leveraged later as multi-year data strategies develop and come to fruition.
Another reason to create data lakes can help highly regulated industries and public companies. Having data in the raw format help facilitates audits and meeting some of the following regulatory mandates:
- The Dodd-Frank Act requires firms to maintain records for at least five years.
- Basel guidelines mandate retention of risk and transaction data for three to five years.
- Sarbanes-Oxley requires firms to maintain audit work papers and required information for at least seven years.
- HIPAA requires data to be stored for at least six years.
- Freedom of Information Act requests and the respective general record retention schedule.
In the event there is a cyber breach or some other event that would require a post-mortem research, having raw data in the data lakes certainly facilitates that research to recreate who data has changed from the original source data.