The data lake – it’s a phrase that’s thrown around a lot right now, but is it just an empty buzzword, or does it actually bring real value?
Well there’s certainly some misconceptions around the concept of data lakes. The biggest one of these is maybe “we can keep everything because storage is cheap” - the ultimate idea behind data lakes. If you don’t manage your data lake, if you don’t have the right governance and processes around it (not to mention the skills necessary) to get value out of it, it can be more bad news than benefit. Let’s take a look at some points to be aware of when managing your data lake.
A note on the difference between data and information
In the following paragraphs I’ll be referring to data and information. There is a seemingly subtle, but extremely important difference between the two. To save you some internet searching, data is an information carrier. Usually data carries more than one piece of information. Keep in mind, people are ultimately interested in the information data carries, not data itself. Imagine a sensor emitting its current state every second. Potentially there are two pieces of information attached to this datum – its state, obviously, (which gains significance only when it changes) and the sheer existence of the datum which suggests the sensor is working.
The similarities between a data lake and an actual lake
If we think of a data lake as like an actual lake, like the one in the picture for instance, we can understand some of the ideas around it.
Let’s lay out the landscape first. We have a nice pure lake (our data lake), with several creeks/streams (our data sources) feeding it with water (data). At this very moment, our lake is filled with pure mountain water so we can even see the bottom of it, leaving people (data scientists) observing its banks and tributaries (metadata and data sources) or diving directly into it (making analysis). This parallel continues further. As with this beautiful mountain lake, only good swimmers and experienced divers (analysts and data professionals), should be allowed in. Their safety, and the preservation of the lake’s ecosystem, water quality and long-term sustainability are all very real concerns. The same goes for a data lake. As beautiful and tempting as it is to dive into all that data, you need to understand how to interpret what you’re swimming in and refrain from polluting the data lake in any way to keep it sustainable. On top of that, there are regions of the lake where only authorized personnel are allowed due to further safety (security access) and privacy (anonymization, data encryption, etc.) regulations.
Take GDPR — a new stringent set of regulations impacting any organization doing business in the EU. With sky-high sanctions, GDPR imposes strict requirements for both governance of personal data and communicating transparency around storage and processes that control the data. Pouring such data into a lake means it can’t be a free-for-all, and there’s a need to restrict swimmers (users) from accessing whatever they want at any time. There may be restrictions around what data can be stored, in what format, and in what combinations. Ultimately, data owners need to be accountable for a) what they’re putting into the data lake and b) who is swimming in it, and for what purpose.
Where did your water come from?
Data lineage – where your data came from, where it’s been and what’s happened to it on the way - could be a topic for a whole separate article but especially in a data lake it is an absolute must to know your creek’s origin and how exactly it got to the lake. Does it flow past a chemical plant (has it been polluted on its journey)? How much mud (data carrying no or insignificant information) your creek brings into the lake is also a valuable indicator. This will determine how much and what kind of maintenance your data lake might need and if the creek is even trustworthy – you’d probably drink from a mountain spring but not so much from water at a freight port. If you allow too much garbage in your data lake, people will eventually stop using it, or even worse, might ignore the muddy waters but fail to notice those seemingly crystal clean creeks that are polluted with invisible toxic chemicals.
This leads on to another important point — the monitoring of data flowing in to the lake. This obviously helps as an early warning to protect the quality of your lake but there’s a broader and deeper benefit.
Receiving constantly corrupted or incomplete data could suggest there is something wrong with an upstream system. In a sense, this central hub should act as the health monitoring facility or internal standards authority that affects all connected systems. We’ve seen instances where analysis of logs of a data loading process helped to identify low-performing branches of a large organization just by revealing the above average occurrence of errors in the data they were providing for the central data hub. This “meta information” can be equally important and transformative as the data itself.
In the worst case, you could end up with a data swamp (or shall we call it a “data dump”?) What is that? If your data lake is unmaintained, with no traces of where its contents come from, with unreliable or even completely absent controlling mechanisms, then your lake gets polluted and hard to navigate. There’s a much higher chance of someone getting hurt or even drowning if the banks of your lake are uncultivated and left to the wilderness of nature. Thinking of a data lake as being just a place to put data, without plans and processes for proper treatment, might quickly waste the investment into a data lake.
What’s in your lake?
Assuming you’ve monitored the quality of your incoming water (or data), and you’ve fixed any problems, how do you go about getting the parts you want out again?
The big advantage of a data lake (as opposed to a data warehouse for example) is that you don’t have to spend too much time upfront organizing and structuring your data. But you do need to have *something* to organize it with, otherwise it’s just a mess. Even simple tagging of the data’s arrival day, time and source can be of enormous help. Imagine what could happen if there was also more general information about the data’s content. The benefit of a data lake – being able to sift through what you have and find what you need – becomes much more achievable when some basic measures are put in place to track what you have in your lake.
Lifeguards on duty
Since a data lake contains just about all the data you have (to fulfil its promise of being able to combine anything with anything), it is also a great potential risk, and requires carefully defined rules to manage that risk. Remember when we talked about GDPR? The level of data governance you need can vary depending on your circumstances. Going back to our water lake, sometimes we’d be fine just with a lifeguard making sure the swimmers and people around the lake are not doing anything they are not supposed to. Other times, a perimeter fence and guard dogs are required. The point is to fulfill legal and regulatory restrictions as well as internal guidelines.
Security also relates to making sure people are considerate and respectful to each other. The lifeguard’s duty is to make sure one person does not bully others and hog the whole lake for himself. Sometimes, the provision of a smaller pool with samples of water from all different places of a lake can be a way to deal with complex hypotheses requiring multiple iterations without affecting the work of other people.
Are data lakes for everybody?
In a word – no. The ‘data lake’ is a great buzzword right now, but as we’ve seen, it’s not the magic bullet to solve all your data problems. You need to put the right processes, tools and governance in place to make the data lake work for you. And crucially, you need to have the right skills to be an effective user of the data lake.
If we leave the lake analogy for a minute and focus more on a cooking analogy (apologies for this, you’ve probably already guessed this is a very analogy-heavy blog post) - you might have all the ingredients and a book of recipes, but lacking the skillset to be a good cook (or an aspiration to become one), you’re not likely to make a great meal. Just hoarding data is like stockpiling ingredients in your pantry. You’re not going to get a single meal if you don’t step up and, well, start cooking.
Can the ‘data lake’ be a useful concept?
Yes, the data lake may be a powerful concept. But only if the right effort is put in to get the results out. To continue our cooking analogy, if you really want to cook, and especially if you want to cook exotic, never-tried-before meals, then a data lake is the way to go. It’s a giant pantry that allows you to store all sorts of ingredients to use in your experimental cooking. But if you’re NOT into cooking, hoarding ingredients won’t fill your bellies with delicious meals.
Articles which oversimplify the situation and use phrases like “keep data and figure out later” or “self-service and on-demand access” just deepen the problem; they are likeable but unrealistic. I became quite a fan of an article Are Data Lakes Fake News by Uli Bethke, challenging these and some other claims.
Still thinking about employing a data lake? Good. Just remember to keep track of your data, mind your data governance, catalogue data from day one, and keep people without enough skills away (by skills I mean a combination of business understanding of the data as well as technical background to work with the data lake effectively).
Above all, remember, a data lake is not a magical place where all data problems are going to be solved in the blink of an eye.