I had the opportunity to to interview Dan Linstedt, one of the leading DW/BI-gurus in the world. He is also an inventor of the Data Vault Model and Methodology.
”Beyond a year from now – the Hadoop NoSQL hybrid solutions will be the primary data management platforms we will be working with. Anyone who is not learning how to work with these solutions, or not gaining the skills today, will have difficulty finding a job in the future.”
Ari: I have seen Hadoop gaining more and more interest also in Finland and many projects are on their way. How do you see the role of Hadoop in DW solutions within a year and further?
Dan: The role of Hadoop is changing. Hadoop is part of embedded solution hybrids produced by Companies like HortonWorks and MapR to name a few. The hybrid is an engineered data management solution that works with both document / multi-structured data, and highly structured data while leveraging the Hadoop scale out principles. The hybrid brings in SQL abilities, real-time data handling, and a variety of other components (like ACID transactions). These hybrids will become the technology behind tomorrows’ data warehouses. Vendors like IBM, Teradata, Microsoft, and Oracle have all seen this, and have begun announcing hybrid solutions. One year out, we will see a change in the market – as the hybrid solutions mature and become easier to use, install and manage. My personal opinion is, beyond a year from now – the Hadoop NoSQL hybrid solutions will be the primary data management platforms we will be working with. Anyone who is not learning how to work with these solutions, or not gaining the skills today, will have difficulty finding a job in the future.
Ari: Gartner predicts that most databases are going to include both SQL and NoSQL. Are relational databases eventually going to be replaced by Hadoop or other NoSQL databases?
Dan: As I stated above, we have been seeing both these technologies coalesce. They will collide and hybridize / innovate or die. Yes, the physical separate ”relational only” database is dying, but don’t confuse that with the intellectual property and lessons learned in a relational system. Those lessons, that IP, and all those best practices (for the most part) will morph, adapt, and be reapplied to the NoSQL hybrid data management systems. Note: Not all NoSQL engines are ”databases”, some are just file stores, others are document stores, others are graph engines, and so on. They are Data Management Systems, not just databases.
Ari: Data Vault seems to be on the rise. What are the main benefits of Data Vault in a nutshell?
Dan: Data Vault 2.0 provides the business with a solid foundation of repeatable, and scalable best practices. It includes agility, adaptability, and flexibility at it’s core. It teaches those in the BI and EDW spaces how to create a future proof enterprise vision for BI and analytics. At the heart of any good Data Vault solution is the Data Vault Model – which helps the BI team focus on business keys, business processes, and master data. The benefits include leaner, faster teams, more accurate BI solutions, and little to no re-engineering of the solution as it scales to the Petabyte levels. All of this ties the data to the value chain of the business, and keeps the BI solution in check as the business grows and changes.
Ari: How is hardware evolution (e g in-memory databases) affecting Data Warehouse design?
Dan: Its changing the game. The way we physically design data models is changing. Not just in-memory solutions, but high speed real-time / streaming solutions, document solutions, and more cause us to re-think how the physical data needs to be structured or semi-structured in order to achieve the performance gains the business users demand. The focus of IT and the design initiatives should now literally be on the business processes, the business keys, and the logical design. The task (or job) at hand is the same as it always has been: the challenge of data integration. Which charges us to consider just what data integration truly means, and at what levels it needs to happen.
In the end, hardware has always gotten better, faster, and cheaper – however it’s now reached a point that it is forcing us to re-think physical design (physical data models), in fact – it’s suggesting that we worry less about physical data models, and focus more on logical models and the flow of the business as I stated earlier.
Ari: Building ETL is often the most tedious phase in a Data Warehouse project. What are the most interesting developments in this area?
Dan: ”ETL” as a physical data movement tool is literally dead or dying. Especially in the context of high volume and / or high volume high speed (real time). It takes too long to execute complex transformations in the middle of the pipeline – it simply slows things down far too much, and doesn’t scale well. ETL as a ”principle” will be around as long as there are ”flat files” delivered. What is happening is: EL-T where EL is the only way to scale, and moving the ”T” or transformation logic to where the data lives is becoming a necessity.
Transformations are being embedded (and will continue to be embedded) in to the Data Management systems (like Map/Reduce at the hardware and software layers). Netezza proved this concept YEARS ago when it moved it’s data locating logic down in to it’s firmware layers. Although they are a specialized device, they were ahead of the curve. Now, Hadoop is echoing the same sentiments, as are all the vendors leveraging Hadoop. Moving the Transformations ”or BIG T” to where the data lives, is the only way forward – especially in Petabyte scaled systems. ELT is one such execution component (think SQL in the database), but there too – relational vendors need to innovate, and MOVE the execution of the logic out to where the data is physically stored, or they will reach the same bottlenecks reached by Amazon and Google years ago.
Engines like Talend that have ELT ”already built” as a foundational design layer will come out ahead of the game. Engines like AnalyticsDS.com (capable of generating Map & Reduce, Hive and Pig scripts) will also be ahead of the game. Basically software that can generate transformations to be executed WHERE the data lives, will survive and live on. Engines that cannot make this transition fully, will eventually erode and topple over.
Another interesting development here is the use of Data Vault standards to automate the generation of these principles. With an advanced / mature BI solution like Data Vault 2.0, BI teams are able to leverage pre-built templates for specific data integration and data movement techniques. Couple these teams with proper automation software (Like AnalytixDS.com / Mapping Manager), and the teams become super powerful quickly. Automation of the ”Data warehousing side” is changing the game.
Another piece to this puzzle is the upswell in Managed Self Service BI. Some (not all) of these tools actually provide business users with a visual manner to construct their own EL-T flows, and then generate them direct to Hadoop / Hadoop solutions. Managed Self-Service BI will continue to evolve over the next two years.
Ari: Do Data Virtualization and the concept of “Logical Data Warehouse” have a big impact on Data Warehousing?
Dan: In one word: Yes. By another token, it all depends on the volume of data you have today, and the capabilities of the hardware and software underneath. Today, if you try to create a logical-only data warehouse (ie all data in the data warehouse is virtualized or in memory), you WILL be limited by the amount of data you can execute with. In other words, just try to ”virtualize in memory” a data warehouse that is just 500 Terabytes, not truly possible.
In the future, as technology continues to evolve, this too will change and shift. There is a blurring of the lines between what ”in memory” means. Just think of the SSD (solid state disk) – it IS memory, just happens to run through a DISK bus (which is slow admittedly), but the SSD is 10x to 100x the performance of standard disk at 7500 rpm. But because the SSD is memory, and I put my ”database” on an SSD, does that mean that I have virtualized my entire data warehouse? No, not really.
To truly virtualize or better yet, provide a 100% logical data warehouse, you must put views over a staging area, then fire ad-hoc queries at those views with thousands of users hitting the virtual data warehouse – then, and only then, can you say that your data warehouse is 100% logical and virtualized.
It WILL become a force in the future, it’s not a question of IF, but rather WHEN and how soon?
Ari: Some people tend to think they can build a BI environment without a Data Warehouse just by using tools like Qlikview or Tableau. Is this a viable option?
Dan: No, it really isn’t. It’s been tried before.
The issue has, is and always will be: INTEGRATED DATA. The next issue is Governance, and the next: Master Data Management, along with Master Metadata. If or when the business can construct a single ”logical enterprise view” of their data across multiple disparate systems, AND subject the models / view to governance, master data management, and metadata – then and only then will tools like the ones you asked about be able to claim: ”you can get by without a data warehouse”.
Until then, the statements along these lines that these vendors make are full of hot air promises. Let me put it another way: If ”Enterprise BI solutions” were as easy as ”buy & install a federated query engine” then companies would have already adopted this route, and wouldn’t still be trying to build BI and EDW solutions. Obviously there is more to this puzzle than simple federated query. The BI Specialists will have to learn more about the business, and business will become immersed in learning how to apply the proper tooling (like QlikView and Tableau), but getting by without a data warehouse isn’t going to happen any time soon.
Ari: Some proponents of the Agile movement suggest that Data base design can be carried out almost without data modeling. How do you feel about this?
Dan: The answer is it depends. IF the we are talking about ingesting and storing snapshots of data on it’s arrival, then sure – build and implement a hadoop solution, and load your data sets in a schema-less fashion. However, if we are talking about turning DATA in to INFORMATION and making it useful for business, then no – data *must* be modeled or attached to some form of model in order to make it understandable by business. Let me ask you this: suppose you saw the hex bits of the following file:
0abf de11 9ed2 0c0a ab33 … and so on…
Can you tell me ”just by looking at these bytes” what this is? Is it a text file? a sentence? a binary? an image? a zipped file? a number? a date?
My point exactly, you *need* context – the next piece of context to decipher this ”data” and make sense of it, is for me to tell you: it’s the start of an image. (ok, I just gave you a MODEL to put it in), a very rudimentary model, a single field model, but a model none-the-less.
Now that you know it’s an image, can you tell me: is it a JPEG, GIF, BMP, PIC, or something else? Nope. Not without MORE metadata, and to get more metadata, you need more modeling, ie: where and what are the header bits, how big is the file, where is the checksum located? etc… more modeling is REQUIRED even for multi-structured data like this.
Then, I ask you the BUSINESS question: can you tell me what this is a picture of? Just by looking at the bytes? The answer is no – by looking at the raw data, you have NO CONTEXT. You need a model, to run an algorithm, then to draw the data on the screen and bring the image to life. It is at the point which you can view the image, that we have turned DATA in to INFORMATION, and the act of asking you ”is it a frog, a horse, a car, a pig, a goat, etc…” that you leverage the model that makes this information understandable. I teach these concepts in my Data Vault 2.0 Boot Camp & Private Certification class, it’s part of the course to learn how to deal with ”unstructured” and multi-structured data.
To complete the question: my answer is this: at the point where you want to turn data in to usable and understandable information, you *must* build a model to represent the data, you *must* codify it and classify it. Otherwise, it’s just a meaningless sequence of ones and zeros (bits). That means: Schema-On-Read is the way forward.
Ari: The dimensional modeling system by Kimball has been very popular but in my experience it does not express well complicated data structures in an EDW. What will be the role of the dimensional modeling in the future?
Dan: Dimensional modeling will become more and more important (not less important) at least from a data visualization paradigm. It allows manipulation and drill in, drill down, drill up components – along with pivot. It helps us categorize, classify, and place data in to ontologies. This is fundamental (as per the answer to my last question) to turning data in to information. Dimensional models themselves will become buried in the tooling (like Tableau, QlikView, Excel), and so on – as they already are, as a background for manipulating and managing user information stores. Physical dimensional modeling will become passe, these tools will automagically manage and handle the dimensions and facts in accordance with the business logic built by the business users.
Ari: What are the most important skills for a Data Warehouse specialist now and in the near future?
Dan: I believe the most important skills for a DW specialist today are divided in to two categories: technical and business. In the technical category: understanding Hadoop, Big Data, NoSQL, nearSQL, NewSQL technologies, understanding managed self-service BI, Ontologies, canonical models, concept models, context, and data mining. Business skills they need today include: turning data in to information, business processes, lean initiatives, cycle time reduction, agility, KPA & KPI components, and how to be transparent with their projects. I think they also need to understand how the business they are supporting is run, what it does, why it does what it does, and how the business keys and business processes map to peoples jobs in the company.
I the future I think the DW specialist will need to understand: (in addition to all the above) technical: MPP, sharding, independent multi-threaded fully re-entrant code, code partitioning, java, python, ELT, and Automation (including test automation). From a business perspective: they need to understand how to talk to the business, help the business spot issues / gaps between reality and expectations, almost become power-users and data miners themselves.
Dan teaches most of this in his Data Vault 2.0 Boot Camp & Private Certification classes. Students get exposed to many of these principles, which makes this course vital to those wishing to improve their skill sets for the immediate future.
You can read more from here.