Datopian Vision Q&A: the need for Data Management, DataOS, Open Source and more
The following is a transcript of a Q&A with Datopian Founder and President Rufus Pollock about the vision for Datopian over the long-term.
Key Takeaway
We think there is a huge opportunity around data management in general. Because today data management is a mess inside many enterprises. We want to help solve that mess, to enable people to get more insight or to drive systems reliably off their data, whether it's big or small.
Specifically, what we have, what we are developing is a pretty robust and powerful framework for managing data, for organizing data, for data engineering. And it’s open source and we think that is a key advantage – in fact, essential for a framework in this area. And because you can think of this as kind of core infrastructure for your data operations we like the term “data operating system” (or DataOS) for this type of framework or product. It’s provides the basic systems, structures and patterns for organizations to enable and scale the flow of data within their enterprise.
Full Q&A Transcript
How did you get started?
Well, we're kind of lucky. The reason we exist at all as an open source company, with some quite serious competitors in the past, is that we have an inbound leads channel, because the product we've built – CKAN – is quite successful and there's a significant installed base. People write in to us and say, “Hey, we want support, we need this or that.” So we don't do a lot of proactive marketing at the moment, although we'd like to do more.
Basically people have found out about CKAN, they've identified that it's a solution for their problem and they write to us saying either we already have CKAN and we want you to do something with it, or we need CKAN to solve a problem.
DataOS Vision (aka DataHub)
Now, that’s great and there’s a very solid demand for open data portals as well a growing demand for internal data portals within the enterprise.
But I think there is a much bigger opportunity which is around data management in general. Because today data management is a mess inside many enterprises, most in fact. We want to help solve that mess, to enable people to get more insight or to drive systems reliably off their data, whether it's big or small.
Specifically, what we have, what we are developing is a pretty robust and powerful framework for managing data, for organizing data, for data engineering. And it’s open source and we think that is a key advantage – in fact, essential for a framework in this area. And because you can think of this as kind of core infrastructure for your data operations we like the term “data operating system” (or DataOS) for this type of framework or product. It’s provides the basic systems, structures and patterns for organizations to enable and scale the flow of data within their enterprise.
The need: the rapidly growing amount and diversity of data in their organizations
Right now, what people are using CKAN for is data management. Often they've discovered a hammer, which is CKAN, so they're always asking for the hammer. But if you actually find out from them what the nail is, the nail is the amount and diversity of data in their organisation, and specifically a way to manage and derive value from that.
Governments have often said we need an open data portal because we're mandated to have an open data portal or to publish our data. But if you look back at the underlying motivation, you have organisations, be they governments or enterprises or other nonprofits, which have increasing amounts of data around, and in particular, data which is diverse.
Governments are a good example of that. Government open data portals were a single point of discovery for data that was spread across many departments. Group dynamics and politics (in the small p sense) meant that you could not get all the government data in one big place; whether that was a good idea or not, it wasn’t going to happen. As a data ecosystem, it’s too diverse. An (open) data portal provides a solution to that: even if you don’t centralize the data itself you can centralize the gateway to all that data by having a central catalog of all the datasets.
Now, this is spreading. Even if you go to a small business, even a non-tech one, they will have multiple sources of quite rich data, for example analytics data on their web traffic or detailed sales data. Or take Datopian, which is a small business, we ask questions like, “How much money per person did we make last month?” That involves accessing at least two systems: our accounting system and an HR system with a list of people working at the company. So even in small enterprises there is an incredible increase in the data available and the questions we want to ask of it.
An explosion in data variety (as well as data volume)
If you look at any kind of company, there's often a lot of confusion around data. Big data, small data, Hadoop or Excel spreadsheets, you name it. It’s chaos. And people invent other things like data lakes, so we can pull our data to somewhere central, to try and manage it. But then you find that you are missing crucial metadata or it’s all in different formats and it’s still chaos – it’s just a centralized chaos rather than a distributed chaos (and one more vendor).
And it’s because you now have a very diverse set of data of sources.
And there’s an explosion in potential uses
And this is different from the past. Say twenty years ago either you collected data for accounting purposes, or you were trying to analyse something very specific, for example credit scoring in a bank, or which customers were doing what. Even if you had a data warehouse, and you connected to your data from your OLTP system, you had a very particular flow. Contrast that with now where a lot of data science is ad hoc: I want to do this analysis now because we have to answer this particular idea or problem, or generate a particular insight. In other words, you've also got a much more diverse set of data needs within your company.
In Summary: there’s a rapidly growing need for better data management
In summary, there's a lot of diversity on both sides, the users and the suppliers of data. And inside many enterprises, most in fact, it’s a mess. We want to help solve that mess, to enable people to get more insight or to drive systems reliably off their data, whether it's big or small.
The Data Management space needs open source tooling and a small pieces loosely joined approach (vs a monolithic one tool-to-rule-them all)
Now, that's a very big ambition and lots of other people are in that space. But, what that space is about isn't very productisable. What I mean by that is that the needs the users have aren’t really suited to being solved by one big tool.
A lot of people who want to make a lot of money in tech startups want to build a product and then sell the product lots of times to people in that space. It might be, “I'm going to build a fancy new data integration pipeline, or I'm going to build a new tool for wrangling data, or I'm going to build a new BI or Tableau”. Now, there is a space for lots of these valuable tools, but I think that overall the space is messy, and it's more like the website and web application space.
With websites and web applications there are tools that build you a whole website (squarespace etc). But overall, they are a minority of the market and it's not like there was one product to rule them all. Why? Because web applications are quite diverse. There are common aspects, which is why we have frameworks. We have Ruby or rails or we have Django or Flask or you name it.
Now that’s different from, say, the word processing space. There you have a pretty well-defined need and you can have a standard product that addresses most of what a user needs. As a result that space is dominated by product companies (with closed source products).
Now we think that the data management space, the data engineering space is more like the web applications space than it is like the word processing space. We think that data management and data engineering is going to be more about “frameworks” and specifically open source frameworks.
The Data Operating System (Data OS)
Operating systems here are maybe an even better metaphor than web applications for this frameworks approach. Frameworks are products in a way, but they're more of a way of building solutions. They aren't a solution in themselves. They allow you to build that solution, the actual web application or website but they don’t directly give you that.
And I think that the data ecosystem, even inside the company, looks more like that. It looks like you patching together different things, whether it's an Oracle database, or even other tooling internally in the management and integration of data, to build your data flows, your data factories, your data warehouses, whatever you want to call those internal things. You might stretch it to be called a data operating system. If you think of an operating system, it provides common layers, patterns and tooling that let you build and run applications. So I see the two as similar, the data ecosystem like an operating system.
Open source is essential because this is a platform that you are going to build on and customize
I think there's a very strong thesis that this will need to be open source because it’s a platform for so much else in your organization and you just can’t afford to have that be proprietary – not only because of the lock-in but because of the lack of flexibility. Open-source is essential to give users the freedom to connect, adapt and grow.
Remember, the data operating system is a foundation on which other diverse applications run or connect. It helps glue your data operations together. If you have a proprietary system, there are these huge issues around lock-in, around customization, around integration.
This is different from the old days, and it links back to the point above about the explosion in the diversity of data, tooling and needs that I talked about earlier.
In the old days, you'd go to a particular vendor for your data warehouse and Oracle would sell you the whole nine yards: you'd have the database, the pipeline, and the BI. Now that's already broken down to some extent because there's just too much diversity just in the tooling – you want the best BI, the best pipeline etc. Plus you now don’t just have a single Oracle database as your source of data: you probably have a whole range of data sources (including some external ones). Finally, you don’t have one analytics pipeline: you’ve probably got dozens. Realistically you’re just not going to have one vendor who comes along and gives you all that (though some of the big cloud providers are trying to convince you they can).
An operating system is patterns (core APIs) plus associated libraries and utilities
When you look at it, or at least for the purposes of our analogy here, an operating system is a set of patterns plus associated libraries and tooling.
For example, an operating system will have a pattern for storing and organizing bytes on disk (files, directories and the file system), it will have a pattern for I/O and processing (e.g. unix pipes), it will have a pattern for adding and removing applications.
Then there will be some libraries for implementing these patterns: a library for doing I/O. Then you’ve got all the utilities, the porcelain to the plumbing in the libraries. For example, something that outputs the contents of a file, or lists the contents of a directory, or makes a directory etc etc.
Now when it comes to your data operating system, you’ve got a pattern for data files and datasets (collections of files), you’ve got a data catalog pattern (the filesystem), you’ve got a pattern for data flows (I/O and pipes) etc.
Then you’ve got your libraries for that. And finally, you’ve got utilities: for validating data, for creating datasets from raw files, for displaying datasets, for versioning datasets etc etc.
And where do you start? One good place is the catalog.
And one of the good things is that you can start small and then build up. It’s just like a real world operating system where you can start with a “kernel” and then build on that incrementally.
For a data operating system, one really good place to start is the catalog.
Often, you've already got a whole bunch of data in your organization, spreadsheets, google analytics, an existing data warehouse etc. A great starting point is just to catalog it. And it's not just useful in simply bringing some order to this kind of chaos, it’s also very practical: all that data that is already out there, it’s owned by people, it’s part of existing business processes. You don’t really want to mess with that to start with. You want to just focus on making that data discoverable right where it is. To start doing some metadata management, what Gartner would call something like ‘enterprise metadata management’, but it's essentially a lightweight discovery process.
So, we, a 40 person company, have CKAN, and our clients, some of whom are Fortune 500 companies, come to us and say, “Hey, we've tried CKAN, and we want to use it for this functionality of being a catalog.”
Now, my dream is to drive a massive truck through that opportunity and say, “Hey, actually, you need an entire data operating system. You want the catalog, but really, you're missing patterns, you’re missing the libraries, you’re missing the utilities. You're hiring some people to write stuff in Jupyter notebooks over here and you're doing some other stuff over there and it’s a mess and it won’t scale. You need some order to your chaos.”
Now, we're not saying we're going to build all of this, but we have a vision, a system, a pattern and tooling.
Data Patterns and Frictionless Data
But you could start thinking of data systems in terms of its primitives. There’s the concept of a cell or a value, of a column or a row. And then you just make a list of those, basically. One primitive is like the kind of data objects of your data system, and the other primitive is the flows or transformations of that data. If you want to think of it in terms of functions, you've got all types, and you've got all functions of types.
Even if a lot of what I’ve been describing is a product, even if a lot of it is a pattern, it's like the Unix philosophy–something I'm deeply inspired by, by the way. They wrote an awesome operating system, but their approach to building it lived on more than the original tooling. But they also did build a lot of tools and the small pieces loosely joined. There's the core OS, the core pattern, and then these tools are in addition.
One project we've got is called Frictionless Data, which we've worked on for quite a few years, and it's got quite a bit of traction. It's basically saying, “Hey, no one's ever just written a schema for tables. No one's ever just written out a simple schema in this SQL.” You could use a JSON schema or something, but no one had written something to be able to exchange tables between different systems. Everyone wants to shove CSV around but it doesn't have a schema. So to standardise the concept of a data set, we wrote this thing called a data package, a table schema, a data resource. If you think about when you build a data system, you're basically passing data objects around. But when data objects move, you can think of them as a byte stream plus metadata. And there will be things that are like parquet that embed that metadata. The parquet format embeds that metadata in it, and I don't think CSV is going away anytime soon.
The other primitive that data systems build are data flows. And interestingly, though I may be wrong, there’s no widely adopted way of describing data flows in common. There's now a little bit of a YAML format for it, but there's no way of saying, for example, “Here’s this spec for this general task we want to do”.
There's a lot of data. There are a lot of software engineering tools, but essentially, a lot of the data work got taken over by data scientists. And data scientists don't really do good software engineering. They don't write tests. They do stuff in Jupyter notebooks, which are great, but not for software engineering.
I think that defining those primitives in easy to use and Zen-like ways is important. One of the obsessions of Frictionless Data was to simplify these complicated languages, like RDF, to standardise them. The challenge is to do things very simply.
But patterns are hard as a sell in themselves – no one really cares about the patterns per se. People never adopt patterns, they adopt a tool.