
After his 2015 appearance, Paul Dix, the creator of InfluxDB will be on stage again at dotScale 2018. He answered a few questions for us:
dotConferences: You spoke at dotScale 2015 about why time series data is the worst and best use case in distributed databases. Do you still feel that way?
Paul Dix: I still do. All the same issues exist. The scale of the data, the big range scans in queries, queries that are frequently not cacheable because of new data constantly coming in, and large deletions being very frequent are all difficult problems to solve. But the advantages are still there: mostly that it’s mainly an append only workload and it’s not OLTP so consistency constraints can be relaxed and a data pipeline with a simple write queue like Kafka can be used with great success to ensure all data eventually gets captured.
InfluxDB has been through a lot since 2015. What have you learned since then and how did it influence InfluxDB?
So much has changed in that time. We created a new storage engine, we created a disk and memory backed index for the time series metadata, and we’ve dramatically improved performance. On the other parts of the stack, we’ve continued to develop Telegraf, our data collector and it’s now the most successful OSS project we have. I would guess it’s deployed on millions of computers worldwide and it has community contribution with higher velocity than the database. We’ve launched a processing engine (Kapacitor) and a UI for the stack (Chronograf) since then and they have tens of thousands of users worldwide.
Much of what we’ve learned is being applied to version 2.0 of the platform, which we’ve been developing for a while now. The new query language and engine are being built with some key usability improvements and the ability to add more powerful features quickly. We’ve started the work to decouple storage from compute (query processing) to improve reliability and scalability. The new version of the platform will bring all four components together under a single consistent, documented API and a UI that ties it all together. Builds of individual components will be available, but the way we’re thinking about it as a single cohesive product, rather than four. Many of these new features and what we’ve learned and are building towards are part of what I’ll talk about at dotScale.
So you decided to write your own storage engine, which is obviously a huge undertaking. Looking back, would you do it again?
Absolutely. We’ve achieved fantastic performance wins in write throughput, query speeds, and compression. There’s still nothing out there in OSS that has our unique set of requirements. Owning the storage engine is also what’s enabling some of the planned improvements for 2.0.
As you mentioned, InfluxDB has evolved into a whole suite of components, which you call the TICK stack. What made you decide to expand your scope beyond just the database?
That was actually already the plan for a bit over a year before I spoke at dotScale. When I was out raising our series A round from early May to early fall of 2014, my pitch to investors was to raise the round so we could build a team to build out a platform for solving four problems I saw as common in time series data: how to collect it, store and query it, process it for ETL and monitoring, and how to visualize it. From about May of 2014 the plan was never to be just a database company.
My philosophy is that if you solve problems to developers so they have to write less code and less configuration to solve their problems, you’ll win in the market. I’ve seen this with other open source projects like Ruby on Rails, Pandas, MongoDB and all sorts of projects. Make it easy to use and improve developer productivity by multiple factors and adoption will follow (assuming you tell people about it). I’ve called this optimizing for developer happiness or, more recently, optimizing for time to awesome.
Congrats by the way for your recent $35M fundraising! How are you going to use it?
Thank you! We’re scaling out the team across all functions (engineering, marketing, sales, and back office) and plan to double in size this year. This is part of the funding but it’s also drive by fantastic revenue growth over last year and our operating plan for this year.
You made an early bet on Go. Are you still happy with it? What's your main grievance?
For sure. We’ve benefitted from Go’s growing popularity since 2013 and the Go team’s exceptional work over the last 5 years. As a language I’m still very happy with how it’s designed and developed. My biggest grievance is dependency/package management. The team still hasn’t really gotten this and that’s something that’s so important that the solution should come from the language implementers. It has been a consistent source of pain. They should just pay Yehuda Katz (creator of Ruby’s Bundler and Rust’s Crate) to develop the officially supported Go solution and be done with it. If he could be bribed with a king’s ransom to do the work I’m sure the Go community would all kick in for it.
Do you think we are going to see more databases written in Go?
It’s already happening. Key/value stores, distributed databases, distributed SQL databases. Almost not a week goes by when I don’t hear about some new database project started in Go. CockroachDB and etcd come to mind as the most popular new DBs in the language.
What's your wishlist for InfluxDB 2.0?
This would require an entire series of blog posts to lay out, but I’ll list a few things shortly. I want to unify the language and the API for the stack. InfluxDB 2.0 looks less like a traditional database and more like a platform for solving problems with time series data. For operating at scale, it won’t be a monolithic piece of software like traditional databases. The API will be designed to be multi-tenant from day one. It’ll will have endpoints for many things in addition to writing and querying data. Things like defining background processing tasks, which consolidates Kapacitor and continuous queries (a current DB feature) into one unified feature. You’ll be able to define collection rules for data that gets pulled from other sources. The API will support defining user dashboards, alerting and notification rules. And you’ll be able to pull and join data from third party services like a relational database or REST APIs. Finally, we’ll have a UI built in to manage all of it and get insight from your data.
Can you tease us a bit about your upcoming talk at dotScale in June?
I’ll be talking about our new query language and engine and their design. The language is less like a query language and more like a scripting language for working with time series data. I’ll talk about the work we’re doing to decouple the query processing engine from the data tier to get better scalability and isolation. I’ll also talk about our use of Apache Arrow and the work we’re doing to optimize that as a data interchange format. We recently contributed the Go implementation to the Apache Software Foundation so we’ll continue to improve that for the entire Arrow community as we build out 2.0. Of course, with the short format I’ll have to focus that message so I’ll likely remove some of those topics as I develop the content.
Thanks a lot, Paul!
We hope you all enjoyed this interview, and we look forward to seeing you at dotScale on June 1st!