Will the future be the world of time series databases?

Posted on 2022-03-30 by Admin

Introduction to Time Series Data

The time series database is breaking out recently, and the search index of each search engine is also on the rise.

Ranking on DB-Engine:

This list is a time series database.

There is a reason for the rise of time series databases. Take unmanned vehicles as an example, unmanned vehicles need to monitor various states during operation, including coordinates, speed, direction, temperature, humidity, etc., and need to record the monitored data at every moment to expand data analysis. Each vehicle collects nearly 8T of data every day. It’s fine if you just store it and don’t query it (although it’s already a big cost), but if you need to quickly query “what are the unmanned vehicles with a speed of over 60km/h on Beijing Road at 2 o’clock this afternoon”? Aggregate query, then time series database will be a good choice.

For example, applications such as securities trading, smart furniture, and urban brains all rely on a form of data that measures how things change over time, where time is not just a metric, but a primary axis of coordinates.

That's time series data, and it's gradually playing a bigger role in our world. Currently, time series databases (TSDBs) have become the fastest growing database category. In the future, with the arrival of 5G, time series databases will become more popular .

Basic concepts and meanings of time series data

When modeling time series data, there are three important parts, namely: subject, time point and measurement value. Applying this model, you will find that you are in contact with this kind of data all the time in your daily work and life.

If you are an investor, the stock price of a stock is a type of time series data that records the stock price at each time point.

If you are an operation and maintenance personnel, the monitoring data is a type of time series data. For example, the monitoring data of the CPU of the machine records the actual consumption value of the CPU on the machine at each time point.

Time series data connects isolated observations into a line in the time dimension, thereby revealing the state changes of hardware and software systems. Isolated observations cannot be called time series data, but if a large number of observations are strung together with a time line, we can study and analyze the trends and laws of observations.

Mathematical models for time series data

The storage of data should consider its mathematical model and characteristics, and of course time series data is no exception.

The following figure shows a period of time series data, which records the incoming and outgoing traffic of each port on each machine in a cluster for a period of time, and records an observation value every half an hour. Take the data in the figure as an example to introduce the mathematical model of time series data (in different time series databases, the title of basic concepts may be different, here Tencent CTSDB shall prevail):

measurement: A dataset of measurements, similar to a table in a relational database;
point: a data point, similar to a row in a relational database;
timestamp: timestamp, representing the time point when the data was collected;
tag: Dimension column, representing the attribution and attribute of the data, indicating which device/module is generated, generally does not change with time, for query use;
field: The indicator column, which represents the measurement value of the data, fluctuates smoothly over time, and does not need to be queried.

The measurement of this set of data is Network, and each point consists of the following parts:

timestamp: timestamp
Two tags: host, port, which represent which port of which machine each point belongs to
Two fields: bytes_in, bytes_out, representing the measured value of piont, the average value of incoming and outgoing traffic within half an hour

The same host, the same port, generates a point every half an hour, and as time grows, the field (bytes_in, bytes_out) keeps changing

Time series data characteristics

Data mode: Time series data grows with time, the same dimension repeatedly takes values, and indicators change smoothly: This can be seen from the data changes in the Network table above.
Writing: Continuous high concurrent writing, no update operations: Time series databases are often faced with real-time data writing of millions or even tens of millions of terminal devices (for example, Mobike has tens of millions of vehicles nationwide in 2017), but The data mostly represents the device state and will not be updated after being written.
Query: Statistical analysis is performed on indicators according to different dimensions, and there is obvious hot and cold data. Generally, only recent data is frequently queried.

Problems existing in traditional databases in time series data scenarios

When the amount of data is small, adding a timestamp column to the traditional relational database can be used as a time series database. However, time series data is often generated by millions or even tens of millions of terminal devices, and the write concurrency is relatively high, which is a massive data scenario.

MySQL has the following problems in massive time series data scenarios:

High storage cost: Poor compression of time series data requires a lot of machine resources;
High maintenance cost: a stand-alone system requires manual sub-database and sub-table at the upper level, and the maintenance cost is high;
Low write throughput: The write throughput of a single machine is low, and it is difficult to meet the write pressure of tens of millions of time series data;
Poor query performance: It is suitable for transaction processing and has poor performance for aggregation and analysis of massive data.

Hadoop ecosystem (Hadoop, Spark, etc.) will have the following problems when storing time series data:

High data latency: Offline batch processing system, from data generation to analysis, it takes hours or even days;
Poor query performance: The index cannot be used well, relying on MapReduce tasks, and the query time is generally in minutes.

The time series database needs to solve the following problems:

Writing time series data: how to support the writing of tens of millions of data points per second.
Reading of time series data: how to support grouping and aggregation operations on hundreds of millions of data in seconds.
Cost-sensitive: The cost issue brought about by massive data storage. How to store these data at a lower cost will become the top priority of time series databases.

Therefore, the birth of time series database is to solve the deficiencies and defects of traditional relational database in time series data storage and analysis.

Comparison of open source time series databases

At present, the more popular open source time series database products in the industry include InfluxDB, OpenTSDB, Prometheus, Graphite, etc. The comparison of their product features is shown in the following figure:

InfluxDB is an open source time series database developed in GO language, especially suitable for processing and analyzing time series related data such as resource monitoring data. It is currently the leader in time series databases.

Over time, major cloud vendors have also launched their own time series databases. Alibaba's TSDB team has gradually served DBPaaS, Sunfire and other group businesses since the first version of the time series database was launched in 2016. After the public beta in mid-2017, it was officially commercialized at the end of March 2018. In terms of technology, TSDB has continuously absorbed the strengths of various companies in the time series field, and has opened up the development path of self-developed time series databases.

ProgrammerSought