1

My team is exploring options to create a robust "analytics" capability that is well-suited for our large quantities of sensor test data. I'd appreciate any suggestions for technologies that would perform well for my use case.

About my data:

  • For each test, we process binary recordings into flat files for each end-user (maybe 5 to 15 files per test, for hundreds of tests per year)
  • Each file contains time-series data for 100 to 1000 parameters
  • Parameter sample rates are anywhere from 20 samples per second to 10k sps
  • Each file contains one or multiple time cuts
  • Time cuts might be "recorder on to recorder off" (~ 2 hours long) or specific shorter events (20-60 seconds on avg.)
  • Sets of parameters will share same time array
  • Some parameters are continuously-changing (e.g. a temperature measurement), while others rarely change (e.g. a fault code)

Currently, the flat file format we use serves us very well, in terms of compression and performance, and providing quality data to our end users. The format uses RLE to compress repeating values, and time index arrays are shared by multiple parameters (not recreated) as applicable. The format is HDF5 with a specific structure specified.

What technologies would work well to open this data up to "data analytics"? I'm hoping to maximize efficiency, performance, data compression, and data mining capabilities.

We've experimented with InfluxDB, which has enabled great data mining capability out-of-the-box, but I/O seems pretty slow, and compression does not seem to be very effective (compared to the flat file format).

Thanks in advance for any leads!

1 Answers1

0

For your large-scale sensor test data, I would encourage checking out the latest "InfluxDB 3.0". It has an entirely new engine built for high performance, capable of ingesting billions of series while using fewer CPUs and less RAM, all at a fraction of the storage cost. It's optimized for sub-second query responses, even with large, live datasets.

The engine leverages modern open source technologies like Apache Arrow for efficient data handling and Apache DataFusion for fast query execution, making it well-suited for your needs in data mining, analytics, and handling high-frequency, high-volume time-series data.

Reference: https://www.influxdata.com/benchmarks/

Suyash
  • 101
  • 2