3

Good afternoon.

At work I'm currently developing a system which takes user input (well structured) and then stores it in memory to do some processing.

The input is basically a dataset formed by matrix of pxq dimension, with q columns of data and p rows; where each row has the following structure:

n indexes, m attributes, k classes, with $n, m, k \geq 1$

The system can handle inputs from several users at the same time, and the values of n, m, k can be different per dataset. However, no matter which values, there has to be a unique data structure capable of handling the size of the dataset, and store the data in memory efficiently, with a particular emphasis on data reads and inserts, which are the most frequent operations.

I'm currently struggling with two things:

  1. How to describe such structure at a high level (without considering the how it will be stored in memory) using only math equations.
  2. Finding an efficient data structure which can be created previous to the user input. This means, as soon as the text fields are shown to the user, the data structure has to be created and after the user fills every data in the dataset and press Enter the data needs to be inserted (and quickly) into the data structure. However, I don't know if this is entirely possible or if it's the best approach.

I'm open to suggestions. Currently what I have thought of is:

  1. The system takes a dataset $d_{pq}^{nmk}$ as a matrix formed by $p$ rows and $q$ columns.

We define an index $id_{i}, i = 1\ldots n $ as following:

$id_{i} = [a-zA-Z\_][a-zA-Z0-9\_]*$

In an analogous way, it's possible to define an attribute $a_{j}, j = 1\ldots m$ and a class $c_{l}, l = 1\ldots k$:

$a_{j} = [a-zA-Z\_][a-zA-Z0-9\_]*$
$c_{l} = [a-zA-Z\_][a-zA-Z0-9\_]*$

Where each row of a dataset $d_{pq}^{nmk}$ is the union of exactly $n$ indexes, $m$ attributes and $k$ classes, as folowing:

$ID = \bigcup_{i = 1}^{n} id_{i} $

$AT = \bigcup_{j = 1}^{m} a_{j} $

$CL = \bigcup_{l = 1}^{k} c_{l} $

Therefore, a the $t-th$ row of a dataset $d_{pq}^{nmk}$ can be expressed as:

$r_{t} = ID_{t} \bigcup AT_{t} \bigcup CL_{t}$

And thus, a dataset $d_{pq}^{nmk}$ can be expressed in s implified way as:

$d_{pq}^{nmk} = \{r_{t}\}, t = 1\ldots q$

However, I'm not sure if this definition is just enough good as a formal expression of the dataset. Also, I feel it doesn't help to think of a good data structure to handle the dataset (except by a list of lists, of course).

  1. I was thinking the simplest approach would be to define a record with two types (a label which indicates if the value represents a index, ar atribute or a class) and the value itself. After that, the data structure could be a list of lists, where each row is a list, and the dataset itself is formed as a list of rows (which are lists). However, if the dataset is big enough, I think this approach would be really really slow, but I have not been able to find a better data structure.

Any kind of help or resource for further research is welcomed.

0 Answers0