
Storage Model
If you're interested in the internal data storage model of the system, then this article is meant for you. From a user perspective, knowledge of this model isn't necessary, but it can help you understand certain aspects of the system and its behavior. Unlike many other systems, evitaDB uses its own data storage model built on the key-value store principle with variable value length. At the same time, data storage is strictly append-only, meaning once written, the data never changes.
Basic File Types and Their Relationships
evitaDB stores data in files on disk in a data directory specified in the configuration. The top level of this directory contains subdirectories for individual catalogs. Each catalog directory holds all the files required for working with that catalog (no external information outside of its directory is needed). The directory always contains:
- Bootstrap file – a file corresponding to the catalog name with the .boot extension containing critical pointers to the other files
- Write-ahead log (WAL) – a file corresponding to the catalog name with the suffix {catalogName}_{index}.wal, where index is an ascending number starting from zero; the file contains a sequence of catalog changes over time
- Catalog data file – a file corresponding to the catalog name with the suffix {catalogName}_{index}.catalog, where index is an ascending number starting from zero; the file contains data tied to the catalog, such as the catalog schema and global indexes
- Entity collection data files – files corresponding to an entity name with the suffix {entityTypeName}_{index}.colection, where index is an ascending number starting from zero; these files contain all data associated with the given entity collection—its schema, indexes, and entity data
flowchart TD A["Bootstrap File\n(catalog.boot)"] -->|"Pointer to WAL"| B["WAL file\n(catalog_{index}.wal)"] A -->|"Pointer to offset index\nin the catalog"| C["Catalog data file\n(catalog_{index}.catalog)"] subgraph S["Entity collection data files"] D1["Entity collection data file\n(entity1_{index}.colection)"] D2["Entity collection data file\n(entity2_{index}.colection)"] Dn["Entity collection data file\n(entityN_{index}.colection)"] end C -->|"Catalog header\n(contains pointer to offset index)"| D1 C -->|"Catalog header\n(contains pointer to offset index)"| D2 C -->|"Catalog header\n(contains pointer to offset index)"| Dn
The content of each file type is described in more detail in the following sections.
Record Structure in the Storage
Information | Data type | Length in bytes |
---|---|---|
Record length in Bytes | int32 | 4B |
Control Byte | int32 | 1B |
Generation Id | int64 | 8B |
Payload | byte[] | * |
Checksum – CRC32C | int64 | 8B |
Below is an explanation of the individual items:
- Record length in Bytes
- The length of the record in bytes. This is compared against the value of Record pointer: length and must match; otherwise, data integrity has been compromised.
- Control Byte
- This byte contains flags with key information about the nature of the record. The flags represent individual bits in this byte:
Byte no. Meaning #1 the last record in a series of records #2 a continuous record, with the payload continuing in the immediately following record #3 a computed checksum is available for the record #4 the record is compressed - Generation Id
- A generation number assigned to each record. This number is not actively used but can be utilized for possible data reconstruction. It typically matches the version of the offset index that points to this record.
- Payload
- The actual record data. This section can have variable length and contains the specific information corresponding to the record type. The payload has a maximum size limited by the size of the output buffer (see outputBufferSize).
- Checksum - CRC32C
- A checksum used to verify the integrity of data within the record. It is used to detect errors when reading data in the payload section.
Reason for the Maximum Record Size Restriction
The maximum record size is limited by the fact that data is written to disk in a strictly append-only manner. The first piece of information in the record is its size, which is not known until the record is fully created. Practically, this means the record is formed in a memory buffer, the final size is then written to the first position of the record, and only afterward is the record written to disk.
Splitting the Payload into Multiple Records
There are numerous scenarios in which the amount of data in the payload exceeds the maximum allowed payload size. During storage, we can often split the payload across multiple records at serialization time, placing them consecutively. Each of these records is compressed individually and has its own checksum. The linkage between records is maintained by setting control bit no. 2. However, it’s crucial for the deserialization mechanism to detect the need to load the next record.
Costs of Checksums and Compression
Bootstrap File
Information | Data type | Length in bytes |
---|---|---|
Storage protocol version | int32 | 4B |
Catalog version | int64 | 8B |
Catalog file index | int32 | 4B |
Timestamp | int64 | 8B |
Record pointer: start position | int64 | 8B |
Record pointer: length | int32 | 4B |
Below is an explanation of the individual items:
- Storage protocol version
- The version of the data format in which the catalog data is stored. This version changes only if there have been significant modifications in naming or structure of data files, or changes to the general structure of the records in the storage. This information allows us to detect when a running evitaDB instance expects data in a newer format than what is actually on disk. If such a situation arises, evitaDB contains a conversion mechanism to migrate data from the old format to the current one.
Currently, the data format version is 3. - Catalog version
- The catalog version is incremented upon completion of each committed transaction that pushes the catalog to the next version. There isn’t necessarily one bootstrap record per transaction. If the system manages to process multiple transactions within a time frame, the jumps between consecutive catalog versions in the bootstrap file can be greater than 1.
If the catalog is in warm-up mode, each of the bootstrap records may have a catalog version set to 0. - Catalog file index
- Contains the index of the catalog data file. Using this information, you can construct the file name corresponding to the catalog’s data file in the format catalogName_{index}.catalog. Multiple data files for the same catalog can coexist in the directory with different indexes, indicating the availability of the time travel feature.
- Timestamp
- A timestamp set to the time the bootstrap record was created, measured in milliseconds from 1970-01-01 00:00:00 UTC. This is used to locate the correct bootstrap record when performing time travel.
- Offset index pointer: start position
- A pointer to the first byte of the initiating record of the offset index in the catalog’s data file.
- Offset index pointer: length
- The length, in bytes, of the initiating record of the offset index in the catalog’s data file. This is crucial for properly reading the offset index from the catalog’s data file.
Data Files
Offset Index
Information | Data type | Length in bytes |
---|---|---|
Primary key | int64 | 8B |
Record type | byte | 1B |
Record pointer: start position | int64 | 8B |
Record pointer: length | int32 | 4B |
Always preceded by this header:
Information | Data type | Length in bytes |
---|---|---|
Effective length | int32 | 4B |
Record pointer: start position | int64 | 8B |
Record pointer: length | int32 | 4B |
Specific offset index entries have the following meaning:
- Primary key
- The primary key of the record. evitaDB typically represents primary keys as int32, but for some keys, two such identifiers are needed. In these cases, two int32 values are merged into a single int64.
- Record type
- The record type. It is used internally to distinguish the type of the record—specifically, it stores the type of . Because the numeric mapping uses only positive numbers starting from 1, the negative of these types marks “removed value.” The principle of loading and handling removed items is explained later.
- Record pointer: start position
- A pointer to the first byte of the previous offset index fragment in the current data file.
- Record pointer: length
- The length of the previous offset index fragment.
Reading all the available record information in the data file goes as follows:
- Load the initiating fragment of the offset index (typically the most recent one in the file).
- Read all pointer information for the records in this fragment:
- if the record type is negative, it indicates a removed record—this is noted in a hash table of removed records.
- Load the previous offset index fragment using its pointer and process it similarly:
- if a fragment entry references a record that appears in the removed records hash table, that record’s information is ignored during loading.
- Repeat this until you reach the very first offset index fragment, which no longer holds a pointer to a previous fragment.
Offset index fragments are usually written at the end of transaction processing (or a set of consecutive transactions, if handled within a dedicated time window). Only new/modified/removed records from that transaction set are stored in the fragment.
Data Records
Data records contain the actual data payload for each record type and are used to store schemas, entities, and all other infrastructural data structures such as search indexes, and so on. The record itself does not indicate whether it is valid or not—this information is available only at the offset index level.
Write-Ahead Log (WAL)
If the end of a WAL file contains a partially written record—i.e., its size does not match the size specified in the transaction header or in a transaction mutation—the WAL file is truncated to the last valid WAL entry upon database startup.
Data Mechanics
Data writes in evitaDB are strictly append-only, meaning once written, data is never overwritten. This has both positive and negative implications.
Deflate compression (part of the JDK) is used to compress data in the payload section. The payload size limit enforced by the output buffer remains in effect—compression only occurs once the buffer is filled or when the payload has been completely written. If the compressed payload ends up the same size or larger, the uncompressed version is used instead.
Cleaning Up the Clutter
This process takes place during transaction processing if it’s found that these conditions are met after the transaction completes. On the one hand, this prevents backlog and overgrowth of data in the file. On the other hand, it means the transaction’s completion may take longer than usual because it includes the compaction work.
Time Travel
This process is not heavily optimized for speed—rather, it simply takes advantage of the append-only nature of the data for historical record lookup (this feature alone does not make evitaDB a fully temporal database specialized in time-based queries). However, it does allow you to retroactively track the history of a record (or set of records), and also enables you to perform point-in-time backups of the database.
Backup and Restore
Handling files also permits a naive backup method—simply copying the files in this order:
- The bootstrap file
- The catalog data files
- The entity collection data files
- The WAL files
Even while the database is running, copying the on-disk data this way captures the current state consistently. That’s because if the bootstrap file is copied first, it necessarily contains a correct pointer to fully written data at the respective location it references. The database writes the bootstrap record last—i.e., only after all data it references has been completely written. If additional data exists that nothing points to, the database won’t mind during the next startup. Moreover, if WAL files are copied last, you also capture the most recent changes. During restore, evitaDB attempts to replay them. Any partially written transaction at the end of a WAL file is automatically dropped, so even if a transaction was only half-written, it should not impede the database from starting up from these copied data files.
There are two ways to handle backups:
- Full backup – a backup of all data files and WAL files, including historical ones.
- Active backup – a backup of only the currently used data files and WAL files.
A full backup employs the naive approach described earlier, but it can be quite large if there is a significant amount of historical data. On the other hand, it can be performed while the database is running since it’s purely a file-copying operation.