
Transactions
Transactions are a fundamental part of the database system. They ensure that the database remains in a consistent state even in the event of failures or concurrent access by multiple users. In this article, we will explore the concept of transactions, their properties, and how they are implemented in evitaDB.
The key concept of a transaction is that data written by a transaction is only visible within that transaction, and not in other simultaneous reader sessions/transactions. Once a transaction has been committed, its changes become visible to any readers opening a new transaction afterwards. In the terminology of multi-version concurrency control, this is called 'snapshot isolation': each client behaves as if it has a full copy of the database at the time the transaction starts. If the client does not open the transaction explicitly, it is opened implicitly for each query or update statement and closed automatically when the statement finishes. This level of isolation prevents write-write conflicts — if two parallel transactions attempt to update the same record, one of them will be rolled back. However, it does not prevent read-write conflicts, also known as write skew. For example, if there are two transactions A and B, transaction A reads record X, adds 1 to the read value and stores the result to record Y, and transaction B reads record Y, subtracts 1 from the value and stores the result to record X, both transactions will be successfully committed without a conflict. We don't recommend using evitaDB for banking systems, but for most other use cases, this level of isolation is sufficient.
Lifecycle of a transaction
First, let's fast forward to the point at which the transaction is committed. Then, we will explain how we isolate the changes made by non-committed parallel transactions. A transaction commit is a multi-step process that ensures all changes made during the transaction are safely written to the database and made visible to other readers once fully integrated into all indexes. The commit process consists of the following steps:
- parallel transaction conflict resolution
- persistence of the transaction to the write-ahead log (WAL)
- processing contents of the WAL and building a new "version" of the database
- incorporating changes to the indexes and writing record payloads to the data storage files
- persisting changes in indexes to the data storage files
- compaction of the data storage files if necessary (described in more detail here)
- swapping the new version of the database with the current one
- propagating the changes to the read nodes in a cluster setup
In the following sections, we will describe each of these steps in detail.
1. Conflict resolution
The conflict resolver needs to examine all changes made between the start and end catalog versions of the transaction. All mutations in the transaction produce a so-called conflict key, which is compared with conflict keys generated by previously committed transactions. If there is a conflict, an exception is raised and the transaction is rolled back.
- deny changes to the same entity
- deny changes to the same entity part, namely:
- single entity attribute
- single associated data
- single price
- single reference
- entity header (for direct entity properties, such as hierarchy placement, price inner record handling strategy and so on)
- allow conflicting changes (last writer wins) - this policy will never raise a conflict exception
The default conflict policy can be set for the entire database engine in its configuration, but this can be overridden for each catalog schema and even for each entity type or a sub-schema definition (attribute, associated data, etc.). We believe that this approach provides the necessary versatility and fine-grained control. There are also special types of safe mutation that can help minimize conflict, such as the "delta" attribute mutation, which can be applied safely in parallel.
2. Write ahead log persistence
In fact, there are two types of WAL file. The first type, an "isolated" WAL, is created for each transaction in a separate file. This enables transactions to be written to their respective isolated WALs simultaneously (in-parallel without blocking each other). The second type is the "global" WAL file, to which the isolated transaction WAL files are copied in a single low-level operation. This copy operation is executed sequentially in the "WAL persistence" step as each transaction is processed. Isolated WAL files are removed immediately after their contents are copied to the global WAL file.
WAL files are a crucial part of the database and are used for the following:
- applying a missed committed transaction to the indexes on the master node (recovery after a crash).
- change replication across multiple read nodes.
- point-in-time backup and restore (PITR)
- change data capture (CDC) - streaming changes to external systems
- auditing in evitaLab
Up to this point, if an exception occurs, the transaction is effectively rolled back, but the world keeps spinning.
3. Processing contents of the WAL
In this phase, we read the unprocessed contents of the WAL file and apply changes to the indexes of the current catalog version and create payload records in shared data files. These changes are isolated and still invisible to other readers. This phase operates in a "time-window" manner, meaning that it tries to replay as many transactions as possible within a dedicated time window. When the time limit is reached or the entire contents of the WAL file have been processed, the transaction processor will create a new instance of the catalog data structure with a particular version (i.e. the last transaction processed will be assigned the value of the variable catalogVersion). This new instance is then passed to the next phase.
If the engine fails at this stage, all transaction processing will be halted. The engine cannot skip the committed transaction persisted in the WAL file, nor can it progress to the next one. It will try indefinitely to replay the last transaction from the WAL file. This ultimately indicates an error in the database engine that needs to be analyzed and corrected. However, the mechanism does not trap the thread in an infinite loop; it always attempts to process the problematic transaction again once a new transaction has been committed.
The storage model of evitaDB is strictly append-only. Once a record has been written to the data file, it cannot be changed or deleted. This fundamental design decision allows us to avoid the complexity of synchronization between writer and reader threads. Records in the file are located using pointers stored in the indexes. If there is no pointer to a record in the index, the record is considered garbage and is automatically removed during the next compaction.
4. New catalog version propagation
This phase is very quick and simply involves replacing the root pointer of the "current" catalog instance with the newly created one. Once complete, the newly created sessions/transactions will see the changes made in that particular transaction. The old catalog instance is not removed immediately but kept in memory until all sessions using this version of the catalog are closed (as required by the snapshot isolation level). Once there are no more sessions using the old catalog instance, it is removed from memory, after which the Java garbage collector will take care of it.
Horizontal scaling
Parallel transaction isolation in detail
Two areas relating to transactions need to be addressed:
- record payload data
- indexes referring to that data
Isolation of non-committed record payload data
Non-committed transactions write their record payload data to temporary memory, which is cleared when the transaction is finished (either through a commit or a rollback). Long transactions with large payloads can currently affect the health of the database engine, so we recommend avoiding such transactions.
Isolation of changes in the memory indexes
Software transactional memory
Atomic transactions
If the transaction is rolled back, we simply discard the entire memory block of the diff layer from the ThreadLocal variable and allow the Java garbage collector to take over.
Preventing update loss
The key problem with the described approach is that updates can easily be lost if the diff layer is not applied to the original data structure and included in the new catalog instantiation process.
Such issues occur during development, so there must be a way to identify and solve these problems. This is actually very tricky, since there may be thousands of diff layers and assigning a specific layer to its creator/owner is challenging. Therefore, we assign a unique transactional object version ID when the diff layer is created and include it in the exception thrown for non-collected diff layers. When we replicate the problematic transaction in a test, the diff layer gets the same version ID repeatedly and we can track the exact moment and place of the layer's creation by placing a conditional breakpoint at the version ID generator class.
Testing
The testing sequence is always similar.
- at the start of the test, both the tested instance and the test double instance are created
- both are filled with the same initial randomized data
- in an iteration with a randomized number of repetitions:
- a random operation is selected and executed with a randomized value on both the tested instance and the test double (an example of such an operation might be "insert value X" or "remove value Y")
- a test then checks that the changes are immediately visible on the tested instance
- the transaction is committed
- after the commit, the contents of both data structures are compared and must be equal
- new instances of the data structures are created with initial data taken from the result of the commit
- steps 3-5 are repeated infinitely
Transactional data structures
Transactional array (ordered)
- TransactionalIntArray: for storing primitive int numbers
- TransactionalObjectArray: for storing plain objects of any type
- TransactionalComplexObjectArray: for storing nested object structures that allow merging and the automatic removal of "empty" containers.
All arrays are naturally ordered by default. In the case of object implementations, the object must be comparable. The array implementation does not allow duplicate values. Therefore, in the event of any insertions/removals, the array knows which indexes will be affected internally. It is not possible to set a value on an index passed from outside logic.
All these implementations share the same idea: in transactional mode, all updates go to the transactional overlay that traps:
- inserts on certain indexes in an internal array of inserted values
- deletions on certain indexes in an internal array of removed indexes.
Using this information, the STM array can build a new array combining the original values with all the changes. To avoid creating a new array (and memory allocations) for each operation, there are optimized methods that operate directly on the diff:
- indexOf
- contains
- length
- BiConsumer producer: this function takes two containers and combines them into one output container containing an aggregate of their nested data
- BiConsumer reducer: this function takes two containers and removes/subtracts the nested data of the second container from that of the first
- Predicate obsoleteChecker: this function tests whether the container contains any nested data. If not, the container may be considered removed; the predicate is consulted after the reduce operation
This implementation provides the ability to partially update the objects held within it. For example, consider a record with the following structure:
- String: label
- int[]: recordIds
If we then insert two such records into the TransactionalComplexObjectArray with the following data:
- label = a recordIds = [1, 2]
- label = a recordIds = [3, 4]
Unfortunately, with this implementation, we cannot provide optimized methods such as:
- indexOf
- length
We have to compute the entire merged array first in order to access these properties. While this data structure could be subject to significant optimizations, it is also quite challenging to implement correctly due to the nature of nested structures.
Transactional unordered array
This array has a special, fast implementation that works on the diff layer for the following methods:
- indexOf
- contains
- length
Transactional list
The diff layer contains a sorted set of indexes that were removed from the original list, as well as a map of new values and the indexes at which they were inserted. When a new item is inserted or removed from the diff layer, all the indexes after this value need to be incremented or decremented. Therefore, the "add/remove first" operation always has O(N) complexity. Conversely, the unordered array splits inserts into multiple segments, so the complexity is usually lower — O(N) is only the worst-case complexity for an unordered array.
Transactional map
When logic attempts to retrieve a value from the map, the diff layer is first consulted to determine whether the key exists in the original map with no removal order in the layer or whether it has been added to the layer. The iterator of entries/keys/values first iterates over all existing, non-removed entries, and then iterates through entries added to the diff layer.
Transactional set
Transactional bitmap
B+ tree
This is a standard B+ tree implementation with a transactional diff layer. The implementation creates a diff layer for each modified leaf segment of the tree, as well as for the parent chain to the root, with new instances that refer to the underlying diff layers (and the original, non-modified leaves). The new B+ tree is materialized at the moment of commit. The entire process resembles copy-on-write data structures, but the copied blocks are quite small compared to the entire tree. The B+ tree is used for all indexes that require a sorted order of keys.
Sequences
The sequences managed by the evitaDB are:
- entity type sequence: assigns a new ID to each entity collection, allowing the collection to be addressed by a small number instead of duplicating and increasing the size of the string value
- entity primary key sequence: assigns a new ID to each created entity, in case the entity schema requires it automatic primary key assignment
- entity primary key sequence: assigns a new ID to each created entity, in case the entity schema requires automatic primary key assignment
- index primary key sequence: assigns a new ID to each created internal index
- transaction ID Sequence: assigns a new ID to each opened transaction