Transactions

Transactions are a fundamental part of the database system. They ensure that the database remains in a consistent state even in the event of failures or concurrent access by multiple users. In this article, we will explore the concept of transactions, their properties, and how they are implemented in evitaDB.

Readers who are familiar with database isolation levels may remember that evitaDB only supports snapshot isolation, so they can skip this introductory chapter, which describes the context relating to this level of isolation.

The key concept of a transaction is that data written by a transaction is only visible within that transaction, and not in other simultaneous reader sessions/transactions. Once a transaction has been committed, its changes become visible to any readers opening a new transaction afterwards. In the terminology of multi-version concurrency control, this is called 'snapshot isolation': each client behaves as if it has a full copy of the database at the time the transaction starts. If the client does not open the transaction explicitly, it is opened implicitly for each query or update statement and closed automatically when the statement finishes. This level of isolation prevents write-write conflicts — if two parallel transactions attempt to update the same record, one of them will be rolled back. However, it does not prevent read-write conflicts, also known as write skew. For example, if there are two transactions A and B, transaction A reads record X, adds 1 to the read value and stores the result to record Y, and transaction B reads record Y, subtracts 1 from the value and stores the result to record X, both transactions will be successfully committed without a conflict. We don't recommend using evitaDB for banking systems, but for most other use cases, this level of isolation is sufficient.

Lifecycle of a transaction

First, let's fast forward to the point at which the transaction is committed. Then, we will explain how we isolate the changes made by non-committed parallel transactions. A transaction commit is a multi-step process that ensures all changes made during the transaction are safely written to the database and made visible to other readers once fully integrated into all indexes. The commit process consists of the following steps:

parallel transaction conflict resolution
persistence of the transaction to the write-ahead log (WAL)
processing contents of the WAL and building a new "version" of the database
- incorporating changes to the indexes and writing record payloads to the data storage files
- persisting changes in indexes to the data storage files
- compaction of the data storage files if necessary (described in more detail here)
swapping the new version of the database with the current one
propagating the changes to the read nodes in a cluster setup

In the following sections, we will describe each of these steps in detail.

1. Conflict resolution

The first step is for the transaction processor to check that changes made by parallel transactions are not mutually exclusive. Each transaction knows the version of the database (catalogVersion) that it started with. When a transaction is committed, it is assigned a new catalogVersion in which the changes made by that transaction will be incorporated. The catalog version is a monotonically increasing number that increments by one for each transaction.

The conflict resolver needs to examine all changes made between the start and end catalog versions of the transaction. All mutations in the transaction produce a so-called conflict key, which is compared with conflict keys generated by previously committed transactions. If there is a conflict, an exception is raised and the transaction is rolled back.

Conflict handling is a work in progress (see issue #503). We want to provide these conflict resolution policies:

deny changes to the same entity
deny changes to the same entity part, namely:
- single entity attribute
- single associated data
- single price
- single reference
- entity header (for direct entity properties, such as hierarchy placement, price inner record handling strategy and so on)
allow conflicting changes (last writer wins) - this policy will never raise a conflict exception

The default conflict policy can be set for the entire database engine in its configuration, but this can be overridden for each catalog schema and even for each entity type or a sub-schema definition (attribute, associated data, etc.). We believe that this approach provides the necessary versatility and fine-grained control. There are also special types of safe mutation that can help minimize conflict, such as the "delta" attribute mutation, which can be applied safely in parallel.

2. Write ahead log persistence

All transactions and their mutations in evitaDB databases are first written to the write-ahead log. The write-ahead log (WAL for short) is described in more detail in the storage model article. Transactions are appended to the WAL in the order they are committed, with the catalog versions of subsequent transactions always being greater than those of previous transactions. Writing to the WAL is a blocking operation, meaning the transaction processor waits until the entire transaction has been successfully written to a WAL file and synchronized with the disk. This ensures that the transaction is durable and can be recovered in the event of a crash.

In fact, there are two types of WAL file. The first type, an "isolated" WAL, is created for each transaction in a separate file. This enables transactions to be written to their respective isolated WALs simultaneously (in-parallel without blocking each other). The second type is the "global" WAL file, to which the isolated transaction WAL files are copied in a single low-level operation. This copy operation is executed sequentially in the "WAL persistence" step as each transaction is processed. Isolated WAL files are removed immediately after their contents are copied to the global WAL file.

Side note: We made life easier by enforcing a single writer for all data files. In other words, this means we can process transactions sequentially. If we knew that the transactions were being written to non-overlapping data files, we could write the data in parallel. However, as this is not expected to be a common use case, we decided not to implement it.

WAL does not participate in the standard compaction process, so it would grow indefinitely. This is why a threshold is configured to limit the maximum size of the WAL file. If the threshold is reached, evitaDB starts writing to a separate file (segment), but leaves the original file in place. A single transaction must always be fully written to the same WAL segment, so huge transactions may cause WAL file sizes to exceed their configured limits. The number of WAL files kept is limited, and this limit can only be exceeded if the changes in them have not yet been applied to the indexes. The WAL files are removed once the transaction processor confirms that all changes have been applied to the indexes (and if running in distributed mode, also propagated to all other nodes).

WAL files are a crucial part of the database and are used for the following:

applying a missed committed transaction to the indexes on the master node (recovery after a crash).
change replication across multiple read nodes.
point-in-time backup and restore (PITR)
change data capture (CDC) - streaming changes to external systems
auditing in evitaLab

Up to this point, if an exception occurs, the transaction is effectively rolled back, but the world keeps spinning.

3. Processing contents of the WAL

In this phase, we read the unprocessed contents of the WAL file and apply changes to the indexes of the current catalog version and create payload records in shared data files. These changes are isolated and still invisible to other readers. This phase operates in a "time-window" manner, meaning that it tries to replay as many transactions as possible within a dedicated time window. When the time limit is reached or the entire contents of the WAL file have been processed, the transaction processor will create a new instance of the catalog data structure with a particular version (i.e. the last transaction processed will be assigned the value of the variable catalogVersion). This new instance is then passed to the next phase.

If the engine fails at this stage, all transaction processing will be halted. The engine cannot skip the committed transaction persisted in the WAL file, nor can it progress to the next one. It will try indefinitely to replay the last transaction from the WAL file. This ultimately indicates an error in the database engine that needs to be analyzed and corrected. However, the mechanism does not trap the thread in an infinite loop; it always attempts to process the problematic transaction again once a new transaction has been committed.

The storage model of evitaDB is strictly append-only. Once a record has been written to the data file, it cannot be changed or deleted. This fundamental design decision allows us to avoid the complexity of synchronization between writer and reader threads. Records in the file are located using pointers stored in the indexes. If there is no pointer to a record in the index, the record is considered garbage and is automatically removed during the next compaction.

4. New catalog version propagation

This phase is very quick and simply involves replacing the root pointer of the "current" catalog instance with the newly created one. Once complete, the newly created sessions/transactions will see the changes made in that particular transaction. The old catalog instance is not removed immediately but kept in memory until all sessions using this version of the catalog are closed (as required by the snapshot isolation level). Once there are no more sessions using the old catalog instance, it is removed from memory, after which the Java garbage collector will take care of it.

Horizontal scaling

Distributed mode of evitaDB is still a work in progress (see issue #109).

We plan to implement replication model similar to streaming replication in PostgreSQL or statement-based replication in MySQL. evitaDB is designed for read-heavy environments, so we plan to stick to single master, multiple reader nodes model with dynamic master (leader) election. All writes will target the master node, which will maintain the primary WAL.

All read nodes maintain an open connection to the master node, streaming and replaying changes in the WAL file locally. This connection is always open and downloads all changes present in the WAL file to each replica node. This means that, once the transaction processing has reached the final phase catalog propagation, mutations will start to be streamed to the replicas. As all conflicts are resolved on the master node before the transaction is committed and written to the WAL file, all mutations are expected to be successfully processed by all replicas.

When a new replica node is added to the cluster, it selects another replica or master node and fetches a binary version of their data storage files, which are cleaned of obsolete records (see active backup). The streaming producer discards obsolete records from the storage file maintainer (the selected replica or master node) on the fly, so no obsolete data is streamed to the new node. The backup process is always locked to valid data in a particular catalog version (including the position in the WAL file), and, due to the append-only nature of the data files, processing new transactions on this node does not need to be paused, and the node operates as usual. As all these operations work at a "binary level", creating a new replica is a reasonably fast process.

Parallel transaction isolation in detail

Two areas relating to transactions need to be addressed:

record payload data
indexes referring to that data

Isolation of non-committed record payload data

Non-committed transactions write their record payload data to temporary memory, which is cleared when the transaction is finished (either through a commit or a rollback). Long transactions with large payloads can currently affect the health of the database engine, so we recommend avoiding such transactions.

The transactional memory resides on the Java heap alongside all the other key evitaDB data structures. If the transaction is very large, it could consume all the available memory, causing an OutOfMemoryException and affecting the rest of the system, including read-only sessions. To avoid this, we need to limit the scope of the transaction. However, retrieving information about the size of data structures is not an easy task on the Java platform, and we can only retrieve a rough estimate. We plan to calculate an estimate of the transaction size and limit the total size of the transaction, as well as the total size of all transactions processed in parallel, to avoid exhausting all free memory. But this is still a work in progress - see issue #877.

Isolation of changes in the memory indexes

The second part — the indexes — is where all the complexity lies. A frequently used approach in databases is a table/row locking mechanism or a transaction ID number stored alongside the record pointer in a B-tree. This is consulted when a record is about to be read/updated in a particular transaction (see Goetz Graefe Modern B-Tree Techniques, or the Helsinki University of Technology team's Concurrency Control for B-Trees with Differential Indices).

Since we keep all indexes in memory keeping 8B catalogVersion next to each record would require a lot of memory, and we can take a different approach: software transactional memory. The idea is to keep the original data structure immutable and create a diff layer that is applied on the fly when reading modified data, but only by the thread that is handling the open transaction. This way, we can avoid locking the indexes and allow multiple transactions to read from and write to the shared "immutable" base concurrently.

Our indexes comprise multiple transactional data structures, which are explained in more detail in the reference section.

Software transactional memory

We isolate concurrent updates to the indexes made by different threads, placing them in separate memory blocks in the form of a read-write diff overlay that envelops the original, immutable data structure. When a thread reads data during a transaction, it accesses the data via an overlay that applies the diff in real time, enabling the transaction to dynamically view its own changes. If there are no updates in the transaction, there are also no diff layers, meaning the transaction reads directly from the underlying immutable data structure. As the diff overlays are stored in a ThreadLocal object bound to the thread processing a specific transaction, transactions cannot see each other's changes. This approach is often labelled software transactional memory STM.

Atomic transactions

The only way to make transactional changes atomic is to gather all changes in a volatile diff layer that is only used for the particular transaction. When a transaction is committed, a new instance of the entire catalog data structure must be built (i.e. new instances of updated entity collections, indexes, etc.). This new instance then replaces the current catalog instance with a single call to the AtomicReference#compareAndExchange method.

Although we state that the entire catalog data structure is to be reinstantiated, this is not entirely true. If that were the case, the transactions would be too expensive for large datasets, and the mechanism would not be feasible. In reality, we only create new instances for the modified parts of the catalog data structures and the catalog itself. Imagine the catalog data structure as a tree, with the catalog instance at the top and all the inner data as a directed acyclic graph. You will realize that the new instances are required only for the changed data structure itself, plus all its "parents" towards the catalog instance at the top of the graph. This technique is called path copying.

If the transaction is rolled back, we simply discard the entire memory block of the diff layer from the ThreadLocal variable and allow the Java garbage collector to take over.

The diff layer approach is not used in the special case of . This index tracks all record positions in a data file within a single hash map, providing quick O(1) access to the payload records. Rebuilding this index using the diff layer would involve constantly reallocating the entire hash map, which would be inefficient. Therefore, this index only keeps the most recent record pointers, and tracks all changes to this map between the current and previous catalog versions. This history is retained for as long as there is a session using a particular old catalog version. Sessions targeting old catalog versions land on a value they should not see, so they analyze the history of this record and recover the correct pointer for their catalog version.

Preventing update loss

The key problem with the described approach is that updates can easily be lost if the diff layer is not applied to the original data structure and included in the new catalog instantiation process.

To avoid this issue, we track every diff layer created in a particular transaction and mark them as consumed by the instantiation process when the transaction is committed. Finally, at the end of the transaction, we check that all diff layers have been marked as processed; if not, an exception is thrown and the transaction is rolled back. This resembles double-entry accounting, where each positive operation must be accompanied by a negative one.

Such issues occur during development, so there must be a way to identify and solve these problems. This is actually very tricky, since there may be thousands of diff layers and assigning a specific layer to its creator/owner is challenging. Therefore, we assign a unique transactional object version ID when the diff layer is created and include it in the exception thrown for non-collected diff layers. When we replicate the problematic transaction in a test, the diff layer gets the same version ID repeatedly and we can track the exact moment and place of the layer's creation by placing a conditional breakpoint at the version ID generator class.

Testing

The integrity of the data structures is vital for the database. In addition to the standard unit tests, there is always one "generational" test that uses a property based testing approach approach. These tests use two data structures: our tested STM implementation and a test double represented by a well-proven external implementation. For instance, in the generational test of the data structure, we use the JDK HashMap implementation as the test double.

The testing sequence is always similar.

at the start of the test, both the tested instance and the test double instance are created
both are filled with the same initial randomized data
in an iteration with a randomized number of repetitions:
- a random operation is selected and executed with a randomized value on both the tested instance and the test double (an example of such an operation might be "insert value X" or "remove value Y")
- a test then checks that the changes are immediately visible on the tested instance
- the transaction is committed
after the commit, the contents of both data structures are compared and must be equal
new instances of the data structures are created with initial data taken from the result of the commit
steps 3-5 are repeated infinitely

Once this generational test has run for a few minutes without any problems, we can be confident that the STM data structure is correctly implemented. However, there is always a small chance that the test itself is incorrect. Quis custodiet ipsos custodes?

Transactional data structures

Data structures are planned to be replaced

Most transactional data structures are suboptimal because they copy the entire contents to a new instance of the class at the moment of committing (the original instance must remain unchanged for other readers). Our primary focus was on read performance, so write performance was not a priority. We plan to improve the performance of these data structures in the future (see issue #760), and to create Clojure data structures that are proven and work in an immutable-friendly fashion (for example HAMT).

Transactional array (ordered)

The transactional array (e.g. and similar) mimics the behavior of a plain array, and there are multiple implementations of it.

TransactionalIntArray: for storing primitive int numbers
TransactionalObjectArray: for storing plain objects of any type
TransactionalComplexObjectArray: for storing nested object structures that allow merging and the automatic removal of "empty" containers.

All arrays are naturally ordered by default. In the case of object implementations, the object must be comparable. The array implementation does not allow duplicate values. Therefore, in the event of any insertions/removals, the array knows which indexes will be affected internally. It is not possible to set a value on an index passed from outside logic.

All these implementations share the same idea: in transactional mode, all updates go to the transactional overlay that traps:

inserts on certain indexes in an internal array of inserted values
deletions on certain indexes in an internal array of removed indexes.

Using this information, the STM array can build a new array combining the original values with all the changes. To avoid creating a new array (and memory allocations) for each operation, there are optimized methods that operate directly on the diff:

indexOf
contains
length

The class is much more complex — it accepts functions that operate on nested structures.

BiConsumer producer: this function takes two containers and combines them into one output container containing an aggregate of their nested data
BiConsumer reducer: this function takes two containers and removes/subtracts the nested data of the second container from that of the first
Predicate obsoleteChecker: this function tests whether the container contains any nested data. If not, the container may be considered removed; the predicate is consulted after the reduce operation

This implementation provides the ability to partially update the objects held within it. For example, consider a record with the following structure:

String: label
int[]: recordIds

If we then insert two such records into the TransactionalComplexObjectArray with the following data:

label = a recordIds = [1, 2]
label = a recordIds = [3, 4]

The array will produce the result with one record: a: [1, 2, 3, 4].

If we remove the record a: [2, 4], the array will produce the result: a: [1, 3].

If we apply the removal of the record again - a: [1, 3], the array will produce an empty result.

Unfortunately, with this implementation, we cannot provide optimized methods such as:

indexOf
length

We have to compute the entire merged array first in order to access these properties. While this data structure could be subject to significant optimizations, it is also quite challenging to implement correctly due to the nature of nested structures.

Transactional unordered array

This version of the transactional array differs from the previous one in that it allows duplicate values. It is also unordered and enables the client to control where new values are inserted or existing ones removed.

The database requires only a single implementation of this structure: .

The diff layer implementation for the unordered array is essentially the same as for the ordered array, with one exception. The inserted values retain information about the relative index within the segment inserted at a specific position within the original array.

This array has a special, fast implementation that works on the diff layer for the following methods:

indexOf
contains
length

Transactional list

The mimics the behaviour of the Java util List interface, enabling it to contain any object. The list can contain duplicates and is unordered. The current implementation is suboptimal and could be improved in the same way as the unordered array.

The diff layer contains a sorted set of indexes that were removed from the original list, as well as a map of new values and the indexes at which they were inserted. When a new item is inserted or removed from the diff layer, all the indexes after this value need to be incremented or decremented. Therefore, the "add/remove first" operation always has O(N) complexity. Conversely, the unordered array splits inserts into multiple segments, so the complexity is usually lower — O(N) is only the worst-case complexity for an unordered array.

Transactional map

The class mimics the behaviour of the Java java.util.Map interface, allowing it to contain any key/value pairs. In this case, the implementation is straightforward: the diff layer contains a set of all keys removed from the original map, as well as a map of all key/value pairs that have been updated or inserted into the map.

When logic attempts to retrieve a value from the map, the diff layer is first consulted to determine whether the key exists in the original map with no removal order in the layer or whether it has been added to the layer. The iterator of entries/keys/values first iterates over all existing, non-removed entries, and then iterates through entries added to the diff layer.

Transactional set

The class mimics the behaviour of the Java java.util.Set interface. In this case, the implementation is straightforward: the diff layer contains a set of all keys removed from the original map, as well as a set of added keys.

Transactional bitmap

A bitmap is a set of unique integers in ascending order. This data structure is similar to a transactional array, but is limited to integers only and enables much faster operations on the number set. This implementation wraps an instance of the internal class RoaringBitmap. The reasons for using this data structure, and more detailed information about RoaringBitmaps, are stated in the the query evaluation chapter. RoaringBitmap is a mutable data structure and a third-party library. As we have no control over it, we wanted to hide it with our interface. This is why the interface was created, to ensure that the entire codebase works with it instead of RoaringBitmap directly.

The allows insertions and deletions to be trapped from the original bitmap in the diff layer. When the bitmap needs to be used for reading, a new RoaringBitmap is computed by applying insertions by boolean AND on the original bitmap and applying removals by boolean AND NOT on the fly. To avoid this costly operation, the result is cached for subsequent read requests, but must be cleared upon the first subsequent write.

This computational method clones the entire original RoaringBitmap twice and is therefore suboptimal. Unfortunately, RoaringBitmap does not provide us with better options. An ideal implementation would require RoaringBitmap to be internally immutable and produce a new instance every time a write operation occurs. As RoaringBitmaps work internally with separate blocks of data, the new immutable version could reuse all the blocks that were not affected by the write action, cloning or altering only the blocks where the changes occurred. However, this would require substantial changes to the internal implementation and would probably be dismissed by the authoring team.

B+ tree

This is a standard B+ tree implementation with a transactional diff layer. The implementation creates a diff layer for each modified leaf segment of the tree, as well as for the parent chain to the root, with new instances that refer to the underlying diff layers (and the original, non-modified leaves). The new B+ tree is materialized at the moment of commit. The entire process resembles copy-on-write data structures, but the copied blocks are quite small compared to the entire tree. The B+ tree is used for all indexes that require a sorted order of keys.

Sequences

evitaDB uses several sequences to assign unique, monotonic identifiers to various objects. These sequences are not part of the transaction process and their progress is not rolled back. All sequences are implemented internally as either an AtomicLong or an AtomicInteger, which allow the retrieval of incremented values in a thread-safe manner.

Currently, we do not plan to support multiple writer mode in the distributed setup, which means we can safely use Java's internal atomic types to manage sequences because new numbers will always be issued by a single master node (JVM).

The sequences managed by the evitaDB are:

entity type sequence: assigns a new ID to each entity collection, allowing the collection to be addressed by a small number instead of duplicating and increasing the size of the string value
entity primary key sequence: assigns a new ID to each created entity, in case the entity schema requires it automatic primary key assignment
entity primary key sequence: assigns a new ID to each created entity, in case the entity schema requires automatic primary key assignment
index primary key sequence: assigns a new ID to each created internal index
transaction ID Sequence: assigns a new ID to each opened transaction

Author: Jan Novotný

Date updated: 16.5.2025

Documentation Source