In-memory Implementation of Evita
The document contains the original research paper for the in-memory implementation of the evitaDB prototype. This version of the prototype proved to be viable and was transformed into the final evitaDB implementation.
The in-memory implementation is built from the ground up to achieve the desired evitaDB functionality. The key principles embodied in design process were:
- Immutability: All key data structures are designed to be immutable. We use Java records everywhere it makes sense - records could open up a significant memory savings once project Valhalla is merged into the Java mainline. The immutability introduces implicit thread safety and opens the path to lock-free concurrent processing.
- Mutations: When immutable data needs to be updated, we wrap them into a temporary mutable object using a builder pattern. The builder is capable of generating a set of mutation operations that gradually modify the source data structure to the fully updated form. These mutations represent atomic operations upon the data structure and can be translated to a modification protocol sent over the network, stored in a WAL and so on.
- Append-only approach: File write operations are all designed to be append-only. There is no single place that would require seek & write in the file. Append-only approach guarantees very good performance characteristics and ability to easily back up the data on the fly. The disadvantage is that the size of the file quickly increases and the file gets fragmented over time (see vacuuming for mitigation steps).
- Minimize external dependencies: We purposefully limit using external dependencies to a bare minimum. However, we take a lot of inspiration from them. There are a lot of functions inspired or copied from Apache Commons, Spring Framework or Guava. Using those functions from the original sources would grow the size ( though, we'd use only a tiny fragment of original libraries) of the evitaDB binary and complicate the usage as an embedded database, which is something we want to support. We aim for evitaDB to be as small a library as possible.
- Stick to primitives: Wherever possible we use java primitive types and operate on them. The smaller the data type, the better. Since our implementation keeps all crucial data in memory, we need to carefully consider what data structures / data types to use to fit as much data in memory as possible. All indexes take advantage of the 32-bit int data type that offers a reasonable range while consuming moderate memory space. See chapter indexes for tactics used in this area.
- Indexes fully in memory: All data needed to resolve the primary keys of the entities matching the query are maintained in RAM. This is the main advantage of our implementation since RAM is pretty fast. Also, it is the main disadvantage - if our hardware does not have enough memory to host all the database indexes, evitaDB would be unusable. This is the key difference between our database system and other database systems available on the market. Alternative database systems usually operate with much larger data than the memory can hold and use data-structures / algorithms friendly to the rotating / solid state disk drives. They trade off querying speed for much larger data size limits, we do the opposite.
All the above-mentioned principles allow us to achieve exceptional performance in all e-commerce use-cases which we decided to solve in evitaDB. Green field implementation allowed us to be in control of any detail of the query processing and transaction handling.
Storage model
Binary format
Storage record structure
Each record follows this structure:
- int(4B) length: length of the record
- byte(1B) nodeId: reserved byte for identification of the master node where the record originated from (it may be used in a multi-master setup if we ever go this way)
- long(8B) transactionId: consistency transaction mark - see consistency verification
- byte(?B) payload: real record payload written by the Kryo serializer
- byte(1B) control: control byte (covered by CRC)
- long(8B) crc: CRC-32C checksum of the record from the control (exclusive) to end of the payload (inclusive) - see consistency verification
The storage record size is limited by the size of the output stream buffer. The reason for this is the fact that the MemTable is an append-only file, and we only need to set the length of the record prior to flushing the record to the disk. Information of the record length must be stored as the first information in the record so that we can skip over records quickly without reading them all and deserializing them with Kryo deserializers. The length of the record serves as a consistency verification point and can also be used by the vacuuming process to skip over entire records if they're not found in the actual file index.
The reading process also uses a buffer (different from the write buffer), but this one is not limited by the buffer size and with a relatively small buffer (let's say 8kB) it can read storage records of any size.
Control byte
The control byte bits have the following meaning:
- 1st bit: if set to 1 the record represents the last record of the transaction
- 2nd bit: if set to 1 the record is part of a chain of multiple consequent records ( see chained records)
- 3rd - 8th bit: not used yet
Chained records
The output buffer size greatly limits the content possible to be stored in a single record. evitaDB indexes quickly become larger than a reasonably sized output buffer. There is a built-in mechanism allowing to automatically split larger payloads into multiple consequently placed storage records and join them automatically during reading. The Kryo serialization library doesn't help much here. evitaDB hijacks the "require" check (which the Kryo library uses to verify that there is enough space in the write buffer) to interrupt the serialization process, finish the record and create a new empty one. Usually there is no limit on the payload size, that can be stored in the storage file - there is only a single limitation: the serializer must not store more bytes than the buffer size in a single write call (such as writeBytes(...) or writeInts(...)).
Consistency verification
The data file is just a file on the disk and can be corrupted any time for a variety of reasons:
- a process can finish prematurely in the middle of the record writing,
- the disk can be damaged,
- the payload might be corrupted when synchronized in the cluster,
- and other issues
Therefore, we need to detect these situations quickly to avoid corrupting valid files or detect file corruption as early as possible, which would be either during the evitaDB start-up sequence or immediately after the record is read and hasn't yet been passed on to the client.
Here is the list of mechanisms checking that the contents of the file are valid:
Write interruption resistance
The first level of defense is the Kryo deserialization process. Kryo will fail if it doesn't finish deserializing the record, and the input prematurely ends.
The transaction ID is the same for all the records in the same transaction and the transaction record list is closed by a record with the transactional control bit set to one. The transaction ID is monotonically increasing between different transactions. There is only one exception and that is "warm-up" mode, which is when evitaDB is initialized (filled up with an initial data set) and the transactional process hasn't yet been enabled. At this time the transactional ID is zero and has no ending record with the transactional bit set. At the warming up stage (non-transactional), the database is being created in isolation by a single process and problems can be easily detected by not being able to finish flush and transition to the go-live state.
When we read the data from the disk, we can be sure the entire transaction is complete at the moment when we read a record of the last known transaction that has a transactional bit set to one (the ID of the last known transaction is stored in the header stored in a different file). If not, we know that we encountered a problematic situation that requires a recovery procedure.
Disk / data corruption resistance
If the checksum doesn't match, an exception is thrown, and we know that we encountered a problematic situation that requires a recovery procedure.
Fragmentation and vacuuming
The data file is an append only file - it means that the data is only appended to the tail of the file and no data is ever written in the middle of the file. If we need to remove the record, we have to write a new record at the tail of the file that will contain the information that the record is considered to be removed from this moment on. Such a record is often called a tombstone. If we need to update the record, we have to write it again at the tail of the file combining old and updated contents.
An append-only approach is expected to be faster on HDD file systems and also allows us to avoid file gap management, which could be tricky. The disadvantage of this solution is that the file is going to get fragmented and grow larger and larger over time. To solve this problem, we need a vacuuming process that will scan the file and create a new, non-fragmented one.
Mem-table index
Because the data file is a stream of variable record sizes, there are multiple "versions" of the logically same record, that are continuously updated, we need to maintain an index that will tell us where the exact record locations are in the file. We call this index "mem-table".
The mem-table index is a simple map with O(1) access time, that maps living records to their location in the data file and is kept in memory. It looks like this:
- RecordKey:
- FileLocation
- int 4B version: version of the record, that gets incremented when a new storage record sharing the same RecordKey gets written to the file
- long (8B) primaryKey: unique identification of the record within its recordType
- byte (1B) recordType: number representation of the enum MemTableRecordType that can be translated to a Java container class with a known serializer and deserializer
- long (8B) startPosition: position of the first byte of the record in the data file
- int (4B) length: size of the record
With this information, any variable-length record can be located in the data file. Thanks to the consistency checks, we make sure, that the block we read represents a valid record. This mem-table index represents the key for reading the data file contents at a certain moment of the time.
Persistence procedure
The mem-table index is stored in the data file along with other types of records. It's the same record as any other. The mem-table index could get huge and exceed the maximum buffer size easily. Also, due to append-only characteristics of the data file, we don't want to rewrite the entire mem-table index everytime it changes. That's why we keep the mem-table index in the data file in the form of an append-only log that can be split into multiple separated records pointing from the tail to the head.
The file index record consist of:
- long (8B) startPosition: location of the previous file index block
- int (4B) length: length of the previous file index block
- long (8B) lastTransactionId: information about the last transaction id known at the time the block is written
- collection of record pointers (1..N):
- long (8B) primaryKey
- byte (1B) recordType (negative when the record represents REMOVAL, i.e. tombstone)
- long (8B) startPosition
- int (4B) length
When the entity collection is created, the first mem-table index block is written, and its coordinates are saved. Then one or more transactions occur and a new mem-table index block is written. This block references the location of the previous mem-table block and contains only changes in the mem-table index. This includes coordinates of newly added records, updated records and also removed records (with the negative record type).
We need to always keep the location of the last mem-table index block written to the data file. This information is crucial for the mem-table index reconstruction and is usually kept in some external file - i.e. the catalog header.
Transaction persistence procedure
We achieve this by the following mechanism when a transaction is committed:
- each modified data file must append a new tail mem-table index block with the information of the last transaction id in that data file
- this produces the new information about the tail block, which needs to be written to the header
- the entity collections headers are written into the catalog index, whose modification produces its own tail mem-table block and the information about last transaction id
- the catalog tail mem-table block is finally written into a special "header file", which contains the fixed-size records containing the location of the tail mem-table block in catalog data file
This way, we can freely write to multiple data files simultaneously and in case things go wrong, we know that the last transaction is automatically rolled back, because the final operation - the writing location of the tail mem-table block in the catalog data file didn't occur. When the database is restarted, it reads the last complete location from the "header file". And that always represents the consistent committed state of all data files.
The ultimate header file
The header file is a binary file with fixed record length. It contains 20B records of following structure written by the Kryo serialization process:
- long (8B) startPosition: the location of the last catalog data file mem-table block
- int (4B) length: the length of the last catalog data file mem-table block
- long (8B) lastTransactionId: information about the last transaction id known at the time the block is written
Database reconstruction
When evitaDB is restarted and the mem-table index needs to be reconstructed, we only need the information about the location of the last mem-table index block to recreate the entire file index. Blocks are read from last to first (tail to head) and a file index is created during the process. When the RecordKey is read:
- it is looked up in the created mem-table index - if it's already present, it means it is an older version of the record and is ignored
- it is looked up in the removed records set - if it's there, it was removed
- when it's not present in any above places:
- if its recordType is negative it means it's record removal, and it's added to the removed records set
- otherwise, it's added to the created mem-table index
When a block is exhausted and there is a pointer to the previous block, the algorithm proceeds with that block until the head block is reached. When the head is processed we have completed the actual mem-table index and the contents of the data file can be safely used.
Due to the nature of the file index record chain we can also recover evitaDB at any point of time, provided that we have the proper "last" file index record fragment kept for such an occasion. The vacuuming process also limits our ability to go back in time in evitaDB contents, but when used properly this could be the key for implementing a safe recovery process.
In case there are a lot of record updates / removals, the mem-table index reconstruction may take a long time, most of which will be spent by skipping already present or removed records. The closer the block is to the head block, the more it will happen. This problem could be mitigated by the vacuuming process, which would create a new data file with actual data only. But if we need to keep historical records in place and still shorten the period needed for mem-table index reconstruction, we may create a new "snapshot" of the mem-table index in the data file by simply creating a new "head" block and writing all "actual" record locations after it valid for the time being.
Entire catalog reconstruction
The in-memory data structures of the catalog are reconstructed by following procedure:
- locating the catalog tail in the mem-table block, which is read from the header file
- the catalog data file mem-table index is reconstructed
- the entity collection headers are read from it - their tail mem-table blocks are located within their headers
- the entity collection data file mem-table indexes are reconstructed
- the catalog and entity collection search indexes are fully loaded from the data files, and into the memory
Many of these operations can be performed in parallel, but currently, the entire catalog loading procedure happens sequentially.
Backup & restore
Restoring the database requires only a simple copy of the entire catalog directory to the original place and starting evitaDB from it.
Data ingestion
In that sense, we want to provide facilities for two scenarios:
- quick initial filling of the database
- incremental updates of existing dataset
In the second - transactional phase - we expect that evitaDB is under heavy reading load and that there are multiple simultaneous write requests as well. There may be one or more processes that propagate changes from the primary data store, and many other processes that update for example the quantity of goods in stock when orders are being placed.
The secondary transactional phase needs to ensure that all transactions are properly ordered and committed and that after each transaction the system is left in a consistent state, and that all user-defined constraints remain intact ( such as that the quantity of goods in stock remains greater than or equal to zero).
Warm-up indexing
When the warm-up ingestion is done, and the evitaDB session is closed, all index structures are persisted into data files and finally, the mem-table indexes. Since evitaDB needs to keep all working and mem-table indexes in the RAM, this initial phase serves us also for checking that the system has enough memory to host all the necessary data.
The downside is, that in case the system is already operating one evitaDB database, and we want to replace it with a freshly created one, that we built in the background process, the system needs to have enough memory to accommodate two evitaDB catalogs at the same time. We can work around this situation by using a different machine for creating the replacement database and switching the machines once the index is created using load-balancer. Or moving the index files to the production machine, closing the original catalog there and loading the new one which would involve a few seconds long downtime.
Transactions
The key concept of the transaction is that the data written by the transaction is visible only within that transaction, and it is not visible to the other simultaneous readers. Once the transaction is committed, the changes become visible to all readers that open a new transaction after the commit. In the multi version concurrency control terminology this is called the snapshot isolation - each client behaves as if it has a full copy of the database at the time of transaction start. If the client doesn't open the transaction explicitly, the transaction is opened implicitly for each query or execute statement and closed automatically when the statement is finished.
There are two places which needs to be taken care of in relation to transactions:
- record payload data
- indexes referring to that data
The first part is trivial - we can store records to our data files without the fear of breaking ACID properties. The records on the disk are not visible until indexes (more so the memory-table index) are aware of them. During transaction, the records are appended directly to the end of the data files and indexes are held and updated in volatile memory.
We made our life easier by enforcing a single writer for all data files together. In other words, this means we can process transactions sequentially. If we knew that the transactions were written into non-overlapping data files we could write the data in parallel. But since this is not expected to be a common use-case, we decided not to implement it.
Software transactional memory
Our approach is totally different. We build on the premise that all indexes are held in memory, and we have a strong ally on our side - the Java garbage collector. We isolate the concurrent updates of the indexes made by different threads into separate memory blocks in the form of read-write diff overlay enveloping the original immutable data structure.
Atomic transactions
Although we state, that the entire catalog data structure is to be re-instantiated, it's not entirely true. If it were to be like that, the transactions would be too expensive for large datasets, for which the mechanism wouldn't be feasible. The reality is that we create new instances only for the modified parts of the catalog data structures and the catalog itself. If you imagine the catalog data structure as a tree with the catalog instance at the top and all the inner data as a directed acyclic graph, you'll realize that the new instances are required only for the changed data structure itself plus all its "parents" towards the catalog instance at the top of the graph.
Preventing update loss
The key problem with the described approach is that the updates can be easily lost if the diff layer is omitted to be applied on the original data structure and included in the new catalog instantiation process.
During development, such issues occur, and therefore there must be a way to unfold and solve these problems. This is actually very tricky since there may be thousands of diff layers, and assigning a specific layer to its creator/owner is not easy. Therefore, we assign a unique transactional object version ID at the time the diff layer is created, and include it in the exception thrown for non-collected diff layers. When we replicate the problematic transaction in a test, the diff layer will obtain the same version ID repeatedly, and we can track the origin (at the exact moment and place) of the layer creation by placing a conditional breakpoint at the version ID generator class.
Testing
The testing sequence is always similar:
- at the start of the test, both tested and test-double instances are created
- both are filled with the same initial randomized data
- in an iteration with randomized count of repetitions
- a random operation is selected and is executed with a randomized value both on the tested and test double instances (an example of such operation might be "insert value X" or "remove value Y")
- a test checks, that the changes are immediately visible on the tested instance
- the transaction is committed
- after the commit occurred, contents of both data structures are compared and must be equal one to another
- new instances of the data structures are created with initial data taken from the product of the commit
- steps 3. - 5. are repeated infinitely
Transactional data structures
Transactional array (ordered)
The transactional array mimics plain array behavior, and there are multiple implementations if it:
- TransactionalIntArray: for storing primitive int numbers
- TransactionalObjectArray: for storing plain Objects of any type
- TransactionalComplexObjectArray: for storing nested Objects structures that allow merging and automatic removal of "empty" containers
All arrays are implicitly naturally ordered. In the case of object implementations, the Object is required to be Comparable. The array implementation doesn't allow duplicates in values. So in case of any insertions / removals, the array knows which indexes will be affected internally. There is no possibility to set value on an index passed from outside logic.
All these implementations share the same idea; in transactional mode, all updates go to the transactional overlay that traps:
- inserts on certain indexes in an internal array of inserted values
- deletions on certain indexes in an internal array of removed indexes
From this information, the STM array is able to build up a new array that combines the original values with all the changes. To avoid creating a new array (and memory allocations) for all operations there are optimized and frequently used methods that operate on the diff directly:
- indexOf
- contains
- length
- BiConsumer<T, T> producer - this function takes two containers and combines them together into one output container that contains an aggregate of their nested data
- BiConsumer<T, T> reducer - this function takes two containers and removes / subtracts nested data of the second container from the nested data of the first container
- Predicate<T> obsoleteChecker - this function that tests whether the container contains any nested data - if not, the container might be considered as removed; the predicate is consulted after the reduce operation
This implementation provides the ability to partially update the objects held in it. For example let's have a record with the following structure:
- String: label
- int[]: recordIds
- label = "a": recordIds = [1,2]
- label = "a": recordIds = [3,4]
Unfortunately in this implementation, we cannot provide optimized methods, such as:
- indexOf
- length
And we have to compute the entire merged array first in order to access these properties. This data structure might be subject to big optimizations, but is also quite hard to implement correctly due to the nature of nested structures.
Transactional unordered array
This array has a special and fast implementation working on the diff layer for methods:
- indexOf
- contains
- length
Transactional list
The dif layer contains a sorted set of indexes that were removed from the original list and a map of new values along with indexes, which they were inserted at. When a new item is inserted or removed from the diff layer, all the indexes after this value need to be incremented or decremented. So, the operation "add/remove first" always has O(N) complexity. On the contrary, the unordered array splits inserts into multiple segments and the complexity is usually lower - the O( N) is only the worst case complexity for an unordered array.
Transactional map
When the logic tries to retrieve a value from the map, the diff layer is first consulted to resolve whether the key exists in the original map and has no removal order in the layer, or whether it was added to the layer. The iterator of entries / keys / values first iterates over all existing non-removed entries and then iterates through entries added to the diff layer.
Transactional set
Transactional bitmap
The computational method clones the entire original RoaringBitmap two times and thus is more than suboptimal. Unfortunately, the RoaringBitmap doesn't provide us with better options. The ideal implementation would require the RoaringBitmap to be internally immutable, producing a new instance every time a write operation occurs. Because RoaringBitmaps internally work with separate blocks of data, the new immutable version could reuse all former blocks that were not affected by the write action and clone / alter only a few blocks where the changes really occurred. However, this would require substantial changes to the internal implementation and would be probably dismissed by the authoring team.
Sequences
The sequences managed by the evitaDB are:
- entity type sequence: assigns a new id to each created entity collection allowing to address the collection by a small number instead of a duplicating and larger String value
- entity primary key sequence: assigns a new id to each created entity - just in case the entity schema requires automatic primary key assignment
- index primary key sequence: assigns a new id to each created internal index
- transaction id sequence: assigns a new id to each opened transaction
Write ahead log
Vacuuming process removes all WAL files in case they have no future use. WAL files are used for:
- applying missed committed mutations on master node indexes (recovery after crash)
- change replication across multiple read nodes
When we know that all mutations are safely in the master indexes and readers nodes applied all changes to their indexes, we might safely remove entire WAL segment.
Transaction processor
The purpose of a transaction processor is to concurrently handle all parallel transactions and verify that their mutations are not mutually conflicting. It also serves as a buffer for all mutations of the opened and not yet committed transactions. Each mutation provide a key, that is compared to keys of all other mutations in the transaction processor. If there is a mutation with the same key, the mutation is asked to handle conflict with previously recorded mutation. Some mutation can safely write to the same attribute (those that apply some form of delta information to it), while other can not and will signal an exception which causes the transaction to be rolled back.
- single entity attribute
- single associated data
- single price
- single reference
- single entity (for direct entity properties, such as hierarchy placement, price inner record handling strategy and so on)
We believe that this approach with combination with "delta" mutations might minimize the number of conflicts and rolled back transactions. No writes will also ever cause problems for the readers, who always read consistent version of the database contents.
Distributed model
When new replica node is added to the cluster it selects any other replica or master node and fetches binary version of their data storage files cleaned from the obsolete records. Obsolete records are discarded on the fly by the streaming producer of the storage file maintainer (selected replica or master node). All files will be consistent and will refer to the last committed transaction id known to the data storage file maintainer node at the moment of replica sync start. When there are additional changes recorder meanwhile the replica is syncing the database, they're written to the data storage file maintainer WAL and when all the files are correctly synced, the new replica will fetch also contents of the WALL file from the point of the last known transaction id it knows about. Because all those operations work on "binary level", spinning a new replica might be reasonably fast process.
Indexes
Indexes are key data structures for fast query processing. One of the key selling points of evitaDB is exceptional read throughput. Therefore, the database can only process queries that hit already prepared indexes. On the contrary, relational databases allow processing queries targeting non-indexed columns, which results in a full table scan that is unbearably slow on large tables.
Range index
The range index allows us to resolve the queries for range types. A range type is a composite type that defines the start and the end in the form of a long type (64-bit numeric type). The evitaDB recognizes multiple numeric based ranges as well as the date time range - all of which are convertible to the long type.
The range index stores the range information in the form of the long threshold and two arrays that contain record primary keys that start and that end at the threshold. Considering following data ranges:
Range | Primary keys |
---|---|
1 - 10 | 1, 2 |
4 - 5 | 3 |
8 - 9 | 3, 4 |
10 - 12 | 5, 6 |
we would end up with the following range index:
Threshold | Starts | Ends |
---|---|---|
1 | 1, 2 | |
4 | 3 | |
5 | 3 | |
8 | 3, 4 | |
9 | 3, 4 | |
10 | 5, 6 | 1, 2 |
12 | 5, 6 |
Range | Primary keys |
---|---|
1 - 2 | 1 |
1 - 4 | 1 |
Would result in range index of:
Threshold | Starts | Ends |
---|---|---|
1 | 1 | |
2 | 1 | |
4 | 1 |
Then, if the range [1-4] is removed from the record, we end up with invalid range index:
Threshold | Starts | Ends |
---|---|---|
2 | 1 |
As such, the information of the record with the primary key 1 would be completely lost. Therefore, the non-overlapping rule was introduced. Primary record MUST NOT be part of two ranges that overlap one another. Such a situation doesn't make sense, because having ranges A-B & C-D that overlap, could be easily transformed to an A-D range. This requirement is also mandatory, so that the index works correctly. The rule is only an internal one - the record may have the ranges that violate this rule - the conversion onto the non-overlapping ranges is handled internally by the engine.
The range index is used to handle two types of queries (for Range data type) described in more detail later:
Inverted index
The range search is also available as a disjunction operation (boolean OR) of all bitmaps collected from/to the index of the found threshold.
The inverted index cannot contain the same record ID in multiple bitmaps. This prerequisite is not checked internally by this data structure, and we check, that on the entity contract level. If this prerequisite is not met, the inverted may return confusing results.
The inverted index is used for resolving constraints, such as:
- equal to
- greater than
- greater than or equal to
- lesser than
- lesser than or equal to
- between
- in range
- in set
Facet index
- entity reference name
- non grouped facets
- facet id -> record IDs
- group ID
- facet id -> record IDs
This index allows for the processing of query constraints, such as:
- facet in set
- compute require constraint facet summary
Hierarchy index
The hierarchy index collocates information about the hierarchical tree structure of the entities. The index itself doesn't keep the information in the form of a tree, because we don't have a tree implementation that is transactional memory compliant.
This index allows processing of query constraints, such as:
- within hierarchy
- within root hierarchy
- compute require constraint parents
- compute require constraint hierarchy statistics
Price index
The price index contains data structures that allow for the processing of price related filtering and sorting constraints, such as:
- price in currency
- price in price list
- price between
- price valid in
- sorting by price
In order to decrease memory consumption in case high price cardinality, there are two types of price indexes:
- super index - this index (or its inner indexes) contains the full price dataset ( price bodies) and is self-sufficient. The super index relates to a global entity index of which there is only one per entity collection
- ref index - this index (or its inner indexes) contains the subset of price pointers to the price bodies in the super index. Ref indexes are used in auxiliary entity indexes maintained for filtering with hierarchy or reference constraints (see entity index organization)
Price list and currency index
The index contains information used for filtering by price that is related to a specific price list and currency combination. Real world use-cases usually filter entities by price that is part of a specific set of price lists and currency. As such, we can greatly minimize the working set by separating the price indexes by this combination.
This index maintains multiple data structures, which allows for fast filtering, while also minimizing memory requirements.
- priceRecords is an array of all simplified price records
- indexedPriceEntityIds contains a bitmap of all record IDs with particular priceList-innerRecordHandling-currency combinations. This data structure can be used for queries, which lack price valid in constraint or indexes that maintain prices for innerRecordHandling equals to NONE
- indexedPriceIds contains a bitmap of all price IDs with particular priceList-innerRecordHandling-currency combinations. This data structure must be used for complex queries (except the ones handled by a previous paragraph)
- entityPrices contains a map where a key's price id and value is a compound structure entityPrices
- validityIndex is a range index, that maintains validity of information about particular price IDs
Price record
The price record maintains only basic information required for filtering:
- internalPriceId - assigned internal price ID (see internal price identification)
- priceId - external price ID
- entityPrimaryKey - record ID of the entity the price belongs to
- priceWithTax - price with tax in a primitive integer form
- priceWithoutTax - price without tax in a primitive integer form
- innerRecordId - inner record ID grouping entity prices in subgroups
Entity prices
This data structure keeps information of all prices assigned to the same entity. It allows for the answering of the following questions:
- return all lowest price records in entity for the distinct inner IDs in it
- identify, whether a particular price ID is a part of the entity prices
- identify, whether a particular inner record ID is used by any of the entity prices
In order to decrease memory consumption, there are multiple variants of this structure, depending on entities having only a single price, or multiple prices without an inner record specification or multiple prices with it.
Internal price identification
Each price has an external price ID identification. Unfortunately, we cannot count on price ID uniqueness among multiple entities of the same type. And thus, we require price IDs to be unique only within the same entity, due to business reasons. In real situations, a single price can be part of multiple indexed entities - there are often "virtual products" that aggregate multiple real products, and the client wants to index them both and filter different products in different situations.
In order to unequivocally identify the price, we would need a combination of the entity ID and the price ID - each of integer type (4B). Together they'd form a long type (8B). The original implementation worked with this interpretation, but was discarded, because:
- it led to significantly larger memory consumption
- the RoaringBitmaps working with long types are considerably slower that of their integer counterparts
- it required multiple formula implementations (which required much more source code and tests) and complicated the code itself
Therefore, we generate our own internal primitive integer price IDs, which are used for computational and indexing purposes, and can be anytime translated back to a full entity/price combination.
Unique index
The unique index is used for operations such as:
- equal to
- in set
- is (null / not null)
Filter index
The filter index allows to filter all filterable, but non-unique attributes (there is one filter index per one attribute schema). Internally it holds:
- invertedIndex - inverted index of values (used when value is non Range type)
- rangeIndex - range index of validity values (used when value is Range type)
- valueIndex - auxiliary data structure map allowing to translate value to record to index in histogram in O(1) complexity
The index delegates the key functionality either to a histogram or a range index, depending on the accepted task and attribute type it manages.
Sort index
Entity index
Each entity collection maintains a set of entity indexes that collect all information required to filter and sort the entities, as well as computing extra results on top of them. All previously documented indexes are part of one of the entity indexes in various forms.
Global entity index
Each entity collection has a single "global" entity index that maintains all the information about all the entities in the collection. Because the real life queries frequently use filters by hierarchical placement of reference to another entity (join), we maintain special "ref" indexes in each entity collection, which maintain information only about a subset of entities that possess such a relation or hierarchy placement. This fact allows us to discard large quantities of data on the top level.
Using the global index for filtering/sorting is a "worst case" scenario - still much better than a full-scan, but it still needs to cope with all the data in the collection in some form or another. There are these flavors of our "ref" entity indexes:
Hierarchy node entity index
It only contains data about entities, which are directly related to the same referenced parent entity identified by the key, which consists of:
- relationName - name of the relation from the entity schema
- referencedPrimaryKey - primary key of the hierarchical entity (parent) of which the index relates to
Referenced entity index
It only contains data about entities that are directly related to the same referenced entity identified by the key, which consists of:
- relationName - name of the relation from the entity schema
- referencedPrimaryKey - primary key of the hierarchical entity (parent) of which the index relates to
- relationName - name of the relation from the entity schema
Reference type index contains the same data as referenced entity index, but instead of using primary keys of indexed entities it uses (indexes) primary keys of referenced entities of indexed entities.
Implementation of query algorithms
Stage: query parsing
Phase: query deserialization
The database offers multiple ways (APIs) to pass the input query:
- Java API - used when evitaDB is running as an embedded database
- gRPC API - fast, but still a system API that is used in the Java driver when evitaDB runs in client / server mode. This API can also be used for microservice integration or fast inter-platform communication (for example C# application communicates with evitaDB)
- GraphQL - a rich API for web oriented applications, primarily in the JavaScript domain. This API is designed for ease of use, comprehensibility and querying / passing data using as few request/response turnarounds as possible
- REST API - a rich API for other domains than JavaScript that don't want to use the gRPC style. The API represents a conservative approach, since the REST API, even though it has several disadvantages, is still one of the most used protocols out there
This phase allows for accepting the API call in some form of input query, which is optimal for the protocol being used.
Phase: query parsing
The query in evitaDB is represented by a tree of nested "constraints" divided into three parts:
- filterBy - limits the amount of returned results
- orderBy - defines the order in which the results will be returned
- require - allows the passing additional information on how much data the returned entities will have, how many of them are required and what other computations should occur upon them
The description of query language constraints and possibilities is a part of the developer documentation, and thus is not part of this document.
Phase: query validation
This phase works on an already parsed query syntax tree. In this phase, we try to reveal logical errors in the query that are otherwise syntactically correct.
Stage: query planning
Phase: selecting indexes
Phase: formulating filtering tree
- boolean AND operations estimate their input counts as the lowest input record id count
- boolean NOT operation estimate the input count as a super set count minus the subtracted count
- other operations estimate their inputs as a sum of all input record id counts
Phase: creating sorter (optional)
In case sorting is specified in the input query, we create an instance of the sorter allowing us to order the entities accordingly. If no sorting is specified, the entities will be returned in their natural order, which is always ascending order by entity primary key.
Phase: prefetch availability analysis
In case there is a selection formula present in the conjunction scope, we estimate the costs of fetching the necessary entity data from the disk and consider, whether it wouldn't be more optimal to prefetch the part of the entities from the disk to apply the predicate on them, instead of merging thousands of record bitmaps, which are present in indexes.
A conjunction scope is a part of the formula calculation tree that begins at the root and covers all formulas that are internally combined by conjunction (boolean AND). Let's see an example:
Root of formula calculation tree:
- and
- attributeEquals(‘age', 18)
- entityPrimaryKeyInSet(45, 12, 33, 19)
- or
- attributeEquals(‘sex', ‘female')
- entityPrimaryKeyInSet(66, 78)
We may take a speculative approach and optimistically prefetch those entities upfront, believing that all/most of them would be returned, and filter them using the predicate. This approach might be way faster than combining, let's say thousands record ids from the indexes.
During the planning phase, we collect all information necessary to make the decision of whether the prefetch is worthwhile or not. It may be counterintuitive, but accessing entities by one or by set of their primary keys is quite a common use case, and this specific handling brings a significant performance boost for these cases.
Phase: preparing extra result producers
The final preparation phase creates instances of extra result producers. The producers capture pointers to all the data necessary to calculate the extra result object requested by the client. The creation is a fast operation, since no computation doesn't occur yet.
Extra results in evitaDB provide access to additional data that is somehow interlinked with the current query result. By calculating both the main result and the additional data in one run, we could take advantage of intermediate sub-results and significantly reduce the computational cost compared to the situation when we would have to issue separate queries for all the data necessary (which is common in the relational databases world).
Some examples of extra results are:
- retrieve parent hierarchy trees for each returned record
- retrieve a set of facets present on the selected records and calculate their counts or selection impact
- compute the value of a histogram for attributes or prices of selected records
The extra result producer may internally manage multiple "computers" that handle the single computational case.
The client asks for following extra results:
The server instantiates single extra result producer for attribute histogram computation that holds three computer instances:
- a computer for the attribute age histogram with 20 buckets requested
- a computer for the attribute height histogram with 30 buckets requested
- a computer for the attribute width histogram with 30 buckets requested
Stage: query execution
Phase: prefetching entities (optional)
Phase: filter computation
Phase: sorting and slicing
Phase: entity fetching (optional)
Entity fetching occurs only when a client requests for any part of the entity body data. In case the client requests only entity primary keys, this step is omitted.
Entity storage decomposition and decoration
- entity body: maintains the primary key. Hierarchical placement, a set of entity locales and a set of all available associated data keys
- global attributes: identified by the entity primary key. It contains the locale agnostic key-value tuples of all entity global attributes
- localized attributes: identified by the entity primary key and a locale. It contains locale specific key-value tuples of attributes for single locale (there as many storage parts of this type as entity locales)
- associated data: identified by entity primary key and the associated data key (associated data name with optional locale). It contains the value of a single piece of associated data
- prices: identified by the entity primary key. It contains an entity price inner record handling value and all price records of the entity
- references and reference attributes: identified by the entity primary key. It contains the data about all the entity references (reference name, referenced entity primary key, referenced group primary key and all attributes specific for this reference).
Entities might be read and transformed to a full entity or left in a storage related "binary" form. The gRPC protocol and Java driver uses an "experimental mode", where the unparsed binary entity parts are transported through the gRPC protocol directly to the client that uses the same Kryo deserializers to reconstruct the entity to a form it could work with. This approach could be used only if the client is Java based, since it could reuse our Kryo deserializers implementations (Kryo is not portable easily to other platforms). On the other hand, this allows bypassing several transformation steps, and we hope it brings some performance gains. This needs to be tested and measured first.
Phase: extra results computation (optional)
Cache
Cache is one of the key components of evitaDB. Cache provides a common API to cache results of many of the expensive operations using common concepts and an API. Making all internal data structures immutable allows us to effectively determine whether the cached result can be used as a result of our computation or not, because the data has changed.
Cacheable elements
The following data structures are subject to caching:
Formula calculation results
We can say that any part of the formula calculation tree is defined by its structure, the set of constants from the input query and the set of transactional ids that are associated with the bitmaps at the tree leaves (either directly or transitively). This fact allows us to compute a single hash number that is computed recursively from the formula node, while taking into account:
- the formula type (class)
- the child formula hashes
- the transactional ids of referenced bitmaps from indexes
By applying the above rules, we can compute a single long number that uniquely identifies each formula in the tree even before its output is calculated at impressive speeds.
Extra results
The hash of the computer instance is computed from:
- the computer type (class)
- the constants from require constraint
- the transactional ids of referenced bitmaps from indexes
- the hash of the referenced formula or its part
Entities
The cache can also hold entity bodies that are frequently fetched from the disk storage. The problem with the entity is that it is composed of multiple "storage parts", which can be modified and fetched separately. Multiple readers may read different parts of the entity with different query parameters, and that means that the returned entity is expected to contain a different structure for each one of the clients.
Let's discuss each of these problems separately. First, the problem with different views on the same entity.
Cache filling and invalidation
The process of filling and invalidating cache needs to handle the following actions:
- it needs to keep track of all the cacheable elements that are requested by the clients and collect the count of their repeated usage
- it needs to decide, which of these elements are worth to be cached and propagate them to the cache
- it needs to invalidate / remove the existing cached elements that are no longer worth to keep there
- it needs to fit to the space, that is allocated to the cache to avoid Java OutOfMemory situations
Anteroom
When an anteroom is asked for a cached value, it first asks Eden whether there is a cached record for the element (the key is a long produced by the hash function). If there is, it just increments its usage count and returns it. If there is no cached record, it looks at its own space (concurrent hash map) whether there is an adept record for this element. If there is, the original element is returned without change. If there is no adept present, it clones the element in the input and adds to it a new "onCompute" callback, that will register a new adept to the anteroom once the element is computed/fetched.
The adept record carries following data:
- long(8B) recordHash - product of the hash function for the cacheable element
- long(8B) costToPerformanceRatio - key part for computing the worthiness of the cacheable element
- int(4B) sizeInBytes - estimated size of the object in memory in bytes
- int(12B + 4B) timesUsed - AtomicInteger that tracks number of usages of the record adept
As you can see, the cache adept doesn't keep the result data - only the meta-information about them, so it's fairly small - 100k cache adepts represent around 8MB of memory (including size estimation for concurrent hash map carrier).
There is a chance, that the Anteroom will fill up faster than the Eden can keep up with the adepts' recalculations. Because the collection for re-evaluation is held in AtomicReference and the Eden's recalculation is never executed in parallel (it figures as a single task in SchedulerExecutor), when such a situation occurs, the latter collection just rewrites the previous one causing the previous collection just being ignored. Nonetheless, this fact is reported in the metrics, because it probably means that the cache is not correctly configured or the system just can't keep up.
Eden
- long(8B) recordHash - a product of the hash function for the cacheable element
- long(8B) costToPerformanceRatio - a key part for computing the worthiness of a cacheable element
- int(4B) sizeInBytes - an estimated size of the object in memory in Bytes
- int(12B + 4B) timesUsed - an AtomicInteger that tracks number of usages of the record adept
- int(12B + 4B) cooling - an AtomicInteger that tracks number of iterations, where this element is not worthy enough to stay in cache
- pointer payload - the cached result payload
Cache re-evaluation
The Eden first combines a passed adept collection with existing cached records into one big array. It excludes all adepts whose size in bytes exceeds a maximal allowed size, and the usage count is lesser than minimal usage threshold. When usage count is zero for the existing cached record, it is not immediately removed from the considered list, but its "coolness" factor gets increased. Only after this "coolness" factor exceeds the configured threshold, the cached record is removed from the cache.
When the collection of the competing adept records is finished, it is sorted by the worthiness value in descending order and iterated. The sum of all adept record expected memory requirements is calculated during the iteration, and when the size reaches the maximal allowed limit, the current index in the adept array is remembered. All cached records after that index are removed from the cache, and all non-cached records before that index are inserted into the cache.
New records that are freshly inserted into the cache contain no payload - i.e. computed results that could be used for that cached record in query processing. They only contain the estimated size of such a result, which was observed when the adept was registered by the anteroom. For initializing the payload, we use a similar mechanism as the Anteroom does for initializing the estimated size. On the first use of the cached record, we only clone the input cacheable element in the input and add to it a new "onCompute" callback that will fill the cached record payload once the element is computed/fetched.
Cache worthiness calculation functions
Invalidation
All cacheable elements and their input data are immutable. Bitmaps in indexes carry their transactional ids and when their data are changed and committed, a new bitmap is created with the new transactional id. There is also a new instance of the parent index, that points to that new bitmap and all other non-changed bitmaps (whose instances can be reused) and which is propagated to the new version of the entire catalog. So, a new transaction that issues the exact same query to the catalog will have the same formula calculation tree except for the one leaf formula, that will now point to the new bitmap with the new transactional id. This fact will cause the hash of the entire formula to differ from the previous one.
Similar rules apply to entity bodies - bodies are stored in immutable parts. When an entity changes, the modified new versions of their modified parts will be created and stored. The actual versions of the entity storage parts can be checked against mem-table index at any time, and we can safely verify whether the value in cache is a valid one.
This fact is the basis for automatic invalidation. If the data is modified, the new queries will try to fetch the results by the different hash values, due to the change in transactional ids of the referenced bitmaps. The old cached values will stop being retrieved, and their worthiness, due to their decreasing usage, decreases as well, and soon, they will fail to reach the necessary threshold to be kept in cache. They may also sooner fail to reach the minimal usage threshold and their "cooling period" might kick in faster (and when they're cool enough, they're removed from the cache).
Cache persistence (future work)
In the case of a database restart, the newly started instance would start with the empty cache that could be fully warmed up after multiple iterations of adept evaluation that could take minutes. But, we could also serialize the cache along with the catalog data into a separate file and reconstruct it immediately after database restart. Loading the cache would add a few seconds to the start of the database (it depends on the size of the cache), but could immediately save a lot of computational resources that would be otherwise wasted for computing the results again.
External cache support (future work)
The single product is sold out and shouldn't be displayed to the users anymore. If such a product was last in the category, the category itself should stop displaying to the users, the same goes for the brand in the brand listing (the product might have been the only product of its manufacturer). The product might have been among the most selling products of the main category, and it should also stop being displayed there, or it might be referenced by other products as their alternative or spare part. As you can see, a tiny change can have a big consequences on the site, and that's why the caching is a hard task for e-commerce sites.
What if the database could proactively invalidate the external cache - or mark statically generated pages as obsolete? Would it open up a possibility to use statically generated resources a little more?
The evitaDB could use the similar system that it uses for internal cache even for the external one. The external system might generate a unique random token at the start of the page rendering and associate it with all the queries issued when the page is rendered. evitaDB would record all bitmap transaction ids associated with such a token. When the transactions containing some updates are committed, and a bitmap with an assigned transaction id is going to be garbage collected, evitaDB may retrieve all tokens associated with this transaction id and invalidate them.
We don't know whether this concept would be feasible in practice, and it would require a lot of additional proofing and optimization. It may turn out that the cardinality of the transactional ids per token/page is too high, or that a single bitmap invalidation leads to too many page invalidations than is practical.
Query evaluation in detail
Boolean algebra
There are two main reasons why we chose this library:
- it allows us to store int arrays in a more compressed format than a plain array of primitive integers,
- and contains the algorithms for fast boolean operations with these integer sets
The main prerequisite is that the integers are sequence based - i.e., there is high probability that the integers in the array will be close to one another. This prerequisite is true in the case of evitaDB datasets because we require each entity to have a unique int primary key, that is either generated by our internal sequence or by an external sequence.
Conjunction (AND) / disjunction (OR)
- AND 116094.151 operations / sec
- OR 80640.199 operations / sec
Negation (NOT)
Our performance tests show following results for two 100k integer arrays of pseudorandom numbers that make a monotonic row:
- NOT 148431.438 operations / sec
String queries on attributes (starts with / ends with / contains)
We need to introduce new data structures for the sake of these operations - probably the compressed trie structures that would allow us to store values more efficiently and also would allow for much faster lookups for appropriate bitmaps.
Comparator queries on attributes (equals / in set / greater / lesser / between)
The comparator queries are applicable on non-range types only (either simple data types or arrays). Each constraint requires a slightly different approach:
- equals - requires a single O(1) lookup in the filter index to retrieve the appropriate element index in the inverted index, and retrieving a bitmap of all record ids on that index, equals operation handling is therefore pretty efficient
- in set - requires multiple applications of the process described for the equals
- greater than / lesser than - uses the inverted index directly, it applies a binary search on values in it to identify the element index that represents the threshold for the results (either exclusive or inclusive depending on the constraint type), then it collects all bitmaps of record ids before / after the threshold and joins them using an OR formula
- between - applies the process described for greater and lesser constraints. It performs two separate lookups into the inverted index and joins a bitmap of record ids between the two found thresholds using an OR formula
Range queries on attributes (equals / in set / greater / lesser / between / in range)
Within query
The within query allows to return all primary keys that have a range that envelopes (inclusively) the threshold stated in the query. In other words, it returns all records that are valid at a certain moment in time.
This query is computed in a following way:
- all the record IDs that start and end before the threshold (inclusive) are collected into a dual array of
bitmaps: array of recordIdsThatStartBefore and array of recordIdsThatEndBefore
- both arrays of the bitmaps (that contain only distinct record ids) are merged together into a sorted array with possible duplicates using a JOIN formula producing two bitmaps: recordIdsThatStartBefore and recordIdsThatEndBefore
- these two bitmaps are merged together using a DISENTANGLE formula, that eliminate duplicate record IDs on the same positions in both of the bitmaps and produces a single RoaringBitmap with distinct record ids - this product represents all the record IDs that possess at least one range starting before (inclusive) the threshold and not ending before it as well - let's name this product as recordIdsThatStartBeforeWithElimination
- all the record IDs that start and end after the threshold (inclusive) are collected into a dual array of
bitmaps: array of recordIdsThatStartAfter and array of recordIdsThatEndAfter
- both arrays of bitmaps (that contain only distinct record ids) are merged together into a sorted array with possible duplicates using a JOIN formula, producing two bitmaps: recordIdsThatStartAfter and recordIdsThatEndAfter
- these two bitmaps are merged together using a DISENTANGLE formula, that eliminates duplicate record IDs on the same positions in both bitmaps and produces a single RoaringBitmap with distinct record ids - this product represents all the record IDs that possess at least one range ending after (inclusive) the threshold and not starting after it as well - let's name this product as recordIdsThatEndAfterWithElimination
- now we can simply apply a RoaringBitmap conjunction (boolean AND) operation on both of the bitmaps, recordIdsThatStartBeforeWithElimination and recordIdsThatEndAfterWithElimination, and retrieve only the record IDs, in which at least one range contains the specified threshold
This query is manifested by a constraint targeting Range type:
- price valid in
- in range
Overlap query
The overlap query allows to return all primary keys whose range overlaps (including boundaries) the specified range. In other words, this returns all records that were valid in a certain period of time (at least partially).
This query is computed in the following way:
- we compute the result the same way as for the within query and name
it recordIdsOverlappingWithElimination with these exceptions:
- for the computation of recordIdsThatStartBeforeWithElimination, we use as a threshold of the start threshold of the input range
- for the computation of recordIdsThatEndAfterWithElimination, we use as a threshold for the end threshold of the input range
- we collect all the record IDs that start within the start and end threshold of the input range and name
them recordIdsStarting
- we collect all the record IDs that end the within start and end threshold of the input range and name them recordIdsEnding
- we compute the disjunction (with a boolean OR) for recordIdsOverlappingWithElimination, recordIdsStarting and recordIdsEnding
This query is manifested by this constraint targeting Range type:
- between
Join formula
This formula accepts a set of bitmaps with distinct record ids and produces a bitmap with possibly duplicate record ids that still maintain ascending order.
Input bitmaps:
produce following result for JOIN operation:
This way, we combine all the record IDs from all input bitmaps into a single ordered array bitmap with duplicates.
Disentangle formula
Input bitmaps:
produce output:
The algorithm picks a record ID from both bitmaps and skips it when both IDs are equal. Then it picks another one and compares it again. A second bitmap pointer advances only when it encounters an ID that is less than or equal to a record ID from the first bitmap.
Queries on reference attributes (reference having)
Hierarchical queries (within / within root)
Facet queries
The facet query handling is special in the sense that it produces different formulas depending on the require constraints:
- facet groups negation
- facet groups conjunction
- facet groups disjunction
Price queries
There are these following price related constraints:
- price in currency
- price in price lists
- price valid in
- price between
Simple price evaluation
No price valid in constraint
Price valid in present
Translates to the following formula calculation tree:
Price between present
Inner record related price evaluation
Find first selling price calculation
The price inner record handling mode "FIRST_OCCURRENCE" selects a single price for each inner record ID (that means that a single entity primary key may have one or more price records to work with). The price with the least amount is then selected as a representative price for the entity primary key for the sake of sorting or evaluating the price between the predicate.
Sum price calculation
The price inner record handling SUM selects the single price for each inner record ID (that means that a single entity primary key may have one or more price records to work with) and creates a new virtual price as a sum of all amounts of entity prices. This virtual price is then selected as the representative price for the entity primary key for the sake of sorting or evaluating the price between predicate.
Complex price query
Sorting
Presorted record sorter
We've written a performance test, that compares quicksort on 100k pseudo random strings with a length of 10 to 50 characters with the following results:
QuickSortOrPresort.quickSort thrpt 17.576 ops/s
QuickSortOrPresort.preSort thrpt 226.016 ops/s
We guess that this sorting algorithm is still suboptimal and more research is needed here.
Random sorter
The random sorter takes an array of sorted record IDs, and for each record in the currently returned slice/page ID, it randomly picks a number between >= 0 and < record ID count. Then, it swaps the record IDs on the current index with the record ID with the randomly chosen index. This produces a randomly sorted record set.
Price sorter
The price sorter must apply full sorting in real time. We found no data structure that could bring a different approach for sorting. The problem is that the client specifies the multiple price lists in the input query and the order of the price lists represents their priority. The price found in the foremost price list will be used and prices in the following price lists are just ignored. The logic is too complex for preparing some kind of presorted result and the price cardinality is also very high.
When the prices are sorted, the result is sliced to a particular viewport and entity primary keys are retrieved from the price records - i.e., the price records are translated into entity record IDs.
Extra result computation in detail
Extra results bring additional information along with the primary result of the query. Extra result computation might take advantage of the already computed sub-results in the formula calculation tree and this fact reduces the total demand on the server and the latency compared to the demand and the sum of the latencies of separate server requests.
Facet summary
Count overview
The added formula must respect these requirements in constraints:
- facet groups negation
- facet groups conjunction
- facet groups disjunction
What is important in this derived formula calculation process is the fact that the new tree carries along all the memoized sub-results from the original formula calculation tree and that for each facet (and there may be hundreds of them) only a part of the formula tree needs to be recalculated.
Facet computation performance optimization
Even if the original formula calculation tree sub-results are preserved, there still may be a lot of computation happening. Consider following tree composition:
Even if we have results of formulas A, B and C memoized, and we only modify the contents of the formula D and the enveloping container AND, we need to process 650k integers over and over. But if we first reconstruct the formula calculation tree to an optimized form, such as:
We use this formula calculation tree as our base tree to derive alternate facet trees from. As such, we may reuse the memoized result of the new AND formula that consists of only 75k records which would allow us to process only 125k integers for each facet calculation. This tiny change can lead to saving 52.5 million of integers on a mere 100 facets that don't need to be processed at all!
Impact overview
The added formula must respect these requirements in constraints:
- facet groups negation
- facet groups conjunction
- facet groups disjunction
So, when the FacetGroupFormula (a formula container that contains inner formulas related to the facet selection related to the same reference name and group id) is not yet present in the formula calculation tree and a new one is appended, it must create a proper composition.
There might be the following compositions for placing a new disjunction facet ID in the disjunction form to the existing formula calculation tree:
which will be transformed to:
which will be transformed to:
which will be transformed to:
The transformation process also needs to cope with complicated compositions in case multiple source indexes are used - in case of such an occasion, the above-mentioned composition is nested within the OR containers that combine results from multiple source indexes.
Hierarchy statistics
The hierarchy statistics producer computes the cardinalities of entities that match the passed query for each hierarchy node of specified reference.
Histograms
Attribute histogram
Price histogram
Parents
The parents producer adds an array of parent IDs for each entity returned in the response (taking pagination into an account). This feature is used, in practice, for rendering breadcrumb navigation menus.
Obstacles encountered
When the work is done, a lot of things seem simple and clear, but the path to the final version was often more complicated than necessary, and involved venturing to a lot of dead ends. There was probably double the code written than is currently present in the main branch.
Kryo streams
Output
Input
Roaring bitmap
Honestly, RoaringBitmap is a fabulous piece of software performance wise. They replaced our original implementation and brought us two orders of magnitude speed up. We really are building on the shoulders of giants here.
Immutability
If a pointer to the original RoaringBitmap leaks, the mutation can still occur and since the RoaringBitmap is not thread safe, it manifests the concurrent related problems by producing weird results (there is no such safe net as ConcurrentModificationException).
Missing partial evaluation
Next problem with this approach is that the most complex queries in e-commerce include sorting and sorting inherently requires calculation of all filtered elements - even the entity with the highest primary key (i.e. the last entity primary key in the filtering result bitmap) may come out as the first element in the result sorted by a different property. In this case, calculating the entire result in one pass is more effective than an iterative approach.
Long bitmaps memory exhaustion
This approach had three major disadvantages:
- the memory requirements went up more than twice for data sets with high price cardinality (we needed 24GB of RAM for the same data set that we now fit into 8GB, but there were also other optimizations besides this one)
- the RoaringBitmap performance went down considerably
- working with the long type required a lot of duplication of the data structures we worked with - because we stick to the primitive types, many classes (and tests) needed to be duplicated just for changing the primitive type (generics are not usable for primitive types in Java)
The first problem is to be expected, the second one is not so obvious. The RoaringBitmap works best when the numbers are close one to the other and when we composed the long from the integers that were "close to one another" in their "namespace", we created longs that were really distant one from another. The RoaringBitmap couldn't shine with such a series of numbers. Introducing the internal price IDs resolved both of these problems.
Performance degradation repeated id translations
The pricing computation was one of the hardest parts in the query processing. The formula calculation tree might get pretty complex in this part and for the proper price inner record handling, we have to take the inner record ID aggregation into account. First versions of this calculation worked with following multiple IDs translation mechanism:
Which took many CPU cycles and brought non-essential complexity to the computation. We underwent a number of rounds of cutting the computational tree to a bare minimum.
Tuning transactional memory
All generational random tests follow a similar structure. First we create a random instance with a seed, that is remembered. Then, we create a random data set to start with using that random source. The tested transactional data structure and the reference "stable" data structure is fed with that initial random data. Then, in an infinite loop, we repeat this sequence:
- randomly select one of the available operations on the transactional data structure
- apply the operation with random inputs both on the transactional data structure and the reference Java data structure
- repeat steps 1 and 2 random number of times
- commit the transaction
- verify that the committed data structure and Java data structure remain equivalent
- create new empty instances of both data structures and feed them with products from step 5. and repeat the entire iteration
We also maintain a StringBuilder where we print the source code corresponding to the described process using generated random values. The StringBuilder is emptied at step 5. and starts from the beginning. This allows us in case of an error to easily reproduce the problem. We could capture the contents of the StringBuilder and create a new JUnit test case based on the source in it and have the problem reproduced.
The rollback doesn't need to be tested because it involves only dropping the change set that is held in a separate layer over the immutable data structure. Because the change layer is isolated, the rollback never represented a source of any error.
Opportunities for improvement
Well, there are way too many of them - more than we know of and that we mention in the following paragraphs. I, personally, am more than 20 years in the business, and I'm still learning a lot and know too little.
Slow formulas improvement
Join / disentangle formulas
The join / disentangle formulas are used to compare the range queries against entity data that might have multiple non overlapping ranges. We need to evaluate whether at least one range of the entity meets our constraint, and discard the other ranges that don't make sense for the query. Maybe there is a more clever approach for this kind of problem than our join / disentangle combination.
The current implementation works, but it's not as fast as we would like it to be. Our advantage is a cache that captures the calculations of repeated range queries and dilutes the negative impact of these formulas.
Price termination / translation formulas
There is a lot of logic implemented in the price related formulas and this means that there might be a lot of room for improvement. The early concepts of evitaDB worked with priorities set directly to prices during indexing. This allowed for the preparation of better indexes for querying but also complicated changes in the price list priorities that were the main source of the price priorities in the first place. A single change in price list priority led to tens of thousands updates of prices in indexes, and we realized that this approach is unmanageable. That's why the priority of the prices is controlled by the order of price list names mentioned in the input filtering constraint now.
The price calculation logic (except for the "none" price inner record handling mode) is quite complex and hard to optimize. In addition to an output result of the entity primary keys, we need to filter out associated price records, so the costs are counted twice.
What we do know is that each optimization in these formulas leads to huge performance gains in overall performance on datasets with high price cardinalities.
Roaring bitmap range indexes
The change is not trivial since it includes changes in indexes, their transactional memory support and their serialization and deserialization. That's why we did not include RangeBitmap in this version of evitaDB, but this could bring a lot of calculation saving in the future.
Roaring bitmaps binary search / mask extension
We use a masking technique for filtering out records from the pre-sorted array that are not part of the filtered result set. We need to hold extra data structures that allow us to quickly find out the positions of the records in the pre-sorted array that would be completely unnecessary if RoaringBitmap supported returning indexes of values matching the boolean disjunction (AND) operation.
Currently, the RoaringBitmaps can perform conjunction operation with following output:
It would help us greatly if it could return only indexes of the matching records from the first bitmap like this:
This would allow us to directly look for indexes in the pre-sorted array without maintaining another useless array data structure.
Vectorization
Project Valhalla and Lilliput
The amount of RAM memory available to rent from the cloud providers gets higher for the same price each year and when we factor in that the mandatory requirements of the Java ecosystem gets lower over the time, the costs of in-memory Java databases might become bearable even for large data sets.
Fulltext search
We haven't yet stepped into this territory, and we're still evaluating whether the incorporation of the Lucene engine is a good way for us to go. The Lucene engine has its own requirements for the storage layer and incorporates its own view on transactions which we would have to align with ours. But it's the state of the art in the field of fulltext search and many people are comfortable working with it. So, there are pros and cons to consider. Fulltext is a crucial part of the e-commerce ecosystem, so we will have to incorporate it one way or the other.
Currently, we're working around this problem on the client level (outside evitaDB) by first selecting the records by a fulltext engine and then passing the set of the selected entity primary keys to an evitaDB query that computes the final result along with facets, parents and all the necessary information. This integration is, however, far from the ideal one.
Personalization
This area should be explored more in the future to enable personalization in the evitaDB but was not in the scope of the current project.
Graal - native image compilation
The recent developments in the Java ecosystem (thanks to another Czech person named Jaroslav Tulach) allows compiling the Java source code directly to native binary code. The compiled code bypasses the natural middle-step of byte-code the Java ecosystem invented years ago and uses the native code of the target platform.
The native compilation offers faster startup time and lower memory footprint than the regular Java byte code with JIT. We don't know yet how Graal native image compilation would affect the evitaDB performance and memory consumption, but it's worth trying it.
The startup times might become crucial for the situations when a new replica starts from scratch and tries to catch up with another node state to take part of the traffic on itself as soon as possible.