Data model

This article describes the structure of the database entity (alternative to record in a relational database or document in some NoSQL databases). Understanding the entity structure is crucial for working with evitaDB.

Terms used in this document

facet: Facet is a property of entity used for quick filtering of entities by the user. It is displayed as a checkbox in the filter bar or as a slider in case of a large number of different numerical values. Facets help the customer to narrow down the current category list, manufacturer list, or full-text search results. It would be hard for the customer to go through dozens of pages of results and probably would be forced to look for some subcategory or find a better search phrase. It's frustrating for the user, and facets could make this process easier. With a few clicks, the user can narrow down the results to relevant facets. The key aspect here is to provide enough information and require the user to go to the most relevant facet combinations. It's very helpful to disregard facets as soon as they would cause no results to be returned, or even to inform the user that selecting a particular facet would narrow the results to very few records and that his freedom of choice will be severely limited.
facet group: Facet group is used to group facets of the same type. The facet group controls the mechanism of facet filtering. It means that facet groups allow to define whether facets in the group will be combined with boolean OR, AND relations when used in filtering. It also allows to define how this facet group will be combined with other facet groups in the same query (i.e. AND, OR, NOT). This type of Boolean logic affects the facet statistics calculation and is the crucial part of facet evaluation.

The evitaDB data model consists of three layers:

catalog
entity collection
entity (data)

Each catalog occupies a single folder within the evitaDB data folder. Each collection within this catalog is usually represented by a single file (key/value store) in this folder. Entities are stored in binary format in the collection file. More details about the storage format can be found in a separate chapter.

Catalog

The catalog is a top-level isolation layer. It's equivalent to a database in other database terms. The catalog contains a set of entity collections that maintain data for a single tenant. evitaDB doesn't support queries that could span multiple catalogs. The catalogs are completely isolated from each other on disk and in memory.

A catalog is described by its schema. Changes to the catalog structure can only be made using catalog schema mutations.

Collection

The collection is a storage unit for data related to the same entity type. It's equivalent to a collection in terms of other NoSQL databases like MongoDB. In the relational world, the closest term is table, but the collection in evitaDB manages much more data than a single relational table could. The correct projection in the relational world would be "a set of logically related linked tables".

Collections in evitaDB are not isolated and entities in them can be related to entities in different collections. Currently, the relationships are only unidirectional.

Although evitaDB requires a schema for each entity type, it supports automatic evolution if you allow it. If you don't specify otherwise, evitaDB learns about entity attributes, their data types and all necessary relations as you add new data. Once the attributes, associated data or other contours of the entity are known, they are enforced by evitaDB. This mechanism is somewhat similar to the schema-less approach, but results in a much more consistent data store.

A collection is described by its schema. Changes to the entity type definition can only be made using entity schema mutations.

Entity

Minimal entity definition consists of:

Entity type
Primary key

Other entity data is purely optional and may not be used at all. The primary key can be set to NULL and let the database generate it automatically.

Entity type

Entity type must be a

String type

Entity type is the main business key (equivalent to a table name in relational database) - all data of entities of the same type is stored in a separate index. Within the entity type the entity is uniquely represented by the primary key.

Primary key

Primary key must be

int

positive number (max. 2⁶³-1). It can be used for fast lookup of entity(s). Primary key must be unique within the same entity type.

It can be left NULL if it is to be generated automatically by the database. The primary key allows evitaDB to decide whether the entity should be inserted as a new entity or whether an existing entity should be updated instead.

All primary keys are stored in data structure called "RoaringBitmap". It was originally written in C by Daniel Lemire, and a team, led by Richard Statin, managed to port it to Java. The library is used in many existing databases for similar purposes (Lucene, Druid, Spark, Pinot and many others).

We chose this library for two main reasons:

it allows us to store int arrays in a more compressed format than a simple array of primitive integers,
and contains the algorithms for fast boolean operations on these integer sets

This data structure works best for integers that are close together. This fact plays well with the database sequences that produce numbers incremented by one. There is a variant of the same data structure that works with the long type, but it has two drawbacks:

it uses twice as much memory
it's much slower for Boolean operations

Since evitaDB is an in-memory database, we expect that the number of entities will not exceed two billion.

Hierarchical placement

Entities can be organized hierarchically. This means that an entity can refer to a single parent entity and can be referred to by multiple child entities. A hierarchy always consists of entities of the same type.

Each entity must be part of at most one hierarchy (tree).

Most of the e-commerce systems organize their products in hierarchical category system. The categories are the source for the catalog menus and when the user examines the category content, he/she usually sees products in the entire category subtree of the category. That's why hierarchies are directly supported by evitaDB.

More details about hierarchy placement are described in the schema definition chapter.

Attributes (unique, filterable, sortable, localized)

The entity attributes allow you to define a set of data to be fetched in bulk along with the entity body. Each attribute schema can be marked as filterable to allow filtering by it, or sortable to allow sorting by it.

The attributes are automatically filterable / sortable when they are automatically added by the automatic schema evolution mechanism to make the "I don't care" approach to the schema easy and "just working". However, filterable or sortable attributes require indexes that are kept entirely in memory by evitaDB, and this approach leads to a waste of resources. Therefore, we recommend to use the schema-first approach and to mark as filterable / sortable only those attributes that are really used for filtering / sorting.

Attributes are also recommended to be used for frequently used data that accompanies the entity (for example "name". "perex", "main motive"), even if you don't necessarily need it for filtering/sorting purposes. evitaDB stores and fetches all attributes in a single block, so keeping this frequently used data in attributes reduces the overall I/O.

More details about attributes are described in the schema definition chapter.

Localized attributes

An attribute can contain localized values. This means that different values should be used for filtering/sorting and should be returned together with the entity when a specific locale is used in the search query. Localized attributes are a standard part of most e-commerce systems and that's why evitaDB provides special treatment for them.

Data types in attributes

Attributes allow using variety of data types and their arrays. The database supports all basic types, date-time types and

types. Range values are allowed using a special type of query filtering constraint - inRange. This filtering constraint allows to filter entities that are inside the range boundaries.

Any of the supported data types can be wrapped into an array - that is, the attribute can represent multiple values at once. Such an attribute cannot be used for sorting, but it can be used for filtering, where it will satisfy the filter constraint if any of the values in the array match the constraint predicate. This is particularly useful for ranges, where you can simply define multiple validity periods, for example, and the inRange constraint will match all entities that have at least one period that includes the input date and time (this is another common use case in e-commerce systems).

Sortable attribute compounds

Sortable attribute compounds are not inserted into an entity, but are automatically created by the database when an entity is inserted and maintain the index for the defined entity / reference attribute values. The attribute compounds can only be used to sort the entities in the same way as the attribute.

Associated data

Associated data carry additional data entries that are never used for filtering / sorting but may be needed to be fetched along with entity in order to display data to the target consumer (i.e. a user / API / bot). Associated data allows storing all basic data types and also complex, document like types.

The search query must contain specific requirement to fetch the associated data along with the entity. Associated data are stored and fetched separately by their name and locale (if the associated data is localized).

More details about associated data are described in the schema definition chapter.

Localized associated data

Associated data value can contain localized values. It means that different values should be returned along with entity when certain locale is used in the search query. Localized data is a standard part of most e-commerce systems and that's why evitaDB provides special treatment for it.

References

The references, as the name suggests, refer to other entities (of the same or different entity type). The references allow entity filtering by the attributes defined on the reference relation or the attributes of the referenced entities. The references enable statistics computation if facet index is enabled for this referenced entity type. The reference is uniquely represented by

int

positive number (max. 2⁶³-1) and

String

entity type and may represent a facetToggle Term Reference that is part of one or more facet groupsToggle Term Reference, also identified by

int

. The reference identifier in an entity is unique and belongs to a single group id. Among multiple entities, the reference to the same referenced entity may be part of different groups.

The referenced entity type can refer to another entity managed by evitaDB, or it can refer to any external entity that has a unique

int

key as its identifier. We expect that evitaDB will only partially manage data and that it will coexist with other systems in a runtime - such as content management systems, warehouse systems, ERPs and so on.

References are unidirectional in nature, which means that if the reference points from entity A to entity B, it does not mean that entity B automatically references entity A. It is possible to set up a bi-directional reference by creating a so-called "reflected reference" on the other entity type and identifying the original reference that should be reflected.

The references may carry additional key-value data related to this entity relationship (e.g. number of items present on the relationship to a stock). The data on references is subject to the same rules as entity attributes.

More details about references are described in the schema definition chapter.

Prices

Prices are specific to very few entity types (usually products, shipping methods, and so on), but since correct price calculation is a very complex and important part of e-commerce systems and highly affects the performance of entity filtering and sorting, they deserve first-class support in the entity model. It is quite common in B2B systems that a single product has dozens of prices assigned to different customers.

The price has the following structure:

int priceId: Contains the identification of the price in the external systems. This ID is expected to be used for synchronization of the price in relation to the primary source of the prices. The price with the same ID must be unique within the same entity. The prices with the same ID in multiple entities should represent the same price in terms of other values - such as validity, currency, price list, the price itself, and all other properties. These values can be different for a limited time (for example, the prices of Entity A and Entity B can be the same, but Entity A is updated in a different session/transaction and at a different time than Entity B).
String priceList: Contains the identification of the price list in the external system. Every price must refer to a price list. The price list identification can refer to another Evita entity or contain any external price list identification (e.g. ID or unique name of the price list in the external system). A single entity is expected to have a single price for the price list unless `validity' is specified. In other words, it makes no sense to have multiple concurrently valid prices for the same entity that are rooted in the same price list.
Currency currency: Identification of the currency. Three-letter form according to ISO 4217.
int innerRecordId: Some special products (such as master products or product sets) may contain prices of all "child" products so that the aggregating product can display them in certain views of the product. In this case, it is necessary to distinguish the projected prices of the subordinate products in the product that represents them.
BigDecimal priceWithoutTax: Price without tax.
BigDecimal priceWithTax: Price with tax.
BigDecimal taxRate: Tax percentage (i.e. for 19% it'll be 19.00)
DateTimeRange validity: Date and time interval for which the price is valid (inclusive).
boolean indexed: Controls whether the price is subject to filtering/sorting logic, non-indexed prices will be fetched along with the entity, but will not be considered when evaluating the query. These prices can be used for "informational" prices, such as the reference price (the crossed out price often found on e-commerce sites as the "usual price"), but are not used as the "price for sale".

The algorithm is quite complex and needs a lot of examples to understand it. Therefore, there is a separate chapter on this subject.

More details about prices are described in the schema definition chapter.

Scope

Scopes are separate areas of memory where entity indexes are stored. Scopes are used to separate live data from archived data. The scopes are used to handle so-called "soft deletes" - the application can choose between a hard delete and archiving the entity, which simply moves the entity to the archive scope. The details of the archiving process are described in the Archiving chapter and the reasons for this feature are explained in the dedicated blog post.

Author: Ing. Jan Novotný

Date updated: 17.1.2023

Documentation Source