Research assignment

Following information was available to the teams at the start of the research project along with the Java API interfaces and shared implementation layer

Terms used in this document

brand: A Brand is an entity that represents the manufacturer or supplier of the product. Brands are often a key factor driving customers' choice process. E-commerce sites provide information about the brand in the product detail and very often provide specialized brand pages with all products of a brand with additional information about the manufacturing process, quality assurance and other marketing information.
cart: A cart is a place where items selected by a customer for later purchase are collected. The cart is an important part of each e-commerce application but since it's specific to the e-commerce store, there's no special support in evitaDB.
category: A category is an entity that forms a hierarchical tree and categorizes items on the e-commerce site into a better accessible form for the customer. A category in a store with electronic products may be Computers, which has sub-categories Laptops, Game consoles, Desktops and so on. It is common upper categories on e-commerce sites display items placed in all its sub-categories. In our example it means that in the category Computers, the customer will see either laptops, game consoles or desktops all at once.
complex entities: Complex entities are entities that represent a bulk of other entities. On example is parametrized product - i.e. a product with several variants when only variant can be bought. A real life example of such parametrized product is a T-shirt that has several sizes (S, M, L, XL, XXL) and different colors (blue, red). You want the T-shirt to be represented as single product in e-commerce listing and variant selection is performed at the moment of the purchase. Different variants of this shirt may have different prices so when filtering or sorting we need to select single price that will be used for parametrized product. For such case FIRST_OCCURRENCE inner entity reference handling strategy is the best fit (see PriceInnerEntityReferenceHandling). A different example of a complex entity is a product set. A product set is a product that consists of several sub products, but is purchased as a whole. A real life example of such product set is a drawer - it consists of the body, the door and handles. The customer could even choose, which type of doors or handles they want in the set - but there always be some defaults. Again, we need some price assigned for the product set for the sake of a listing (i.e. filtering and sorting) but there may be none when the price is computed as an aggregation of the prices of sub-products. In case this happens, the SUM inner entity reference handling strategy is the best fit (see PriceInnerEntityReferenceHandling).
facet: A facet is a property of the entity that is used for quick filtering entities by the customer. It is represented as a checkbox in the filtering bar or as a slider in case of a large number of distinct numeric values. Facets help customers to narrow their listing of current category, manufacturer listing or the results of the fulltext search. It would be hard for the customer go through dozens of pages of results and probably would be forced to look for some sub-category or find a better search phrase. That's frustrating for the user and facets could ease this process. The user can narrow the results by a few clicks on relevant facets. The key aspect here is to provide enough information and require to guide user to most relevant facet combinations. It's very helpful to disregard facets as soon as they would cause no results to be returned or even inform the user that selecting a particular facet would narrow the results to very few records and that their freedom of choice will be severely affected.
facet group: A facet group is used to group facets of the same type. Facet groups control the mechanisms of facet filtering. It means that facet groups allow defining, whether facets in the group are combined with a boolean OR, AND relations when used in filtering. It also allows defining how this facet group will be combined with other facet groups in the same query (i.e. AND, OR, NOT). This type of boolean logic affects the statistics computation of the facets and is the crucial part of the facet evaluation.
fulltext search: Fulltext search is a process when user tries to locate an appropriate product/category/brand (or any other entity) by specifying a search phrase. The problem is that user expects fulltext results to be as accurate as Google provides in its search. Achieving a similar quality in e-commerce application is a really hard task that requires both, a top-notch fulltext engine and some form of self-learning algorithms (AI).
group: A group is an entity that references a set of products by some cross-cutting concert. As an example of groups, you can imagine: top products (displayed on the homepage), new products (last X products added to an inventory), gifts and so on. Groups are versatile units to display a bunch of products on different places of the web application.
product: A product is an entity that represents the item sold at an e-commerce store. The products represent the very core of each e-commerce application.
product with variants: A product with a variant is a "virtual product" that cannot be bought directly. A customer must choose one of its variants instead. Products with variants are very often seen in e-commerce fashion stores where clothes come in various sizes and colors. A single product can have dozens combinations of size and color. If each combination represented standard product, a product listing in a category and other places would become unusable. In this situation, products with variants become very handy. This "virtual product" can be listed instead of variants and a variant selection is performed at the time of placing the goods into the cart. Let's have an example: We have a T-Shirt with a unicorn picture on it. The T-Shirt is produced in different sizes and colors - namely:

– size: S, M, L, XL, XXL
– color: blue, pink, violet

That represents 15 possible combinations (variants). Because we only want a single unicorn T-Shirt in our listings, we create a product with variants and enclose all variant combinations to this virtual product.
product set: A product set is a product that consists of several sub products, but is purchased as a whole. A real life example of such product set is a drawer - it consists of the body, the door and handles. A customer could even choose which type of doors or handles they want in the set - but there always be some defaults.
When displaying and filtering by a product set in the listings on the e-commerce site, we need some price assigned for it but there may be a none exact price assigned to the set and the e-commerce owner expects that price would be computed as an aggregation of the prices of sub-products. This behaviour is supported by setting proper PriceInnerEntityReferenceHandling.
property: An entity that represents an item property. Properties are handled as top entities, because we expect that properties may have additional attributes and localizations to different languages. Properties are usually referenced in other items' facets. Properties are usually composed of two parts:

– property group, referenced in facet group, fe. color, size, sex
– property value, referenced in facet, fe. blue, XXL, women
variant product: A variant product is a product that is enclosed within a product with variants and represents a single combination of a particular product. In case of the example used in a referenced chapter, it would be, for example, the Unicorn T-Shirt, pink, M size.

This document uses the future tense because it was originally written at the beginning of the project. The research phase of the project has now ended, so the future tense does not make sense now, but we have made only minimal changes to the text to preserve the original ideas and tone. Notes and references have been added at some points in the document to refer to a specific part of the research already conducted.

Research procedure

An API covering the required functionalities will be implemented on top of the two general-purpose freely available databases and compared with single "greenfield" solution.

All competing solutions will be populated using a data pump with real datasets (both from B2C and B2B environment) of existing FG Forrest clients, who gave their explicit consent to do so. The client data sets will not contain any personal data - it will only represent a "sale catalog".

Evaluation methodology

A single, automated test suite will be implemented that will test all of the basic scenarios from a real-life e-commerce solution. This suit will then be run against individual implementations in a lab environment and the results will be recorded.

The criteria for evaluating the best option are:

cumulative response speed in the test scenarios (weight 75%)
memory space consumption (weight 20%)
disk space consumption (weight 5%)

The cumulative response time to (at least) all subsequent search queries is measured on:

category tree display (open to the current category) - menu rendering
category detail display (retrieving one full category entity) + product listing
product listing filterable by
- parameters
- brands
- tags
- prices
product ordering
- by selected attribute
- by price

The cache utilization is problematic in a filtering scenario, because there are way too many combinations the user can select. Moreover, the data is frequently changed due and the impact of these changes, it is hard to translate to cache record invalidation orders, because of the complex relations between them. The implementations thus can't rely on the caching layer when the tests are run and the performance is evaluated.

The final evaluation of the reseach is available here.

Expected record counts for performance tests

Entities

Our performance tests are run on data sets that are similar to these:

Entity	Senesi.cz	Signal-nabytek.cz	Fjallraven CZ	Rako CZ	Kili CZ
adjustedPricePolicyToggle Term Reference	2	2	3	1	110
brandToggle Term Reference	158	42	0	0	76
categoryToggle Term Reference	202	220	87	16	325
groupToggle Term Reference	1190	20	7	1	265
parameterItemToggle Term Reference	19299	3477	1346	2044	2848
parameterTypeToggle Term Reference	255	558	32	48	39
paymentMethodToggle Term Reference	15	4	3	1	5
priceListToggle Term Reference	2	3	6	4	7044
productToggle Term Reference	64628	50852	31144	3587	26567
shippingMethodToggle Term Reference	3	15	6	1	52

Legend

entity: adjustedPricePolicy: price policy (discounts, special programs and so on)
entity: brand: brand (such as: Nokia, Samsung and so on)
entity: category: product category (such as: TV, Notebook and so on)
entity: group: product groups (such as: action ware, new items on stock and so on)
entity: parameterType: parameter facet group detail data (such as: color, size, resolution)
entity: parameterItem: parameter facet detail data (such as: blue, yellow, XXL, fullHD)
entity: paymentMethod: form of paying on the site (such as: by credit card, direct transfer and so on)
entity: priceList: entity aggregating prices of product sharing common trait (such as: VIP, sellout)
entity: product: entity that is being sold on the site
entity: shippingMethod: form of delivery of the goods (such as: DPD, PPL, Postal service and so on)

The final datasets were reduced to only Senesi.cz, Signal-nabytek.cz and the new dataset, KeramikaSoukup.cz, was added. The record cardinalities in some of those datasets were enlarged considerably (see evaluation results here).

Connected data cardinalities:

Type of data	Senesi.cz	Signal-nabytek.cz	Fjallraven CZ	Rako CZ	Kili CZ
price	68522	59018	193680	15885	1594502
associated data	479246	326205	209694	23535	298963
localized texts	258798	353597	88913	94803	102855
attributes	1351227	924281	574488	66820	572674
facets	876080	802583	334008	144161	466160

Localized texts are a part of associated data and they are counted separately, so that Evita knows how big of a part they play in the associated data set. Localized texts are not counted in the associated data row.

Validation set

In the initial phase, the university teams will select one general-purpose relational database and one NoSQL database. The criteria for selecting a database machine is:

the license to run the e-commerce platform must be free of charge
the database resource must have good documentation
the database is expected to be further developed or supported in the next 5 to 10 years
it can run on Linux OS (ideally Ubuntu distribution)

Recommended technologies to start investigation are (the selected set is not complete and can be extended by another database engine):

relational database candidates:
- MySQL (MariaDB, Percona)
- PostgreSQL

The selection process is described in appendix A of the associated thesis. The PostgreSQL was selected as the platform for prototype implementation on top of the relational database.

NoSQL database candidates:
- MongoDB
- Elasticsearch (Lucene)

The selection process is described in appendix A of the associated thesis. Elasticsearch was selected as the platform for prototype implementation on top of the relational database.

Custom "greenfield" solution

The solution will be designed as an in-memory NoSQL database optimized to run on a single machine. The cluster mode will be designed as single writer, multiple reader replicas with the same dataset, which will be updated using an event stream from the primary source of truth.

The solution will be based on the assumption that the entire search index can be stored in RAM. Information that is not used for retrieval but only for displaying to the user may or may not be stored in memory (we consider the use of a memory mapped file).

The data will be stored in memory as a sorted array of numeric primary identifiers of entities (products, categories, tags, etc.) over which Boolean AND, OR, NOT operations will be performed, and then GROUP BY and SUM operations in the pricing area. All operations will be performed in a so-called "lazy" manner, which promises better performance in moments when complete results do not need to be computed (e.g. for the purpose of displaying the category tree). Optimized data structures such as the inverted index or the interval tree or range tree will be used for specific search operations.

The API and implementation will be designed so that as many related operations as possible are computed within a single request using common intermediate results. The client will not have to compose the functionality with a large number of database engine calls (in current solutions it is common for dozens of calls to a generic database engine to be needed to display category details with product listings).

Developing solutions for practical applicability

Based on the measurements, the best implementation option is selected and refined to production quality, which requires extensive and clear documentation, coverage of automated units, integration and performance tests.

The goal is:

to prepare HTTP APIs for communication with the outside world using commonly used formats:
- GraphQL
- REST
- gRPC
containerization of the distribution package (using Docker)
extending the implementation to a clustered solution (required on all major e-commerce sites)
publishing source files on generally accepted hosting platforms (GitHub, Gitlab or BitBucket)
API documentation
implementation of a sample e-commerce solution on top of this API with a basic dataset in JavaScript / Node.JS

Data model

See a more detailed [API schema(updating/schema_api), describing the data model manipulation.

The minimal entity definition consists of: Entity type and Primary key (even this is optional and may be automatically generated by the database). Other entity data is purely optional and may not be used at all.

This combination is covered by this interface: . The full entity with data, references, attributes and associated data is represented by this interface: .

The schema for entities is described by:

Entity type

String is a type of entity. Entity type is main business key (equivalent to a table name in relational database) - all data of entities with the same type are stored in a separated index. Within the entity type, the entity is uniquely represented by its primary key.

Entity is described by its schema:

Although evitaDB requires a schema for each entity type, it supports automatic evolution when you allow it. If you don't specify otherwise, evitaDB learns about entity attributes, their data types and all necessary relations along the way when you insert new data into it. However, once the attribute, associated data or other contours of the entity are known, they are enforced by evitaDB. This mechanism somehow resembles the schema-less approach, but results in much more consistent data store.

The details about schema definition are part of a different document.

Primary key

A unique int positive number (max. 2⁶³-1) represents the entity. Can be used for fast lookup for entity (entities). The primary key must be unique within the same entity type.

May be left empty if it should be automatically generated by the database. The primary key allows evitaDB to decide whether the entity should be inserted as a new entity, or an update to an existing entity.

Hierarchical placement

Entities may be organized in a hierarchical fashion. That means that an entity may refer to a single parent entity and may be referred to by multiple child entities. A hierarchy is always composed of entities of the same type.

Each entity must be part of, at most, a single hierarchy (tree).

Hierarchy placement is represented by this interface: .

Most of the e-commerce systems organize their products in a hierarchical category system. The categories are source for the catalog menus and when the user examines the category content, they usually see the products in the entire category subtree of the category. That's why the hierarchies are directly supported by evitaDB.

Attributes (unique, filterable, sortable, localized)

Entity attributes allow defining sets of data that are fetched in bulk along with the entity body. The attribute may be marked as filterable to enable filtering by it, or sortable to be sorted by it. The attributes are not automatically searchable / sortable in order to not waste precious memory space and save computational overhead for maintaining and an index for the data that will never be used in queries.

Attributes must be used for all of the data you want to filter or sort by. Attributes are recommended to also be used for frequently used data that are associated with the entity (for example "name". "perex", "main motive") even if you don't necessarily need it for querying purposes.

The attribute provider (entity or reference) is represented by this interface:

The attribute schema is described by this:

Allowed decimal places

The allowed decimal places setting represents optimizations that allow for converting rich numeric types (such as BigDecimal used for precise number representation) to a primitive int type that is much more compact and can be used for fast binary search in an array/bitset representation. The original rich format is still present in an attribute container, but internally, the database uses the primitive form when an attribute is a part of filtering or sorting constraints.

When a number cannot be converted to a compact form (for example it has more digits in the fractional part than expected), an exception is thrown and the entity update is rejected.

Localized attributes

An attribute may contain localized values. It means that different values should be used for filtering / sorting and should be returned along with the entity when a certain locale is used in the search query. Localized attributes are standard part of most of the e-commerce systems, and that's why evitaDB provides special treatment for those.

Data types in attributes

Attributes allow for the using of variety of data types and their arrays. The database supports all basic types, date-time types and types. Range values are allowed, using a special type of search query filtering constraint - InRange.

This filtering constraint allows filtering entities that are inside the range bounds. For more information, see the InRange documentation.

Any of the supported data types may be wrapped in an array - that means that the attribute might represent multiple values at once. Such attributes cannot be used for sorting, but can be used for filtering, where it fulfills the filtering constraint when any of the values in the array match the constraint predicate. This is particularly useful for ranges where you can, for example, simply define multiple periods of validity and the InRange constraint will match all of the entities, which have at least one period enveloping the input date and time (this is another frequently present use-case in e-commerce systems).

See the chapter about all supported data types for more information.

Associated data

Associated data carries additional data entries that are never used for filtering / sorting but may be needed to be fetched along with an entity in order to display data to the target consumer (i.e. an user / API / bot). Associated data allows for the storing of all basic data types and also complex, document like types.

The complex data type is used for rich objects, such as Java POJOs and is automatically converted by to an internal representation that is composed solely of supported data types (or another complex objects) and can be deserialized back to the client in a custom POJO on demand, providing that the POJO structure matches the original document format.

The associated aata provider (entity) is represented by this interface:

The associated data schema is described by this:

The search query must contain specific requirements to fetch the associated data along with the entity. Associated data are stored and fetched separately by their name.

Localized associated data

The associated data value may contain localized values. It means that different values should be returned along with an entity when a certain locale is used in the search query. Localized data is a standard part of most of the e-commerce systems, and that's why evitaDB provides special treatment for those.

References

The references, as the name suggests, refer to other entities (of the same or different entity type). The references enable entity filtering by the attributes defined on the reference relation or the attributes of the referenced entities. The references enable statistics computation if the facet index is enabled for this referenced entity type. The reference is uniquely represented by an int positive number (max. 2⁶³-1) and a String entity type and can represent a facetToggle Term Reference that is a part of one or multiple facet groupsToggle Term Reference, which are also identified by an int. The reference identifier in one entity is unique and belongs to a single group id. Among multiple entities, the reference to the same referenced entity may be a part of different groups.

The referenced entity type may relate to another entity managed by evitaDB, or it may refer to any external entity possessing a unique int key as its identifier. We expect that evitaDB will maintain data only partially, and that it will co-exist with other systems in one runtime - such as content management systems, warehouse systems, ERPs and so on.

The references may carry additional key-value data linked to this entity relation (fe. item count present on the relation to a stock). The data on references is subject to the same rules as entity attributes.

A reference is represented by this interface: . A reference schema is described by this:

Prices

Prices are specific to very few entity types (usually products, shipping methods and so on), but because correct price computation is very complex and an important part of the e-commerce systems and highly affects performance of the entities filtering and sorting, they deserve first class support in entity model. It is pretty common in B2B systems for a single product to have dozens of assigned prices for different customers.

A price provider is represented by this interface: Single price is represented by the interface:

A price schema is part of the main entity schema:

For detailed information about the price for sale computation, see this article.

Entity indexing

See a more detailed entity API describing entity manipulation.

The entity indexing is a mechanism that stores entity data into the persistent storage and prepares the data for searching. We distinguish two types of this mechanism:

bulk
incremental

Bulk indexing

Bulk indexing is used for rapid indexing of large quantities of source data. It's used for initial catalog setup from an external (primary) data store. It doesn't need to support transactions. Whenever something goes wrong, the work in progress might be thrown away entirely without affecting any clients (because it's initial DB setup, no client is reading from it yet). The goal here is to index hundreds or thousands entities per second.

Bulk indexation is executed in single-threaded fashion.

Incremental indexing

Incremental indexing is used for keeping index up-to-date during its lifetime. We expect some form of change data capture process to be incorporated in the primary data store. Incremental indexing must support transactions at least with the read commited isolation level. Initial implementations may relax the concurrency and limit parallelism so that only single write transaction may be open at a time. Multiple parallel read transactions must be supported in the final implementation since the beginning and simultaneous read/write transactions must be supported as well.

Rolled back transaction must not affect the working data set. Committed transaction must leave data set in a consistent state and must be resistant to unexpected process termination or hardware failure (committed data should enforce fsync to a persistent disk storage).

Incremental indexing should be as fast as possible, but since there are bigger requirements, it is expected to be considerably slower than bulk indexation.

Data fetching

Only the primary keys of the entities are returned to the query result by default. Each entity in this simplest case is represented by the interface.

The client application can request for the returning entity bodies instead, but this must be explicitly requested by using a specific require constraint:

When such a require constraint is used, data is fetched greedily during the initial query. A response object will then contain entities in the form of an .

Lazy fetching (enrichment)

Attributes, associated data and prices can be fetched separately by providing the primary key of the entity. The initial entity is loaded by the entity fetch or by a limited set of requirements can be lazily expanded (enriched) with additional data by so-called lazy loading.

This process loads the above-mentioned data separately and adds them to the entity object anytime after it was initially fetched from evitaDB. Due to the immutability characteristics enforced by the database design, the entity object enrichment leads to a new instance.

Lazy fetching may not be necessary for a frontend designed using the MVC architecture, where all requirements for the page are known prior to rendering. But different architectures might fetch thinner entity forms and later discover that they need more data in it. While this approach is not optimal performance-wise, it might make the life for developers easier, and it's much more optimal to just enrich an existing query (using lookup by primary key and fetching only missing data) instead of re-fetching the entire entity again.

Query requirements

See more detailed query API and query language in separate chapters.

Querying is the heart of all databases, and therefore, the core of the query language was designed upfront in the prototype implementation phase along with the unified functional test suite. When the first versions of the prototype implementations were created, the functional suite also accompanied the performance test suite, first for the artificial data set, and later, for real customer data sets.

This chapter briefly describes the use-cases the e-commerce catalog frequently solves and which we try to cover in our query language document and the API document.

Attributes

We need to fully support basic boolean algebra (AND, OR, NOT) with parentheses grouping logic that supports predicates:

numeric: equals, greater than, lesser than, between, in range
temporal: equals, greater than, lesser than, between, in range
string: contains, starts with, ends with
boolean: is null, is not null

Localized search

The items in the multi-language catalogs often only have a limited set of localizations (translations). The search engine must easily filter only those items that are available in selected locale. Let's assume that this specific brandToggle Term Reference is referred to by items with the following localizations:

Product A: EN, CZ, IT
Product B: EN, DE, FR
Product C: EN, CZ, PL

When the query lists all items referring to that brand and specifies the locale equal to EN, all items needs to be returned, while using locale equal to CZ lists only items Product A and Product C.

Parameters (faceted search)

Faceted search is commonly used on all major e-commerce sites. Facets represent properties of items in a certain area that is significant for customers to pick the correct item to buy. Faceted search cam be usually found in the categoryToggle Term Reference drill down view, specific groupToggle Term Reference pages or the fulltext searchToggle Term Reference result page.

According to some studies - properly implemented faceted search (sometimes called as parametrized search) can increase conversions of the e-commerce sites by 20%. Each product can have multiple parameters in the form of a key-value map. Values might represent:

discrete constants (for example color:black, size:XXL, OS:Android) visualized as checkboxes or selects
or numeric values that are spread out in some range and can be visualized as a slider with min and max boundaries

Parameters - facetsToggle Term Reference are organized and grouped into facet groupsToggle Term Reference by their similarity (color, size, operation system). When the user enters the category page (or any other item listing view) they should only see facets (facet groups) that make sense in that category. In other words - the only facets visible in the filter are those that are linked to (referenced by) at least one item visible in the view (ignoring pagination settings).

For better customer orientation on how a certain facet narrows the item listing e-commerce sites display or the number of the items that have a certain property next to it - see this example (numbers in brackets):

Faceted search on Alzashop.com

An even better approach is to reflect the currently selected filter in the faceted filter itself. Additional facets that would return no result if selected are displayed as disabled (see grayed properties with zero brackets in above example).

See the description of the facet summary object for better understanding.

Interval properties

E-commerce filters contain not only facet,s but also sliders that allow the user to search for items having the attribute in a certain value range. For example, when you buy a refrigerator, you are usually constrained by the space at your disposal, and you need to set limits for the width, length and height of the refrigerator.

Example:

Faceted search on Senesi.cz

The search engine must:

return thresholds for each interval property (stored as entity attribute):
- highest value of the property in the item view
- lowest value of the property in the item view
optionally: compute the histogram of describing attribute values computed from the items in this view (i.e. threshold with a count of items having this property in a respective histogram interval)

See the histogram object for better understanding.

Inverted relations

Facets in the same facet group are usually combined by a boolean OR (disjunction) relation and facets in different groups are combined by a boolean AND (conjunction) relation. These relations might be inverted in some edge cases and the database must support the definition of inverted relations among facet groups and the facets within a certain group.

Negative properties

Some facets might have a negative meaning - so that if a user marks them, they expect that listing to only contain items that don’t have such property (as an example consider this facet: allergen:gluten which will cause that all items containing gluten will be removed from the listing).

When a facet with negative meaning (defined on property group) is selected in the filter, the search engine must:

EXCLUDE all items having such property from the result
properly compute the number of selected records

Impact statistics

Facets that would further expand count of the matching items (if selected) display difference count with the plus sign, or the updated overall result count next to them. See example below (numbers in brackets):

Extended facet statistics on CZC.cz

When any facet filtering is applied in the passed database query, the search engine must:

return extended statistics computed for all other facets that contains information about:
- how many items are added to result if facet would be added to filtering
- how many items are removed from result if facet would be added to filtering
- how many items remain in the result if facet would be added to filtering
correctly apply OR / AND / NOT relations defined by property groups

Prices

The search engine can search for products whose price for sale falls within the price range specified by the user. The price for sale must respect the selected currency, price date/time validity range and relates to one of the price list identifiers available to the user. All this information is part of the input query.

For master products, the lowest price of any product variant is used. For complete sets, dual behaviour must be supported:

if a price is set directly for a set, count with this price
if not, the most preferred price for each item of such a set must be added up and the resulting amount calculated

The exact and detailed price for sale computation algorithm is described in separate chapter.

Price histogram

Optionally, we would like to be able to generate a histogram of prices similar to interval properties.

Brands / groups

E-commerce sites have special landing pages for brandsToggle Term Reference / manufacturers or groupsToggle Term Reference. These landing pages behave similar to category detail page and display all items directly related to the brand / group. The search engine must be able to compute facets and all other information for all entities that relate (have reference to) the specified external entity.

The number of products of a given brand in the categories

The search engine needs to look up, for all hierarchy placements (categories), where at least one product of that brand is (directly or transitively) located for any of the brandsToggle Term Reference or groupsToggle Term Reference. The detail of the brand frequently lists all categories of products the brand produces.

Fulltext search

The solution should allow to easily combine full-text search with parametrized (faceted search). Fulltext search may not be implemented initially, but should offer a mechanism for integration of an external fulltext system.

Hierarchical search and tree exclusion

CategoriesToggle Term Reference are usually organized in a hierarchical fashion. A single product may be listed in one or more categories. E-commerce sites usually display all products in a specific category or its subcategories. Certain categories might be excluded from the displaying by the site owner or might be accessible only to a subset of the frontend users (by design). The search engine should take all of these requirements into account and should transparently exclude all items that are only a part of the excluded category subtree.

The search engine should assist in menu generation from the hierarchical tree in real time, listing only those hierarchical entities (categoriesToggle Term Reference) that:

match the basic attribute predicate: for example, those are marked as visible, are valid for display at that certain moment in time etc.
contain at least one visible product in any child level, the product must:
- match an independent attribute query
- match the requested locale
- produce a price for sale

For each of returned entity (categoryToggle Term Reference), the search engine must:

compute the overall count of all items available in this entity (categoryToggle Term Reference)
- items that have an invisible parent (categoryToggle Term Reference that doesn't match its own predicate) must not be counted
- a single item (productToggle Term Reference) may relate to more than one hierarchical parent (categoryToggle Term Reference), the fact that a product is not counted in one category axis must not affect the other visibility axis
- counts of lower level nodes are automatically counted in the overall count of their parent category
produce a result in a tree-like structure friendly for rendering

Product sorting

The search engine must be able to sort results by:

attribute: for example, the number of stars, number of sales
multiple attributes: there are special cases where one attribute contains less selective values and another attribute is required for predictable search results
price for sale: cheapest, most expensive

Personalized sorting

As a part of the research, it would be useful to look at the possibilities of personalized sorting, which would allow a user to be presented with the results that are likely to be of interest to them first, based on their previous experience with that user (i.e., based on their previous purchases or visits).

For this purpose, it will probably be necessary to use one of the "shallow" artificial intelligence (machine learning) algorithms and construct a personalised search index in such a way that the search is not slowed down.

This functionality is not critical to the evaluation of the winning approach - it is an add-on functionality that no existing database currently includes as part of its functionality. However, we know that global trends are moving in this direction.

The last paragraph was written in the year 2018. As of now, Elasticsearch supports a k-nearest neighbour algorithm that could be used for personalized sorting. Also, there is a mature database Vespa.ai oriented on machine learning assisted fulltext search and recommendation.

Author: Ing. Jan Novotný (FG Forrest, a.s.)

Date updated: 15.12.2022

Documentation Source

Research assignment

Terms used in this document

Research procedure

Evaluation methodology

Expected record counts for performance tests

Entities

Legend

Connected data cardinalities:

Validation set

Custom "greenfield" solution

Developing solutions for practical applicability

Data model

Entity type

Primary key

Hierarchical placement

Attributes (unique, filterable, sortable, localized)

Allowed decimal places

Localized attributes

Data types in attributes

Associated data

Localized associated data

References

Prices

Entity indexing

Bulk indexing

Incremental indexing

Data fetching

Lazy fetching (enrichment)

Query requirements

Attributes

Localized search

Parameters (faceted search)

Interval properties

Inverted relations

Negative properties

Impact statistics

Prices

Price histogram

Brands / groups

The number of products of a given brand in the categories

Tags

Fulltext search

Hierarchical search and tree exclusion

Product sorting

Personalized sorting