How and on what we test

Three independent teams implement the same functional specification of the API and query language, which was defined based on many years of operational experience from FG Forrest, a.s..

the team builds its implementation on top of a relational database PostgreSQL
the team builds its implementation on top of no-sql database Elasticsearch
the team is building a greenfield implementation as an in-memory no-sql database

Testing process

A common test suite is created and made available to the developers of all three teams:

functional tests
performance tests

Separately, each team creates its own set of unit tests, which are not part of the final assessment and serve only for the development purposes of that team.

Performance tests use the tool JMH and they are twofold in nature:

tests verifying the speed of specific individual database functions (e.g. only filtering by attributes or by prices or facet filter separately)
synthetic tests verifying the overall speed of the database on a sample of real queries from the production environment

Real queries were recorded for several days on the existing e-shop solution (see next paragraphs) in a log. They were then converted into a query structure over evitaDB and re-saved in the log file. This log file is then loaded by the performance test and the queries split into parallel threads and sequentially played over a snapshot of the anonymized production data, which was created during the same period as the traffic snapshot on the e-shop. Thus, the queries and the data are roughly correlated.

Performance tests use the following data sets:

production anonymised database of the Senesi e-shop: 115 thousands of products, 3.3 millions of prices, 3.8 millions of attributes, 965 thousands of relations, 695 thousands of JSON data fragments
production anonymised database of the Signál nábytek e-shop: 15 thousands of products, 120 thousands of prices, 525 thousands of attributes, 115 thousands of relations, 90 thousands of JSON data fragments
production anonymised database of the Keramika Soukup e-shop: 13 thousands of products, 10 thousands of prices, 265 thousands of attributes, 55 thousands of relations, 43 thousands of JSON data fragments
artificial data set: randomly generated set 100 thousands of products, 1.3 millions of prices, 800 thousands of attributes, 680 thousands of relations, 200 thousands of JSON data fragments

Each implementation creates a separate JAR run from the command line and builds its own JAR on top of it Docker image. Along with the database prototype, test datasets are packed into the image and the JMH is started, which performs the mentioned performance tests. The results are then stored on GitHub, where they are also retroactively available. Thanks to Docker technology, tests can be easily run on a local machine on all major operating systems. We are working on making it possible for you to run and verify the above tests on your machine.

Running tests are monitored using Prometheus / Grafana technologies and the output includes visual graphs of CPU load and memory usage.

Test environment specification

4x CPU (Intel Xeon Skylake or Cascade Lake processors with with the basic clock of 2,7 GHz)
16GB RAM
25GB SSD (non NVMe)
OS: Linux (Ubuntu 20.04)
JDK 11

The research phase ends with a final functional and performance testing that compares all evitaDB prototypes. evitaDB targets the medium to low level hosting programs recommended for e-commerce solutions, so we will select the appropriate Digital Ocean General Purpose Droplet for the final test. In order to obtain repeatable and comparable results, the test environment will make use of a droplet with dedicated CPU. All three implementations are run sequentially on the same droplet, so their outputs should be comparable to each other.

Final evaluation priorities

the prototype meets all necessary functions of the specification
prototype maximizes performance for read operations
- queries per second (throughput)
- query latency
the prototype meets the optional features of the specification
prototype maximizes indexing speed
- mutations per second (throughput)

HW requirements

Along with the measured performance, telemetry data from the system on which the tests are running is also taken into account. We monitor:

unused RAM capacity (weighted average)
unused CPU capacity (weighted average)

The implementation with lower requirements receives a better rating than the others.