Scylla DB modelling techniques for creating a multitenant analytical engine

Aug 07, 2024

Designing a multi-tenant analytical engine with ScyllaDB involves careful data modeling to ensure data isolation, efficient query performance, and scalability. Here are the key techniques and considerations:

Data Isolation with Tenant IDs

Each tenant's data should be isolated to ensure privacy and performance isolation. This is typically achieved by incorporating a `tenant_id` in the primary key.

Primary Key Design

The primary key in ScyllaDB is composed of a partition key and optional clustering columns. The partition key determines how data is distributed across nodes while clustering columns define the sorting order within each partition.

For a multi-tenant setup, a common pattern is to use the `tenant_id` as part of the partition key. Here are a few examples of table structures:

Save different metrics and sort by time

CREATE TABLE analytics_data (
    tenant_id UUID,
    metric_id UUID,
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((tenant_id, metric_id), timestamp)
);

Save different events and sort by time

CREATE TABLE events_data (
    tenant_id UUID,
    event_type TEXT,
    timestamp TIMESTAMP,
    event: TEXT
    PRIMARY KEY ((tenant_id, event_type), timestamp)
);

In case of events_data we did not use event_id which could have been a random id as part of partitioning key as it will lead to random storage. This would mean that clusterting columns are not being utilized properly to sort the actual data but instead randomized data. Thus, resulting in an inefficient data model.

In general:

The tenant_id is used along with the logical key.

The Logical Key is used to identify the data of interest. Its usage with 'tenant_id', helps in the "logical grouping" of data. This grouping of data allows clustering columns to provide the desired sort order.

Data Partitioning

In the example above, the partition key is a composite of `tenant_id` and `metric_id`. This design ensures that data is partitioned by the tenant, reducing the risk of hot spots and improving query performance by localizing data for each tenant.

Clustering Order

The clustering column `timestamp` is used to order time-series data within each partition. This allows efficient “range queries” over time.

Data Ingestion

Efficiently ingesting data in a multi-tenant system involves batch operations and careful management of write patterns to avoid overloading any single node. Below, with tenant_id and metric_id the write pattern is majorly even unless a few of the metric_id(s) are outliers and create hot-spot.

INSERT INTO analytics_data (tenant_id, metric_id, timestamp, value)
VALUES (uuid(), uuid(), '2024-01-01T00:00:00Z', 123.45);

Query Design

Queries should be designed to leverage the partition key for efficient lookups.

Retrieve Metrics for a Specific Tenant

SELECT * FROM analytics_data
WHERE tenant_id = some_tenant_uuid
AND metric_id = some_metric_uuid
ORDER BY timestamp DESC;

Aggregate Data for a Tenant

SELECT AVG(value) FROM analytics_data
WHERE tenant_id = some_tenant_uuid
AND metric_id = some_metric_uuid
AND timestamp >= '2024-01-01T00:00:00Z'
AND timestamp <= '2024-01-01T23:59:59Z';

Secondary Indexes and Materialized Views

Secondary indexes and materialized views can be used to support additional query patterns, but they come with trade-offs in terms of write amplification and consistency.

Using a Materialized View

CREATE MATERIALIZED VIEW tenant_metrics_by_date AS
    SELECT tenant_id, metric_id, timestamp, value
    FROM analytics_data
    WHERE tenant_id IS NOT NULL AND metric_id IS NOT NULL AND timestamp IS NOT NULL
    PRIMARY KEY (tenant_id, timestamp, metric_id);

Handling Large Data Volumes

ScyllaDB is designed to handle large volumes of data. New nodes can be added to the cluster to handle an increase in required throughput. However, it's essential to monitor and manage cluster resources to ensure consistent performance.

Multi-Tenancy Best Practices

Resource Allocation: Ensure fair resource allocation among tenants by configuring appropriate limits and quotas.
Data Retention: Implement data retention policies to manage storage efficiently.
Security: Use role-based access control (RBAC) and encryption to ensure data security.

Use Case

Suppose we have a multi-tenant application that collects and analyzes web traffic data. The data model might look like this:

CREATE TABLE web_traffic (
    tenant_id UUID,
    page_id UUID,
    timestamp TIMESTAMP,
    visitor_count INT,
    PRIMARY KEY ((tenant_id, page_id), timestamp)
);

Ingesting data

INSERT INTO web_traffic (tenant_id, page_id, timestamp, visitor_count)
VALUES (uuid(), uuid(), '2024-01-01T00:00:00Z', 100);

Querying data for a tenant's specific page

SELECT * FROM web_traffic
WHERE tenant_id = some_tenant_uuid
AND page_id = some_page_uuid
ORDER BY timestamp DESC;

Aggregating visitor counts for a tenant over a day:

SELECT SUM(visitor_count) FROM web_traffic
WHERE tenant_id = some_tenant_uuid
AND page_id = some_page_uuid
AND timestamp >= '2024-01-01T00:00:00Z'
AND timestamp <= '2024-01-01T23:59:59Z';

By following these techniques, we can effectively design a scalable and efficient multi-tenant analytical engine with ScyllaDB.

Engineering Mojo

Discussion about this post

Ready for more?