Tuesday, September 28, 2021

Testing MariaDB ColumnStore performance

MariaDB's ColumnStore is an engine that stores data in a columnar fashion. Although it shines in distributed architectures where you have a massive amount of data, you can see non-trivial performance improvements in single-instance architectures. This article shows how to:

  • install ColumnStore using Docker in your development machine,
  • create a simple Java web application to generate realistic test data,
  • run test SQL queries to compare execution times of the ColumnStore vs InnoDB engines.

To follow this article, you need Java, a Java IDE, and Docker installed on your computer.

What is ColumnStore?

ColumnStore is a pluggable engine for MariaDB and MySQL. You can install it on top of an existing database instance. It is possible to have tables in the same database using multiple storage engines. For example, you can create a table called book for OLTP using the general-purpose InnoDB engine, and a table called book_analytics for OLTP using the column-oriented ColumnStore engine. Both tables can live in the same database schema and you can run queries that mix the two. Check these online resources to learn more about OLTP/OLAP and the different storage engines available in MariaDB:

Installing MariaDB and ColumnStore using Docker

Make sure you have Docker installed and running on your machine and use the following commands to download an image and create a container with CentOS, MariaDB and ColumnStore preconfigured:

docker pull mariadb/columnstore

docker run -d -p 3306:3306 --name mariadb_columnstore mariadb/columnstore

Configuring the database

If you want to configure the instance you can connect to the container using:

docker exec -it mariadb_columnstore bash

For example, you can edit the /etc/my.cnf file to increase the buffer pool size (this is an optional step):

[mysqld]
innodb_buffer_pool_size=1G

Remember to restart the container when you make this kind of change:

docker restart mariadb_columnstore

Connecting to the MariaDB instance

Use any SQL client that supports MariaDB to connect to the instance. For example, you can use IntelliJ IDEA's database view, or the command-line tool, mariadb. If you have MariaDB in your host machine, you already have the tool:

mariadb --protocol tcp -u user -p

Since a Docker container is not really a virtual machine, you connect to the database like if it was running natively on your matching, that is, the host is localhost.

Setting up the test user and tables

Create a new database user:

docker exec mariadb_columnstore mariadb -e "GRANT ALL PRIVILEGES ON *.* TO 'user'@'%' IDENTIFIED BY 'password';"

Create a new database schema and two tables, one using the InnoDB engine and another using ColumnStore:

CREATE DATABASE book_demo;

USE DATABASE book_demo;

CREATE TABLE book
(
    id           int(11) NOT NULL AUTO_INCREMENT,
    title        varchar(255) DEFAULT NULL,
    author       varchar(255) DEFAULT NULL,
    publish_date date         DEFAULT NULL,
    pages        int(8)       DEFAULT NULL, 
    image_data   longtext     DEFAULT NULL,
    PRIMARY KEY (id)
) engine = InnoDB;

CREATE TABLE book_analytics
(
    id           int(11) NOT NULL,
    title        varchar(255) DEFAULT NULL,
    author       varchar(255) DEFAULT NULL,
    publish_date date         DEFAULT NULL,
    pages        int(8)       DEFAULT NULL
) engine = ColumnStore;

Generating test data using Java

Create a new project using the Spring Initializr and add the Spring Data JPA, MariaDB Driver, Lombok, and Vaadin dependencies. Also, add the following dependency to the pom.xml file:

<dependency>
    <groupId>com.vaadin</groupId>
    <artifactId>exampledata</artifactId>
    <version>4.0.0</version>
</dependency>

Configure the database connection in the application.properties file:

spring.datasource.url=jdbc:mariadb://localhost:3306/book_demo
spring.datasource.username=user
spring.datasource.password=password

Create a new JPA Entity class to encapsulate the test data and persist it in a table row:

package com.example;

import lombok.Data;
import lombok.EqualsAndHashCode;

import javax.persistence.*;
import java.time.LocalDate;

@Data
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
@Entity
@Table(name = "book")
public class Book {

    @EqualsAndHashCode.Include
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Integer id;

    private String title;

    private String author;

    private LocalDate publishDate;

    private Integer pages;

    @Lob
    private String imageData;

}

Create a new repository:

package com.example;

import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.stereotype.Repository;

@Repository
public interface BookRepository extends JpaRepository<Book, Integer> {
}

Create a service class with logic to generate realistic random test data in batches:

package com.example;

import com.vaadin.exampledata.ChanceIntegerType;
import com.vaadin.exampledata.DataType;
import com.vaadin.exampledata.ExampleDataGenerator;
import lombok.RequiredArgsConstructor;
import lombok.extern.log4j.Log4j2;
import org.springframework.stereotype.Service;

import java.time.LocalDateTime;
import java.util.List;
import java.util.Random;

@Service
@RequiredArgsConstructor
@Log4j2
public class GeneratorService {

    private final BookRepository repository;

    public void generate(int batchSize, int batches) {
        var generator = new ExampleDataGenerator<>(Book.class, LocalDateTime.now());
        generator.setData(Book::setTitle, DataType.BOOK_TITLE);
        generator.setData(Book::setAuthor, DataType.FULL_NAME);
        generator.setData(Book::setPublishDate, DataType.DATE_LAST_10_YEARS);
        generator.setData(Book::setPages, new ChanceIntegerType("integer", "{min: 20, max: 1000}"));
        generator.setData(Book::setImageData, DataType.BOOK_IMAGE_URL);

        for (int batchNumber = 1; batchNumber <= batches; batchNumber++) {
            List<Book> books = generator.create(batchSize, new Random().nextInt());
            repository.saveAllAndFlush(books);
            log.info("Batch " + batchNumber + " completed.");
        }
    }

}

Create a web UI:

package com.example;

import com.vaadin.flow.component.button.Button;
import com.vaadin.flow.component.html.H1;
import com.vaadin.flow.component.notification.Notification;
import com.vaadin.flow.component.orderedlayout.VerticalLayout;
import com.vaadin.flow.component.textfield.IntegerField;
import com.vaadin.flow.router.PageTitle;
import com.vaadin.flow.router.Route;
import lombok.extern.log4j.Log4j2;

@Route("")
@PageTitle("Data generator")
@Log4j2
public class GeneratorView extends VerticalLayout {

    public GeneratorView(GeneratorService service) {
        var batchSize = new IntegerField("Batch size");
        var batches = new IntegerField("Batches");

        Button start = new Button("Start");
        start.addClickListener(event -> {
            service.generate(batchSize.getValue(), batches.getValue());
            Notification.show("Data generated.");
        });

        add(new H1("Data generator"), batchSize, batches, start);
    }

}

Running the generator web application

The project includes an Application class with a standard Java entry point method. Run this application as you would with any other Java application using your IDE or the command line with Maven (make sure to use the name of the JAR file that your project generated):

mvn package

java -jar target/test-data-generator-0.0.1-SNAPSHOT.jar

The first build and run of the application take time, but further builds and runs will be faster. You can invoke the application in the browser at http://localhost:8080:

I generated 100 batches of 10000 rows each for a million rows in the database. This takes some time. Check the log if you want to see the progress.

Challenge: Try using the ProgressBar class, the @Push annotation, and the UI.access(Command) to show the progress in the browser.

Running a simple ETL process for analytics

To populate the book_analytics table, we can execute a simple SQL INSERT statement that serves as an Extract, Transform, Load (ETL) process. In concrete, the image_data column stores the image potentially used in a web application that shows the cover of a book. We don't need this data for OLAP, which is the reason we didn't include the image_data column in the book_analytics table. Populate the data with the following SQL statement:

INSERT INTO book_analytics(id, author, pages, publish_date, title)
    SELECT id, author, pages, publish_date, title FROM book;

The process should take only a few seconds for a million rows.

Comparing the performance of analytical queries

Below are some examples of queries that you can use to compare the performance of the ColumnStore engine vs InnoDB. I used an empiric approach and run each query several times (5 to 10 times) taking the fastest run for each. Here are the results:

SELECT COUNT(id) FROM book; -- 251 ms
SELECT COUNT(id) FROM book_analytics; -- 111 ms

SELECT publish_date, SUM(pages) FROM book GROUP BY publish_date; -- 767 ms
SELECT publish_date, SUM(pages) FROM book_analytics GROUP BY publish_date; -- 206 ms

SELECT author, COUNT(id) FROM book GROUP BY author HAVING COUNT(id) > 5; -- 20 s 218 ms
SELECT author, COUNT(id) FROM book_analytics GROUP BY author HAVING COUNT(id) > 5; -- 1 s 93 ms

The biggest difference is in the last query. 20 seconds vs 2 seconds approximately. All the queries show a non-trivial advantage in favor of ColumnStore.

Take into account that for other kinds of database operations, ColumnStore could be slower than InnoDB. For example, queries that read several columns without aggregate functions involved. Always make informed decisions and experiment with queries before deciding on InnoDB, ColumnStore, or other engines for your database tables.

Share:

2 comments:

Anonymous said...

You don't give any info about what resources you've allocated for your Docker container. Columnstore needs a lot of CPU. Same for the RAM situation. Which is curious because you've stolen 1GB for innodb_buffer_pool. Also 1 million records is woefully inadequate for an analytics benchmark. The data size should be in the terabytes.

Alejandro Duarte said...

Thanks for sharing your thoughts. I started the container as-is from the image. I didn't really configure 1G for the buffer but wanted to show how to configure parameters, so that was only an example. I'm aware that 1 million rows might be far from certain real-world scenarios, but as I mentioned in the initial paragraph, having massive amounts of data is not necessary in order to see important performance gains when using ColumnStore. Hopefully, the instructions in this article serve developers who want to start exploring the topic more in-depth.