January 15, 2023
10
min read

Database seeding for Preview Environments

In this article we discuss two strategies for seeding database instances for each of your ephemeral Preview Environments.

Summary

Depending on your application and test data requirements, you can use one of the following methods to seed relational database instances for each of your Preview Environments:

  • Method 1: Load an SQL dump file upon container initialization.
  • Method 2: Use a language/framework-specific migration tool.

With either of these methods, the test data can be stored in your repository or object storage, depending on the size of your data set and how frequently it gets updated.

Introduction

Previews simplify the testing process by replacing a polluted QA environment with a clean and ephemeral test environment for every feature branch. But what if your application depends on a stateful service like a database? How can you include persistent test data or other stateful configurations in each of your Preview Environments? In this guide we discuss important design considerations for database seeding in ephemeral environments along with specific techniques you might consider to add seed/test data to your Preview Environments.

Assumptions

This article assumes that your application or service uses an SQL relational database management system (RDBMS), such Postgres, MySQL, or MariaDB. Non-relational (NoSQL) databases and other types of datastores can also be seeded but are outside the scope of this article.

Where should you store seed data?

Before discussing specific seeding methods, it's first worth considering where to store data before it gets seeded into a database.

Storage Option 1: Your repository

One option is to store your seed data in your Git repository alongside your code. GitHub warns about files larger than 50MB and blocks files larger than 100MB, so this isn't a good option for larger data sets. For files larger than this, GitHub offers Large File Storage (LFS), which uses a pointer file committed to your repository that references your actual data, stored outside your repository. However, if your seed data changes frequently, such as when it's mirrored from production, storing within Git will generate many commits, each with a full duplicate. Therefore, storing seed data in your repository is only recommended for small datasets that change infrequently.

Storage Option 2: Object storage (recommended)

If your seed data is larger than 100MB or changes often, you should consider storing it in an object storage service, such Amazon S3 or Google Cloud Storage. In addition to being more scalable, object storage is also simpler and more cost effective than storing within a Git repository.

Seed Method 1: Load an SQL dump upon container initialization

Of the seeding methods discussed, loading an SQL dump file upon container initialization is the most broadly applicable, as it will work with any application that uses SQL like Postgres, MySQL, or MariaDB. This is because we rely on the database container to seed itself, instead of an application container performing the seed.

Pros

  • Will work with any application that uses SQL
  • Simple and straightforward process

Cons

  • Requires updating when your database schema changes

Initialization

The official Postgres, MySQL , and MariaDB images published on Docker Hub include facilities for populating the database on start-up: these containers will  automatically execute *.sql, *sql.gz, and *.sh files stored in the /docker-entrypoint-initdb.d directory.

Important
As of 1 May 2023, there is a known issue where Uffizzi will not mount individual files as volumes. This is currently being fixed. As a workaround you should mount the SQL dump as a directory that contains your *.sql, *sql.gz, or *.sh file(s).

Therefore, if you mount a directory containing one of these file types as a volume in  /docker-entrypoint-initdb.d, you will have a seeded database when the container starts up. Note that this method will work whether your SQL dump is mounted directly from your Git repositiory or whether a directory containing a script is mounted that fetches the dump file from object storage. See the following official Docker documentation for details:

Postgres initialization documentation

MySQL initialization documentation

MariaDB initialization documentation

As described in the links above, the database container will execute these initialization files in sorted name order and only if the data directory in the container is empty, i.e. if there are no pre-existing databases. This is useful for Uffizzi Preview Environments since a freshly seeded database will be created when a pull request is first opened, but any new data will not be overwritten when the environment is refreshed upon new commits to the branch. When the environment is destroyed (by merging or closing the pull request), the database and volumes are destroyed too. This allows teams to persist data during the lifetime of the Preview Environment, while ensuring that new pull requests get a freshly seeded database without data accumulation from the last environment.

Mounting volumes

To store initialization files (*.sql, *sql.gz, or *.sh ) in the /docker-entrypoint-initdb.d directory of your database container, you will need to mount them as Docker volumes. In Uffizzi, you can define a host mount in Docker Compose as follows:

# Mount a directory ./seed/ as a host mount volume that contains a seed.sh or *.sql.gz file
services:
  app:
    ...
  db:
    image: postgres:11.4
    volumes:
      - ./seed/:/docker-entrypoint-initdb.d

Figure 1: An example Docker Compose file that mounts a directory ./seed/ as a host mount volume that contains a seed.sh or *.sql.gz file. This script might fetch an SQL dump file (.sql) from object storage.

As noted above, currently you must mount a directory containing your SQL dump file or script, instead of the individual file(s). In Figure 1, a shell script seed.sh (stored in the ./seed/ directory) is mounted to the /docker-entrypoint-initdb.d directory of the Postgres container. When Uffizzi creates a Preview Environment from this Docker Compose file, the Postgres container will execute seed.sh to fetch seed data from object storage.

Caveats

  • If your database schema changes, you will need to create an new SQL dump file. You might consider adding a step to your CI process to do this automatically. If you expect your schema to change regularly, you might consider method 2.
  • Since scripts in /docker-entrypoint-initdb.d are only run if the container starts with an empty data directory, one problem that can occur is if one of your /docker-entrypoint-initdb.d scripts fails. This causes the container to restart with a data directory that is already initialized, resulting in a database that never gets properly seeded. To avoid this scenario, you can unmount the database volume shown in Figure 1; however, this will cause a database instance to lose any changes after the seed if your deployment restarts (as it does on new branch commits).

Seed Method 2: Use a language/framework-specific migration tool

If your application language or framework includes facilities for database seeding—such as an object-relational mapper (ORM), you may want to choose this method. Whereas in the previous method an SQL dump file is loaded by the database container itself, in this method the application container seeds the database. An advantage of this method is that if your schema changes, your application can easily perform a migration to update the database schema. Database migration and seed commands can be added to your Docker Compose file using a command definition, so that your application performs these steps before the application itself starts.  

Pros

  • Can handle schema migrations and seeding
  • Often very fast

Cons

  • Not every language/framework has these features
  • Can require extra logic in application or CI pipeline

Below are a selection of popular lanaguages/frameworks and their tools for database seeding. We plan to release detailed guides for these options soon.

Django

In Django you can load data using the framework's built-in object-relational mapper (ORM). The manage.py loaddata command lets you load a fixture in JSON format. To ensure that the seeding step happens before your application starts up on Uffizzi, include this command in your Docker Compose file.

# Django application that performs a database migration before seeding data from a JSON fixture.
services:
  app:
    image: "${APP_IMAGE}"
    commmand: ["python manage.py migrate \
    && python manage.py loaddata data.json && ..."]
  db:
    image: postgres:11.4
    volumes:
      - /var/lib/postgresql

Figure 2: An example Docker Compose file for a Django application that performs a database migration before seeding data from a JSON fixture.

Go

Atlas CLI is a declarative migrations tool for Go applications. You can declare the end-state schema  you want using atlas scheme apply or perform versioned migrations using atlas migrate apply. Install atlas in you application container, then run migrations and seeding steps before starting your application.

Rails

Rails has good support for bootstrapping databases, as well as defining and customizing fixtures for testing, using the built-in migrations features of Active Record. To add seed data to your preview databases, you can define a db/seeds.rb file and run rails db:seed after your application starts.

# Example database migration using Rails
services:
  app:
    image: "${APP_IMAGE}"
    commmand: ["bash", "-c", "bundle install \
    && bundle exec rails db:create db:migrate \
    && bundle exec rails db:seed && bundle exec puma -C config/puma.rb"]
  db:
    image: postgres:11.4
    volumes:
      - /var/lib/postgresql

Figure 3: An example Docker Compose file for a Rails application that performs a database migration before seeding test data from a db/seeds.rb.

Conclusion

If your application language or framework uses an ORM or declarative migration tool, you should consider using it for its dual ability to handle both schema migrations and database seeding. For a more generic solution that will work with any application that leverages Postgres, MySQL, or MariaDB, you can instead choose to load a SQL dump file from the /docker-entrypoint-initdb.d repository.

To store your seed data, it is recommended to use an object storage service such as Amazon S3 or Google Cloud Storage if your data is larger than 100MB or changes regularly. If your test data is small and changes infrequently, colocating your seed data with your code in your repository is probably sufficient.