In this article, I’m telling you:
- How to generate docs of database schema structure with SchemaSpy? Why you need it, if you use a relational database (PostgreSQL/Oracle/MySQL etc.) in your application?
- How to run SchemaSpy with Testcontainers?
- How to host the generated database documentation on GitHub Pages?
The stack I’m using in this project consists of:
- Spring Boot
- Spring Data JPA
- PostgreSQL
You can find the entire repository with code examples by this link. The test that generates the SchemaSpy docs is available by this link.
What is SchemaSpy?
SchemaSpy is a standalone application that connects to the database, scans its tables and schemas, and generated nicely composed HTML documentation. You can check out an example sample by this link but I’m showing you just one screenshot to clarify my point.
SchemaSpy visualizes the database structure detailedly. Besides, you can also deep dive into relations, constraints, and table/column comments (i.e. COMMENT ON COLUMN/TABLE statements).
Such documentation is helpful for a variety of specialists:
- System analysts want to understand the data processing and storage principles behind the business logic.
- When QA engineers face a bug, the database structure documentation helps to investigate the reason that cause the problem deeply. As a matter of fact, they can also attach additional details to the ticket to make its fixing more transparent.
- Data engineers have to be aware of tables structure to deal with Change Data Capture events correctly.
So, database documentation that is always relevant (because it’s generated) is excessively beneficent.
Database tables
Look at the DDL operations that create database tables. This is our application domain we’re going to document.
CREATE TABLE users
(
id BIGSERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL
);
CREATE TABLE community
(
id BIGSERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL
);
CREATE TABLE post
(
id BIGSERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
community_id BIGINT REFERENCES community (id) NOT NULL
);
CREATE TABLE community_role
(
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users (id) NOT NULL,
community_id BIGINT REFERENCES community (id) NOT NULL,
type VARCHAR(50) NOT NULL,
UNIQUE (user_id, community_id, type)
);
CREATE TABLE post_role
(
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users (id) NOT NULL,
post_id BIGINT REFERENCES post (id) NOT NULL,
type VARCHAR(50) NOT NULL,
UNIQUE (user_id, post_id, type)
);
As you can see, there are 3 core tables (users
, community
, and role
) and 2 linking tables (community_role
, post_role
).
Running SchemaSpy with Testcontainers
The easiest way to keep documentation in sync with the current database structure is updating it on each merged Pull Request. Therefore, we need somehow to run SchemaSpy during tests execution.
I’m showing you the algorithm step by step. But you can check out the entire suite by this link.
Firstly, we need to define the SchemaSpy container itself. Look at the code snippet below.
class SchemaSpyTest extends AbstractControllerTest {
@Test
@SneakyThrows
void schemaSpy() {
@Cleanup final var schemaSpy =
new GenericContainer<>(DockerImageName.parse("schemaspy/schemaspy:6.1.0"))
.withNetworkAliases("schemaspy")
.withNetwork(NETWORK)
.withLogConsumer(new Slf4jLogConsumer(LoggerFactory.getLogger("SchemaSpy")))
.withCreateContainerCmdModifier(cmd -> cmd.withEntrypoint(""))
.withCommand("sleep 500000");
...
}
The
AbstractControllerTest
contains PostgreSQL container configuration. You can see its source code by this link.The Cleanup annotation comes from the Lombok project. It generates try-finally statement.
I want to point out some important details here.
- The
withNetwork
clause assigns the container to the existing TestcontainersNETWORK
. This value inherits from theAbstractControllerTest
and the PostgreSQL runs with this network as well. It’s crucial because otherwise SchemaSpy won’t be able to connect to PostgreSQL. - Log consumer applies a logger to push logs from the container. It’s useful to track bugs and errors.
- The
withCreateContainerCmdModifier
is the primary part. By default, SchemaSpy container tries to connect to a database immediately and generate documentation. Then a container terminates. However, that’s not an acceptable behaviour, because the generation result remains inside container. Therefore, we need to copy it in the OS directory. But if a container has already stopped, it’s impossible. So, we have to override a default entry-point to make a container run (almost) indefinitely. That’s why I put thesleep 500000
command. Container will hang and do nothing on its start.
Now we need to trigger the generation process. Look at the code block below.
schemaSpy.start();
schemaSpy.execInContainer(
"java",
"-jar", "/schemaspy-6.1.0.jar",
"-t", "pgsql11",
"-db", POSTGRES.getDatabaseName(),
"-host", "postgres",
"-u", POSTGRES.getUsername(),
"-p", POSTGRES.getPassword(),
"-o", "/output",
"-dp", "/drivers_inc",
"-debug"
);
schemaSpy.execInContainer("tar", "-czvf", "/output/output.tar.gz", "/output");
Here is what happens:
- I start the container (remember that it hangs and does nothing).
- Then I execute the command that generates the documentation. The host equals to
postgres
because that’s the PostgreSQL container’s network alias (thewithNetworkAliases
method). The command execution happens in a separate process inside the container. So, thesleep
command is not terminated. - Finally, we put the directory with generated contents (HTML, CSS, JS) into a tarball. Testcontainers library allows to copy files from a container to the OS but not directories. That’s why we need an archive inside the SchemaSpy container.
It’s time to copy the result documentation in the OS directory and unpack the changes. Look at the code snippet below.
final var buildFolderPath =
Path.of(getClass().getResource("/").toURI()).toAbsolutePath();
schemaSpy.copyFileFromContainer(
"/output/output.tar.gz",
buildFolderPath.resolve("output.tar.gz").toString()
);
schemaSpy.stop();
final var archiver = ArchiverFactory.createArchiver("tar", "gz");
archiver.extract(
buildFolderPath.resolve("output.tar.gz").toFile(),
buildFolderPath.toFile()
);
The steps are:
- Define the
buildFolderPath
that points to thebuild/classes/java/test
directory (I use Gradle in this project). - Then I copy the tarball with documentation from the container to the
buildFolderPath
directory. - And finally, I unpack the archive contents to the same directory (I use jararchivelib library here).
In the end, we have a pretty documentation generated on the database schema structure. Look at the screenshot below.
Hosting result on GitHub Pages
We got generated documentation based in build/classes/java/test
folder. Anyway, it’s not that useful yet. We need to host it on GitHub Pages and update accordingly.
Look at the pipeline definition below.
name: Java CI with Gradle
on:
push:
branches: [ "master" ]
permissions:
contents: read
pages: write
id-token: write
jobs:
build-and-deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 17
uses: actions/setup-java@v3
with:
java-version: '17'
distribution: 'temurin'
- name: Build with Gradle
run: ./gradlew build
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
with:
path: build/classes/java/test/
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v1
The actions are:
- Build the project and run tests with
Set up JDK 17
andBuild with Gradle
steps. - Then the
Upload artifact
step uploads thebuild/classes/java/test/
directory folder to GitHub registry (it contains the documentation). - And finally, deploy the artefact created on the previous step to the GitHub Pages.
That’s basically it. Now the SchemaSpy documentation is available on GitHub Pages and being updated automatically on each merged Pull Request.
Conclusion
That’s all I wanted to tell you about generating database schema documentation and hosting the result on GitHub Pages. Do you have any docs generation automations in your project? Is it SchemaSpy or something else? Do you host the result on GitHub or GitLab Pages? Tell your story in the comments. It’ll be interesting to discuss your experience as well.
Thanks for reading! If you like that piece, press the like button and share the link with your friends.
Resources
- The repository with code examples
- Code of the test that runs SchemaSpy container
- GitHub Pages final documentation
- SchemaSpy
- Testcontainers
- GitHub Pages
- Spring Data JPA
- PostgreSQL
- SchemaSpy documentation sample
- PostgreSQL COMMENT ON COLUMN/TABLE statements
- Change Data Capture events
- Cleanup annotation
- Try-finally statement
- GitHub issue: copyFileFromContainer – is there a way to copy a folder?
- Gradle
- Jararchivelib