Skip to main content

Go KonMari on your SQL queries, bruh!

··1771 words·9 mins
Bestest-Practices Jpa Unfold Before You Code

I use JPA, I got all things SQL covered, no?
#

If you reading this, chances are that you are an avid user of JPA, but completely unaware of the mess you are sitting on. The simple and clean situation we all want to be is as few individual SQL statements executed as possible, yeah? Only so, our applications can scale with the amount of data without the risk of suddenly going down unexpectedly and inexplicably. So, if you not planning to heat your place with the machine that your database is running on, this one is for you!

Spring Data, JPA, Hibernate got our backs though to avoid such a mess, right? HELL NO! All they do is teach you a lesson to not rely on other systems without verifying the fundamentals functioning as expected yourself! There is ALWAYS pitfalls and you do not want to iron out those kinks in production.

this one does not spark joy jpa sql meme

It’s entirely on you to clean up the mess you are sitting on. Just like Marie Kondo will not clean up your home, nobody will clean up your JPA generated SQL statements for you. But just like Marie Kondo, I will tell you how to do so, keep what sparks joy and smile awkwardly all the time.

But, let’s just be honest, what JPA implementations do or do not under the hood, nobody can really predict, so verifying the outcome of SQL statements is simply a must if you want to deliver professional work. So, you do not have to know by heart what is happening, but as with everything, you simply need to be able to observe and verify and have the fundamentals down.

WTF is happening and why it sparks no joy?
#

It’s mighty difficult to debug the performance of all your database accesses when it is obfuscated which SQL statements are actually sent to the database. But, that is exactly what our friendly lil’ helpers JPA, Hibernate & friends do. That’s not even the last of it, cause we are also highly dependent on whether multiple INSERT or DELETE statements, for example, are transferred in reasonably sized batches or in the worst case, individually. Yes, JpaRepository.saveAll() was not exactly what simple, naive me had expected.

That’s why proper logging of one’s SQL statements is crucial. Meaning, if you do NOT have that logging in place at all, there’s 2 possibilities, either…

  1. you are a JPA guru with a bit of a cocky side or
  2. they should not let you near any other backend application any time soon.

A good and straight-forward choice when working with JPA/Hibernate is the not well-known enough DataSourceProxy. That one enables you to log generated database statements along with their binding parameters. This makes it very easy to clearly see how many SQL queries you actually produce from your Java code and whether intended batch mechanisms for modifying statements work as intended.

This also makes DataSourceProxy a useful tool when trying to detect N+1 loading problems when relying on ORM such as JPA.

The N+1 problem basically occurs whenever more SQL statements are generated & executed by JPA than expected and necessary. A common example is an additional SELECT statement when using @OneToMany / @ManyToOne on an entity. The default FetchType is EAGER, so querying the owning entity will always lead to at least 2 queries (for each entity!), because the linked entity will be eagerly loaded in a separate query. This is a huge problem, because obviously we want to reduce the number of round trips to the database (network latency = sad face) to avoid performance issues.

To integrate DataSourceProxy, the following steps are necessary:

  1. Add dependency (example pom.xml in case of maven)
<dependency>
	<groupId>net.ttddyy</groupId>
	<artifactId>datasource-proxy</artifactId>
	<version>1.10</version>
</dependency>
  1. Initialize Proxy and wrap Datasource
SLF4JQueryLoggingListener loggingListener = new SLF4JQueryLoggingListener();
loggingListener.setQueryLogEntryCreator(new DefaultQueryLogEntryCreator());
return ProxyDataSourceBuilder
	.create(new HikariDataSource(updatedConfig))
	.name("SQL_POLICE")
	.listener(loggingListener)
	.build();
  1. Configure Logging
<logger name="net.ttddyy" level="debug"></logger>

If you don’t trust your peers or yourself (which btw, you shouldn’t) you can even build unit tests that will validate and assert on the number of expected SQL queries.

Inserting Data in Batches
#

If you go beyond the most standard CRUD use cases, you will at some point be confronted with having to insert multiple rows of data at once. Of course we could do this via native queries, but often it’s a bit more endearing to stick to JPA and its perks. So, naively looking for a way to accomplish this, we immediately find the simple answer: we just gotta set jdb.batch_size in our Spring application properties and make use of an IdGenerator of type SEQUENCE. Done!

Then, continuing on the assumption that all is good is bold, but not uncommon. But we are here to verify, so let’s dive into 2 common pitfalls:

Generated primary key via Sequence
#

Let’s say, we call JpaRepository.saveAll() for 500 entities and with a configured batch size of 50. What we’d expect is 10 batches of insert statements, so 10 times communicating with the database for 500 entities, which is relatively reasonable. But checking our DatasourceProxy log, we are greeted with a wall of SQLs in the form of 500 additional and most of all unexpected `SELECT queries being executed. Taking a closer look, we see that each query is fetching the next possible value for the sequence primary key from the database. The culprit here, is a primary key sequence with an allocation size of 1. This leads to an additional select being executed for every single entity before it can be inserted in order to determine the next available id.

nextval crabs jpa sql meme

To avoid this mess, the allocationSize for sequence generation has to be appropriate for the use case of each entity. Knowing for example, that we will generally always insert 500 entities at a time, an allocationSize of 500 would also make sense. But watch out, the configured value of the allocationSize has to match the increment set in the linked sequence table in the database.

@Entity
@Table(name = "crochet_pattern")
public class CrochetPattern {
@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "cp_seq")
@SequenceGenerator(name = "cp_seq", allocationSize = 500)
private Long id;

This oversight is often caused by using a single sequence table across all tables. Generally, it is also advised to use a dedicated sequence per table, unless it is desired to have unique primary keys across all tables of an application. BUT: be careful when making the change to an existing system! There might be id collisions when the startValue of the new sequence is not set correctly, namely high enough to bypass all existing ids.

Entities with a non-generated primary key
#

Checking back with our log, we find huge numbers of SELECT statements in addition to our expected INSERTS with a couple of other entities, though. There, sequence generation is not the issue, because those entities do not have a surrogate primary key. Why’s that? Let’s check the implementation of springframework.data.jpa.repository.support.SimpleJpaRepository.save() for answers:

@Transactional
public <S extends T> S save(S entity) {
	if (entityInformation.isNew(entity)) {
		em.persist(entity);
		return entity;
	} else {
		return em.merge(entity);
	}
}

When calling save() on one of those entities without a generated primary key, it must first be determined whether the entity to be saved already exists in the database or not. Cause if it does, save will internally do a call to merge and if not a persist on the EntityManager. So, we need to provide the information whether an entity is new or not without checking the database. There is 3 ways to do that:

  1. Implement the Persistable interface and its isNew method in those entities to directly control the result of the isNew check.

  2. Have a field in the entity that is annotated with @Version. Unless that field is already set, JPA knows that it is dealing with a new entity and has to persist instead of merge.

  3. Interact with the EntityManager directly and use persist in those use cases.

The final piece of the puzzle is setting the reWriteBatchedInserts property of our data source to true. This will convert a batch of 50 individual INSERT statements into a single one with 50 different mutations in the VALUES portion. That additionally improves performance, because the whole overhead coming with executing an insert has to be done only once.

Querying for multiple fields using IN
#

Now, being able to insert multiple entities in batches, we also want to speed up the querying part. This means selecting as many rows as possible with a single select while taking advantage of indexes that are in place. This can be a bit not so straight-forward when we are dealing with indexes across multiple columns of a table:

The combination of fields, which are to be used as the Tuples in the desired IN clause (which should correspond to an existing database index), have to be put into a dedicated class and annotated with @Embeddable. Then, replace the individual fields with a field of this newly created class an annotate either with @EmbeddedId or with @Embedded, depending on whether you are dealing with the primary key of the table or not. Now, you can use the embeddable type as a parameter in a JPA Data generated select query to achieve SQLs that look as follows:

Query:[
	"select designer_nr, pattern_nr from public.crochet_patterns 
	where (designer_nr, pattern_nr) in ((? , ?) , (? , ?)"
], 
Params:[((0069, 007),(0084, 009))]

Don’t screw up the indexes you use though. So, repeat after me: INDIVIDUAL INDEXED ON 3 DIFFERENT COLUMNS DO NOTHING WHEN QUERYING ON ALL 3 COLUMNS

The 64.000$ question - why you telling me this?
#

There’s no easy way around this. JPA is still the de-facto standard in Java backends around the globe and I am not sure you are popular enough to be different there (I know 4 sure I ain’t). Even if it would be the right thing to do, going too radical won’t help you get that bandwagon going, just making you look like an oddball.

not popular to be different homer jpa sql meme

So, let’s break it down as simple as we can. We need to be aware of what our software does, which includes what APIs or libraries do that we introduce to our stack. This is even more true with staples such as JPA. You do not have to be an expert in specific technologies per se. It all comes down to knowing the fundamentals and being able to verify them. Everything from there is just filling in the blanks. We basically do test-driven development, just one level of abstraction higher. Even if it’s not your fault a mess exists in the first place, it absolutely is your fault if it stays that way. #noexcuses