PostgreSQL HA with pgpool-II - Part 1

This is the first part of  the tutorial. Other parts are:

Since the tutorial is rather long I must start with:

We'll get through. Invest a few hours in reading all six parts of this tutorial, and you'll be able not only to implement HA, but also to actually understand how it works.

0. Introduction

About two weeks ago I decided to spend day or two to implement PostgreSQL high availability (HA) cluster by using pgpool-II. Today I still don't have it implemented. It turned out to be much more complicated than I initially expected. Well, I often underestimate the work that needs to be done, but in this case it was especially painful since it happened when I'm already pretty short with time. It is true that I don't have any serious experience with PostgreSQL (besides basic usage - apt-get and basic settings in postgresql.conf and pg_hba.conf). However, it is also true that I'm kind of good with computers, and I've succeeded to implement things like Elasticsearch server and Cassandra cluster in less than two days each, although not having any prior experience. But PostgreSQL / pgpool-II turned out to be a different kind of a monster...

It is important to say that this tutorial is written with all details, without assuming any preexisting knowledge. It covers all the steps so that you won't need to search for other resources to be able to understand it. In short, this tutorial is truly Dummy-to-Expert kind of tutorial.

1.1. What is Actually HA?

Don't worry, I won't bother you with theory, but I must be precise about what we are actually trying to accomplish, and it'll also help us understand some basic terms, often used in this area. At the highest level of abstraction I will define the following expectation:

Database cluster should be implemented in such way that database remains available if any of servers goes down.

Fair enough. Let's see what it means at a bit lower, more technical level of abstraction:

  • The data should be distributed between cluster members in such way that all the members have the most recent data. (It's not 100% true, but let's leave it as such for now.) This part is accomplished by replication.
  • When primary server goes down, a standby server should take over its role. Note: PostgreSQL clustering always include a primary server and standby server(s). This part is accomplished by (preferably automatic) failover.
  • Not mentioned in our original request but assumed: the failed server should be easily replaceable / recoverable.
  • Optional: When the system is in its regular state (all servers are running), overall load should be distributed. It means that not all queries should be executed on the primary server, but some will be directed to a standby server(s). This is accomplished by load balancing.

A wise man said once: "knowing where you want to go will significantly increase your chances to actually get there." For this reason I will be more precise about what I want to accomplish (and I suggest you to do the same for yourself):

My primary aim is to get replication and failover up and running. At the moment I'm not interested in load-balancing, although the solution that will be implemented by procedure described here will actually allow load-balancing also.

1.2. Why Two Products, and Who Does What?

At the moment of this writing (PostgreSQL version 9.5) it is not possible to implement complete HA deployment by using only PostgreSQL. Looking at HA parts defined in the previous section we can say that:

  • PostgreSQL does offer a variety of replication options, so replication part is fully covered;
  • PostgreSQL implements an easy way to perform failover (as it will be shown later, it is enough to create a trigger file, and standby server will take over primary server role), but it does not perform it automatically. Even more, PostgreSQL does not implement any kind of tracking (awareness) if the primary server is down or not. It means that we need another product for automatic failover feature.
  • PostgreSQL itself does not implement any load-balancing feature. The server that receives a query will execute it. It means that we need another product for load-balancing feature (if we need it in the first place).

If you wonder why PostgreSQL does not implement mentioned features - the answer is pretty simple: Failover part in the whole HA implementation is very risky. A lot of problems (resulting in data loss) can arise if there is more than one primary server in the cluster, and it may happen if one of standby servers falsely concludes that primary server isn't working, and that it should take over its role. For example, it may happen if a standby server loses network connection causing it to conclude "wow, all other servers are down, I must go primary..." Another example would be that it may happen that the primary server actually goes down, but comes back again, still thinking that it is the primary server, although some other standby server has already taken over that role...

Anyway, PostgreSQL team obviously didn't want to include this risk in their product. This way they and their product are protected from such problems, being able to say "Well, it is not caused by our database. You shouldn't allow two primary servers in the same cluster." Honestly, I fully agree with their decision - it was probably one of the smartest, life-saving decisions they ever made.

1.2.1. Selecting The Other Product

As you can see at PostgreSQL wiki, there are quite a few products that can jump-in and help in accomplishing PostgreSQL HA. My reasons for selecting pgpool-II are:

  • I've wanted a product that relies on existing PostgreSQL replication instead of introducing its own. Many products (for example slony) are actually implementing their own replication mechanism instead of using the embedded one. In my opinion, no one can know better how to implement PostgreSQL replication than PostgreSQL team itself. pgpool-II relies on the embedded PostgreSQL replication.
  • At the moment of this writing, pgpool-II was the only product that implements all three additional features used in the comparison matrix: connection pooling, load balancing and query partitioning. Although at the moment I'm not interested in the latest two, it is generally good idea to be prepared if a need for such features arises.
  • Although pgpool-II is not too mature, it seems to be well supported and aggressively developed.

Nevertheless, I must admit that I haven't investigated all the products in details. It means that another factor was important - I feel lucky pick.

1.2.2. Selecting Replication Model

We've already decided to use PostgreSQL embedded replication mechanism, but not exactly which one of variety of them. You can literally spend weeks researching about all of them. Things are getting even more complicated when even the official PostgreSQL documentation starts introducing a new terms besides the ones mentioned in the list (i.e. binary replication). If you've clicked on the previous link, you've might noticed the following note: "NB: there is some duplication with the page on Streaming Replication". Really? So you can go towards discovering all nuts and bolts about all replication models, which will ultimately lead you to an interesting type of lunacy, or you can stick with me and my choice. After some research I've decided to go with Transaction Log Shipping by using streaming replication. The fact that influenced my decision the most (besides my quick and rough research) is that this replication model is the most popular in other online resources that are dealing with PostgreSQL HA.

1.3. Physical Infrastructure and Implementation Plan

I'll implement two-servers (single-standby) cluster, although the procedure is the same if you want to implement multiple standbys. In fact, I will prepare primary server to be able to accept more than one standby, just in case. The next question that needs to be answered is how to implement pgpool-II. Often scenario is to have single pgpool-II server in front of PostgreSQL cluster, as illustrated on the following image:

Image
Image 1: Multiple PostgreSQL servers behind a single pgpool-II instance.

However, this scenario again introduces single point of failure - pgpool-II server itself. If pgpool-II server goes down we'll lost database connectivity although both database servers are actually running. Nevertheless, if you are using a decent hardware dedicated to pgpool-II - this risk is not too big. pgpool-II (thanks to the fact that it does not torture hard drives) is one of turn-on-and-forget tools that can run for eons on dedicated hardware. But it stands only if pgpool has its own physical machine. In every other case you should use some redundancy. Actually, why I'm fooling around? It is always better to have more than one instance running. We can deploy them on the same servers PostgreSQL database is deployed on.

Long story in short: I'll use two Ubuntu 14.04 servers, each carrying PostgreSQL 9.5 and pgpool-II 3.5.2. The architecture is described in the following diagram (borrowed from this pgpool.net article):

Image
Image 2: PostgreSQL and pgpool-II sharing the same hosts.

BTW., the post the image is borrowed form is OK, but it deals only with pgpool-II, not covering any configuration needed at PostgreSQL side.

1.3.1. Watchdog

Before starting with the actual implementation it is important to demystify one pgpool-II component - watchdog. The purpose of this component is to periodically check all other pgpool-II instances (especially the active one) if they are still running. If the active instance fails standby instance will be aware of this failure and take over active role, thanks to the watchdog. If there's more than one standby instance running - the one with the highest priority will become the active one (we can configure priority of a particular instance in pgpool-II configuration file, as will be explained later). Honestly, I don't know what happens if multiple standby instances are configured with the same priority; hopefully this would be handled internally by pgpool-II in an appropriate way. Finally, to avoid any confusion, I will tell that watchdog checks other pgpool-II instances, not PostgreSQL databases. Health of the databases is checked by all pgpool-II instances. In a manner of speaking we can say that watchdog checks pgpool-II, which further checks PostgreSQL.

If you take a look at pgpool.net download page, you may notice that there's also a product called "pgpool-HA" (or something like that). This product was used with earlier versions of pgpool-II for the similar purpose the watchdog is used now. It means that pgpool-HA is now obsolete in favor of the watchdog. I'm not exactly sure in which version of pgpool-II is watchdog implemented for the first time (I think 3.1). Anyway, the watchdog is present during last few years, so if you're using pgpool-II chances are that it's a version with watchdog. On the other side, if you are just starting with pgpool-II, you'll start with the newest version, right?.

1.3.2. Virtual IP

Another term to explain is virtual IP. You've might heard about a similar term - floating IP, which is often used with server high availability. Virtual IP is actually the same exact thing. For those who don't know, I'll briefly explain how it works on an example with two servers, but the principle is the same for any number of servers. Our infrastructure will be installed on two servers with IP addresses ip1 and ip2. But besides these addresses we'll also introduce another IP address (let's call it ipV) that will be used by both servers. One may ask how can the same IP address be used with multiple servers? Well, it can't. In reality it is used only by the server where the active pgpool-II instance is running. But if it happens that this server fails, thanks to previously explained watchdog, another instance will become the active one, and it will also take over virtual IP. So it cannot happen that two servers are using virtual IP at the same moment.

The benefit of introducing virtual IP is obvious: all other applications and systems in our infrastructure will continue to use the same IP (virtual IP) for database access, even if active pgpool-II instance (or primary PostgreSQL instance) fails. It means that no reconfiguration of other systems is needed when failover happens. The only important thing related to virtual IP is that we must select an IP address that is not used by any other system/server in our network, of course.

1.3.3. What about pgpool-II???

Originally it wasn't my intention to describe pgpool-II. But since I've explained some of its parts, it would be unfair not to provide any explanation about pgpool-II itself. I won't go into details about all its features (i.e. load balancing, query partitioning, etc.). Instead I will explain only its basic role. Basically, pgpool-II behaves as PostgreSQL HA proxy. It means that pgpool-II exposes the same API to outside world as PostgreSQL does, so all database clients will actually be connected to pgpool-II, not to PostgreSQL itself, without even being aware of that. On the other side, when pgpool-II receives a query from the outside world, it decides what to do with it. It will know which PostgreSQL instance is down, which is primary, and to which it should forward the query. And it does that completely transparently for outside world.

The similar transparency stands from the database's point of view; from PostgreSQL's perspective pgpool-II is nothing more than another database client. Probably the only direct client, but still nothing more than a client. Basically, pgpool-II does a great job, still being completely invisible for all other participants in the party.

Where to Go Next?

Once we've met our enemy, thus being significantly less afraid, we can continue with installing PostgreSQL servers and establishing replication in PostgreSQL HA with pgpool-II - Part 2.