|Mark Smith (mark) wrote in dw_dev_training,|
@ 2009-08-09 11:11 pm UTC
|Entry tags:||tutorial: database|
Editor's note: Some things are simplified. Particularly around the handles returned in some cases. This is a beginner's tutorial!
Types of Database
Dreamwidth (and LiveJournal -- everything I say here is applicable to LJ development as well) has, at the most basic level, two different groupings of databases. What we call the global database and then the user clusters. Each of these groups is actually made up of potentially many individual database machines. But here, let's have a graphic:
I will explain what the graph is showing. First, at the top, you will see the User Clusters section. Notice that there are three separate pairs of databases. They have names such as dfw-db-a01 and dfw-db-a02, etc. Each of these pairs of machines is a separate cluster -- what we know as the Alpha Centauri cluster is actually those two machines I just listed. Betelgeuse is the dfw-db-b01/02 pair, etc.
The second thing you should notice is that the arrow between these machines goes both ways. This is what's known as a master-master replication setup. You don't really need to know what that means, but if you're curious, I'll write up something about it. It's illustrated in the graphic to demonstrate the differences between this and the global cluster.
The third thing to notice is that there are no lines between the pairs. The data on one pair of machines is not copied to any other pair. This is why, when you think you it might solve your problem to start joining tables, it probably won't work. If one user is on Alpha Centauri, and you try to join on data that is in Betelgeuse, it won't work. The databases just can't work that way.
Now, let's move on to the global cluster. It looks a lot more hairy. The main difference is that there is only one "cluster" here: and all of the machines contain the same data. However, since this is very popular data (user accounts, passwords, support, payments, edges [WTF data], reluser, etc etc), we need a lot of copies of it to handle the load. (In reality, Dreamwidth only has two of the machines pictured in this cluster right now, but we're planning for the future...)
The second thing to notice about the globals is that the arrows only go one way. This is called master-slave replication.
Now that we've talked about the types of databases, let's dig deeper.
Writing, Reading, and Arrows
Those arrows are now going to become important. In a nutshell, they tell you, as a programmer, which way data will "flow". This is very important to databases (and programmers!), because you want your data to make it off of one physical machine and onto a second so that you have two copies, just in case one of the machines dies, gets overloaded, or there's some other problem that comes up.
So, go back to those pictures. If you were to connect to the MySQL shell on one of the machines, let's say for example dfw-db-c01, that command will then be replicated down the arrow so that it happens on dfw-db-c02 too. This means that, for your one command, it now happens on two machines. Easy!
Now, picture what happens if you insert a row into a table on dfw-db04. Since the arrow is only one way, your change is stuck on that machine. Dead end. Not only is your change probably going to disappear one day, but since it only appears on one of the machines, you're going to see really weird behavior on your site! Picture, if you will, that you inserted a new post into a table on dfw-db04.
You post the entry (insert a row), then you refresh your journal. Well, since that row only exists on one of the five global databases, you will only see your entry on every fifth reload! Ouch! Obviously, that's not a state you want to be in.
Complicating Matters...Red, Green, and Grey
Now, let's complicate matters. Flip back to the graphic, and notice how some of the databases are Red, some are Green, and some are Grey? The Green ones are "masters" and are "write OK". In other words, you are allowed to insert/delete/update on those machines, and only those machines.
The red machines are "slaves" or "read only" -- the only way data gets onto these machines is by MySQL replication only. You never, ever write to them.
The grey machines are offline! This happens more often than you'd think. When we do a backup, or maintenance, or any number of other things we have to take machines offline for administrative tasks
Now, gah! This is a lot to keep track of, isn't it? Thankfully, you don't have to!
Writers and Readers, Again
The Dreamwidth code gives you very easy to use methods of getting database handles, and ensuring that you are always talking to the right machine at the right time. As long as you follow these few simple rules, you'll never have to wonder "is this machine grey? red? or green?", the DBI::Role module (and friends) will take care of it for you.
That sounds pretty simple, and it really is. So we're going to plow right into the next section...
Talking to User Clusters
If you're going to write to a user cluster, you get a master handle like this:
my $dbcm = LJ::get_cluster_master( $u );
Cluster master handles are always called $dbcm and it is always safe to write to one of these handles. If you use that function, in that way, then you know you are connected to a green machine. Write, write, write!
Now, if you just want to read some data and don't care about writing, you can get a red handle pretty easily too:
my $dbcr = LJ::get_cluster_reader( $u );
Done! You now have a reader, and you can use that to read. But you must never write to it. Otherwise you're writing to a red machine and things will go badly for you.
Talking to the Global Cluster
The global is pretty much the same, except you use different functions and you have different handle names. (This helps you distinguish between which set of databases you're talking to, so you know what data is available on them.)
my $dbh = LJ::get_db_writer();
my $dbr = LJ::get_db_reader();
Note that the $dbh is the writer/master, and it allows you to write. (That is the green database at the top of the pyramid!) And the $dbr is a red database. (Never a grey.)
That's simple! Now make sure you always write to the former, and never to the latter.
Reading From the Master
Another thing you will sometimes have to do is read from a green database. But first, let's talk briefly about replication in MySQL. The thing you have to know: it's not immediate. If you write something to your $dbh and then you immediately turn around and try to read that same thing from your $dbr, it won't show up. Because MySQL replication takes some time. We try to keep the replication delay (how far behind the red databases are) to a minimum, but sometimes it can be seconds or longer.
Because of this, if whatever code you're writing is doing something that requires a totally accurate view of the world, you have to read from the $dbh. As an example, let's say you're trying to update one table with the count of how many comments a user has made. Since you are going to be using your data to do a write, you should read from the master (green databases). This makes sure you always have the most up to date information possible and that your write, when you do it, is accurate.
The same goes for populating memcache. If you're going to look up who someone trusts and put that data in memcache, you should use a master. Using the slaves means that you can end up in a situation where memcache contains out of date information, which confuses users and leads to a bad experience.
I think that's the basics of database handles. Questions, comments? How can I make this tutorial better?