kareila: "Mom, I'm hungry." "Hush, I'm coding. You ate yesterday." (coding)
kareila ([personal profile] kareila) wrote in [site community profile] dw_dev_training2009-08-02 09:29 pm
Entry tags:

Kareila's intro to database design

[staff profile] denise sez:

I'd really like to see someone's thought processes while implementing something big/major that relies on working with DB stuff: what's the best practice bits that we just won't think about because we don't know enough? What sort of stuff should we do just because that's the best way to do it? Etc.


I am not a professional database designer, but I've read a few books and learned my way around MySQL, so I'll try to share my thoughts by way of the spec I've implemented for Bug 215: Implement v-gifts. Fair warning: although that spec has benefited from one round of peer review, it hasn't been approved yet, so it might will change again.

1. Objects, Properties, and Uniqueness



Each table in a database should completely describe the properties specific to one group of similar objects. Obviously the virtual gifts themselves comprise one table (vgiftids), and each gift transaction belongs in another table (vgifts). To come up with the needed properties, I look at the proposed functionality and figure out what data I need to store to provide that functionality. For the vgiftids, the obvious properties are a name, a description, a picture, and a cost. I've also decided to create space for some useful metadata such as creator (who designed the gift), creation date, and a couple of Y/N markers called "featured" and "active" that will let us control how the gift is displayed in the store.

I'll quote the spec for vgiftids here so you don't have to keep flipping back and forth:

CREATE TABLE vgiftids (
    vgiftid    INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    name       VARCHAR(80) NOT NULL UNIQUE,
    created    INT UNSIGNED NOT NULL,   #unixtime
    creator    INT UNSIGNED NOT NULL DEFAULT 0,
    active     ENUM('Y','N') NOT NULL DEFAULT 'N',
    featured   ENUM('Y','N') NOT NULL DEFAULT 'N',
    desc       VARCHAR(255),
    pickey     VARCHAR(255),
    cost       FLOAT(2,2),

    INDEX (name),
    INDEX (created)
)


I'm not going to explain the SQL syntax here, beyond the fact that the first word on each line is the name of a column in our table, and the second word is the type of data it will contain (numbers or strings). That's stuff you can copy out of other tables that contain similar data to what you want. What I want to focus on here are the phrases "NOT NULL", "PRIMARY KEY", "DEFAULT" and "UNIQUE". Judicious application of these keywords is crucial to a well-designed table.

Each row in a table should uniquely describe a single instance of an object. Having duplicate rows in a table is a bad, bad thing. The most common way to enforce this restriction is to define a column or a group of columns as the table's primary key, and the most common form of primary key is a unique ID number. In my vgiftids table, vgiftid is the primary key. Although the name column also contains unique values, names can be changed; primary keys are forever, until the row is deleted. (You can also think of userids vs. usernames as another illustration of this principle.) By further specifying vgiftid as AUTO_INCREMENT, I've told the database to keep track of this column for me. Every time I create a new row, it will automatically give the new row the next vgiftid number in the sequence.

Advanced topic: explain my primary key for the vgift_tagids table. (Hint: I explain it further down, but I want you to think about it here before continuing.)

Okay, before we move on, we need to consider not only desired database values, but allowed database values. What if we add a row that doesn't have a value specified in every column? The primary key is calculated via auto-increment, but any other column that is left blank will show up as NULL, unless we've specified a default value. In my vgiftids table, my "featured" and "active" flags default to "N", and if a creator is not specified, that defaults to 0 (which is an invalid userid).

You want to carefully consider which values could legitimately be allowed to be NULL. I've decided that a gift could have no description, picture or cost. No cost? On second thought, maybe that should also have a default of 0. A gift with no picture or description would be boring, but maybe it might have one but not the other? I've decided not to enforce a value here. All the other columns in my table have a NOT NULL qualifier, which means the database will throw an error if you try to create or update a record that has no value set for that column. I have two columns specified NOT NULL that have no DEFAULT (name & created), which means I must specify those every time I create a record.

Advanced topic: explain why "created" doesn't default to the current SQL timestamp. (Hint: it has to do with timezones.)

One last thing: defaults only apply upon record creation. You can later update a column with a default value to be NULL unless you've also specified it's NOT NULL. That's why you still have to specify NOT NULL even when you have a DEFAULT, if you want to be sure it has a value.

(I think PRIMARY KEY also enforces NOT NULL, but it doesn't hurt to specify both.)

2. And You Thought Your Relationships Were Complicated



In my vgifts table (the one that lists gift purchases), I've specified "transid" as the primary key. But I also have a column for vgiftid, which specifies which gift was given in the transaction. Guess what? The values in the vgiftid column for that table match up to the ones in the vgiftids table! That defines a relationship between the two tables.

Advanced topic: explain why vgiftid can't be a primary key for the vgifts table. (Hint: in my next version of the spec, I will probably rename this table vgift_trans.)

That means we can do something like this:

SELECT name, desc, pickey FROM vgiftids WHERE vgiftid IN ( SELECT vgiftid FROM vgifts WHERE rcptid = 3 ORDER BY delivery_t DESC LIMIT 1 );


That would, theoretically, return the properties describing the most recent gift given to [staff profile] denise; her list of gifts is in the vgifts table, but the gifts are actually described in the vgiftids table.

Advanced topic: name the two reasons the above query won't run as given. (Hint: I mention one of the problems in a comment on the bug, but don't give an explanation.)

Do you see why we've designed it this way? By having only the vgiftid stored in the transaction, and looking up the other data associated with the vgiftid when we need it, we avoid data duplication and make it easy to change the properties associated with a gift in that one row of the vgiftids table, instead of having to do a search and replace on every copy of the gift that was ever given in the vgifts table. Each table in a database should completely describe the properties specific to one group of similar objects. The name and description of the gift are not properties of the gift transaction.

Of course, one user could receive multiple copies of the same gift from different people, or the same person at different times, or even different gifts at the same time. That's why there are no unique keys in vgifts apart from the primary key, transid.

Have you looked yet at the other two tables, vgift_tagnames & vgift_tagids? I designed those in response to feedback that specifying gift categories as columns of the vgiftids table wasn't flexible enough. The vgift_tagnames table defines the names and corresponding ids of the tags, but what about vgift_tagids?

CREATE TABLE vgift_tagids (
    tagid      INT UNSIGNED NOT NULL,
    vgiftid    INT UNSIGNED NOT NULL,

    PRIMARY KEY (tagid, vgiftid)
)


This table defines the relationships between the tagids and the vgiftids, but one gift can have many tags, and one tag can apply to many gifts! There are no unique columns in this table; the constraint is that there can only be one row describing a given combination of the two (and no NULL values; the relation of a thing to nothing is not of interest). Each row in a table should uniquely describe a single instance of an object. In this case, the object is a mapping between two other types of object, and it's a complex enough relationship that we can't contain it in the other tables.

Okay, I've probably given you enough to think about for one post. If there's interest, I'll do a writeup on database queries after I go off and write the functions that will interact with my vgift tables.

And of course, please point out if you see anything I got wrong! I don't want to lead anyone astray.
yvi: Kaylee half-smiling, looking very pretty (Default)

[personal profile] yvi 2009-08-03 06:20 am (UTC)(link)
Oh, thanks for this :) I had a one-week SQL course this year, but we mainly did the WHERE stuff and not so much table design.

Although I'd now like to add the three rules of normal design: "Data should depend on the key, the whole key, and nothing but the key!" I love that expression.
cesy: "Cesy" - An old-fashioned quill and ink (Default)

[personal profile] cesy 2009-08-03 07:07 am (UTC)(link)
In your vgifts table, vgiftid would be a foreign key - it's basically a field that's a primary key in another table and defines the relationship between the tables.
cesy: "Cesy" - An old-fashioned quill and ink (Default)

[personal profile] cesy 2009-08-03 07:30 am (UTC)(link)
Yup.
janinedog: (Default)

[personal profile] janinedog 2009-08-03 03:10 pm (UTC)(link)
I think (though I could be entirely wrong) that DW/LJ doesn't use any actual foreign keys, in the sense that you could delete tables that have some sort of relationship without it yelling at you. In reality, they don't actually have a relationship...they do use the same fields (userid/journalid is one that's used in many tables), but there's no actual forced relationship.
afuna: Cat under a blanket. Text: "Cats are just little people with Fur and Fangs" (Default)

[personal profile] afuna 2009-08-03 07:11 am (UTC)(link)

I've also heard of foreign keys but I couldn't find a single occurrence of the word "foreign" in update-db-general.pl so it can't be that important!


I think [staff profile] mark agrees with you!

http://bugs.dwscoalition.org/show_bug.cgi?id=180#c17

yvi: Kaylee half-smiling, looking very pretty (Default)

[personal profile] yvi 2009-08-03 07:32 am (UTC)(link)
But, but, foreign keys... :)
owl: Stylized barn owl (Default)

[personal profile] owl 2009-08-03 08:18 am (UTC)(link)
O_O

Well, they are nice from stopping some clampett from entering an address with a state of BY or something, app then proceeds to go, "Wait, what, I can't sell X product in this BY place!", falls over, Blackberries go off....
allen: (explosiverunes)

[personal profile] allen 2009-08-03 05:35 pm (UTC)(link)
Just for reference, I respectfully disagree with [staff profile] mark here. I think this thread at Stack Overflow gives a pretty good overview on the pros and cons of using foreign keys.
yvi: Kaylee half-smiling, looking very pretty (Default)

[personal profile] yvi 2009-08-03 07:36 am (UTC)(link)
Yeah, I've read enough to have heard of normal design, but I never figured out exactly what the different normal forms were or anything. It seemed overly didactic for me. I'm more of the "do it this way, it works" school. :)


Heh, it's very philosophical sometimes :) It's shiny if you can get it to work, but it usually means having a lot of tables, which would probably just complicate things for DW devs as it makes getting stuff from the database harder. I think :)
owl: Motherboard and CD (computer)

[personal profile] owl 2009-08-03 10:46 am (UTC)(link)
Rule of thumb: more normalised == easier to do inserts and deletes
more denormalised == easier to do selects. This is because of the constraints you need to put in a denormalised db to ensure you have consistent data (always a good thing when you're dealing with, eg credit card payments). So your OLTP system where you do your selling should be more normalised (generally 3NF or thereabouts) than your reporting, 'read-mostly' db.

I think from the sounds of it LJ and DW's dbs are neither normalised nor denormalised, but un-normalised :D
siderea: (Default)

Normalization is Next to Godliness

[personal profile] siderea 2009-08-04 08:06 am (UTC)(link)
Yeah, I've read enough to have heard of normal design, but I never figured out exactly what the different normal forms were or anything.

Normalization is a way of making sure one doesn't make short term decisions that Sound Like a Good Idea at the Time, and are ruinously expensive to fix later. It's wonderfully fine stuff, and it's probably already something you do without having a name for it.

Of the five normal forms, you basically only care about 1-3.

The First Normal Form is so technologically and culturally enforced by our modern RDBMSes that the reason its hard to understand is that reasonable people ;) can't imagine doing it any other way.

The First Normal Form says that really, all tables are grids now, not trees. All records on a table have the same number of fields, and they're the same fields on all records.

The alternative is to do things like have one field which is specifier for what the rest of the fields in the row shall stand for, such that each row of the table can effectively have different fields. This is, my sweetie tells me, normal practice in pre-RDBMS mainframe dbs; our employer in common had just such a feature in their old Communications db, the one that lived in that barn in Rhode Island.

To use the later as an example, there was for each contact on the contacts table, there was a bit which stood for "individual" or "business". If the record was for an individual person, they got, say, both a set of "home" address fields and "work" address fields", while if the record was for "business", there would be only one set of address fields, but there might be fields for "alternative contacts", the names of other people at the company. In this way, what fields a record had were contingent. This is an egregious violation of the 1NF.

Another form of violating the 1NF is to have arrays in fields. This may be the most common way to violate the 1NF: it's an obvious thing to do, which generally leads to suffering, perdition, and having to do irritating amounts of parsing in your interface layer.

(As an aside, this leads to Siderea's First Important Principle of Database Design: If you find yourself doing common db operations in your interface layer, somebody has screwed up, maybe you.)

The Second and Third Normal Forms together are that The Only Stuff On The Table Is The Stuff That Should Be On That Table. What should be on a table? Facts exclusively about the primary key. This is the "Data should depend on the key, the whole key, and nothing but the key!" rule. If you start having facts about facts or facts about the normal key but also the key on some other table entirely you have violated the 2/3NF.

Classic example: you have a book store inventory system. You have a table of books, it's got sane things on it like a Title field and an ISBN field, and particularly useful, it's got a Publisher field, so you can figure out where to buy more. That's all within the 2/3NF. If you then have on that same book table have a field for publisher address, you have violated the 2NF, because that's not a fact about the book, it's a fact about a publisher. Why is this a problem? Because you have a bunch of books from the same publisher, and you now have a massive data integrity timebomb, trying to keep track of which "publisher address" field is the right one with the most valid data, or otherwise keeping them all in sync.

Violating 2/3NF is also one of the classic, "Gosh, maybe we should have thought that through", errors, which seeming like a good idea at the time, inevitably -- and usually pretty shortly -- results in swearing and crying and people angrily shaking budget sheets in meetings. Typically people violate it to save developer time/money in the short term. I have never caved to violating the 2/3NF but that I have bitterly, bitterly regretted it.

Very rarely, the 2/3NF will be violated for performance reasons. Really, the solution is to buy better hardware. ;)

If you conform to 1/2/3, you pretty much have what most people consider a reasonably normalized DB. Pretty much every discussion of the Five Normal Forms just handwaves through 4 and 5. :)
yvi: Kaylee half-smiling, looking very pretty (Default)

Re: Normalization is Next to Godliness

[personal profile] yvi 2009-08-04 10:26 am (UTC)(link)
Oh, nice comment. When I learned about it, it sounded very shiny, but I forgot quite a bit over time.
vlion: cut of the flammarion woodcut, colored (Default)

[personal profile] vlion 2009-08-04 02:11 pm (UTC)(link)
1-3 Normalization is actually not that didactic/pedagogic.
Essentially, the goal of normalization is to ensure data integrity/sanity across tables as your database goes through life.

Foreign keys:

Let's say we have the classic example of classes at the U. Classes have students and professors. Both students and professors may change between the time the class is announced and the time school starts.

So a git-er-done method looks like this...

create table class(student_id int, prof_id blob, class_id int).

Over time it will wind up looking something like...
class
|------------+------------+----------|
| student_id | prof_id    | class_id |
|------------+------------+----------|
|          1 | Dumbledore | 360F09   |
|          2 | Dumbledore | 360F09   |
|          3 | Dumbledore | 360F09   |
|------------+------------+----------|



Notice that prof_id and class_id don't change, yet they are in each row. Suppose the professor changes from Dumbledore to Hagrid - then we have to do some kind of select * from class where class_id = 360...and it gets gnarly - I didn't include a row_id so I'm not sure at the moment how to write out the rest!
We're in 1NF however - we don't have a group of student ids in one row.

A 3NF way of writing this would be this

create table class_prof(unique_id int, class_id int, prof_id int)

create table class_list(unique_id int, student_id, class_id int)



class_prof
|-----------+----------+------------|
| unique_id | class_id | prof_id    |
|         0 | 360F09   | Dumbledore |
|         1 | 440F09   | Snape      |
|         2 | 112F09   | Hagrid     |
|         3 | 404F09   | Trelawney  |
|-----------+----------+------------|



class_list
|-----------+------------+----------|
| unique_id | student_id | class_id |
|         0 |          1 | 360F09   |
|         1 |          2 | 360F09   |
|         2 |          3 | 360F09   |
|         3 |          3 | 404F09   |
|-----------+------------+----------|




That's *two* tables now instead of simply one. However, we have the significant advantage of being able to update the professor whenever we so please without making a major query out of it. The unique_ids let us do manipulation much easier; instead of making the entire row a primary key, we can simply use the unique ID.