dw_dev_training | DW object-oriented programming explained (Part 2)

Entry tags:

DW object-oriented programming explained (Part 2)

Welcome to the second part of the series on object-oriented programming - or OO - as it applies to the Dreamwidth codebase. :)

If you haven't already read the first part, you'll want to do that before reading this part. I also realise that I never got around to explaining what 'methods' are in the first post, so I'm going to do that right now before delving into the main part of this post:

What are methods?

Recall from the previous post that each object in both real life and OO have what are called "properties" - pieces of information about the object. Each object constructed from the same class will have the same property *names* (eg. "number_of_pages"), but different *values* (one book might have 500 pages, another might have 150, etc).

But these aren't enough to fully allow an object to work. For example, take an iPod. It would have a number of properties, such as "color" and "disk_space", but they don't help describe what the iPod *does* - plays music.

When an iPod is used to play music, the user generally just selects a song and hits Play. That's all the user needs to know; the iPod itself takes care of the tricky parts, like making sure the status on the screen is up-to-date with what's happening, pumping audio through those earbuds, and turning your body into a silhouette. Okay, maybe it doesn't do that last one, but still, the point is that it knows how to deal when someone wants to play music. It's what it was designed to do, after all.

And that's what methods are for. Class methods are there to deal with stuff that other programmers shouldn't have to care about - they can just tell an object to do something, and it does it. Methods are defined in the class - the blueprint - but when a programmer using an object invokes one of these methods, the method gets access to the object's memory store, which allows it to take action appropriate for that *particular* object.

That's a little confusing. Let me try to explain it in terms of the iPod. Let's say we have an "iPod" object and the class it was constructed from has a method called "play_song". When this method is invoked for a particular song, the code that's called isn't tailored for that specific iPod - it's the same code that runs for all "iPod" objects.^(*) But some magic in the programming language allows the code to gain access to the property values of that specific "iPod" object, which will have everything the iPod needs to know to play the song it was given, such as the current volume level, etc.

(Before I leave this subject, I wanted to note that in the comments on my last post, my quick explanation of methods involved having a "nextPage" method on a Book class. After further reflection, I figured that this probably wasn't quite accurate, because you can't ask a book to turn its own pages - you have to do that yourself. Hence, I used a new example here.)

^(*) Of course, in real life an iPod has an actual copy of the code to itself stored in a microchip. If you think of the construction process, however, each real-life iPod that's constructed will have the same code in its microchip, which isn't tailored for any particular manufactured iPod - so it still kinda makes sense.

As with the last post, if you have any questions on this, feel free to let me know in the comments!

So, with that explanation of methods out of the way, it's time to move onto our next topic - how it applies to the DW codebase.

I'm going to do this as a few posts, each dealing with their own topic, because I've got a fair amount to say about them. I'm still not entirely sure how many there'll be, but I'm writing them one at a time so there may be some time (a few days to a week) between each one.

A couple of things to note before I begin:

This post may require some basic knowledge of Perl and/or programming in general. Not much, I promise! (Things such as what a 'string' is, etc.) But all the same, if anybody finds themselves confused by anything I write, feel free to ask for clarification in the comments. I won't bite!

Secondly, if you're used to OO from another language, you'll find some things about Perl's implementation of OO to be strange and baffling. That's because Perl wasn't actually designed with OO in mind; OO support came later, and to be honest, it shows. Still, it's what we use, so I hope I can at least help with understanding it.^(**)

^(**) There is a version of Perl in the works which does a much better job of not only OO but a lot of other things - Perl 6 - but at the cost of revamping a lot of the language such that you probably wouldn't be able to use it without spending some time making sure your code conformed to it. For this series, therefore, I'll be concentrating on Perl 5, which is what most Perl developers - including DW and LJ - use.

With all that said, let's move onto our first topic!

What is an 'object' in Perl?

You may already have seen examples of OO in Perl in the Dreamwidth codebase. For example, when you see something like:

$ret .= "<td>" . $u->ljuser_display . "</td>";

...what you're actually seeing is the coder calling the 'ljuser_display' method on an object called $u. The value that method gives back is then inserted into a string.

But wait. Aren't Perl variables beginning with a dollar symbol supposed to be scalars (variables holding a single value), not objects?

To explain this, let me explain briefly the three different types of variables to be found in Perl:

Scalars: These variables begin with a dollar symbol ($), and represent a single value.
Lists: These variables begin with an at-sign (@), and represent a series of values which are accessed by number. Other languages might know this as an 'array'.
Hashes: These variables begin with a percent sign (%) and represent an unordered list of named values, and each value can be accessed by using its name. Other languages might know this as an 'associative array'.

So clearly, $u must be a scalar, because it begins with a dollar sign. But it's *also* an object, and that's not on the list above. Wha?

Here's the thing - unlike other languages, Perl doesn't have separate 'Object' types. Instead, when you create an object, what you're *really* doing is taking a scalar and "blessing" it as an object of a certain class. (Seriously, that's what it's called.) After that, you can use class methods on the scalar.

Why would anybody do such a thing? Because the scalar represents that object's internal memory store.

I didn't tell you this above, but although it's true that scalars can only represent a single value, that single value can be a reference to another variable. That's allowable because Perl does it by storing the memory location of that variable as the value. (Other languages can also do this, but they're known as 'pointers'.)

Recall from the last post that the memory store of an object consists of 'properties', which are named values, such as 'number_of_pages'. As such, the internal memory store of an object is best represented as a hash. But Perl doesn't let you "bless" a hash directly, so instead you create a reference to the hash (or in Perl parlance, a "hashref"), put it in a scalar, and then bless the scalar. It's a roundabout way of doing it, but because of the convenience of having the memory store variable *right there*, it works.

There's one problem with this, and that's that if you have the variable that represents the object, you also have access to its internal memory store, because you can still use the scalar as a normal hash by using a syntax such as:

my $id = $u->{'userid'};

Here, we're using the "->" syntax to say that we know that $u contains a reference of some kind, and that we want to get to the variable that it's pointing to. We then use that variable as a hash to get to the property named 'userid'.

Now, if this is code within the class itself, then this is generally fine. In most other cases, however, it's bad form to peek directly into the memory store of another object, even if you do have it right there. That's because you don't generally know how that object uses its memory store; it's possible that any information you grab might be out of date, for example. Worse, the layout of the memory store might change in the future; after all, it's only intended to be an *internal* memory store, and as long as the object knows how to deal with its own memory store, that's all that's really required.

Instead, most classes will supply methods that can get you the value you want. (Rather appropriately, they tend to be informally called "getters".) In the example above, although I didn't show its creation, I can tell you that $u is an LJ::User object, and the class for LJ::User defines a method called "id" that will get you the same information, so you can write the above line like so:

my $id = $u->id;

Perl reuses the "->" syntax even when you want to call a method; I'm not entirely sure why. In any case, here we're calling the "id" method to gain the userid instead of looking directly into the memory store, and the class itself gets to decide how to give us the information we want. With this, we can be sure that if LJ::User's memory store layout changes in the future, we'll still get what we need.

(In practice, this is unlikely to be an issue in DW's codebase, and indeed a lot of code in there *does* use the memory store instead of the appropriate method. It isn't a good idea, though, and it makes future code maintenance much easier if getters are used instead.)

That's about it for this post. There's a lot of stuff here so feel free to ask questions if there's anything you don't understand! My next post will probably talk about how you can create and use an object, as well as some example of existing classes in the codebase.

Flat | Top-Level Comments Only

What's the speed tradeoff for using methods vs. accessing the hash directly?

ISTR that

zorkian recently optimised the comment accesses for posts with lots of comments by shifting from accessing a method to accessing a hash directly because it was quicker??

Accessing the hash directly is always going to be faster than using methods, because the method itself is going to have to access the hash too at the very least.

Beyond that, the speed tradeoff really depends on a number of factors, such as what the method has to do to bring you the information you want, and how often you're doing it. For example, if the method has to ask the database server for the info, that's going to slow it down a fair bit.

As far as I can make out from looking at the changelog in question, it looks like for each comment, the previous code was calling three methods each time round. If you then examined the methods themselves - specifically the "nodeid" method - there was another method being called too called "preload_rows". The idea behind the "preload_rows" method was to make sure that the only time the database server was contacted was when it needed to be. Normally, such a routine would only ever be called once, but the code in this loop was indirectly calling it lots of times, even though the only effect it had after the first time was to slow things down.

In addition, Mark looked at the code and saw that the other subs didn't actually do anything beyond returning the actual value from its data store (and, in the case of "nodeid", calling "preload_rows"). Normally, as in this post, it would be a bad idea to access the memory store directly. That said, when something like this is causing a 4-second delay when viewing posts with lots of comments, that's a very bad thing, so yes, optimising this is the right thing to do.

In that code change, Mark actually did three things:

1. He first changed the method calls to hash accesses, as you note.
2. He then added a single call to preload_rows before entering the loop, so he was only calling it once rather than 5,000 times.
3. Finally, because $u was never changing inside the loop, he saved the value of the userid to a separate scalar before entering the loop. That's quicker because it means Perl doesn't have to "dereference" the hashref and look up the hash value each time round; it can just get the value straight from the scalar.

He did note in the changelog that it was "kind of ugly", meaning that he was aware that this in an ideal world, this isn't how you'd do it, and it's true that it'll make code maintenance harder in the future. I do think this was the right thing to do though; coding a webapp to real-world specs unfortunately means sometimes you have to put things you might have learned in Computer Science classes to the back of your mind.

Thanks, that's a great explanation!! I hadn't looked at the code at all, just the changelog itself, so it didn't make a lot of sense except that I had assumed the methods must have been mostly or solely "getters" or it wouldn't have worked.

I didn't realise dereferencing $u like that was actually optimisation too - I assumed that Perl would have optimised for that. Presumably it doesn't optimise because - at least in theory - one of the things done inside the loop might have changed $u so Perl has to check?

Thanks for these posts - they're really helpful.

That's what I believe, yes. I actually don't know how Perl optimises things, though, so you may actually be right. I don't think it does, though, because as you note, $u could theoretically change, and I don't think Perl will attempt to optimise that.

Hi.

I responded to

sophie below, if you're interested. The basic gist of it was that the call to nodeid() really should have been just a getter, or a getter with a cheap check in it, and if it were then the gains from those optimizations probably would have been minimal. But nodeid() had a call to preload_rows(), and preload_rows() was potentially expensive, and avoiding that was the big win there.

Actually, at the time I made this change, I had already effectively optimized out preload_rows so that it ran through once on the first call, and later calls were effectively no-ops.

See http://changelog.dreamwidth.org/1135575.html

So, in this particular case, it was literally just invoking the getters that was slow. preload_rows didn't matter.

Yeah, in a sane world that would have fixed the problem, and in the importer codepath assumingly it did. But in the read message codepath... Well, the logic you put in made it so each time you created a Comment object, it added it to @unloaded_singletons (among other things). Then at some point preload_rows() gets called, which does this:


     map  { [ $_->journal, $_->jtalkid ] }
         grep { ! $_->{_loaded_row} } @unloaded_singletons;
 
     # already loaded?
     return 1 unless @to_load;

     ...(call absorb_row on unloaded_singletons, which sets $->{_loaded_row})

     @unloaded_singletons = ();

so it gets all of the entries in @unloaded_singletons that don't have $_->{_loaded_row} set (which should be all of them--they're unloaded, right?), and, assuming that there are any unloaded singletons, loads those, and then clears out @unloaded_singletons.

But when reading comments on an Entry, you load those Comments in get_talk_data(), which does


    my $make_comment_singleton = sub {
        my ($jtalkid, $row) = @_;
        return 1 unless $nodetype eq 'L';

        # at this point we have data for this comment loaded in memory
        # -- instantiate an LJ::Comment object as a singleton and absorb
        #    that data into the object
        my $comment = LJ::Comment->new($u, jtalkid => $jtalkid);
        # add important info to row
        $row->{nodetype} = $nodetype;
        $row->{nodeid}   = $nodeid;
        $comment->absorb_row(%$row);

        return 1;
    };

So it actually goes in and calls absorb_row() on each Comment, but, because it didn't go through preload_rows(), didn't remove the Comments from @unloaded_singletons. So here _all_ of the Comment objects in @unloaded_singletons have already had $_->{_loaded_row} set. Now that grep at the beginning of preload_rows() returns an empty array. So we return 1, @unloaded_singletons never gets cleared out, and next time we call preload_rows(), we go through all the Comment objects in @unloaded_singletons, none of them have $_->{_loaded_row} set...

Pretty terrible, huh? Maybe

sophie should cover encapsulation next. :)

Let's form a pact. Next chance we get, we take this code out behind the woodshed and go Office Space on it. I think that would be cathartic.

Do you remember the time that I came out to OR to sleep on the beanbag in your and Michael's apartment for like three weeks, and we all went up into the state park to execute dead computer hardware? BECAUSE THAT WAS FREAKING AWESOME.

That was so much fun! I can't find the photos, alas, I thought they were on

whitaker's scrapbook... guess not.

Woe!

Did I tell you about how when we were cleaning up after ourselves, I'd wound up scooping a bunch of casings into my backpack for later disposal, and must have missed one; two years ago I was going through security at BWI and they were Very Concerned that there was a spent bullet casing in my backpack. That had been there for like six years. That had been through security approximately 872 times since then.

(If I'd realized earlier that it was there, I would've made a pendant out of it or something!)

Let's form a pact. Next chance we get, we take this code out behind the woodshed and go Office Space on it. I think that would be cathartic.

AGREED!

Edited (made intent clearer) 2012-02-20 14:52 (UTC)

So I did some more work on that same code after

mark touched it. It turns out that that preload_rows() call was a big problem--the implementation was that in order to see if we had any more comments to load, it scanned through all of the comments that had been touched on that request to see if any hadn't been loaded yet. Now, the method that

mark modified was also scanning through all of the loaded comments to check for a different setting... So by having that nested call to preload_rows in there, that moved it from checking n comments to checking n^2 comments. Very very bad.

I don't remember if I fixed that in the update that I made, but it's certainly a fixable issue. If the calls to nodeid() (and therefore preload_rows()) is cheap, then calling nodeid() vs. $_->{nodeid} shouldn't make that much difference. I mean, it'll make some, and if you're really seriously optimizing that could be worth it, but chances are there's some other underlying problem.

Ahhhh. Thanks for that explanation! I had only been looking at the most recent version of nodeid/preload_rows (because I assumed they hadn't been changed), and it did seem odd to me that it was going as slowly as that. That makes a lot more sense, now.

It definitely seemed odd to me that pure getters should have *that* much of an effect! Glad to see this was just a mistake on my part.

Sophie's explanation is bang on and explains why I made that change. Thanks, Sophie!

I will take another moment here to make a comment about what "calling a method" actually does, behind the scenes, which helps explain why this particular case was so slow. (Wherein I ramble about computers.)

** What does this code do?

# this is my cool program, it returns my favorite number
return 4;

If you write that and save it as cool.pl and run it... what do you think it does? Well, let's try it:

Can't return outside a subroutine at cool.pl line 2.

This makes intuitive sense. You aren't in a subroutine -- so how can you return? You haven't gone anywhere to return from! The very word itself means "to go back" -- importantly, to a place you've already been.

** How do computers know how to do this?

To over-simplify, computers are very linear devices that go from point A to point B. You give them a set of instructions (commands) and they execute them, start to finish. Some of those instructions include things like "go execute this other code".

When that happens, the computer needs some way of knowing where it's been so that it can get back when you execute a return. The way they keep track of that is a thing called the stack. It is, in its most basic form, a way of keeping track of where you are in the program and where you've been so that you can get back. Kind of like a bookmark.

For this example, I'll be using the following small program:

1: sub cool {
2:     my $num = shift;
3:     return 0 if $num == 0;
4:     return cool($num - 1) + rand();
5: }
6: 
7: print cool(1);   # prints one random number
8: print cool(2);   # prints the sum of two random numbers
9: print cool(4);   # sum of four random numbers

This is a very silly example, but it gives us something recursive and (hopefully) easy to look at. Basically, it calls itself (recursively) the number of times you specify and returns the sum of that many random numbers.

** The stack, step by step!

When your program starts, the stack consists of one thing:

1. start of program [line 7, CURRENT]

I.e., "I'm at the beginning!" is what this bookmark says. Then it's going to do the first line of code: print cool(1);. This code tells it to execute the cool subroutine with the argument of 1.

To do this, the computer creates a new frame and pushes it on the stack. In other words, it creates a new bookmark so it can remember where it was and where it's going. The stack now looks like this:

1. start of program [line 7] 2. subroutine cool, arguments 1 [line 1, CURRENT]

Now, the code starts running. Yay! Eventually, it needs to leave the subroutine, i.e., it needs to return. To do that, the computer uses the stack so it knows where to go to leave. This is called popping from the stack. Once the computer does this, the stack looks like it did before:

1. start of program [line 7, CURRENT]

Now the computer knows where to go back to, so it can resume executing. It increments the current line and yay! Now you're on line 8!

** So... what next?

As you might expect, line 8 wants to call the cool subroutine again. It goes through the very same process: create a stack frame, push it onto the stack, jump to line 1, start executing. Eventually it returns by popping off the stack frame, jumping back to where it was, and continues execution of your program.

This is a lot of work. Every time you call a subroutine, it has to go through this pretty involved process to do all of the bookmarking required to properly jump around between different subroutines. For this reason (and others), calling a subroutine (or method) is a lot slower than you might expect.

In a tight loop like the one I changed the other day, all of that bouncing back and forth involves creating tens of thousands of stack frames. While individually they're very fast, 50,000 of anything is a lot slower than not doing it at all. In this case, it added up to approximately three seconds of time spent just doing bookkeeping. Oh well.

** The rest of the story I didn't talk about.

In reality, it's a lot slower due to dynamic/late binding. Perl has to look up exactly which line of code you are talking about when you tell it to call a method. Because of inheritance, it has to look in a bunch of places to see what method it should call.

Because of polymorphism and because you can change an object's class pretty much at will, it has to do this logic every time you call a method. It doesn't do any caching. This is extremely slow.

I swear to God, every time you explain something like this, you answer questions I didn't even know I had.

Did you see

allen's reply to me above? ( http://dw-dev-training.dreamwidth.org/32468.html?thread=235476#cmt235476 ) Apparently there was a nested call somewhere that caused it to change from scanning n comments to n^2 comments, which is what really sucked up the time. I'm not great at following this code in particular so I can't see it myself, but that would explain the biggest delay in this case.

In the old way of doing preload_rows, that's absolutely right. But I had already (on the 4th) optimized out that second part:

http://changelog.dreamwidth.org/1135575.html

This code creates an array and then loops over it once, loading everything. The second time through, that array will be empty, so it doesn't do any work other than calling the methods.

Flat | Top-Level Comments Only

DW object-oriented programming explained (Part 2)

What are methods?

What is an 'object' in Perl?

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject