sophie: A cartoon-like representation of a girl standing on a hill, with brown hair, blue eyes, a flowery top, and blue skirt. ☀ (Default)
Sophie ([personal profile] sophie) wrote in [site community profile] dw_dev_training2012-02-17 11:14 pm

DW object-oriented programming explained (Part 2)

Welcome to the second part of the series on object-oriented programming - or OO - as it applies to the Dreamwidth codebase. :)

If you haven't already read the first part, you'll want to do that before reading this part. I also realise that I never got around to explaining what 'methods' are in the first post, so I'm going to do that right now before delving into the main part of this post:

What are methods?

Recall from the previous post that each object in both real life and OO have what are called "properties" - pieces of information about the object. Each object constructed from the same class will have the same property *names* (eg. "number_of_pages"), but different *values* (one book might have 500 pages, another might have 150, etc).

But these aren't enough to fully allow an object to work. For example, take an iPod. It would have a number of properties, such as "color" and "disk_space", but they don't help describe what the iPod *does* - plays music.

When an iPod is used to play music, the user generally just selects a song and hits Play. That's all the user needs to know; the iPod itself takes care of the tricky parts, like making sure the status on the screen is up-to-date with what's happening, pumping audio through those earbuds, and turning your body into a silhouette. Okay, maybe it doesn't do that last one, but still, the point is that it knows how to deal when someone wants to play music. It's what it was designed to do, after all.

And that's what methods are for. Class methods are there to deal with stuff that other programmers shouldn't have to care about - they can just tell an object to do something, and it does it. Methods are defined in the class - the blueprint - but when a programmer using an object invokes one of these methods, the method gets access to the object's memory store, which allows it to take action appropriate for that *particular* object.

That's a little confusing. Let me try to explain it in terms of the iPod. Let's say we have an "iPod" object and the class it was constructed from has a method called "play_song". When this method is invoked for a particular song, the code that's called isn't tailored for that specific iPod - it's the same code that runs for all "iPod" objects.(*) But some magic in the programming language allows the code to gain access to the property values of that specific "iPod" object, which will have everything the iPod needs to know to play the song it was given, such as the current volume level, etc.

(Before I leave this subject, I wanted to note that in the comments on my last post, my quick explanation of methods involved having a "nextPage" method on a Book class. After further reflection, I figured that this probably wasn't quite accurate, because you can't ask a book to turn its own pages - you have to do that yourself. Hence, I used a new example here.)

(*) Of course, in real life an iPod has an actual copy of the code to itself stored in a microchip. If you think of the construction process, however, each real-life iPod that's constructed will have the same code in its microchip, which isn't tailored for any particular manufactured iPod - so it still kinda makes sense.


As with the last post, if you have any questions on this, feel free to let me know in the comments!


So, with that explanation of methods out of the way, it's time to move onto our next topic - how it applies to the DW codebase.

I'm going to do this as a few posts, each dealing with their own topic, because I've got a fair amount to say about them. I'm still not entirely sure how many there'll be, but I'm writing them one at a time so there may be some time (a few days to a week) between each one.

A couple of things to note before I begin:
  • This post may require some basic knowledge of Perl and/or programming in general. Not much, I promise! (Things such as what a 'string' is, etc.) But all the same, if anybody finds themselves confused by anything I write, feel free to ask for clarification in the comments. I won't bite!

  • Secondly, if you're used to OO from another language, you'll find some things about Perl's implementation of OO to be strange and baffling. That's because Perl wasn't actually designed with OO in mind; OO support came later, and to be honest, it shows. Still, it's what we use, so I hope I can at least help with understanding it.(**)

    (**) There is a version of Perl in the works which does a much better job of not only OO but a lot of other things - Perl 6 - but at the cost of revamping a lot of the language such that you probably wouldn't be able to use it without spending some time making sure your code conformed to it. For this series, therefore, I'll be concentrating on Perl 5, which is what most Perl developers - including DW and LJ - use.


With all that said, let's move onto our first topic!

What is an 'object' in Perl?

You may already have seen examples of OO in Perl in the Dreamwidth codebase. For example, when you see something like:

$ret .= "<td>" . $u->ljuser_display . "</td>";

...what you're actually seeing is the coder calling the 'ljuser_display' method on an object called $u. The value that method gives back is then inserted into a string.

But wait. Aren't Perl variables beginning with a dollar symbol supposed to be scalars (variables holding a single value), not objects?

To explain this, let me explain briefly the three different types of variables to be found in Perl:
  • Scalars: These variables begin with a dollar symbol ($), and represent a single value.
  • Lists: These variables begin with an at-sign (@), and represent a series of values which are accessed by number. Other languages might know this as an 'array'.
  • Hashes: These variables begin with a percent sign (%) and represent an unordered list of named values, and each value can be accessed by using its name. Other languages might know this as an 'associative array'.
So clearly, $u must be a scalar, because it begins with a dollar sign. But it's *also* an object, and that's not on the list above. Wha?

Here's the thing - unlike other languages, Perl doesn't have separate 'Object' types. Instead, when you create an object, what you're *really* doing is taking a scalar and "blessing" it as an object of a certain class. (Seriously, that's what it's called.) After that, you can use class methods on the scalar.

Why would anybody do such a thing? Because the scalar represents that object's internal memory store.

I didn't tell you this above, but although it's true that scalars can only represent a single value, that single value can be a reference to another variable. That's allowable because Perl does it by storing the memory location of that variable as the value. (Other languages can also do this, but they're known as 'pointers'.)

Recall from the last post that the memory store of an object consists of 'properties', which are named values, such as 'number_of_pages'. As such, the internal memory store of an object is best represented as a hash. But Perl doesn't let you "bless" a hash directly, so instead you create a reference to the hash (or in Perl parlance, a "hashref"), put it in a scalar, and then bless the scalar. It's a roundabout way of doing it, but because of the convenience of having the memory store variable *right there*, it works.

There's one problem with this, and that's that if you have the variable that represents the object, you also have access to its internal memory store, because you can still use the scalar as a normal hash by using a syntax such as:

my $id = $u->{'userid'};

Here, we're using the "->" syntax to say that we know that $u contains a reference of some kind, and that we want to get to the variable that it's pointing to. We then use that variable as a hash to get to the property named 'userid'.

Now, if this is code within the class itself, then this is generally fine. In most other cases, however, it's bad form to peek directly into the memory store of another object, even if you do have it right there. That's because you don't generally know how that object uses its memory store; it's possible that any information you grab might be out of date, for example. Worse, the layout of the memory store might change in the future; after all, it's only intended to be an *internal* memory store, and as long as the object knows how to deal with its own memory store, that's all that's really required.

Instead, most classes will supply methods that can get you the value you want. (Rather appropriately, they tend to be informally called "getters".) In the example above, although I didn't show its creation, I can tell you that $u is an LJ::User object, and the class for LJ::User defines a method called "id" that will get you the same information, so you can write the above line like so:

my $id = $u->id;

Perl reuses the "->" syntax even when you want to call a method; I'm not entirely sure why. In any case, here we're calling the "id" method to gain the userid instead of looking directly into the memory store, and the class itself gets to decide how to give us the information we want. With this, we can be sure that if LJ::User's memory store layout changes in the future, we'll still get what we need.

(In practice, this is unlikely to be an issue in DW's codebase, and indeed a lot of code in there *does* use the memory store instead of the appropriate method. It isn't a good idea, though, and it makes future code maintenance much easier if getters are used instead.)


That's about it for this post. There's a lot of stuff here so feel free to ask questions if there's anything you don't understand! My next post will probably talk about how you can create and use an object, as well as some example of existing classes in the codebase.
jeshyr: Programming dreamsheep (Programming)

[personal profile] jeshyr 2012-02-18 12:12 am (UTC)(link)
What's the speed tradeoff for using methods vs. accessing the hash directly?

ISTR that [personal profile] zorkian recently optimised the comment accesses for posts with lots of comments by shifting from accessing a method to accessing a hash directly because it was quicker??
jeshyr: Blessed are the broken. Harry Potter. (Default)

[personal profile] jeshyr 2012-02-18 03:31 am (UTC)(link)
Thanks, that's a great explanation!! I hadn't looked at the code at all, just the changelog itself, so it didn't make a lot of sense except that I had assumed the methods must have been mostly or solely "getters" or it wouldn't have worked.

I didn't realise dereferencing $u like that was actually optimisation too - I assumed that Perl would have optimised for that. Presumably it doesn't optimise because - at least in theory - one of the things done inside the loop might have changed $u so Perl has to check?

Thanks for these posts - they're really helpful.
allen: extras (extras)

[personal profile] allen 2012-02-18 06:42 pm (UTC)(link)
Hi.

I responded to [personal profile] sophie below, if you're interested. The basic gist of it was that the call to nodeid() really should have been just a getter, or a getter with a cheap check in it, and if it were then the gains from those optimizations probably would have been minimal. But nodeid() had a call to preload_rows(), and preload_rows() was potentially expensive, and avoiding that was the big win there.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2012-02-19 08:36 pm (UTC)(link)
Actually, at the time I made this change, I had already effectively optimized out preload_rows so that it ran through once on the first call, and later calls were effectively no-ops.

See http://changelog.dreamwidth.org/1135575.html

So, in this particular case, it was literally just invoking the getters that was slow. preload_rows didn't matter.
allen: "Badass Dreamwidth Dev" on a green background (dwdev)

[personal profile] allen 2012-02-20 12:39 am (UTC)(link)
Yeah, in a sane world that would have fixed the problem, and in the importer codepath assumingly it did. But in the read message codepath... Well, the logic you put in made it so each time you created a Comment object, it added it to @unloaded_singletons (among other things). Then at some point preload_rows() gets called, which does this:


map { [ $_->journal, $_->jtalkid ] }
grep { ! $_->{_loaded_row} } @unloaded_singletons;

# already loaded?
return 1 unless @to_load;

...(call absorb_row on unloaded_singletons, which sets $->{_loaded_row})

@unloaded_singletons = ();


so it gets all of the entries in @unloaded_singletons that don't have $_->{_loaded_row} set (which should be all of them--they're unloaded, right?), and, assuming that there are any unloaded singletons, loads those, and then clears out @unloaded_singletons.

But when reading comments on an Entry, you load those Comments in get_talk_data(), which does


my $make_comment_singleton = sub {
my ($jtalkid, $row) = @_;
return 1 unless $nodetype eq 'L';

# at this point we have data for this comment loaded in memory
# -- instantiate an LJ::Comment object as a singleton and absorb
# that data into the object
my $comment = LJ::Comment->new($u, jtalkid => $jtalkid);
# add important info to row
$row->{nodetype} = $nodetype;
$row->{nodeid} = $nodeid;
$comment->absorb_row(%$row);

return 1;
};


So it actually goes in and calls absorb_row() on each Comment, but, because it didn't go through preload_rows(), didn't remove the Comments from @unloaded_singletons. So here _all_ of the Comment objects in @unloaded_singletons have already had $_->{_loaded_row} set. Now that grep at the beginning of preload_rows() returns an empty array. So we return 1, @unloaded_singletons never gets cleared out, and next time we call preload_rows(), we go through all the Comment objects in @unloaded_singletons, none of them have $_->{_loaded_row} set...

Pretty terrible, huh? Maybe [personal profile] sophie should cover encapsulation next. :)
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2012-02-20 08:28 am (UTC)(link)
Let's form a pact. Next chance we get, we take this code out behind the woodshed and go Office Space on it. I think that would be cathartic.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2012-02-20 08:30 am (UTC)(link)
Do you remember the time that I came out to OR to sleep on the beanbag in your and Michael's apartment for like three weeks, and we all went up into the state park to execute dead computer hardware? BECAUSE THAT WAS FREAKING AWESOME.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2012-02-20 08:42 am (UTC)(link)
That was so much fun! I can't find the photos, alas, I thought they were on [livejournal.com profile] whitaker's scrapbook... guess not.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2012-02-20 08:45 am (UTC)(link)
Woe!

Did I tell you about how when we were cleaning up after ourselves, I'd wound up scooping a bunch of casings into my backpack for later disposal, and must have missed one; two years ago I was going through security at BWI and they were Very Concerned that there was a spent bullet casing in my backpack. That had been there for like six years. That had been through security approximately 872 times since then.

(If I'd realized earlier that it was there, I would've made a pendant out of it or something!)
allen: (Default)

[personal profile] allen 2012-02-20 02:51 pm (UTC)(link)
Let's form a pact. Next chance we get, we take this code out behind the woodshed and go Office Space on it. I think that would be cathartic.

AGREED!
Edited (made intent clearer) 2012-02-20 14:52 (UTC)
allen: extras (extras)

[personal profile] allen 2012-02-18 06:37 pm (UTC)(link)
So I did some more work on that same code after [staff profile] mark touched it. It turns out that that preload_rows() call was a big problem--the implementation was that in order to see if we had any more comments to load, it scanned through all of the comments that had been touched on that request to see if any hadn't been loaded yet. Now, the method that [staff profile] mark modified was also scanning through all of the loaded comments to check for a different setting... So by having that nested call to preload_rows in there, that moved it from checking n comments to checking n^2 comments. Very very bad.

I don't remember if I fixed that in the update that I made, but it's certainly a fixable issue. If the calls to nodeid() (and therefore preload_rows()) is cheap, then calling nodeid() vs. $_->{nodeid} shouldn't make that much difference. I mean, it'll make some, and if you're really seriously optimizing that could be worth it, but chances are there's some other underlying problem.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2012-02-18 09:12 pm (UTC)(link)
Sophie's explanation is bang on and explains why I made that change. Thanks, Sophie!

I will take another moment here to make a comment about what "calling a method" actually does, behind the scenes, which helps explain why this particular case was so slow. (Wherein I ramble about computers.)

** What does this code do?

# this is my cool program, it returns my favorite number
return 4;


If you write that and save it as cool.pl and run it... what do you think it does? Well, let's try it:

Can't return outside a subroutine at cool.pl line 2.


This makes intuitive sense. You aren't in a subroutine -- so how can you return? You haven't gone anywhere to return from! The very word itself means "to go back" -- importantly, to a place you've already been.

** How do computers know how to do this?

To over-simplify, computers are very linear devices that go from point A to point B. You give them a set of instructions (commands) and they execute them, start to finish. Some of those instructions include things like "go execute this other code".

When that happens, the computer needs some way of knowing where it's been so that it can get back when you execute a return. The way they keep track of that is a thing called the stack. It is, in its most basic form, a way of keeping track of where you are in the program and where you've been so that you can get back. Kind of like a bookmark.

For this example, I'll be using the following small program:

1: sub cool {
2:     my $num = shift;
3:     return 0 if $num == 0;
4:     return cool($num - 1) + rand();
5: }
6: 
7: print cool(1);   # prints one random number
8: print cool(2);   # prints the sum of two random numbers
9: print cool(4);   # sum of four random numbers


This is a very silly example, but it gives us something recursive and (hopefully) easy to look at. Basically, it calls itself (recursively) the number of times you specify and returns the sum of that many random numbers.

** The stack, step by step!

When your program starts, the stack consists of one thing:

1. start of program [line 7, CURRENT]

I.e., "I'm at the beginning!" is what this bookmark says. Then it's going to do the first line of code: print cool(1);. This code tells it to execute the cool subroutine with the argument of 1.

To do this, the computer creates a new frame and pushes it on the stack. In other words, it creates a new bookmark so it can remember where it was and where it's going. The stack now looks like this:

1. start of program [line 7]
2. subroutine cool, arguments 1 [line 1, CURRENT]


Now, the code starts running. Yay! Eventually, it needs to leave the subroutine, i.e., it needs to return. To do that, the computer uses the stack so it knows where to go to leave. This is called popping from the stack. Once the computer does this, the stack looks like it did before:

1. start of program [line 7, CURRENT]

Now the computer knows where to go back to, so it can resume executing. It increments the current line and yay! Now you're on line 8!

** So... what next?

As you might expect, line 8 wants to call the cool subroutine again. It goes through the very same process: create a stack frame, push it onto the stack, jump to line 1, start executing. Eventually it returns by popping off the stack frame, jumping back to where it was, and continues execution of your program.

This is a lot of work. Every time you call a subroutine, it has to go through this pretty involved process to do all of the bookmarking required to properly jump around between different subroutines. For this reason (and others), calling a subroutine (or method) is a lot slower than you might expect.

In a tight loop like the one I changed the other day, all of that bouncing back and forth involves creating tens of thousands of stack frames. While individually they're very fast, 50,000 of anything is a lot slower than not doing it at all. In this case, it added up to approximately three seconds of time spent just doing bookkeeping. Oh well.

** The rest of the story I didn't talk about.

In reality, it's a lot slower due to dynamic/late binding. Perl has to look up exactly which line of code you are talking about when you tell it to call a method. Because of inheritance, it has to look in a bunch of places to see what method it should call.

Because of polymorphism and because you can change an object's class pretty much at will, it has to do this logic every time you call a method. It doesn't do any caching. This is extremely slow.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2012-02-18 09:24 pm (UTC)(link)
I swear to God, every time you explain something like this, you answer questions I didn't even know I had.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2012-02-19 08:37 pm (UTC)(link)
In the old way of doing preload_rows, that's absolutely right. But I had already (on the 4th) optimized out that second part:

http://changelog.dreamwidth.org/1135575.html

This code creates an array and then loops over it once, loading everything. The second time through, that array will be empty, so it doesn't do any work other than calling the methods.