yvi: Dreamsheep with Replicator pattern (Dreamsheep - Replisheep)
yvi ([personal profile] yvi) wrote in [site community profile] dw_dev2009-08-23 04:44 pm
Entry tags:

where to put functions

So, I am currently in the first stages of working on http://bugs.dwscoalition.org/show_bug.cgi?id=1574 ( "Last active entries" module) and it's actually going quite nicely :)

However, I am a bit stumped on where to put a function I need. Namely, I need a function that, depending on what journal the module is in and who is viewing it, returns the last x active entries in that journal (maybe not the full Entry objects, though, still working on that).

My first thought was cgi-bin/LJ/User.pm , as it would be good to have $u->get_recent_active($remote) or something like that. But now I am leaning more towards putting it in cgi-bin/DW/Logic/LogItems.pm , which has the functions $u->watch_items and $u->recent_items . However, my idea only semi-fits the module description.

Or maybe it should be in cgi-bin/LJ/S2.pm?

Yes, I am still confused by where to put stuff. Any ideas?
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2009-08-23 04:48 pm (UTC)(link)
I'd suggest DW::Logic::LogItems.

And how are you implementing this? Where are you pulling the data from?
zorkian: Icon full of binary ones and zeros in no pattern. (Default)

[personal profile] zorkian 2009-08-23 05:02 pm (UTC)(link)
Latest comments from talk2? How are you getting that? Show me the SQL query. Does it look like this:

mysql> explain select journalid, nodeid, datepost
from talk2 where journalid = 35 order by datepost desc limit 5\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: talk2
type: ref
possible_keys: PRIMARY,journalid
key: PRIMARY
key_len: 4
ref: const
rows: 36862
Extra: Using where; Using filesort
1 row in set (0.00 sec)

If it does, you can't use it. See that "using filesort" bit? Yeah, that means it's doing an awkwardly large scan of data: for this query, every single comment on a journal has to be loaded from disk and then sorted.

As a general rule, you must have an index on the data you're searching and ordering on. In this case, there's no index that covers datepost so there's no way to do a "sorted by time" search of the comments using talk2.

I can't think of any other "easy" way of doing it, either. Sorting by the logprop "commentalter" won't work for similar reasons (you'd have to load every logprop for every entry and sort).

Only/best way to do it is probably going to involve very painful ALTER TABLE statements on log2 to add (commenttime) and then INDEX (journalid, commenttime) and then doing your sorts based on that... but that's pretty painful. (But probably the best way...)
badgerbag: (Default)

[personal profile] badgerbag 2009-08-23 05:56 pm (UTC)(link)
Ooo, thanks for sharing it, I'd be interested to see any replies/opinions.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2009-08-24 12:40 am (UTC)(link)
Wow, that's some serious hardcore SQL! Nice!

Something to keep in mind, though, is that we don't really use joins (or subselects):

[dw @ mark-dev - ~/current] -> grep -r "SELECT " cgi-bin | wc -l
558
[dw @ mark-dev - ~/current] -> grep -r " JOIN " cgi-bin | wc -l
3


(That's skewed a lot since it only counts explicit JOINs and not implicit inner joins. But you get the idea...)

There's a lot of reasons not to do joins like that. Database performance can be pretty terrible (in MySQL) when you start doing tough queries like that, it's much hard to memcache the data, and writing (and figuring out/modifying later) the SQL is so hard. (Granted, a lot of this is because MyISAM uses table level locking, so table joins is a quick path to pain. But even InnoDB, in my experience, is subject to deadlock conditions that are no fun when you start getting fancy with the SQL.)

What you should do in a situation like this is break the process down into small steps, and then see how you can do it simply and easily. For example, we already have things that load lots of comments and entries quickly and efficiently on the database, relying on memcache, etc, so the only part you have to write is the code that figures out which comments to load.

This can be a tricky problem, though. There is no efficient way, right now, to find the "10 most recently commented on posts" since we don't have that index. What you can find, however, is the N most recent comments, and see what posts they're on. If you select enough comments from the database you can be reasonably certain you're going to find a few posts.

So, for example, I would implement this function like this:

# called $u->get_rec..blahblah, returns array of LJ::Entry objects, so you can do
# things like 'foreach my $entry ( $u->get_rec... ) { ... }'
#
sub get_recently_commented_on_entries {
    my $u = shift or die 'no user';

    # take arrayref input, return LJ::Entry objects
    my $make_entry_objs = sub {
        return map { LJ::Entry->new( $u, jitemid => $_ ) @{$_[0]};
    };

    # try memcache first
    my $recent = $u->memc_get( 'recently_commented_posts' );
    return $make_entry_objs->( $recent ) if $recent;

    # not in memcache, step load some post ids from comments
    my @ids = $u->selectcol_array(
        q{SELECT nodeid FROM talk2
          WHERE journalid = ? AND state <> 'D'
          ORDER BY jtalkid DESC LIMIT 50},
        undef, $u->id
    );

    # now we have nodeids in @ids, you can find the first five unique values
    # through whatever method seems best to you, here assuming @postids is the
    # output of "find five uniques"
    FIXME: code this

    # now memcache and return the results.  never memcache LJ::Entry objects,
    # just the ids...
    $u->memc_set( 'recently_commented_posts', \@postids );
    return $make_entry_objs->( \@postids );
}


I haven't tested the above, but that should work. It digs up the last 50 comments, finds what entries they were on, picks the first 5 unique entries (first five unique = five newest guaranteed because we ordered by jtalkid), and then returns LJ::Entry objects which you can use to do whatever.

This is easy on the database (one query, no joins, it's memcached), and it's fairly minimal as far as the code goes. Easy to debug and work on.

The drawback to this method is that it's "handwavey". If you have a really popular journal, you might have 50 comments on your most recent post. Selecting only the most recent 50 will end up with the "recently commented on posts" box having only a single entry in it. It's pretty easy to deal with that, though. If you really wanted, you could make that 'SELECT' into a 'SELECT DISTINCT' and change the LIMIT to 5. That guarantees you get five unique entries (if they exist), but it is theoretically unlimited in how many rows it sorts through in the database.

In our case, we know it won't be TOO terrible because we limit the number of comments on a single entry. So, at most, SELECT DISTINCT ... LIMIT 5 will go through 25,000 rows. Which is a lot, but is still much less than 'SELECT ... ORDER BY datepost' on some accounts.

(You also have to make sure that whenever someone comments on an entry, you delete the 'recently_commented_posts' memcache key for the journal they're commenting on. Otherwise, the person viewing the site won't see 'live' data on that box...)

...

I've now rambled on for half an hour. Thoughts, questions, comments, critiques?
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2009-08-24 05:56 am (UTC)(link)
Yeah, SELECT DISTINCT ... LIMIT 5 would be fine. If it turns out to be a heavy load on the database in the future, we can always tune it!
alierak: (Default)

[personal profile] alierak 2009-08-24 05:18 pm (UTC)(link)
Hmm. Your ideas are intriguing to me and I wish to subscribe to your newsletter...
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2009-08-26 04:53 pm (UTC)(link)
Sure, 10 is fine.

AND state <> 'D' and state <> 'S'

Prefer AND state NOT IN ('D', 'S')

Why can't you test memcache?
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2009-08-26 04:57 pm (UTC)(link)
Yep, and memcache is pretty easy. I'll try to write up a post for [site community profile] dw_dev_training about how it works, how to use it, when to use it, etc. :)