dw_dev | where to put functions

Entry tags:

coding

where to put functions

So, I am currently in the first stages of working on http://bugs.dwscoalition.org/show_bug.cgi?id=1574 ( "Last active entries" module) and it's actually going quite nicely :)

However, I am a bit stumped on where to put a function I need. Namely, I need a function that, depending on what journal the module is in and who is viewing it, returns the last x active entries in that journal (maybe not the full Entry objects, though, still working on that).

My first thought was cgi-bin/LJ/User.pm , as it would be good to have $u->get_recent_active($remote) or something like that. But now I am leaning more towards putting it in cgi-bin/DW/Logic/LogItems.pm , which has the functions $u->watch_items and $u->recent_items . However, my idea only semi-fits the module description.

Or maybe it should be in cgi-bin/LJ/S2.pm?

Yes, I am still confused by where to put stuff. Any ideas?

Flat | Top-Level Comments Only

I'd suggest DW::Logic::LogItems.

And how are you implementing this? Where are you pulling the data from?

I'd suggest DW::Logic::LogItems.

Great, thanks!

And how are you implementing this? Where are you pulling the data from?

Latest comments from talk2 and security from log2. I hope that's okay, since they are both on the user's cluster?

Latest comments from talk2? How are you getting that? Show me the SQL query. Does it look like this:

mysql> explain select journalid, nodeid, datepost
from talk2 where journalid = 35 order by datepost desc limit 5\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: talk2
type: ref
possible_keys: PRIMARY,journalid
key: PRIMARY
key_len: 4
ref: const
rows: 36862
Extra: Using where; Using filesort
1 row in set (0.00 sec)

If it does, you can't use it. See that "using filesort" bit? Yeah, that means it's doing an awkwardly large scan of data: for this query, every single comment on a journal has to be loaded from disk and then sorted.

As a general rule, you must have an index on the data you're searching and ordering on. In this case, there's no index that covers datepost so there's no way to do a "sorted by time" search of the comments using talk2.

I can't think of any other "easy" way of doing it, either. Sorting by the logprop "commentalter" won't work for similar reasons (you'd have to load every logprop for every entry and sort).

Only/best way to do it is probably going to involve very painful ALTER TABLE statements on log2 to add (commenttime) and then INDEX (journalid, commenttime) and then doing your sorts based on that... but that's pretty painful. (But probably the best way...)

Add: I also looked at recent_comments.bml today and am thinking about whether to do something similar as in there, as it also gets some entry information based on the last comments with

my $lrow = $logrow{"$u->{userid} $r->{nodeid}"} ||= LJ::get_log2_row($u, $r->{'nodeid'});

But, basically, any suggestions welcomed. I have an early version of an SQL query so I can see what I actually need, how to sort it, etc., but I am far from set on actually *using* it.

Okay, as per IRC, you wanted my SQL. Don't say I didn't warn you, my experience with SQL mis... limited.

SELECT * FROM
(SELECT talk2.*, logtext2.subject FROM talk2 INNER JOIN log2 INNER JOIN logtext2 
ON logtext2.journalid=log2.journalid and talk2.journalid=logtext2.journalid and talk2.nodeid=log2.jitemid and log2.jitemid=logtext2.jitemid
WHERE logtext2.journalid=$userid AND talk2.state<>"D" AND talk2.state<>"S" ORDER BY talk2.datepost DESC) innertable 
GROUP BY innertable.nodeid ORDER BY innertable.datepost DESC LIMIT 10;

[Yes, three-table join *winces* - I would love to have the subject from logtext2 without getting the whole Entry object first - but that can definitely be substituted!]

I tried with just substituting datepost with jtalkid and for my test case (http://activeentries.yvi.hack.dreamwidth.net/) the results it shows are fine and in the correct order. I am not sure about the nested table thing, though, which I found is necessary since I only want one comment for each post and only the most recent one and GROUP BY and ORDER BY don't play nicely together :(. Is that a problem?

And reading back over it it's probably really horrible and if I shouldn't use any joins I'll be more than happy to do without. This was more for me testing how the databases work together and where what is stored. recent_comments.bml does have some stuff that might work otherwise.

I'll shut up now.

Edited 2009-08-23 17:38 (UTC)

Ooo, thanks for sharing it, I'd be interested to see any replies/opinions.

Heh, nice to know it might help more than just me. The DW databases leave me confused - I have worked (and still occasionally do) with mySQL, but it's always small databases with maybe three tables, so I can do pretty much anything I want to without time/logistics problems.

Wow, that's some serious hardcore SQL! Nice!

Something to keep in mind, though, is that we don't really use joins (or subselects):

[dw @ mark-dev - ~/current] -> grep -r "SELECT " cgi-bin | wc -l 558 [dw @ mark-dev - ~/current] -> grep -r " JOIN " cgi-bin | wc -l 3

(That's skewed a lot since it only counts explicit JOINs and not implicit inner joins. But you get the idea...)

There's a lot of reasons not to do joins like that. Database performance can be pretty terrible (in MySQL) when you start doing tough queries like that, it's much hard to memcache the data, and writing (and figuring out/modifying later) the SQL is so hard. (Granted, a lot of this is because MyISAM uses table level locking, so table joins is a quick path to pain. But even InnoDB, in my experience, is subject to deadlock conditions that are no fun when you start getting fancy with the SQL.)

What you should do in a situation like this is break the process down into small steps, and then see how you can do it simply and easily. For example, we already have things that load lots of comments and entries quickly and efficiently on the database, relying on memcache, etc, so the only part you have to write is the code that figures out which comments to load.

This can be a tricky problem, though. There is no efficient way, right now, to find the "10 most recently commented on posts" since we don't have that index. What you can find, however, is the N most recent comments, and see what posts they're on. If you select enough comments from the database you can be reasonably certain you're going to find a few posts.

So, for example, I would implement this function like this:

# called $u->get_rec..blahblah, returns array of LJ::Entry objects, so you can do
# things like 'foreach my $entry ( $u->get_rec... ) { ... }'
#
sub get_recently_commented_on_entries {
    my $u = shift or die 'no user';

    # take arrayref input, return LJ::Entry objects
    my $make_entry_objs = sub {
        return map { LJ::Entry->new( $u, jitemid => $_ ) @{$_[0]};
    };

    # try memcache first
    my $recent = $u->memc_get( 'recently_commented_posts' );
    return $make_entry_objs->( $recent ) if $recent;

    # not in memcache, step load some post ids from comments
    my @ids = $u->selectcol_array(
        q{SELECT nodeid FROM talk2
          WHERE journalid = ? AND state <> 'D'
          ORDER BY jtalkid DESC LIMIT 50},
        undef, $u->id
    );

    # now we have nodeids in @ids, you can find the first five unique values
    # through whatever method seems best to you, here assuming @postids is the
    # output of "find five uniques"
    FIXME: code this

    # now memcache and return the results.  never memcache LJ::Entry objects,
    # just the ids...
    $u->memc_set( 'recently_commented_posts', \@postids );
    return $make_entry_objs->( \@postids );
}

I haven't tested the above, but that should work. It digs up the last 50 comments, finds what entries they were on, picks the first 5 unique entries (first five unique = five newest guaranteed because we ordered by jtalkid), and then returns LJ::Entry objects which you can use to do whatever.

This is easy on the database (one query, no joins, it's memcached), and it's fairly minimal as far as the code goes. Easy to debug and work on.

The drawback to this method is that it's "handwavey". If you have a really popular journal, you might have 50 comments on your most recent post. Selecting only the most recent 50 will end up with the "recently commented on posts" box having only a single entry in it. It's pretty easy to deal with that, though. If you really wanted, you could make that 'SELECT' into a 'SELECT DISTINCT' and change the LIMIT to 5. That guarantees you get five unique entries (if they exist), but it is theoretically unlimited in how many rows it sorts through in the database.

In our case, we know it won't be TOO terrible because we limit the number of comments on a single entry. So, at most, SELECT DISTINCT ... LIMIT 5 will go through 25,000 rows. Which is a lot, but is still much less than 'SELECT ... ORDER BY datepost' on some accounts.

(You also have to make sure that whenever someone comments on an entry, you delete the 'recently_commented_posts' memcache key for the journal they're commenting on. Otherwise, the person viewing the site won't see 'live' data on that box...)

...

I've now rambled on for half an hour. Thoughts, questions, comments, critiques?

[dw @ mark-dev - ~/current] -> grep -r " JOIN " cgi-bin | wc -l
3

Heh, yeah, I tried that, too. Found a few more, though, since there are some lowercase joins ;)

when you start doing tough queries like that, it's much hard to memcache the data, and writing (and figuring out/modifying later) the SQL is so hard.

Memcache, one of the last big misteries left on Earth ;)

If you select enough comments from the database you can be reasonably certain you're going to find a few posts.

Sounds good.

# try memcache first
my $recent = $u->memc_get( 'recently_commented_posts' );
return $make_entry_objs->( $recent ) if $recent;

...

$u->memc_set( 'recently_commented_posts', \@postids );

Ah, now I finally get how *that* works!

In our case, we know it won't be TOO terrible because we limit the number of comments on a single entry. So, at most, SELECT DISTINCT ... LIMIT 5 will go through 25,000 rows. Which is a lot, but is still much less than 'SELECT ... ORDER BY datepost' on some accounts.

*nods* So you think that would be fine? Limit 5 is good (I was planning on either 5 or 10). I'll take another look when it's not 7:30 and I am actually logged into my Dreamhack, but this sounds very reasonable. Thanks for the explanation!

Yeah, SELECT DISTINCT ... LIMIT 5 would be fine. If it turns out to be a heavy load on the database in the future, we can always tune it!

Hmm. Your ideas are intriguing to me and I wish to subscribe to your newsletter...

Okay, I have further questions (and I might comment here occasionally, if that is fine for you?):

SELECT DISTINCT * FROM talk2 WHERE journalid = ? AND state <> 'D' and state <> 'S' ORDER BY jtalkid DESC LIMIT 5;

does the trick, thank you. But only if all entries are public or the user has access to all 5 entries will all of those later show up in the list, as I will obviously discard ones that can't be viewed at a later step. So the list might come up very short for partially locked journals if we start from 5. Anyway, going through all is obviously data-base intensive, so we don't want to do that.

But, as a compromise, would it be possible to either a) pull out ten instead, or b) pull out 5 if the user is owner/maintainer and 10 if the user isn't and then narrow it down to no more than 5 later? I hope you get what I mean here. I'd just hate to see that module only come up with one entry regularly because I do think it's shiny :) [And if I had a biggish community it would totally be a reason to buy a paid community for it]

We can also do it with 5 first, though, and maybe talk about this a while after implementation again?

(You also have to make sure that whenever someone comments on an entry, you delete the 'recently_commented_posts' memcache key for the journal they're commenting on. Otherwise, the person viewing the site won't see 'live' data on that box...)

Okay, that needs to be done manually in talkpost then? memcache is a big mystery for me and I can't test it...

Yvi, lost in code and thought.

Sure, 10 is fine.

AND state <> 'D' and state <> 'S'

Prefer AND state NOT IN ('D', 'S')

Why can't you test memcache?

10 it is then :)

Prefer AND state NOT IN ('D', 'S')

Oh, new syntax, shiny :) Will use.

Why can't you test memcache?

I thought it wasn't set up on the Dreamhacks?

*searches on the Wiki* Oh, I can set it up. Cool. Will contact

sophie about it then.

Yep, and memcache is pretty easy. I'll try to write up a post for

dw_dev_training about how it works, how to use it, when to use it, etc. :)

Flat | Top-Level Comments Only

where to put functions

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject