Entry tags:
Stats: Active accounts by payment level
One of the stats we need to provide is active accounts by account level (free, paid, premium, seed). Currently, the active accounts come from a per-cluster table with the user id as primary key and the time of last activity, in which users are added/updated whenever they do something that makes them active.
DW::StatData::ActiveAccounts handle collecting those stats, basically running something like the following request to retrieve the number of accounts active in the last 30 days:
To avoid that, the obvious solution is to add the account level to that table. However, this could cause discrepancies if the account level changes after the latest activity. For instance, if a paid account expires after the last activity, it would still be counted as "active paid" in the breakdown of active accounts, but free in the accounts by level stats, possibly leading to more accounts counted as "active paid" than as "paid". The only way to fix that is to have the account level transitions (paid account expiring, payment received, etc...) "correct" the table, but that gets rather ugly as the number of stats that needs a breakdown by account level goes up (and may not be possible for some stats at all, depending on how they're collected). Also, it can lead to biased data. For instance, if a free account gets paid for the first time after the last activity but before the stats are collected, it will be counted as paid even though all its activity was as a free account.
Adding the account level to the primary key (and thus, keeping separate activity records for each user id) isn't feasible, because that field would be null for existing activity records, which makes it unsuitable for a primary key. In addition, that wouldn't really help, because we'd still need to choose between activity records for the same account, lest we count it twice (or more). If having (according to stats) more active paid accounts than paid accounts (or than active accounts) makes stats look unreliable, think how much worse it would be to have more active paid accounts than total accounts. (Can you hear the "DW is inflating its stats" cries from here? :-) )
Anyway: the best compromise (IMO) seems to capture the account level in the activity table at the time that table is updated, and to leave it unchanged afterward. Opinions? Questions? Blindingly obvious solutions that I missed?
DW::StatData::ActiveAccounts handle collecting those stats, basically running something like the following request to retrieve the number of accounts active in the last 30 days:
SELECT COUNT(*) FROM clustertrack2 WHERE timeactive > UNIX_TIMESTAMP()-30*86400That table doesn't include the account level, because (so far) it doesn't need to. This means that to break down active accounts by account level, I'd need to retrieve the complete list of active accounts and check the account level for each, which is much more costly than the current SQL.
To avoid that, the obvious solution is to add the account level to that table. However, this could cause discrepancies if the account level changes after the latest activity. For instance, if a paid account expires after the last activity, it would still be counted as "active paid" in the breakdown of active accounts, but free in the accounts by level stats, possibly leading to more accounts counted as "active paid" than as "paid". The only way to fix that is to have the account level transitions (paid account expiring, payment received, etc...) "correct" the table, but that gets rather ugly as the number of stats that needs a breakdown by account level goes up (and may not be possible for some stats at all, depending on how they're collected). Also, it can lead to biased data. For instance, if a free account gets paid for the first time after the last activity but before the stats are collected, it will be counted as paid even though all its activity was as a free account.
Adding the account level to the primary key (and thus, keeping separate activity records for each user id) isn't feasible, because that field would be null for existing activity records, which makes it unsuitable for a primary key. In addition, that wouldn't really help, because we'd still need to choose between activity records for the same account, lest we count it twice (or more). If having (according to stats) more active paid accounts than paid accounts (or than active accounts) makes stats look unreliable, think how much worse it would be to have more active paid accounts than total accounts. (Can you hear the "DW is inflating its stats" cries from here? :-) )
Anyway: the best compromise (IMO) seems to capture the account level in the activity table at the time that table is updated, and to leave it unchanged afterward. Opinions? Questions? Blindingly obvious solutions that I missed?
no subject
(Anonymous) 2009-09-27 02:59 am (UTC)(link)Build a memory table comprised of all (or all active) account numbers + status (presumably a single digit) and join to it when you want data by status. Hell, build it new every time you run stats. Can it be that costly?
no subject
no subject
no subject
no subject
(Also, your username looks very familiar.)
no subject
(Anonymous) 2009-09-27 08:46 pm (UTC)(link)I am more thinking of those of us who are "always" logged in.. does their account show as active once they hit a dreamwidth website? Or does it show them as active when they have to log in for real?
This isn't the sort of programming I do at work, so I could just be blowing smoke :)
no subject
no subject
no subject
Does this mean "using login.bml" or "sending a logged-in cookie"?
I tend to log in to LiveJournal/Dreamwidth about twice a year or something, even though I tend to visit them often.
no subject
no subject
r
PS
Miss you!
no subject
no subject
Not that I'm a dev - the "codiest" thing I've done so far is write CSS that's being turned into a journal theme by
no subject
no subject
no subject
This way you can keep track of when the account level was changed and when the last activity was, thus perhaps avoiding biased data? Because this way, you can select the active accounts with paid status that have the their last activity while in the paid period.
SELECT COUNT(*)
FROM clustertrack2
WHERE AccountLvl = 'Paid' AND
TimeActive > AccountLvStart AND
TimeActive < AccountLvEnd
(This code should select the paid users that used their accounts within their paid account period?)
But this relies on the updating the table every single time the status of the account is changed (account getting paid or expiring) and it will generate a lot of SQL entries.
I hope this helps, but, yeah I'm still not very good with SQL. If this was complete silly, then I apologize. *hides*
no subject
(And FTR, your SQL isn't silly. Implementation-wise, if I were to use your scheme, I would probably do a join with the existing account level history table which IIRC exists somewhere, rather than duplicate its content in clustertrack2. But if that table didn't exist, your implementation would make sense.)
no subject
SELECT COUNT(*)
FROM clustertrack2 ct2 JOIN accountHistory ah ON ct2.memberID = ah.memberID
WHERE ct2.AccountLvl = 'Paid' AND
ct2.LastActiveTime > ah.AccountLvStart AND
ct2.LastActiveTime < ah.AccountLvEnd AND
ah.AccountLvStart < getDate() AND
ah.AccountLvEnd > getDate();
This assumes the accountHistory table gets a new row added every time the account level changes, not overwriting the existing rows. For example, if user with ID DW12345 goes from free to paid on the 29th September 2009, the AccountLvEnd for DW12345's free record will go from NULL to 29th September 2009. Then there will be a new record added for DQ12345 for a paid account where the AccountLvStart is 29th September 2009 and the AccountLvEnd is the date at which the paid period expires. This should restrict a member to one type of account only at any one time.
The three lines:
WHERE ct2.AccountLvl = 'Paid' AND
ct2.LastActiveTime > ah.AccountLvStart AND
ct2.LastActiveTime < ah.AccountLvEnd AND
should be isolate the users that where last active in their paid period, since the last time they were active must fall between the date when they started their paid period and when their pay period ended.
The next two lines:
ah.AccountLvStart < getDate() AND
ah.AccountLvEnd > getDate();
would further restrict it so that the paid period falls on the date of when the statistics are taken. This also assumes that this code is run on the day that you want to gather the statistics. Otherwise, you can replace getDate() with the date that you intent for the cut-off point.
Actually, are the tables documented anywhere? Or at least the clustertrack2 and account history tables? It will be easier if I can know the structure of at least some of the tables.
no subject
Re your general idea:
that may work, if the performance isn't intolerably low. (Need to look at that in more depth, unless mark or one of the ops people already knows.)Never mind. That won't work, because clustertrack2 is clustered and the account level history table (IIRC) isn't, so we can't join them (different database servers).no subject
The active_user table:
CREATE TABLE active_user (
userid INT UNSIGNED NOT NULL,
type CHAR(1) NOT NULL,
time INT UNSIGNED NOT NULL,
KEY (userid),
KEY (time)
)
And the dw_paidstatus table:
CREATE TABLE dw_paidstatus (
userid int unsigned NOT NULL,
typeid smallint unsigned NOT NULL,
expiretime int unsigned,
permanent tinyint unsigned NOT NULL,
lastemail int unsigned,
PRIMARY KEY (userid),
INDEX (expiretime)
)
So if the assumption about the time attribute in the active_user table applies, then this:
SELECT COUNT(*)
FROM active_user au JOIN dw_paidstatus dwps ON au.memberID = dwps.memberID
WHERE dwps.typeid = (whatever the paid IDs are, several of them can be allowed using OR statements) AND
au.time > dwps.beginTime AND
au.time < dwps.expireTime AND
dwps.beginTime < getDate() AND
dwps.expireTime > getDate();
should work if we add a column to contain when that account level period (beginTime) started to the dw_paidstatus table, and that the records were kept by adding a new row for every account level change. But then, we'll need to change the primary key as well, because one user may have several account level changes, thus several entries, and the primary key needs to be unique. (This method will generate more entries than the existing method, but the new model will allow us to track and analyze trends like when people are buying paid accounts and etc.)
This might work? Maybe? I hope I'm not doing something completely stupid...
no subject
no subject
no subject