anarres: (Default)
anarres ([personal profile] anarres) wrote in [site community profile] dw_dev2010-04-09 03:29 pm
Entry tags:

GSoC Proposal: Usage and Business Statistics

 

Dreamwidth Usage and Account Statistics


1. Introduction

The project has two sections:

-Usage Statistics: to give a detailed overview of how Dreamwidth is used, visible to all.

-Business Statistics: to give a detailed financial report over any desired time period, visible to site admins only.

Information to be presented both numerically and graphically.


2. Why this project?

Dreamwidth is a big project that has lots of people putting energy into it: coders, people doing the business side, documenters, and also end-users - Dreamwidth is a tool they use to express their dreams and ideas. All of these people will be able to spend their energy more effectively if they have useful, reliable information about how Dreamwidth is used. For instance, post-length: if most users are making short, Twitter-like posts, it would make sense to cater for this by having the simplest, quickest-to-use interface possible, and making it work with mobile phones and other devices. If users are including lots of images and video, it would make sense to cater for that as well. Similarly, succinctly presenting detailed financial information will make life easier for the people running the business side of things.


3. Graphics front-end

Graphing library: GD graph. A script will take an array or several arrays of numerical values as its arguments, and create a line, bar, or pie chart. Since essentially any set on numbers can be fed into this script it is flexible enough to be used for any new statistics that are desired in future.

There is a sample script for generating graphs, oneBar.pl, at http://pastebin.com/vuugQ3UT and http://pastebin.com/52UUjMVU.



4. Back-end: quick overview of the statistics data storage and retrieval system

At the core of the statistics system is the database table site_stats, which stores key-value pairs, Eg.:

 

Key

Value

Personal accounts

46000

...

...

Table 1. A simplified representation of database table site_stats, showing that there are 46000 personal accounts (this number is made-up).

At present there are 26 keys, which all represent different types of account (not all mutually exclusive). Data is regularly (daily?) added to this table by three modules: AccountsByType.pm, ActiveAccounts.pm, and PaidAccounts.pm. The modules StatStore.pm and StatData.pm are also needed for collecting and retrieving this data.


5. Usage Stats

 

5.1 Statistics using the existing statistics system, giving a snapshot of Dreamwidth usage right now.


The following two tables show data to be generated daily and displayed on /stats/site.bml (much of it is already there)


 

Account type (not all mutually exclusive)

Number

Percentage

Personal



External id






Active



Inactive






Paid



Unpaid






Redirect



Identity



x



x



x






TOTAL

495927

100.00%

 

Table 2: Accounts. To be displayed on /stats/site.bml.



 

Account type

 

Number of accounts

 

Percentage of active accounts

 

Percentage of total accounts

Paid - active




Paid - inactive




Paid - total








Seed - active




Seed - inactive




Seed - total








Premium - active




Premium - inactive




Premium - total








Total active paid accounts




Total inactive paid accounts








TOTAL PAID ACCOUNTS




 

Table 3: Paid accounts. To be displayed on /stats/site.bml.



Example bar graph

Fig.1: Example bar graph (using made-up numbers).
The graphs could be made to look better through better choice of colours, fonts, and whitespacing. The code to generate this graphs is at http://pastebin.com/vuugQ3UT and http://pastebin.com/52UUjMVU.

Things I'm not clear about:

-Do the account types (redirect, identity, personal, syndicated, community) refer to all accounts, or just to paid accounts?

-Which is better: bar charts or pie charts?

-Are there two types of paid account (premium, seed) or three types of paid account (paid, premium, seed)?


 

5.2 Statistics using the existing system, showing usage over time

The graphs above are all snapshots at a given moment in time. Trends over time can be shown using line graphs.


Example line graph


Fig. 2: an example of a line graph showing changes in the number of active and inactive paid accounts over time (the numbers are made-up).

Graphs like this could be automatically generated, for instance once per week, and automatically displayed on /stats/site.bml. However there is a very large number of possible graphs and it would also be possible to allow the user to select a time period and a set of desired statistics from drop-down menus, and generate a graph on the fly.


 

5.3 New accounts per day

It would also be interesting to show new accounts per day / per week of all the various account types. An easy way to do this would be to use the formula:

new accounts on Tuesday = Total accounts on Tuesday – Total accounts on Monday.

This assumes that no accounts are shut down, which might be reasonable if most people simply allow their accounts to become inactive rather than shutting them down.

A better way to do it woud be to add some new keys to database tables statkeylist and site_stats: 'new paid', 'new premium', 'new seed' etc, and write a new module similar to ActiveAccounts.pm, AccountsByType.pm and PaidAccount.pm, to add this extra data to site_stats.


5.4 More possible Usage statistics

The statistics in sections 5.4.1 and 5.4.1 don't fit in with the statistics system described in section 4, so it might be more difficult to store and retrieve these statistics.

Many statistics involve getting some value (for instance, size of last post) for every user or every account. With around 500,000 accounts this could be memory-intensive, and it might be appropriate to select data from every 10th every 100th account, for instance, when working out the average post size.


5.4.1 Usage of post features such as tags, cuts, images

For each feature a histogram could be made showing how many users use the feature with a given frequency. Usage of a given feature could be calculated per user, per account, or per blog post.

To find out which features are most popular, I'd suggest including only the top 30% (say) most active users. This is because I imagine there are a lot of users who set up a blog, but quickly lose interest in it after making only one or two posts, and these users are likely to have used a feature because it was on by default or easy to use, rather than because they really liked it. Another option would be to exclude the first few blog posts from usage statistics, since this could be seen as the period when people are trying things out and finding out what they like.

On the other hand if you wanted to find out which features are easiest to find and use, it would make sense to do the opposite, and look at what features are used in the first one or two posts.


5.4.2 Some more possible statistics:

Where referred in from?

 

Average size of post

This could be either a number simply giving the average size, or a histogram showing the distribution of post sizes.

 

Time since last activity (or since last post)

This could be either an average or a histogram, and could be broken down by account type.

Usage of post features such as tags, cuts, images


6. Business statistics

The purpose is to give a detailed financial report over any desired time period, or to show trends over time. I don't yet have a clear sense of how this data is stored and collected, so I'll simply show a mock-up of how the information could be presented (this section is basically a write-up of the Admin Stats wish-list in Bug 124).

This information could be automatically generated (for example) once a week, or could be generated on the fly with the user choosing the desired time period and the desired statistics from drop-down boxes in an HTML form.

 

6.1 Mock financial report for period April 1 – April 7, 2010:


Total payments: $XXX

Total refunds: $XXX

Net money in: $XXX


 

Table 4: Payments for the period April 1 – April 7, 2010

Payment type

Number of payments

Dollar amount

Seed - new

XXX

$XXX

Seed – not new

XXX

$XXX

Seed - total

XXX

$XXX




Paid - new

XXX

$XXX

Paid – not new

XXX

$XXX

Paid - total

XXX

$XXX




Premium - new

XXX

$XXX

Premium – not new

XXX

$XXX

Premium - total

XXX

$XXX




TOTAL

XXX

$XXX


Table 5: Lapsed accounts: April 1 – April 7, 2010



Percent churn for the time period: total lapsed paid accounts that don't renew within 7d/total

paid accounts * 100


(This would have to be calculated for a period ending at least 7 days before the current date.)


Table6: Refunds: April 1 – April 7, 2010

 

Account type

 

Amount

 

Fees

 

Refund type – chargeback / paypal?

Seed

$XX

X

X

Paid

$XX

X

X

Premium

$XX

X

X

TOTAL

$XX

X

X


Working out the number of new accounts created in a given period is complicated by the fact that an account could be created via code and paid quite a long time afterwards, or not paid at all. A simple solution would be to count these accounts on the day they are first paid rather than on the day they are created. Another approach: work out what fraction of accounts created were eventually paid for in past – if the fraction is roughly constant when calculated at different time periods, it can be used to predict what fraction of accounts in the past (day / week / etc.) will eventually be paid for.


Table 7: New paid accounts: April 1 – April 7, 2010

 

Account type

 

Number of accounts created

Premium – created via code

X

Premium - created via payment

X

Premium - total

X



Paid – created via code

X

Paid - created via payment

X

Paid - total

X



Seed – created via code

X

Seed - created via payment

X

Seed - total

X



TOTAL

X


The data in tables above could also be presented in bar charts or pie charts.


6.2 Showing financial trends over time

All the data above represents a snapshot over one short time period. Trends over long periods of time could be shown using line graphs similar to fig. 3.



Post a comment in response:

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org