fu: Close-up of Fu, bringing a scoop of water to her mouth (Default)
fu ([personal profile] fu) wrote in [site community profile] dw_dev2010-05-27 03:14 pm
Entry tags:

Batch processing of huge tables

So, I just committed a code merge for an issue we found via LJ, where comment properties weren't being deleted when the comment was.

That takes care of new properties, but we still have old ones hanging around! How do we handle this? I'm assuming that due to the size of talkprop2, doing:

delete talkprop2 from talkprop2, talk2 where talkprop2.journalid=talk2.journalid AND talkprop2.jtalkid=talk2.jtalkid AND talk2.state="D";

would be a horrible idea, but what can we do to make it less horrible? Should we even try to delete?
foxfirefey: A wee rat holds a paw to its mouth. Oh, the shock! (thoughtful)

[personal profile] foxfirefey 2010-05-27 07:37 am (UTC)(link)
Hrm....maybe there could be a worker that gets batches by the primary indexed key where talk2.state="D" and checks each one. Just slow and steady and low-key?
sophie: A cartoon-like representation of a girl standing on a hill, with brown hair, blue eyes, a flowery top, and blue skirt. ☀ (Default)

[personal profile] sophie 2010-05-27 10:17 am (UTC)(link)
That sounds like it could work. Grab the list of state="D" comments *now* (since those in the future should have their properties deleted anyway), and then slowly and steadily work through them at a slow enough rate that it doesn't affect the site too much. There shouldn't be any difference on the frontend, after all. (unless export_comments.bml returned the properties, but I can't remember if that's true or not)
zorkian: Icon full of binary ones and zeros in no pattern. (Default)

[personal profile] zorkian 2010-05-27 03:07 pm (UTC)(link)
IMO not really worth worrying about. We aren't hurting for those megabytes.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2010-06-04 10:12 pm (UTC)(link)
Given that we have backups of things going back for months, it's not really a big deal IMO.

However, I'm not opposed to purging out the data. This can be done on the inactive database to determine how slow of an operation is, and then we can run it at non-peak times on the active database.

I suppose the answer is yes, we should try to delete it.