alierak: (Default)
[personal profile] alierak
I previously noted an apparent memory leak in several workers under Perl 5.14.2. We upgraded the Encode module because it had a known memory leak bug and seemed relevant, and that did actually slow down the memory leak somewhat. However, I now need help brainstorming what could still be causing this. The affected workers seem to be synsuck, send-email, and resolve-extacct, so I've set cron jobs to restart those every so often.

Honestly, what can resolve-extacct be doing that causes it to grow to 9GB in size over the course of a week? We will probably have to resort to profiling memory usage to figure this one out.
alierak: (Default)
[personal profile] alierak
Anyone remember when we deployed some servers with Perl 5.10.0 and saw all sorts of problems with memory usage due to a memory leak in map? I ended up drastically reducing the number of requests each Apache process would handle before it exits, [staff profile] mark and I set up various cron jobs that would unceremoniously kill off workers every so often, etc. As we upgraded and phased out older servers, I think it was no longer a problem (e.g., web08 had Perl 5.10.1).

Well, it's looking like there's another Perl memory leak on the newest group of servers. They're running 5.14.2, and things like synsuck and resolve-extacct (really?) are using several gigs of RAM after a few days. I've set up an array of lovely new worker-restarting cron jobs, and I think this is the relevant Perl bug: perl: memory leak in Encode::decode. It is likely that we'll be able to patch this in production.

I guess if you have a choice, just skip to 5.14.4.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
FYI -- I've tinkered with the DNS for If you see anything untoward on the hacks or something, please let me know!

(This is part of my voyage in the name of performance on far off shores. I'm testing on our hack servers.)
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Hi all,

This is important. Please note!

If you are going to be touching bin/upgrading/ (and related scripts) -- you must respect the way this script works -- i.e., this script should never make any changes unless the -r flag is provided or it prompts the user running it.

This script is designed to be run by admins before code pushes, and it's supposed to spit out the "this is what I'm going to do" information. That way the admin can run it to see what might happen before it happens. It's not safe to just have it execute SQL without that flag.

This is a safety issue. I just ran it on production and it made some changes (regarding fixing certain edges) without warning me or requiring me to use the run flag. That's scary. :)

mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Hi all,

I've moved Bugzilla's outgoing email to Mailgun ( This should improve reliability of the emails because it now does SPF and DKIM on the outgoing messages. This should also reduce the headache of me trying to manage email in an EC2 environment, which is kind of a pain.

This should be no-op to most of you, but if you stop getting bugmail or something weird happens, please let me know and I can investigate.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Just to bring this top level -- I've had some issues with email from the Bugzilla machine. It is now hosted in EC2, but this is causing a few difficulties because sending email from EC2 turns out to be a real pain. They're on a perpetual blacklist by SpamHaus and presumably others.

Anyway, I've signed up for SMTP relaying service through Dyn and forwarding email through that. This should work fine unless we go over the daily quota -- which wasn't a problem yesterday, but I could see being an issue on a day in which someone generates a lot of bugmail.

I think the ultimate solution is for me to stop hosting Bugzilla myself and just have Dreamwidth do it. I'm working on that now, but it's going to take a bit of time. This does mean Bugzilla will take one more downtime to move, but it should be on a fairly final location.

Sorry for the trouble folks. You all rock.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Hi all, I'm moving Bugzilla and so it'll be down for a few hours. Just FYI.

Denise: your email will be down for ~2 hours too, until DNS propagates. Sorry!

Update: This is done. DNS is rolling through the wild TTLs, everything should be back up and running in the next hour or two. If you see any problems please let me know.
szabgab: (Default)
[personal profile] szabgab

Seeing how complex the deployment of applications developed in various corporations I am doing a little research to see how open source application manage the complexity. For that I was looking for complex (in terms of deployment and development environment) applications to see how they manage the task and how they test their system. Dreamwidth seems like a good case so I jumped on the IRC channel where [personal profile] pauamma gave me a quick list of the entities. Then I was volunteered to raise the issue here to get further clarifications.

So here is the list I got:

  1. perlbal on the load balancer(s)

  2. webservers running Apache+modperl+DW webapp code

  3. database cluster for global data

  4. several (currently 3) database clusters for per-user data. Each cluster holds the same data for different users

  5. Gearman servers, to offload some web-synchronous compute-intensive tasks

  6. TheSchwartz server(s) for asynchronous or non-web tasks.

  7. MogileFS servers for blobby data not stored in a SQL database for various reasons.

  8. Postfix

I found the Production Notes to be also a good source for information but I'd appreciate to get further details if the list is not complete. What would be very interesting to me is to understand what are the responsibilities of the various entities in the system. What kind of data is kept in the various databases. Which entities can be duplicated if load requires it and how the jobs can be divided?
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Bugzilla and Mercurial are running on a server that is going to get migrated to a new home sometime in the next few days. There will be a little downtime when this happens. I'll try to keep it to a minimum and will, of course, take proper precautions to ensure data integrity.

Thanks all.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
FYI, I'm upgrading our Bugzilla install. It might be a little bumpy for the next hour or so while I work on it.

New version will be 3.7.2.

Update: Upgrade complete.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Hey -- how many of you have noticed the couple of times sb-web02 (one of our web servers) has died in the past few days? None of you? Oh good! That means things are working like they should.

It turns out one of our web servers has developed a faulty disk. It's been isolated and removed from serving, and the NOC staff have been dispatched to take a look and replace it. You shouldn't notice anything wrong on the site, though!

This is our first disk failure, I'm kind of proud! It's like we're all grown up.

(I realize this isn't really development related...but I thought it was fun.)
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
When doing a global master change...

* don't forget MogileFS
* stop all jobs, if anybody is still up with the old config (TheSchwartz workers!) they will continue writing to the old master
* take the old master hard-offline and leave it that way to flush out any bad behavers? at least monitor show processlist
* don't forget all slaves (mailserver slave)

Todo: figure out if there's a way to put a MySQL server fully read-only (except for replication?), I know there's a read-only config file option but I don't think it works as expected.
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
For the curious, when Thursday rolls around, Open Beta will have...

1x Global Cluster (master/slave)
3x User Clusters (master/master)
20x Web Servers
4x Memcache Servers
6x MogileFS Stores
4x Job Servers
4x Perlbal Servers
+ a few miscellaneous machines

For a monthly cost of ~$8,000. This is most certainly overkill for Open Beta. Probably quite a bit overkill. But we decided we'd rather be safe than offline, and we will adjust the spend as appropriate during the rest of May, to bring things down to a sustainable level.

This brings the total number of servers (virtual servers, but still) that will be handling up to nearly fifty. That's a lot of oomph!
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
For the curious... there's actually a lot more that goes into running a production setup than just running the Dreamwidth code. In particular, here's a list of the software we're using to manage the infrastructure:

* Puppet, a configuration management system. This software is responsible for installing packages, updating configuration files, and basically keeping all of the production machines in sync. Most of the work here has been done by [personal profile] xenacryst.

* Cacti, a performance/graphing system. Cacti is great, you can configure your servers and tell it to start graphing. It's actually fairly intense to setup (took me a dozen hours or more to get it working for our setup), but once you get it going it's amazing. We have graphs of bandwidth (internal and external), CPU/disk/memory usage, even non-system things such as Perlbal requests per second, how many items are in each of the memcached instances, and the replication lag in MySQL.

* Nagios, the gold standard in monitoring and alerting. This is the software you will hear me cursing at 3AM because it has found a failure in some part of the infrastructure and started paging me. Oh yes, there will be cursing. Generally, Nagios is a tool that does one thing really well: keep an eye on things, make sure they're up and running, and tell someone if they're not.

These tools are fairly standard in the industry. I've used all of them at previous jobs and have gotten fairly familiar with their ins and outs. As always, the configuration we're using is available in our source repository:

If you are particularly interested in this end of the system, in the esoteric details that go into running a production cluster, let me know. I'm looking for a few people who like this sort of thing and who are wanting to help make sure that our servers are the best they can be. :)
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)
[staff profile] mark
Someone on IRC inquired as to our current setup. Figured I'd describe the server architecture we're currently running, both for the record, and for people to learn / give feedback / etc. As this is [site community profile] dw_dev, I'm going to be fairly technical. But stop me if you have questions.

We have a total of 13 servers. These servers are what are called VPS -- virtual private servers. They're hosted by Slicehost, a neat company that's been doing VPS hosting for a while now. Henceforth you will hear me refer to the servers as slices -- since they are technically slices of bigger physical machines.

Anyway -- here's the breakdown of the slices we have right now.

dfw-admin01 (512MB RAM) ... this box is the administrative box. It runs the puppetmaster for our configuration management system, Cacti, and serves as the distribution point for pushing out code, managing the rest of the cluster, etc.

dfw-lb01 / dfw-lb02 (256MB RAM each) ... these two machines run Perlbal right now. They are also the frontend - the site's IP is hosted on one of these. They're configured with heartbeat for failover, so if one machine dies, the other takes over within a few seconds. (This isn't fully tested/deployed, but I'm working on it.) Static files are served by these machines.

dfw-web01 / dfw-web02 (1GB RAM each) ... as you might expect, these are the webservers. They run Apache, and that's it. All of the web requests (not static files) are served by these two slices.

dfw-jobs01 (1GB RAM) ... this box runs the TheSchwartz workers. It sends email, handles events/subscriptions/notifications, and various other things that go through workers. Oh, and all of the imports are handled by this machine too.

dfw-memc01 / dfw-memc02 (512MB RAM each) ... as you might imagine, these are memcache nodes. They're small right now, but will grow over time.

dfw-mog01 / dfw-mog02 (256MB RAM each) ... MogileFS storage nodes. While we do not yet have this system deployed on production, we will before open beta hits. Right now MogileFS is mostly used for storing and manipulating userpics.

dfw-mail01 (256MB RAM) ... the incoming mailserver. This box just handles incoming mail. It's a separate box for security reasons, and also so we can configure it differently.

dfw-db01 / dfw-db02 (1GB RAM each) ... our databases. We run a pair of them, and they will soon be configured with MySQL replication. Although I haven't yet decided how to setup for Open Beta -- we'll probably deploy a couple more sets of databases...

Anyway. That's a basic tour of what we have in terms of physical units of separation. There are a lot more components that go into the production cluster as far as what gets installed where and how it works. That's beyond the scope of this post though, but eventually it will get documented so other people can setup similar sites.

(PS, and because someone is going to ask: the dfw prefix is for Dallas/Ft. Worth, the data center the servers are located in. Years working at companies with globally distributed data centers has taught me how useful it is to know where the server you are talking to actually is located...)


dw_dev: The word "develop" using the Swirly D logo.  (Default)
Dreamwidth Open Source Development

April 2019



RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Apr. 24th, 2019 07:02 am
Powered by Dreamwidth Studios