Last month, we launched a brand new site at our institution. This site was built by an outside contractor in conjunction with our team. This image above was our launch pre-, during, and post- launch plans.
As it goes, no launch is perfect. The launch was successful in that, by the next morning, we had the new site up for all the world to see. However, our plan was to begin the launch at 8PM and finish by 10PM. My team and I didn’t log off from our workstations until 4AM. We also had to work with our hosting provider to get emergency upsizing for our two dedicated webservers.
Accordingly, there were a few things we could have done better and quite a few things to take away from the experience.
What We Could Have Done Better
Environment Space Differences
Within an hour of starting the launch, we ran into our first issue. Content managers had been working in our Staging environment for the last few months, getting the site ready for launch. One step of the launch was to migrate all files from the Staging environment to the Production environment. We opted to simply tar up all the files on staging, scp the entire tarball to production, and untar the file there.
We have a shared gluster between all of our dev and stage environments on that box, thus, while making the tar, we quickly learned that effectively doubling the space usage (by keeping all files + a tar of all files) was going to quickly put us over our space limit. Despite all of our brainstorming and prep, no one on the team considered file space! Our production boxen have a much higher space limit so it would be no problem to hold and extract the tar there. We just needed to make it first on Stage.
Our solution: We simply paused the tar and manually cleared out the files directories for the old version of our sites on the Stage server. Once we finished the launch, we could, if so desired, delete the tar and pull back down the files from Prod to Stage.
Lesson Learned: In the future, we should be wary of space limitations and perhaps even test the creation of the “launch files tarball”. Another option would be to use rsync to migrate the files without having to have a moment in time when both the original file and that tar’d version of the file exists on the same server.
Our biggest hurdle was the lack of a full cache at launch. Within ten minutes of turning off maintenance mode, our production box fell over. When our production boxen truly go down, we are at the mercy of our host to restart said boxen.
This was because our site effectively relies on caching to perform. At our host, we accept lower origin server performance to save on cost, knowing that Varnish and our load balancers will bear the brunt of requests.
Of course, without said Varnish cache properly warmed, that means our production boxes are at high risk to go down with any significant traffic. This ties into Launch Time & Publicity, which I will speak more to in the next section.
Our solution: We overcame this hurdle by contacting our host and asking them to temporarily upsize our production boxes so we could withstand the higher-than-normal number of requests to origin and give Varnish a chance to warm up.
Lesson Learned: In the future, it would be wise to keep maintenance mode (or something akin to it, like a CDN) on and allow only a specific IP or IP Range. Then write and use a script to perform a slow-crawl of the site (or, at least, the high traffic pages). This would allow Varnish to warm up before opening it up to the world.
Lesson Learned: Temporary, emergency (read: possible to implement in minutes or hours, not days) upsizing and/or horizontal scaling is a must when your site truly relies on Varnish to stay up.
Launch Timelines & Publicity
Many people were aware of the details of our launch plan. This lead to (1) people spamming F5 at 10PM which was only detrimental to our efforts to get the site up without a warmed Varnish cache and (2) people asking questions about progress and reporting “the site is down! the sky is falling!” while we were obviously 100% aware and working on the issues.
Lesson Learned: Inform only those who need to know of all the minute details of your launch plan. Most people would do fine to know just that “The site will be live on Tuesday morning”, instead of “The site will be live by 10:01PM Monday night.
Lesson Learned: It may be best to launch during “deep” off hours (i.e. 12am-5am). Doing so would result in fewer people spamming F5 at the tail end of the projected launch window and, likewise, would result in less exposure to the site’s downtime — resulting in fewer frantic stakeholder and fewer duplicate issue reports from constituents.
The Cost of A Bootstrap
In past days, we served a proxies.pac file to all machines that might connect to our network. We either configured images that were pre-loaded onto our user’s laptops to request this file or gave them a tutorial to configure it on their own. We are seeing remnants of those days in that there are many systems out there that still contain this legacy configuration. In fact, the vast majority of requests to our webserver are legacy requests for this file by two orders of magnitude.
In our previous version of the site, requests for this file would never be seen by Drupal because we would redirect the request away as soon as we knew it wasn’t for a Drupal site. In our new iteration, _everything_ bootstraps Drupal as long as it gets past .htaccess.
This meant reams of requests for proxies.pac (and corollary proxies-new.cgi and proxies-test.cgi) were bootstrapping Drupal!
Immediately after launch and before our production boxen fell over, I noticed that proxies.pac requests were spamming our logs. We had talked about implementing Fast 404 and deemed it unnecessary pre-launch. Hindsight is 20/20 and it turns out that this massive number of requests was one of the largest factors that sunk our boxen.
Our Solution: We added a redirect in the top-level .htaccess to intercept these requests before bootstrapping Drupal.
Lesson Learned: Speak up! If I had called more attention to the massive number of requests as a potential issue, we might have been able to get this fix it and, thus, log off a lot earlier.
Lesson Learned: It wouldn’t have hurt anything at all to install an enable Fast 404 pre-launch. Looking back, I can’t think of any good reason other than laziness for why we didn’t have Fast 404 enabled at launch.
What We Did Well
Not all was doom and gloom, though. In the grand scheme of things, this was a huge endeavor! With that in mind, it was one of the most smooth launches I’ve had the joy of working on and living through.
Be on Acquia
Being on Acquia instead of hosting within our own data center ensured a level of environment parity that we’ve never had before. We could be relatively certain that if it worked on Stage, it would work on Prod — save for elements that explicitly relied on being in Prod to test.
We spent some time developing a step-by-step rollback plan and decided together at what point in the night we would resort to revert (it was 5AM, by the way. We just made it.) This gave us a level of comfort in knowing that if everything got truly FUBAR, we could always rollback. Sure, at great cost, but at least we’d still have a site in the morning no matter how you sliced it.
I really shouldn’t need to mention this but we’ve only had our entire web codebase in version control since mid-2012. Our last site launch was done without any version control.
Our team has come a long way since then and I’ll once again preach the importance of version control. It allowed us to iteratively review the work of the vendor and quickly and easily deploy the new version of the site. It also provided a simple rollback path, if that was needed.
Code and Content Freeze
We were able to work with all our content managers and stakeholders to agree to a code and content freeze a good five hours before the start of the launch process. This meant ample time to wrap up the pre-launch tasks without having to keep track of updating databases during go-time.
We use NewRelic to monitor our boxen. This helped us to be aware of what was actually going on, bringing down our servers — Instead of grepping through and squinting at log files.
Failure Mode Brainstorming
We spent nearly two hours just brainstorming all the possible things that could have gone wrong. We also brought up all the things that went wrong during previous launches. If we didn’t already have a measure in place to deal with a potential issue that came up, we made it a point to implement one before launch day.