Jump to content
Banner by ~ Ice Princess Silky

A few words about outages, servers, and duct tape


Twilight Sparkle ✨

Recommended Posts

We've had an unfortunate string of outages here at awkward times in recent memory. You trust us to provide a place to pony it up 24/7 and we've been failing to deliver on that. Sorry. :(

Turning it off and on again.png

I'll get to a technical explanation of why these outages happened in a moment but for tl;dr's sake, I wanted to let you all know that we were trying a little too hard to keep our hosting costs down and ended up with what we paid for when we duct-taped four boxes together and called it infrastructure.
 
So what are we doing about it? This week, we got two fancy new servers - Mount Everhoof and Sweetie Shores (that's Pixel Wavelength's hometown, for those of you following our mascots' canon) - that won't suffer from the issue that kept bringing down our current setup. Work is already underway to move our everything over to them, which involves around ~3 TB of data and over 20 virtual servers.
 
We're doing what we can to minimize downtime during this process but some maintenance windows will be unavoidable. Following @Poniverse on Twitter is the best way to keep up with these.
 
The downside to this, and the reason it was put off as a last resort, is that this move significantly increases our hosting costs. Along with Ponyville Live!'s bills, paying for more reliability means there's less to spend on other cool stuff like giveaways, contests, advertising, and convention presences. We're also interested in exploring redundant infrastructure across more servers now that we have a proper LAN, but that's not financially feasible right now.

The upgraded subscriber perks came at a good time as subscriptions are a reliable, recurring revenue stream we can count on every month when budgeting. Because it recurs, even a modest subscription goes a long way when making decisions about how much to invest into the servers that keep Poniverse online. Whether or not perks are your thing, we'd appreciate if you consider subscribing so we can keep things awesome for everyone.  B)

 

sig-4333829.bEOrhBc.jpg

We feel like Buffy here when the servers are on fire.

 
A special thanks to our donors, everyone who has opted into ads here, our commission artists and their customers, and especially our 14 current subscribers - you're making this migration possible, for the benefit of thousands of users.
 
I'll update you all again once everything is settled into its new home. :)
 
 
Technical details, for those who like such things

The heart of our infrastructure is a Proxmox cluster comprising several dedicated servers. For the clustering functionality to work (which allows us to easily move VPS's between host nodes, among other useful things), as well as for our VPS's to be able to communicate with each other, a private LAN is necessary.
 
Our old host didn't offer private networking between servers so we used a peer-to-peer VPN called tinc to create one. Duct-taping the servers together, if you will.
 
tinc is an incredibly cool piece of software - I'm still a fan of it - but some of our servers were saturating their links during the nightly backups. Remember - the same links served stuff to the public Internet; carried internal traffic for databases, Proxmox, BungeeCord & Minecraft; and carried our backups. One of two outage-causing things sometimes happened when the links were saturated:

  • A set of very important NFS mounts became inaccessible and caused a host node's kernel to hang. This is particularly horrible because when the kernel hangs, the server stops responding to everything except pings, and needs to be power-cycled.
  • The tinc process on one of the host nodes entered a state in which it disconnected from the VPN and became unresponsive. This cascaded into various other things breaking, such as the Proxmox cluster, NFS (as above), and database connections.

Getting rid of backups is obviously not an option, so we made efforts to throttle those so they consume less bandwidth. We still had outages every few days but it was an improvement over having one nearly every day.

 

At this point, the nightly backups were taking ~14-16 hours to run and failed every now and then whenever the LAN broke. We tried various other tweaks to improve our LAN's stability, and some of them did produce positive results, but... the problem was still far from solved.
 
In the name of making uptime mean something again and preserving what's left of the sysadmin team's sanity, we gave up on tinc and decided this week to replace the peer-to-peer VPN with a LAN that used actual networking hardware for the task. To that end, we're replacing our four duct-taped nodes with two much fancier (and more powerful) nodes that have separate NIC's for private and public networking, with separate bandwidth (and a lot more of it!) dedicated to each.

 

The standard deviation of ping times between the new nodes is four orders of magnitude smaller than over our old VPN. So far, this bodes well for the new servers' reliability, and I won't be surprised if we see some general performance improvements as well.

 

  • Brohoof 38
Link to comment
Share on other sites

Status update on this so far: MLP Forums, Equestria.tv, Pony.fm, Poniverse.net, and Poniverse's databases are all up and running on the new servers now!

 

What's next:

  • PoniArcade
  • Poniverse's giant NFS server
  • our partners' websites, including Ponyville Live! radio station homepages

Thanks for bearing with us through this process, and a special thanks as well to our new subscribers, too. :grin2: Uptime can only get better from here!

  • Brohoof 7
Link to comment
Share on other sites

Yay! New server!

 

Less downtime, more roleplaying.

 

Lesson in all this: Duct tape is only good for taping long sticks onto pointy sticks, not servers.

  • Brohoof 6
Link to comment
Share on other sites

If funding is really an issue, I don't see much of a point in making ads optional. They don't take up much space on the page, and I have no trouble with loading time on my smartphone. I think that you should check to see how much of a difference it will make for your budget, and if it's significant enough, open it up to discussion and see what the community as a whole thinks.

  • Brohoof 2
Link to comment
Share on other sites

News:

 

Thru calculations and many technical terms we have concluded a simple answer that everyone asks.

 

The Official answer to: How much better is the new server.

 

is:

  • Brohoof 4
Link to comment
Share on other sites

I haven't been on much lately, so I haven't noticed the downtime. Interesting info, nonetheless.

 

I was very surprised when you say you had to move 3 terabytes of data to the new servers. :blink: I don't know if that includes the OS and server software, or if it's just the forum data. Either way, that's a lot.

Edited by Grepper
  • Brohoof 3
Link to comment
Share on other sites

If funding is really an issue, I don't see much of a point in making ads optional. They don't take up much space on the page, and I have no trouble with loading time on my smartphone. I think that you should check to see how much of a difference it will make for your budget, and if it's significant enough, open it up to discussion and see what the community as a whole thinks.

 

Admittedly, this was a while ago, but I found opt-in ads to be more effective than forced ads. It seems counterintuitive but my own theory is that ad blockers are so incredibly prevalent nowadays that some users won't even know that a site has ads unless it's somehow pointed out to them. The option to opt into ads here makes their existence crystal clear and serves as a reminder to disable any ad blockers.

 

 

I haven't been on much lately, so I haven't noticed the downtime. Interesting info, nonetheless.

 

I was very surprised when you say you had to move 3 terabytes of data to the new servers. :blink: I don't know if that includes the OS and server software, or if it's just the forum data. Either way, that's a lot.

 

Some of the bigger chunks of data in that package:

  • every image and attachment ever posted on MLP Forums
  • Pony.fm's library of over 20,000 tracks, including the entire MLP Music Archive
  • all of PoniArcade's game servers, including >100 GB of Minecraft data
  • various archives and backups of everything

All this poni adds up over time. :)

 

@@Feld0, out of curiosity, are your servers Linux/Unix based?

 

~ Miles 

 

Linux. Most of our servers run Ubuntu but we have some CentOS and Debian boxes as well.

  • Brohoof 4
Link to comment
Share on other sites

  • 3 months later...
  • 5 months later...

Yay! New server!

 

Less downtime, more roleplaying.

 

Lesson in all this: Duct tape is only good for taping long sticks onto pointy sticks, not servers.

 

why not roleplay on skype? I use to do that with a friend of mine before he wanted to take a break

 

this is what the server looks like now:

 

sig-4342409.ducttape_4.jpg

 

oh boy, that is such a build ... its a darn shame about the old servers and yey for the new ones ... lets just hope that it wont break down as often as the old servers

 

 

 

 

 

edit: I derped, I saw the date of when this was posted AFTER posting this and now I feel like an idiot for posting it 11 months late ... xDDDDDDDDDDDDDD

Edited by Vera Veil
  • Brohoof 2
Link to comment
Share on other sites

  • 2 weeks later...

Reminds me of how messed up my file server is, But is hasn't caught fire yet so thats fantastic. And to your staffs I.T, I feel you pain.

 

Which picture posted? I'm getting a cheap dedicated server ($10/mo core2duo), but for now, I've got a VPS.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Join the herd!

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...