Jump to content

Resiliency Update


RadioRob
This topic is 757 days old and is no longer open for new replies.  Replies are automatically disabled after two years of inactivity.  Please create a new topic instead of posting here.  

Recommended Posts

We're coming close to a year following the loss of our beloved Daddy (@Guy Fawkes).  His efforts in running this place that many of us call home was nothing short of heroic.  He was the "one man band" who not only kept the lights on, but also helped keep us in line, managed the review site, and played mentor to many of us.  I know from first hand experience that it must not have been easy and that it was truly a labor of love for him.  

Part of what made us love him so much however was ultimately a challenge that almost cost this community its survival.  By taking on most everything himself, it left us in a spot where his death nearly meant everything he loved so much was almost lost.  When I agreed to help take over the technical operations of the site, improving the resiliency of this community was one of my top goals.  

I wanted to take a moment and share a brief update on where we stand with regards to ensuring the site goes on no matter what happens to anyone here.

Fixing the People Problem

Having a single person with the "keys to the kingdom" creates a single point of failure scenario.  In order to reduce the likelihood of this happening, the following changes have been made.  

  • We now have multiple full site admins.  Today, that is both @Cooper and myself.  Eventually I would like to expand this to one additional person so that should something happen to any of us, there would not be a significant amount of effort needed to cover for that person's loss.  
  • You most likely by now have seen that we've grown our super moderator and moderator ranks quite a bit over the last year.  This was done intentionally for several reasons:
    • It provides additional moderator coverage around the clock for the site.  We now have people on the west/central/east coast of the US and even someone based outside of the US.  This gives us better coverage to handle issues such as spammers or other important issues that might happen while part of the world is sleeping.
    • It provides additional thoughts and view points as we debate our values and our guiding principles.  
    • It reduces the workload of handling reported content for each of us while also ensuring if someone needs a break we don't end up in a "short handed" situation where the loss of someone means the rest of us end up overburdened.  

Continuity of Operations

  • In the event something happen to me, I have documented all of the critical site components and usernames/passwords.  This information has been shared with @Cooper who is storing it offline should something happen to me.  
  • In the event the site is unable to find someone who has system administration experience, I'm developing a plan where the site could be transferred to being hosted by Invision (the people that make the IPB software) as a managed service.  This is more expensive than the setup we have today, but it removes the hosting requirement as well as finding someone that knows how to manage servers, etc.  I'm still working on this documentation as the original version is outdated from when the site was moved to the cloud.  

Fixing the Technology Problem

This past year has seen a number of scenarios where we've had "teething issues".  From times in which the server could not handle the traffic to the site, to bugs that have introduced unintended consequences, to "oops" moments where someone (*cough* me *cough*) fat fingered something that broke a part of the site.  

  • To provide additional scalability, the site is now hosted in Amazon EC2 (aka "the cloud").  When the site was first moved off of Daddy's server, I placed it on a VPS which worked, but did not have as much resources as we needed as the site began to grow.  As a result, I moved the site elsewhere and none of you even realized it!  :)  
  • The workload has been split into multiple servers in a manner designed to allow us to be able to scale as needed.  (We have a separate web, database, and search tier that can each be grown as needed.)
  • As we grow, I can add load balancing and split the work among multiple web servers, database servers, etc.  
  • Storage of files/uploads/attachments has been rearchitected to allow us to not worry about running out of space.  Anything uploaded to the site (gallery images, file attachments, etc) is independent from the site files themselves and won't effect the site operation.
  • I have setup a development area that I can use to test/stage major changes to reduce the risk of problems as a result of updates and other feature changes.  

Improving Security / Protecting Data

  • I have implemented TLS (formerly known as SSL) to encrypt the traffic between you and the site.  Not only do we leverage TLS, it's configured to the same standards used by banks to ensure someone does not sniff the traffic between you and us.
  • In order to access ANY of the site's servers, you must first connect to a VPN.  There are no remote "back door" access areas to ensure someone cannot take over the server or do something catastrophic to the site.  
  • I have implemented two factor authentication (2FA) for anything not directly behind our VPN such as the IPB admin area.  Even if someone guesses a password, they would also need to have a one-time password that changes every 30 seconds.  This reduces the chance of someone brute force attacking the site to try and take over control of it.
  • I have setup multiple daily backups of all critical infrastructure components.  Specifically I have setup cloud-based snapshots that make a backup of the site daily that allows for quick restoration should something happen.  And in the off chance Amazon entirely disappear from the face of the earth, I have a separate daily backup that runs that stores files off-site in a different location.  That backup would take much longer to restore, but it would only be needed if we lost our primary site AND we lost our primary backups.

Financial Continuity

  • This site has primarily been funded through member donations.  I sincerely appreciate EVERYONE'S contributions to this site.  We've had folks who have given their time and others who have given with their financial support and even some that have done both.  Each of these areas have been critical to being able to manage and grow the site.  (THANK YOU as well to each of you for your gifts!)
  • In the coming weeks, I'm going to be adding a feature that would allow someone to setup a recurring donation if they don't want to have to remember to do so.  This is not meant to be a way to "push people to give", but instead be a convenience to those that would like to do so, but not need to remember to do it.  
  • I'm looking at ways to try and monetize the site.  If you've accessed the site while not being logged in, you might have noticed ads embedded into the site.  These are shown only to guests, and disappear once logged in.  While the ad revenue does not provide all of the funding the site needs, it does provide a nice supplement without impacting our logged in members.  
    • I'm also looking at potentially adding features or data to the upcoming review site that might be available as part of some sort of "premium" tier.  I don't have details on how or even if this would be sustainable, but it's something that I'm considering.  
Link to comment
Share on other sites

  • 3 weeks later...
This topic is 757 days old and is no longer open for new replies.  Replies are automatically disabled after two years of inactivity.  Please create a new topic instead of posting here.  

Guest
This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...