A maintenance page that works when sites are down like this morning | Voters

A maintenance page that works when sites are down like this morning

complete

Scott

We all work so hard to get customers to visit us online and in store. So it does not inspire confidence in you business when your website is down for 20 minutes and it’s either a blank screen or a 500 error message
PLEASE can we have a reliable maintenance page that is in place when Citrus Lime are doing necessary updates like this morning? I cannot see why this isn’t already in place and if there is a technical issue then really it should be overcome. We had a customer panicking as we have his servicing in, our site was down and we are closed Mondays so not answering the phone. Bad luck I know, but we work so hard to get customers into our sites and spend ££££ too, a maintenance page at least reassures potential and existing customers that it’s a temporary thing and all of will be well?
I cannot think of a multi client platform that doesn’t have this base covered (eg Shopify etc) so hopefully this can be done.
It also gives a professional look for our businesses, and for our developers too!
Thank you

August 14, 2023

Neil McQuillan

Pinned

Scott, 
Good to hear from you as always. I agree with your sentiment above. 
We're in the middle of a three of significant projects here on the hosting side. 
Firstly we've recognised that the significant growth our customers are seeing. July was a record month for revenue thru POS with each of the five vertical markets (for example Cycle and outdoor) we serve setting new records and YvY growth. 
We've been working with HP Enterprise to standardise our private Cloud around HPE SAN (storage) and their AMD Epyc Servers. We did look at public Cloud for everything as we use Azure pretty extensively for microservice hosting but it would have meant major price increases for our customers. This is worth a read https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0. 
We started this process last year, but extremely long lead times meant the kit only arrived late May. It's now installed every server in our setup has been replaced and all customers are migrated to the new setup. Our building is now full or Lenovo and Dell Servers of varying ages :-) Literally a couple of hundred thousand pounds of now obsolete gear, but such is the IT game :-)
Why have we done this when it meant investing over quarter of a million pounds when we could have continued for a few years just adding servers here and there?
To provide easier scale up;
Full fault tolerance on every layer of our setup;
We recognise that ecommerce reliability needs to improve, whilst our overall stats are good in industry teams each problem causes frustration for customers and also stress and workload for my own team;  
Clear rules on when to buy new equipment based on resource usage (however see below).
Now all the kit is installed its clear we are using more compute than our modelling has shown in the past, so yesterday I signed off an order for a further £60,000 of RAM and AMD Epyc servers. This will add another 4TB of RAM and 256 Epyc cores to the cluster (four new servers, two CPU's per server and 32 cores per CPU). The RAM should arrive within days (we had an initial batch but HPE and our supplier Insight provided the wrong specification which is indirectly what caused your problems above). We've ordered more in from a supplier in the USA a benefit of Citrus-Lime Inc trading out of Delaware for our UK customers. 
In addition to this we've completely rewritten our image resizing technology and built our own custom load balancer around Microsoft's Yarp project https://github.com/microsoft/reverse-proxy/tree/release/2.0.  This allows us for our Premium customers to automatically load balance across multiple servers with the load balancer automatically registering new backend servers with no human intervention. We've also moved image and customer specific content outside of the main web servers, this will greatly improve reliability particularly around software upgrades. 
A lot of our issues are historic one off server setup customisations for customers which then trip us and our customer up. We're getting these down to customisations managed by Google Tag Manager. This setup also automatically manages SSL registration and offloading, another frequently cause of problems. 
This setup is in and working and running on our another major investment Docker and a container cluster, currently it just services www.workingclassheroes.co.uk but by the end of the year all customers will enjoy this setup. Here is our internal management interface of the running instances.
As the solution is our own source code, we can add custom error pages when the backend servers are offline, I'll raise an issue to address that. 
So a lot going on but it's very much to improve the reliability and scalability of our ecommerce platform without loosing our legendary agility in terms of delivering features and upgrades.

This post was marked as

complete

Neil McQuillan

All the maintenance page update is due to go live on the 20th of December.

Neil McQuillan

marked this post as

in progress

Neil McQuillan

marked this post as

under review

Neil McQuillan

Following our upgrades to load balancing we can serve a maintenance page when all the destination servers are offline. I've had a play around with this, this weekend and we can add custom middleware code to the load balancer to do this.

See my example below.

Neil McQuillan

I've spent some more time looking at this, and I now have code ready for the QA team to test.

Scott

Neil McQuillan: great!

Neil McQuillan

We ordered the RAM from three sources and one has delivered today. There is another 1.25TB being installed tomorrow.

Neil McQuillan

Scott, 
Good to hear from you as always. I agree with your sentiment above. 
We're in the middle of a three of significant projects here on the hosting side. 
Firstly we've recognised that the significant growth our customers are seeing. July was a record month for revenue thru POS with each of the five vertical markets (for example Cycle and outdoor) we serve setting new records and YvY growth. 
We've been working with HP Enterprise to standardise our private Cloud around HPE SAN (storage) and their AMD Epyc Servers. We did look at public Cloud for everything as we use Azure pretty extensively for microservice hosting but it would have meant major price increases for our customers. This is worth a read https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0. 
We started this process last year, but extremely long lead times meant the kit only arrived late May. It's now installed every server in our setup has been replaced and all customers are migrated to the new setup. Our building is now full or Lenovo and Dell Servers of varying ages :-) Literally a couple of hundred thousand pounds of now obsolete gear, but such is the IT game :-)
Why have we done this when it meant investing over quarter of a million pounds when we could have continued for a few years just adding servers here and there?
To provide easier scale up;
Full fault tolerance on every layer of our setup;
We recognise that ecommerce reliability needs to improve, whilst our overall stats are good in industry teams each problem causes frustration for customers and also stress and workload for my own team;  
Clear rules on when to buy new equipment based on resource usage (however see below).
Now all the kit is installed its clear we are using more compute than our modelling has shown in the past, so yesterday I signed off an order for a further £60,000 of RAM and AMD Epyc servers. This will add another 4TB of RAM and 256 Epyc cores to the cluster (four new servers, two CPU's per server and 32 cores per CPU). The RAM should arrive within days (we had an initial batch but HPE and our supplier Insight provided the wrong specification which is indirectly what caused your problems above). We've ordered more in from a supplier in the USA a benefit of Citrus-Lime Inc trading out of Delaware for our UK customers. 
In addition to this we've completely rewritten our image resizing technology and built our own custom load balancer around Microsoft's Yarp project https://github.com/microsoft/reverse-proxy/tree/release/2.0.  This allows us for our Premium customers to automatically load balance across multiple servers with the load balancer automatically registering new backend servers with no human intervention. We've also moved image and customer specific content outside of the main web servers, this will greatly improve reliability particularly around software upgrades. 
A lot of our issues are historic one off server setup customisations for customers which then trip us and our customer up. We're getting these down to customisations managed by Google Tag Manager. This setup also automatically manages SSL registration and offloading, another frequently cause of problems. 
This setup is in and working and running on our another major investment Docker and a container cluster, currently it just services www.workingclassheroes.co.uk but by the end of the year all customers will enjoy this setup. Here is our internal management interface of the running instances.
As the solution is our own source code, we can add custom error pages when the backend servers are offline, I'll raise an issue to address that. 
So a lot going on but it's very much to improve the reliability and scalability of our ecommerce platform without loosing our legendary agility in terms of delivering features and upgrades.

Scott Zesty

Neil McQuillan: Thanks for the detailed response, all good to hear as a client, and good to hear that we agree that the end customer needs to see something tangible and that it will be raised, as the end customer isnt privvy to any of this extensive background work :)

Neil McQuillan

Scott Zesty: Just to update further, the RAM from the US is due to drop on Tuesday and will likely be installed the next day (depending on the arrival time). As it's coming in from the US there might be some variability here, but we shall see. 
Servers are due in four to six weeks. As we'd not factored that kind of lead time into any our scale out plans, we're repurposing a very high spec Dell 640 we have spare to act as an overflow out of the cluster for none essential services, we can put that as a standalone cluster on the SAN and pop none critical virtual machines on it if we're short of capacity after a server order.
We have also found and resolved a cause of downtime, we've noted that if the content server is rebooted at exactly the same time as the web server application restarts this prevents the site coming up until we manually intervene. We've scheduled our maintenance windows so this does not occur. Annoying that IIS works in this manner, why it does not retry I do not know!