Alphasat Backend Failure (Resolved Now)


#1

Alphasat (25E, Europe - West Asia - Africa) - had a backend failure a couple hours ago.

As yet we know that the SSD failed. Its going to have to be replaced. The Carousel is still up for now, but its possible that you will stop receiving packets entirely while we work on replacing the drive. Meanwhile, no new files will go on the carousel.

This is not going to be very quick, unfortunately. My best estimate is from 3 to 10 hours right now.


#2

#3

It seems to be up again. Receiving packets without problems. Great!

Just a question, but why does it take so much time to replace a drive in the Backend Server?
Also has the Server no Backup disk? There should be a raid and maybe a hot spare ssd.

regards, Manuel


#4

In my experience… and most people stick their fingers in their ears an don’t want to hear this… SSD’s are crap and they ONLY last about 3 years!! A simple HDD will give a good 5 or even 10 years life for a FRACTION of the cost!
There I said it…
I BET you will find that the SSD in question has been in use for about Three years… I would be interested to know :wink:

My advice is a simple RAID (not some propitiatory rubbish!) using some HDD’s and you will have a very reliable system forever… If it is just a PC, and a RAID array is not an option, then stick an HDD in it (not an SSD) and back it up every few hours. You will get two HUGE HDD’s, easily for the price of a moderate capacity SSD LOL!


#5

I agree for the most part, but SSDs do have their uses.


#6

Speed of access and no fragmenting issues, and less power/heat for portable use, but I doubt that any of those is a really issue in this case :slight_smile:


#7

In this case I cant see a need for a SSD, your right. Another use is for environments that are prone to vibrations. A couple years ago every retailer in the area were sold out of HDDs and SSDs because the county was repaving some roads and were using a 10 ton block of steel pulled behind a tractor to break up the old concrete. Those impacts could be felt from miles away.


#8

The system is currently working off a replacement drive. The server will be replaced over the coming weeks, (which means there will be a short outage again, at that point).

The carousel had to be rebuilt - which happens automatically, but it does mean that the files there may be received again, if you had received them before the crash. Also, file iteration counts were reset, so some of the older files will be going around a bit more frequently than they should over the next day or two.

Why it takes so long: the “backend” is a dedicated, specialized box, not a generic server. Its located at a European teleport. Everything has to be done remotely, including finding a temporary replacement for the drive, finding someone to swap it out, and then having to restore the box image remotely.

I do agree about the SSDs though - I my experience their failures are more frequent than spinning disks. Spinning disks also are better at warning about imminent failure. But there are obvious advantages as well - as @neil already mentioned. No, its not a PC. Yes, RAID isn’t currently a feature on the hardware. Thats something we are taking up and want to resolve in the replacement box.

I am told that this is the first time ever a drive has failed in such a box. So theres that.


#9

#10

Just assuming that you are using some flavor of Linux on the backend system I would highly recommend looking into using ZFS and giving raid the skip since you are not using a full blown hardware raid controller.

I am told that this is the first time ever a drive has failed in such a box. So theres that.

Nothing like getting out ahead of the curve.

-C


#11

For me as a young IT specialist, a Backend that fails because of a single drive, is very bad practice.
If you dont have physical access to a box/server, that is a single point of failure, you should absolutly have a raid.

If the speed of a ssd is needed (i guess it isn’t important in that case), just use a second ssd as a backup or even a hdd. A slower server is better than one that is offline :slight_smile:

A second option would be to use a ssd for the operating System and a second one to store and access the data. If a ssd is not used much, it will last a very very long time. If you then add a third ssd as a hot spare, you could use it while the normal ssd is swaped out.

That the swap is time consuming is easy understandable in this case. Just two questions:
How much storage capacity is needed for the backend?
Is there a strong limitation in the size of the Box/Server?

regards, Manuel