Server crash. UPDATE: New server ordered

Wodan · 19 Jul 2017, 20:16

webwit wrote: I am not, it's just how the hoster set it up.

Maybe get in touch with their support. Since we're on the default config maybe they can do the mdadm magic as well when replacing that hdd!
My experience with Hetzner support has been very good so far.

webwit · 19 Jul 2017, 20:41

Ok, I sent them a support request explaining the issues.

webwit · 19 Jul 2017, 20:48

OK, they are quick. What time is convenient?

Dear Client,

We would like to check the hard drives briefly from our side. Please tell us when we may turn off the server for approx 30-45 minutes in order to perform the test.

Kind regards

xxxxxx

webwit · 19 Jul 2017, 21:00

I told them you are all just silly people so any time is convenient, better sooner than later, but not at night, so I can check after reboot if everything is running well, and they should let me know one hour in advance, so I can announce the downtime.

Wodan · 19 Jul 2017, 21:06

Thanks for taking care of this. Really hoping Hetzner doesn't drop the ball!

webwit · 19 Jul 2017, 22:11

The server and thus deskthority will be down from 22:25 UTC July 19th (00:25 CEST July 20th, 18:25 EST July 19th, 15:25 PST July 19th) for an estimated 30 to 45 minutes, for a health check of our hard drives. This is 2 hours and 15 minutes from now. See you on the other side of the event horizon!

P.S. I just completed another off-site backup, just in case.

webwit · 19 Jul 2017, 23:22

matt3o wrote: and that's where raid on just 2 drives is a little pointless since not always the machine can tell which data is actually bad and which one is good (50-50).

I am just talking out of my ass here, since the last time I got deep into drive technology was Amiga floppy disks. If I remember correctly, such a disk was divided in a bunch of tracks, which were divided in a bunch of sectors. Each sector had a checksum. So when you read data from the sector, and then compared with the checksum, you knew if the data was healthy or corrupt. I presume technology hasn't deteriorated and modern HDD and SSD also checksum or otherwise validate parts, so in a raid 1 setup you know which disk has the right data and which the broken?

Wodan · 20 Jul 2017, 07:16

Any update?

And yeah, RAID1 would be pretty pointless if the bad drive couldn't be told apart from the good drive

webwit · 20 Jul 2017, 07:43

Yeah, they took 2 hours 15 minutes instead of 30-45 minutes, and then told me this:

Dear Client

Both hard drives are fine. We have started your server back into your installed system. But note there is currently a rebuild of one device running.

Kind regards
xxxxxxx

However, I just checked smartctl -a again, and the numbers seem significantly worse than yesterday.

Code: Select all

root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
  1 Raw_Read_Error_Rate     0x000f   070   063   044    Pre-fail  Always       -       12163138

Wodan · 20 Jul 2017, 07:53

Aw rats. Did they take note of the SMART readouts?

Maybe we should get our shit together and move to a new server. I've really learned to appreciate AWS lately. Depending on the performance we need it might even be cheepcheeper than a small Hetzner root EX server.

webwit · 20 Jul 2017, 08:16

I did sent them yesterday's readouts. It's probably best to keep it simple right now and just hop to another hetzner server. Not the right time for a bigger move.

matt3o · 20 Jul 2017, 08:26

Unfortunately it seems that hetzner only checks smarctl selftest errors and not the single values. Yesterday the Raw_Read_Error_Rate value was 78, today it's 70 already. Basically you have to wait until the HDD fails, at that point they will change it in few minutes. At this rate at 10 points per day we should have 4-5 days autonomy.

webwit wrote: I am just talking out of my ass here, since the last time I got deep into drive technology was Amiga floppy disks. If I remember correctly, such a disk was divided in a bunch of tracks, which were divided in a bunch of sectors. Each sector had a checksum. So when you read data from the sector, and then compared with the checksum, you knew if the data was healthy or corrupt. I presume technology hasn't deteriorated and modern HDD and SSD also checksum or otherwise validate parts, so in a raid 1 setup you know which disk has the right data and which the broken?

RAID is not a backup system, it's just a way to have some redundancy (or a nice way to be able to add disk space to an array).

Without a raid after yesterday's failure we would probably have a dead server. So hurrah for us! But if it worked the way you are saying we wouldn't have corrupted data, the good bits should have been sync'ed from the healthy HDD, but we had data loss anyway. RAID1 is fine and dandy, but it doesn't save you from data loss, actually since the failure rate of an HDD is around 1.5-3%, having 2 HDD we double our chances of a broken HDD. In a sense having just 1 new HDD is better than having 2 old ones... but hetzner uses hard drives that are running non-stop for ages, so raid even with just two drives makes sense.

But if data loss is your concern, backup is the only solutions.

webwit · 20 Jul 2017, 08:31

Weirdly the raw value is going up but the value is at 72 now.

Code: Select all

root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
  1 Raw_Read_Error_Rate     0x000f   070   063   044    Pre-fail  Always       -       12163138
  1 Raw_Read_Error_Rate     0x000f   070   063   044    Pre-fail  Always       -       12518172
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       12762654
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       13082807
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       13765149
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       14005397
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       14182096
  1 Raw_Read_Error_Rate     0x000f   071   063   044    Pre-fail  Always       -       14432541
  1 Raw_Read_Error_Rate     0x000f   072   063   044    Pre-fail  Always       -       14697695
  1 Raw_Read_Error_Rate     0x000f   072   063   044    Pre-fail  Always       -       14840703

matt3o · 20 Jul 2017, 08:39

yeah the values fluctuate. If you look at the 4th column that is the worst value that has even been registered, while the 5th is the threshold that we should never reach.

webwit · 20 Jul 2017, 09:10

I ordered a new server.

matt3o · 20 Jul 2017, 09:47

check the hard drives before installing anything

Wodan · 20 Jul 2017, 10:14

matt3o wrote: check the hard drives before installing anything

Very good point, they re-use servers and we should request a brand new one considering their policy with worn out hdds!

webwit · 20 Jul 2017, 11:05

Both sda and sdb on the new server have a fluctuating Raw_Read_Error_Rate, which after a few queries stabilizes at 080.
I'll run some longer tests.

matt3o · 20 Jul 2017, 11:09

80 is fine if the raw value is more or less stable

seebart · 20 Jul 2017, 11:38

matt3o wrote: 80 is fine if the raw value is more or less stable

Great work webwit and matt3o, I promise I'll refrain from posting memes excessively if that helps.

XMIT · 20 Jul 2017, 17:22

So, not SSD time, yet?

If not as the primary, I'd love to see an SSD being used in a write-through cache.

webwit · 20 Jul 2017, 19:21

This one:
https://www.hetzner.de/dedicated-rootserver/ex41

When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR per month), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.

tobsn · 20 Jul 2017, 19:26

What you think about hitting up AWS and getting Elastic Beanstalk hosting for free? I think thats a possibility... then you wouldn't ever have to bother with server hardware

wobbled · 20 Jul 2017, 19:27

If you don't want to go SSD, at least get 15k SAS.

Wodan · 20 Jul 2017, 23:20

webwit wrote: This one:
https://www.hetzner.de/dedicated-rootserver/ex41

When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.

Unless we are experiencing HDD performance bottlenecks I would prefer a good enterprise hdd over a ssd.
Most HDDs die slowly and give you time to react .. while some SSDs just stop working and there is no way to recover your data.

Maybe get weekls SMART reports from the server for an early watch

Norman_ · 20 Jul 2017, 23:56

Wodan wrote:
webwit wrote: This one:
https://www.hetzner.de/dedicated-rootserver/ex41

When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.
Unless we are experiencing HDD performance bottlenecks I would prefer a good enterprise hdd over a ssd.
Most HDDs die slowly and give you time to react .. while some SSDs just stop working and there is no way to recover your data.

Maybe get weekls SMART reports from the server for an early watch

Losing your data to something like SSD failure as opposed to catching and replacing a failing HDD is kind of irrelevant IMO, because SSD failure is much less common than HDD failure, by like, an order of magnitude, and i find that generally early warning measures for HDD failure aren't as reliable as one would hope. It can be just as sudden and unexpected as SSD failure.

Not to mention if you don't have some sort of redundancy/backup you might as well just delete everything manually.

And while I'm here, friendly reminder that raid is not a backup.

Of course, not that this entire conversation matters...because deskthority is actually pretty fast and doesn't even need SSDs at all lol.