Server crash. UPDATE: New server ordered
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
The server crashed this morning, sometime after 6:00 CET (nightly backup completed). I couldn't ssh into the server even after a restart. I used a rescue system by our hoster which uses PXE boot and runs in the memory of the server. Mounted raid disk, used fsck to repair disk (it found variour errors). Now it's up. Currently checking database tables for any corruptions.
UPDATE 19 July 20:10 UTC:
The server and thus deskthority will be down from 22:25 UTC July 19th (00:25 CEST July 20th, 18:25 EST July 19th, 15:25 PST July 19th) for an estimated 30 to 45 minutes, for a health check of our hard drives. This is 2 hours and 15 minutes from now. See you on the other side of the event horizon!
UPDATE 19 July 20:10 UTC:
The server and thus deskthority will be down from 22:25 UTC July 19th (00:25 CEST July 20th, 18:25 EST July 19th, 15:25 PST July 19th) for an estimated 30 to 45 minutes, for a health check of our hard drives. This is 2 hours and 15 minutes from now. See you on the other side of the event horizon!
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
I only found 1 error in the database, in some not important visitor statistics table, which I repaired. There was also an error earlier in a wiki table, which caused the wiki db backup not to complete the past four days. However, the fsck seemed to have repaired it.
Please let me know if you find any missing data or if other things don't work as they should.
Please let me know if you find any missing data or if other things don't work as they should.
I don't even know what a SMART query is. I'm not much of a linux admin.Wodan wrote: ↑Did you SMART query the HDD drives?
- kbdfr
- The Tiproman
- Location: Berlin, Germany
- Main keyboard: Tipro MID-QM-128A + two Tipro matrix modules
- Main mouse: Contour Rollermouse Pro
- Favorite switch: Cherry black
- DT Pro Member: 0010
Thanks for fixing that,webwit wrote: ↑The server crashed this morning, sometime after 6:00 CET (nightly backup completed). I couldn't ssh into the server even after a restart. I used a rescue system by our hoster which uses PXE boot and runs in the memory of the server. Mounted raid disk, used fsck to repair disk (it found variour errors). Now it's up. Currently checking database tables for any corruptions.
even if I don't understand much of what you did
- chuckdee
- Location: USA
- Main keyboard: Clueboard/RS Ver.B
- Main mouse: Logitech g900
- Favorite switch: Cherry MX Brown
- DT Pro Member: 0151
https://en.wikipedia.org/wiki/S.M.A.R.T.webwit wrote: ↑I only found 1 error in the database, in some not important visitor statistics table, which I repaired. There was also an error earlier in a wiki table, which caused the wiki db backup not to complete the past four days. However, the fsck seemed to have repaired it.
Please let me know if you find any missing data or if other things don't work as they should.
I don't even know what a SMART query is. I'm not much of a linux admin.Wodan wrote: ↑Did you SMART query the HDD drives?
It allows you to predict whether a drive is in danger of failing. I don't know how to do it on Linux, though.
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
run:webwit wrote: ↑I don't even know what a SMART query is. I'm not much of a linux admin.
smartctl -t short /dev/sdX
(where X is the drive)
the test will take few minutes. If the test is successful run
smartctl -t long /dev/sdX
this will take much longer.
If any of the above fails, copy the test result, send to the host and ask for a replacement.
To check test status run:
smartctl -l selftest /dev/sdX
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
Thanks. Short tests completed without error. Currently long testing first raid disk. "Please wait 316 minutes for test to complete" ...
- Wodan
- ISO Advocate
- Location: ISO-DE
- Main keyboard: Intense Rotation!!!
- Main mouse: Logitech G903
- Favorite switch: ALL OF THEM
- DT Pro Member: -
Thanks very much for taking care of this!webwit wrote: ↑Thanks. Short tests completed without error. Currently long testing first raid disk. "Please wait 316 minutes for test to complete" ...
Some linux distributions have some kind of SMART check daemon that you can configure to run periodically and send you new test results.
- DanielT
- Un petit village gaulois d'Armorique…
- Location: Bucharest/Romania
- Main keyboard: Various custom 60%'s/HHKB
- Main mouse: MS Optical Mouse 200
- Favorite switch: Topre/Linear MX
- DT Pro Member: -
What maker is the server ? Depending on the manufacturer there are some tools that can be used and are way better than the smart stuff . This is what I do for a living by the way
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
DanielT wrote: ↑What maker is the server ? Depending on the manufacturer there are some tools that can be used and are way better than the smart stuff . This is what I do for a living by the way
Code: Select all
>lshw
server.deskthority.net
description: Desktop Computer
product: MS-7823 (To be filled by O.E.M.)
vendor: MSI
version: 1.0
serial: To be filled by O.E.M.
width: 64 bits
capabilities: smbios-2.8 dmi-2.7 vsyscall64 vsyscall32
configuration: administrator_password=disabled boot=normal chassis=desktop family=To be filled by O.E.M. frontpanel_passw
ord=disabled keyboard_password=disabled power-on_password=disabled sku=To be filled by O.E.M. uuid=00000000-0000-0000-0000-44
8A5BD4482E
*-core
description: Motherboard
product: B85M-G43 (MS-7823)
vendor: MSI
physical id: 0
version: 1.0
serial: To be filled by O.E.M.
slot: To be filled by O.E.M.
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: V3.14B3
date: 06/23/2014
size: 64KiB
capacity: 15MiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy288
0 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-cpu
description: CPU
product: Xeon (Fill By OEM)
vendor: Intel Corp.
vendor_id: GenuineIntel
physical id: 3d
bus info: cpu@0
version: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
slot: SOCKET 0
size: 3500MHz
capacity: 3900MHz
width: 64 bits
clock: 100MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflu
sh dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts rep_good xtopology no
nstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic m
ovbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpri
ority ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm cpufreq
configuration: cores=4 enabledcores=4 threads=8
*-cache:0
description: L2 cache
physical id: 3e
slot: CPU Internal L2
size: 1MiB
capacity: 1MiB
capabilities: internal write-back unified
*-cache:1
description: L1 cache
physical id: 3f
slot: CPU Internal L1
size: 256KiB
capacity: 256KiB
capabilities: internal write-back
*-cache:2
description: L3 cache
physical id: 40
slot: CPU Internal L3
size: 8MiB
capacity: 8MiB
capabilities: internal write-back unified
*-memory
description: System Memory
physical id: 41
slot: System board or motherboard
size: 32GiB
*-bank:0
description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
product: CT102464BA160B.C16
vendor: Conexant (Rockwell)
physical id: 0
serial: AE008BBA
slot: ChannelA-DIMM0
size: 8GiB
width: 64 bits
clock: 1600MHz (0.6ns)
*-bank:1
description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
product: CT102464BA160B.C16
vendor: Conexant (Rockwell)
physical id: 1
serial: A41163FD
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 1600MHz (0.6ns)
*-bank:2
description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
product: CT102464BA160B.C16
vendor: Conexant (Rockwell)
physical id: 2
serial: A10FE015
slot: ChannelB-DIMM0
size: 8GiB
width: 64 bits
clock: 1600MHz (0.6ns)
*-bank:3
description: DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
product: CT102464BA160B.C16
vendor: Conexant (Rockwell)
physical id: 3
serial: AE008BB9
slot: ChannelB-DIMM1
size: 8GiB
width: 64 bits
clock: 1600MHz (0.6ns)
*-pci
description: Host bridge
product: Xeon E3-1200 v3 Processor DRAM Controller
vendor: Intel Corporation
physical id: 100
bus info: pci@0000:00:00.0
version: 06
width: 32 bits
clock: 33MHz
*-display UNCLAIMED
description: VGA compatible controller
product: Xeon E3-1200 v3 Processor Integrated Graphics Controller
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 06
width: 64 bits
clock: 33MHz
capabilities: msi pm vga_controller bus_master cap_list
configuration: latency=0
resources: memory:f7800000-f7bfffff memory:e0000000-efffffff(prefetchable) ioport:f000(size=64)
*-usb:0
description: USB controller
product: 8 Series/C220 Series Chipset Family USB xHCI
vendor: Intel Corporation
physical id: 14
bus info: pci@0000:00:14.0
version: 05
width: 64 bits
clock: 33MHz
capabilities: pm msi xhci bus_master cap_list
configuration: driver=xhci_hcd latency=0
resources: irq:33 memory:f7d00000-f7d0ffff
*-usbhost:0
product: xHCI Host Controller
vendor: Linux 2.6.32-696.3.2.el6.x86_64 xhci_hcd
physical id: 0
bus info: usb@4
logical name: usb4
version: 2.06
capabilities: usb-3.00
configuration: driver=hub slots=6 speed=5000Mbit/s
*-usbhost:1
product: xHCI Host Controller
vendor: Linux 2.6.32-696.3.2.el6.x86_64 xhci_hcd
physical id: 1
bus info: usb@3
logical name: usb3
version: 2.06
capabilities: usb-2.00
configuration: driver=hub slots=12 speed=480Mbit/s
*-communication UNCLAIMED
description: Communication controller
product: 8 Series/C220 Series Chipset Family MEI Controller #1
vendor: Intel Corporation
physical id: 16
bus info: pci@0000:00:16.0
version: 04
width: 64 bits
clock: 33MHz
capabilities: pm msi bus_master cap_list
configuration: latency=0
resources: memory:f7d16000-f7d1600f
*-usb:1
description: USB controller
product: 8 Series/C220 Series Chipset Family USB EHCI #2
vendor: Intel Corporation
physical id: 1a
bus info: pci@0000:00:1a.0
version: 05
width: 32 bits
clock: 33MHz
capabilities: pm debug ehci bus_master cap_list
configuration: driver=ehci_hcd latency=0
resources: irq:20 memory:f7d14000-f7d143ff
*-usbhost
product: EHCI Host Controller
vendor: Linux 2.6.32-696.3.2.el6.x86_64 ehci_hcd
physical id: 1
bus info: usb@1
logical name: usb1
version: 2.06
capabilities: usb-2.00
configuration: driver=hub slots=2 speed=480Mbit/s
*-usb
description: USB hub
vendor: Intel Corp.
physical id: 1
bus info: usb@1:1
version: 0.05
capabilities: usb-2.00
configuration: driver=hub slots=6 speed=480Mbit/s
*-pci:0
description: PCI bridge
product: 8 Series/C220 Series Chipset Family PCI Express Root Port #1
vendor: Intel Corporation
physical id: 1c
bus info: pci@0000:00:1c.0
version: d5
width: 32 bits
clock: 33MHz
capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:31 ioport:2000(size=4096) memory:df200000-df3fffff ioport:df400000(size=2097152)
*-pci:1
description: PCI bridge
product: 8 Series/C220 Series Chipset Family PCI Express Root Port #5
vendor: Intel Corporation
physical id: 1c.4
bus info: pci@0000:00:1c.4
version: d5
width: 32 bits
clock: 33MHz
capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:32 ioport:e000(size=4096) memory:f7c00000-f7cfffff ioport:f0000000(size=1048576)
*-network
description: Ethernet interface
product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
vendor: Realtek Semiconductor Co., Ltd.
physical id: 0
bus info: pci@0000:02:00.0
logical name: eth0
version: 0c
serial: 44:8a:5b:d4:48:2e
size: 1Gbit/s
capacity: 1Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix vpd bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100b
t-fd 1000bt 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=2.3LK-NAPI duplex=full firmware=rt
l8168g-2_0.0.1 02/06/13 ip=136.243.20.197 latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
resources: irq:35 ioport:e000(size=256) memory:f7c00000-f7c00fff memory:f0000000-f0003fff(prefetchable)
*-usb:2
description: USB controller
product: 8 Series/C220 Series Chipset Family USB EHCI #1
vendor: Intel Corporation
physical id: 1d
bus info: pci@0000:00:1d.0
version: 05
width: 32 bits
clock: 33MHz
capabilities: pm debug ehci bus_master cap_list
configuration: driver=ehci_hcd latency=0
resources: irq:23 memory:f7d13000-f7d133ff
*-usbhost
product: EHCI Host Controller
vendor: Linux 2.6.32-696.3.2.el6.x86_64 ehci_hcd
physical id: 1
bus info: usb@2
logical name: usb2
version: 2.06
capabilities: usb-2.00
configuration: driver=hub slots=2 speed=480Mbit/s
*-usb
description: USB hub
vendor: Intel Corp.
physical id: 1
bus info: usb@2:1
version: 0.05
capabilities: usb-2.00
configuration: driver=hub slots=6 speed=480Mbit/s
*-isa
description: ISA bridge
product: B85 Express LPC Controller
vendor: Intel Corporation
physical id: 1f
bus info: pci@0000:00:1f.0
version: 05
width: 32 bits
clock: 33MHz
capabilities: isa bus_master cap_list
configuration: driver=lpc_ich latency=0
resources: irq:0
*-storage
description: SATA controller
product: 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
vendor: Intel Corporation
physical id: 1f.2
bus info: pci@0000:00:1f.2
logical name: scsi0
logical name: scsi1
version: 05
width: 32 bits
clock: 66MHz
capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:34 ioport:f0b0(size=8) ioport:f0a0(size=4) ioport:f090(size=8) ioport:f080(size=4) ioport:f060(si
ze=32) memory:f7d12000-f7d127ff
*-disk:0
description: ATA Disk
product: HGST HUS724020AL
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: AA70
serial: PN1134P6JHVLRS
size: 1863GiB (2TB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=00065bc0
*-volume:0
description: Linux swap volume
physical id: 1
bus info: scsi@0:0.0.0,1
logical name: /dev/sda1
version: 1
serial: 804f02eb-215f-4002-aa3f-87fd72ea6f60
size: 15GiB
capacity: 16GiB
capabilities: primary multi swap initialized
configuration: filesystem=swap pagesize=4096
*-volume:1
description: EXT3 volume
vendor: Linux
physical id: 2
bus info: scsi@0:0.0.0,2
logical name: /dev/sda2
version: 1.0
serial: f431e2a6-4fdc-4fab-8f06-d6df688fd284
size: 511MiB
capacity: 512MiB
capabilities: primary multi journaled extended_attributes recover ext3 ext2 initialized
configuration: created=2015-05-12 12:29:47 filesystem=ext3 lastmountpoint=/installimage.rMbiM/hdd/boot mod
ified=2017-07-18 06:47:12 mounted=2017-07-18 06:47:12 state=clean
*-volume:2
description: EXT4 volume
vendor: Linux
physical id: 3
bus info: scsi@0:0.0.0,3
logical name: /dev/sda3
version: 1.0
serial: eb9ef821-4f84-49a1-9c52-6aa817286fec
size: 1846GiB
capacity: 1846GiB
capabilities: primary multi journaled extended_attributes large_files huge_files dir_nlink recover extents
ext4 ext2 initialized
configuration: created=2015-05-12 12:29:57 filesystem=ext4 lastmountpoint=/ modified=2017-07-18 06:46:09 m
ounted=2017-07-18 06:47:12 state=clean
*-disk:1
description: ATA Disk
product: ST2000NM0033-9ZM
vendor: Seagate
physical id: 1
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: SN03
serial: Z1X0CRNR
size: 1863GiB (2TB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=000b1474
*-volume:0
description: Linux swap volume
physical id: 1
bus info: scsi@1:0.0.0,1
logical name: /dev/sdb1
version: 1
serial: 804f02eb-215f-4002-aa3f-87fd72ea6f60
size: 15GiB
capacity: 16GiB
capabilities: primary multi swap initialized
configuration: filesystem=swap pagesize=4096
*-volume:1
description: EXT3 volume
vendor: Linux
physical id: 2
bus info: scsi@1:0.0.0,2
logical name: /dev/sdb2
version: 1.0
serial: f431e2a6-4fdc-4fab-8f06-d6df688fd284
size: 511MiB
capacity: 512MiB
capabilities: primary multi journaled extended_attributes recover ext3 ext2 initialized
configuration: created=2015-05-12 12:29:47 filesystem=ext3 lastmountpoint=/installimage.rMbiM/hdd/boot mod
ified=2017-07-18 06:47:12 mounted=2017-07-18 06:47:12 state=clean
*-volume:2
description: EXT4 volume
vendor: Linux
physical id: 3
bus info: scsi@1:0.0.0,3
logical name: /dev/sdb3
version: 1.0
serial: eb9ef821-4f84-49a1-9c52-6aa817286fec
size: 1846GiB
capacity: 1846GiB
capabilities: primary multi journaled extended_attributes large_files huge_files dir_nlink recover extents
ext4 ext2 initialized
configuration: created=2015-05-12 12:29:57 filesystem=ext4 lastmountpoint=/ modified=2017-07-18 06:46:09 m
ounted=2017-07-18 06:47:12 state=clean
*-serial UNCLAIMED
description: SMBus
product: 8 Series/C220 Series Chipset Family SMBus Controller
vendor: Intel Corporation
physical id: 1f.3
bus info: pci@0000:00:1f.3
version: 05
width: 64 bits
clock: 33MHz
configuration: latency=0
resources: memory:f7d11000-f7d110ff ioport:f040(size=32)
*-power UNCLAIMED
description: To Be Filled By O.E.M.
product: To Be Filled By O.E.M.
vendor: To Be Filled By O.E.M.
physical id: 1
version: To Be Filled By O.E.M.
serial: To Be Filled By O.E.M.
capacity: 32768mWh
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
Doesn't look well, the database crashed again. Rest of the server was still up. I'll wait for the smartctl results.
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
want me to check the DB config?webwit wrote: ↑Doesn't look well, the database crashed again. Rest of the server was still up. I'll wait for the smartctl results.
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
I don't think it's the config but just hd corruptions? But feel free to check. Smartctl is still at 70% remaining...
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
If this happens again I might just move the entire thing to a fresh server. Last time I did that it only was one or two clicks with cPanel WHM, moving over website, db, mail, dns etc. And we'll get some newer hardware for the same price.
- wobbled
- Location: USA
- Main keyboard: HHKB PD-KB300 Pro 1
- Main mouse: Logitech MX Master 3
- Favorite switch: Topre
- DT Pro Member: 0192
If possible go entirely solid state with a new server, HDD's are a ticking time bomb honestly.webwit wrote: ↑If this happens again I might just move the entire thing to a fresh server. Last time I did that it only was one or two clicks with cPanel WHM, moving over website, db, mail, dns etc. And we'll get some newer hardware for the same price.
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
That's what we had last time. It crashed. Also, it was small compared to HDD, we need capacity.
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
how did the test go?
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
No errors on the first disk, now checking the second.
Edit: Second disk also without errors.
Edit: Second disk also without errors.
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
that is really weird. let me check the mysql configwebwit wrote: ↑No errors on the first disk, now checking the second.
Edit: Second disk also without errors.
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
There's one strange thing. /dev/sda took a lot longer than /dev/sdb, like 12 hours vs a couple of hours, while these are (I think) identical disks.
With both sda and sdb, the process (like "70% remaining") is shown with smartctl -c /dev/sdX. But with smartctl -l selftest /dev/sdX, it only showed the test was in progress for X% for sdb under Num #1, not sda (see below).
With both sda and sdb, the process (like "70% remaining") is shown with smartctl -c /dev/sdX. But with smartctl -l selftest /dev/sdX, it only showed the test was in progress for X% for sdb under Num #1, not sda (see below).
Code: Select all
root@server [~]# smartctl -c /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 247) Self-test routine in progress...
70% of test remaining.
Total time to complete Offline
data collection: ( 24) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 316) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
root@server [~]# smartctl -l selftest /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22554 -
# 2 Short offline Completed without error 00% 22542 -
# 3 Extended offline Completed without error 00% 3393 -
# 4 Extended offline Completed without error 00% 3316 -
# 5 Extended offline Completed without error 00% 21 -
# 6 Extended offline Completed without error 00% 4 -
root@server [~]#
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
would you post "smartctl -a /dev/sdX" for both drives? honestly 12 hours seems way too much for a smartctl test.
I had a look at the mysql config, it can be improved but I don't see anything too bad
I had a look at the mysql config, it can be improved but I don't see anything too bad
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
/dev/sda:
/dev/sdb:
PS: Currently re-testing /dev/sda
Code: Select all
root@server [~]# smartctl -a /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Hitachi/HGST Ultrastar 7K4000
Device Model: HGST HUS724020ALA640
Serial Number: PN1134P6JHVLRS
LU WWN Device Id: 5 000cca 22de36471
Firmware Version: MF6OAA70
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 19 15:35:25 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 246) Self-test routine in progress...
60% of test remaining.
Total time to complete Offline
data collection: ( 24) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 316) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 138 138 054 Pre-fail Offline - 76
3 Spin_Up_Time 0x0007 152 152 024 Pre-fail Always - 456 (Average 364)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 142 142 020 Pre-fail Offline - 25
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 22572
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 222
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 222
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 23/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22554 -
# 2 Short offline Completed without error 00% 22542 -
# 3 Extended offline Completed without error 00% 3393 -
# 4 Extended offline Completed without error 00% 3316 -
# 5 Extended offline Completed without error 00% 21 -
# 6 Extended offline Completed without error 00% 4 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Code: Select all
root@server [~]# smartctl -a /dev/sdb
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.3.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Constellation ES.3
Device Model: ST2000NM0033-9ZM175
Serial Number: Z1X0CRNR
LU WWN Device Id: 5 000c50 03cecc3c6
Firmware Version: SN03
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ACS-2 (revision not indicated)
Local Time is: Wed Jul 19 15:34:43 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 254) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 67401833
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 094 060 030 Pre-fail Always - 3001257799
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 22094
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 7
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 067 059 045 Old_age Always - 33 (Min/Max 31/33)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 946
194 Temperature_Celsius 0x0022 033 041 000 Old_age Always - 33 (0 22 0 0 0)
195 Hardware_ECC_Recovered 0x001a 046 015 000 Old_age Always - 67401833
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22091 -
# 2 Short offline Completed without error 00% 22064 -
# 3 Extended offline Completed without error 00% 2914 -
# 4 Extended offline Completed without error 00% 2838 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
- Wodan
- ISO Advocate
- Location: ISO-DE
- Main keyboard: Intense Rotation!!!
- Main mouse: Logitech G903
- Favorite switch: ALL OF THEM
- DT Pro Member: -
Okay I am not the SMART readout expert but this is worrying me (sdb)
HDDs are in a RAID config?
Might make sense to request a replacement of SDB, rebuild the raid and then have SDA replaced ?
Code: Select all
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 67401833
Might make sense to request a replacement of SDB, rebuild the raid and then have SDA replaced ?
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
/dev/sdb seems to be deteriorating. If you look at smartctl -a every 5-10 minutes does the column "VALUE" lower over time?
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
Code: Select all
root@server [~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
524224 blocks super 1.0 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
16777088 blocks super 1.0 [2/2] [UU]
md2 : active raid1 sda3[0] sdb3[1]
1936208832 blocks super 1.0 [2/2] [UU]
unused devices: <none>
Code: Select all
root@server [~]# dmesg | grep raid
md: raid1 personality registered for level 1
md/raid1:md2: not clean -- starting background reconstruction
md/raid1:md2: active with 2 out of 2 mirrors
md/raid1:md1: active with 2 out of 2 mirrors
md/raid1:md0: active with 2 out of 2 mirrors
Code: Select all
root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 68872755
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69089054
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69227460
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69271452
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69324943
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69363286
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69486033
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69625038
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69693573
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69829568
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69901410
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 69950639
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail Always - 70037059
- webwit
- Wild Duck
- Location: The Netherlands
- Main keyboard: Model F62
- Favorite switch: IBM beam spring
- DT Pro Member: 0000
- Contact:
I am not, it's just how the hoster set it up.
- matt3o
- -[°_°]-
- Location: Italy
- Main keyboard: WhiteFox
- Main mouse: Anywhere MX
- Favorite switch: Anything, really
- DT Pro Member: 0030
- Contact:
if you got errors on one hdd raid reconstruction is understood... and that's where raid on just 2 drives is a little pointless since not always the machine can tell which data is actually bad and which one is good (50-50).webwit wrote: ↑I am not, it's just how the hoster set it up.
A value of 78 in Raw_Read_Error_Rate is not terrible per-se, but looking at the raw value (70037059) it is dropping quickly. I would proceed with the replacement of the hdd ASAP if they let you.