Koozali.org: home of the SME Server

Contribs.org Forums => General Discussion => Topic started by: SchulzStefan on February 10, 2014, 07:26:44 PM

Title: RAID1 Problem (again)
Post by: SchulzStefan on February 10, 2014, 07:26:44 PM

Running a SoftRAID1 on SME8 latest updates installed.

A few weeks ago, one of the disks in the server gave up. Somehow I managed to install an identical disk in the machine, added the disk to the RAID, and server was up and running again. No errors have been reported. On Saturday night, the machine ends up in a loop, no more responding to nothing. I could read on the console, that there was no write access in /var/log/httpd/access_log. I tried to shutdown the server from the console, but no luck.

I did a hard reset, the machine did boot but tells me, to do a filesystem check, because of an inconsistent file system and dropped me to the maintenance console. I did not perform a check at all. I shut down the server, and changed the disks on the board. I did not connect the old sda to the board. The server went up again everything seemed to be fine. I shut down the machine again. Connected the second drive to the board, booted again, and added the (new) disk to the RAID.

Syncing started, but after a few hours I got an email:

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

and:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

This means to me, that the RAID is still broken. The reason why must be unreadable sectors on sda. Interesting, because this is the disk I was able to boot. Users are reporting no errors while working. I checked the sdb drive with the manufacturer disktool. No errors are reported.

Am I assuming right, that with an unreadable error on sda I will never be able to rebuild a RAID? And if so, what would be the easiest way (shortest downtime) to clone/copy/mirror sda to another disk? Remember, the server is up an running without errors.

Any help/hints would be great. Thank's in advance,
stefan

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 10, 2014, 09:59:45 PM

From /var/log/messages:

Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: Unhandled sense code
Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Feb 10 18:45:12 saturn kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Feb 10 18:45:12 saturn kernel: sda: Current [descriptor]: sense key: Medium Error
Feb 10 18:45:12 saturn kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 10 18:45:12 saturn kernel:
Feb 10 18:45:12 saturn kernel: Descriptor sense data with sense descriptors (in hex):
Feb 10 18:45:12 saturn kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 10 18:45:12 saturn kernel: 74 70 53 30
Feb 10 18:45:12 saturn kernel: ata1: EH complete
Feb 10 18:45:12 saturn kernel: raid1: sda: unrecoverable I/O read error for block 1953309440
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: RAID1 conf printout:
Feb 10 18:45:12 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:12 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 18:45:12 saturn kernel: disk 1, wo:1, o:1, dev:sdb2
Feb 10 18:45:13 saturn kernel: RAID1 conf printout:
Feb 10 18:45:13 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:13 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 19:01:09 saturn smartd[2582]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

I assume, that sda is corrupted. Did some googling: https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID

Quote

Hardware error

If the reading fails for a harddisk, you need to copy that harddisk to a new harddisk. Do that using GNU ddrescue. ddrescue can read forwards (fast) and backwards (slow). This is useful since you can sometimes only read a sector if you read it from "the other side". By giving ddrescue a log-file it will skip the parts that have already been copied successfully. Thereby it is OK to reboot your system, if the copying makes the system hang: The copying will continue where it left off.

ddrescue -r 3 /dev/old /dev/new my_log
ddrescue -R -r 3 /dev/old /dev/new my_log

where /dev/old is the harddisk with errors and /dev/new is the new empty harddisk.

Re-test that you can now read all sectors from /dev/new using 'dd', and remove /dev/old from the system. Then recompute $DEVICES to include the /dev/new:

UUID=$(mdadm -E /dev/sdj1|perl -ne '/Array UUID : (\S+)/ and print $1')
DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})

and: http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

Quote

3 Using ddrescue safely

Ddrescue is like any other power tool. You need to understand what it does, and you need to understand some things about the machines it does those things to, in order to use it safely.

Always use a logfile unless you know you won't need it. Without a logfile, ddrescue can't resume a rescue, only reinitiate it.

Never try to rescue a r/w mounted partition. The resulting copy may be useless.

Never try to repair a file system on a drive with I/O errors; you will probably lose even more data.

If you use a device or a partition as destination, any data stored there will be overwritten.

Some systems may change device names on reboot (eg. udev enabled systems). If you reboot, check the device names before restarting ddrescue.

If you interrupt the rescue and then reboot, any partially copied partitions should be hidden before allowing them to be touched by any operating system that tries to mount and "fix" the partitions it sees.

As I understand the procedure should be:

1. remove sdb form the RAID. (set it to faulty and then remove it)
2. boot the machine with both disks plugged, from a CD or USB where ddrescue can be run from.
3. mount sda (the disk with the errors) read only. Do I have to mount the disk? Or is ddrescue taking control itself?
4. mount sdb read/write. Same question: Do I have to mount the disk? Or is ddrescue taking control itself?
5. use ddrescue to copy the data from sda to sdb.

After copying, pull the corrupted disk out of the machine. Plug the good disk (the copy of sda) on the board to boot.

Never done this before, I'm a bit more than a user, but surely no expert in this things. So my question is, anybody with experience in this things? Will it work in this way? Did I miss something?

Title: Re: RAID1 Problem (again)
Post by: TerryF on February 10, 2014, 11:00:07 PM

Are you saying you were able to reboot with a single disk, in this case sdb?

If that is the case why not, remove the sda HD, reboot the machine on the single drive, power it down and simpley install a new disk of the same size as sdb, reboot and let the sytem resync the raid, as it should do when a new HD is introduced to a single drive setup..

http://wiki.contribs.org/Raid#Replacing_and_Upgrading_Hard_Drive_after_HD_fail see here and leave out the grow section..

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 08:43:09 AM

TerryF thank you for your reply.

While the original disk on the sda port ended in the maintenance console because of an inconsistent file system, I booted the machine from the sdb disk. Therefore I changed the sdb to sda on the board. The server booted. Then I tried to add the other disk back to the RAID. Syncing started but ended up in:

Quote

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

and:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

In my understanding the disk is not mirrored - therefore not usable. But I didn't try that already. Is it possible to boot from an unfinished mirrored disk? I don't know. Does it make sense to try this?

But all this means to me I'm not able to add a physically good disk in a RAID1, while the other disk has a I/O error. The rediculous thing is that the server is running with the disk, which has I/O errors on bad sectors.

Is there a way to force the mirroring or bypass the bad sectors while building the RAID? They are obviously (or hopefully) empty? Well as I said, this is beyond my knowledge...

Title: Re: RAID1 Problem (again)
Post by: TerryF on February 11, 2014, 09:18:55 AM

Remove the faulty drive, Boot off the good sdb, you end up with a single disk raid 1..

Procure a new HD the same size as sdb, use an old one if thats all you have, make sure to "zero" it, buy a new one is a better idea.

Power down the machine, put it in the new HD, reboot the system, go to the console and check that the system has now identified a new HD has been added, let it sync up, will take some time..

You are back with a two drive Raid 1 system

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 09:22:40 AM

How can I remove the faulty drive, while this is the disk the server is running with?

Title: Re: RAID1 Problem (again)
Post by: TerryF on February 11, 2014, 09:30:34 AM

Quote from: SchulzStefan on February 11, 2014, 08:43:09 AM

I booted the machine from the sdb disk. Therefore I changed the sdb to sda on the board. The server booted.

Sorry I don't get your last comment..according to what you said above the machine was boted up on the old sdb, is that not correct?

Throw the faulty drive in the rubbish, buy a new drive of the same size, put it in the machine..power it up.

Check that the new drive is found and has started to sync from the console..

Whats not to do..??

sorry spelling "booted"

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 09:49:56 AM

Quote

the machine was boted up on the old sdb, is that not correct?

Yes, the machine booted up with the sdb disk.

Quote

Throw the faulty drive in the rubbish

The faulty drive is running the server.

Quote

Quote
Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Quote

buy a new drive of the same size, put it in the machine..power it up.

Check that the new drive is found and has started to sync from the console.

That's what I did.

IMHO syncing will not finish because of an I/O error on the sda disk, the server is running with.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 10:06:12 AM

Meanwhile I installed via epel repo ddrescue.

1. Disable all services.

SVC='qpsmtpd sqpsmtpd crond imap pop3 imaps pop3s ftp httpd-e-smith atalk smb qmail'
for s in $SVC; do service $s stop; done

There's also hylafax, zarafa, firebird running as well as the bridge contrib, of course they have also to be stopped.

2. Stop the RAID

mdadm --stop /dev/md1
mdadm --stop /dev/md2

3. Copy the data fom sda to sdb

ddrescue -r 3 /dev/sda /dev/sdb my_log
ddrescue -R -r 3 /dev/sda /dev/sdb my_log

Will this way work? Anybody done this before?

Title: Re: RAID1 Problem (again)
Post by: TerryF on February 11, 2014, 11:54:17 AM

Sorry I am totally confused..

If the sdb drive booted, why is the sda HD still in the server..why keep trying to use a faulty drive?

If the sytem starts with the sdb as the only disk present then you can put in another and sync it up..

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 01:42:13 PM

TerryF, thank you for your patience. Maybe I didn't explain correct (English is not my motherlanguage.)

Step-by-step:

1. server was dead. sda ended on a maintenance console because of inconsistent file system.
2. pulled sda out of the server.
3. changed sdb to sda
4. the server booted and is still running. No user reported any error so far.
5. examined with the wd-tool the disk which I pulled out. No errors reported either on the short test, neither on the long test.
6. put the examined disk back in the server.
7. added the disk with the server admin menu back to the RAID
8. mirroring started.
9. finally got an email, that the following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
10. # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

11. sdb is not clean in the RAID
12. I assume that because of an unreadable sector of *sda* (see above) the mirroring will not finish.
13. While the server is running without errors I would like to know
a) is there a way to force mdadm to mirror, and if not
b) does it make sense to copy all data from sda to sdb with ddrescue in the described way.

EDITED: Server is running with an error:

SMART error (CurrentPendingSector) detected on host: saturn
This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

As there are no reported errors from the users I assume, that this unreadable sector does not harm any program.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 11, 2014, 08:56:02 PM

I couldn't stop md2.

Quote

2. Stop the RAID

mdadm --stop /dev/md1
mdadm --stop /dev/md2

Couldn't find any hint in the forum how to unmount /dev/md2.

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

Title: Re: RAID1 Problem (again)
Post by: TerryF on February 11, 2014, 09:41:54 PM

Quote from: SchulzStefan on February 11, 2014, 08:56:02 PM

I couldn't stop md2.

Couldn't find any hint in the forum how to unmount /dev/md2.

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

Am I just not seeing something, are you using the inbuilt Raid support that is part of SME8..

or are you using a third party Raid system?

Good Luck

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 12, 2014, 08:19:54 AM

Quote

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

That worked. I pulled out the disk with the I/O error. Server is running without errors so far.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 12, 2014, 09:14:53 PM

Changed the disk with the I/O error today at my dealer into a new one. Was still under warranty.

Put the disk as sdb back in the server. Added the disk in the array.

Got an email which tells me:

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

I'm totally confused now. Don't know what to do with this.

Remeber: I tested the running sda disk in the server with the manufactorer disk test tool (yes, I did the long test) and no errors have been reported. With ddrescue I copied from the old sda, that was this one with the I/O error, all data to the new disk, which is now sda. The server was, until adding sdb to the RAID, running without errors.

How can it be, that there is an I/O error again?? Did ddrescue copy the I/O error on this disk? Does this make sense???

Right now there are two brand new disks in the server, and I'm not able to build a RAID???

Title: Re: RAID1 Problem (again)
Post by: janet on February 13, 2014, 12:25:24 AM

SchulzStefan

I suggest you perform a full backup to locally connected USB disk, BEFORE you lose data, as you are at a big risk now of losing data.

Then do a new install of sme8.0 from CD & reformat (erase) the existing drives in the process & wipe all data & partitions etc from your current system
Select software RAID1 option, although with two disks installed the installer should default to that option.
Do NOT choose the upgrade option.

You shoulc then have system with a properly (& automatically) configured & sync'd software RAID1 array

Run yum update
Then restore from USB backup
Then reinstall contribs

Your system should be fully functional with a correctly configured software RAID1 array

I advise you in future to stop playing with RAID arrays & let your system automatically configure itself.

The admin console (log in as admin) has a menu option to add a drive to an array when a drive has been replaced due to faults or errors.

REMEMBER every single time you reuse a drive, even when from the same system, & BEFORE adding it to a RAID array either automatically or manually, you MUST erase the partion & MBR information using the dd command here
http://wiki.contribs.org/Raid#Reusing_Hard_Drives
eg
dd if=/dev/zero of=/dev/sdb bs=512 count=1
(replace sdb with the location of the drive to be erased eg sda sdb sdc sdd etc)
Check drives installed & their details using
fdisk -l

The above step (erase MBR) is ESSSENTIAL to do whenever the drive has been already used, all drives added to arrays should (MUST) be blank

It seems to me you have been breaking some (technical) rules, & so you are having problems.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 13, 2014, 01:35:27 AM

janet, thank you for your reply.

Quote

I suggest you perform a full backup to locally connected USB disk, BEFORE you lose data, as you are at a big risk now of losing data.

I'm doing this twice a day, one affa and one with the server-manager.

Quote

REMEMBER every single time you reuse a drive, even when from the same system, & BEFORE adding it to a RAID array either automatically or manually, you MUST erase the partion & MBR information using the dd command here
http://wiki.contribs.org/Raid#Reusing_Hard_Drives
eg
dd if=/dev/zero of=/dev/sdb bs=512 count=1
(replace sdb with the location of the drive to be erased eg sda sdb sdc sdd etc)
Check drives installed & their details using
fdisk -L

The above step (erase MBR) is ESSSENTIAL to do whenever the drive has been already used, all drives added to arrays should (MUST) be blank

Copying data with ddrescue on a drive says clearly "all data on the target (drive) will be erased". I assumed that ddrescue is exactly that doing. Maybe that was wrong, I don't know really.

Quote

I advise you in future to stop playing with RAID arrays & let your system automatically configure itself.

The admin console (log in as admin) has a menu option to add a drive to an array when a drive has been replaced due to faults or errors.

Well, I bought a new drive, (didn't format it, didn't partition it) plugged it in the server, and did what you suggest.

What makes me wonder is, before I made the copy on the (now) sda-disk, I run a long-test from WD. Took around 3 hours. No errors have been reported. Then I run ddrescue on this target.

I booted the server with this disk, took the other out before, and I got no errors from sda until I plugged the new disk on the board, and added sdb via admin console to the RAID. How can this be?

Maybe the Western Digital WD10EFRX Red 1TB are the problem. They are made for NAS and RAID. I don't know. Need some time for this try. If it fails, I'll buy two new, other than the WD Red, disks.

Quote

Then do a new install of sme8.0 from CD & reformat (erase) the existing drives in the process & wipe all data & partitions etc from your current system
Select software RAID1 option, although with two disks installed the installer should default to that option.

and

Quote

REMEMBER every single time you reuse a drive, even when from the same system, & BEFORE adding it to a RAID array either automatically or manually, you MUST erase the partion & MBR information using the dd command here
http://wiki.contribs.org/Raid#Reusing_Hard_Drives
eg
dd if=/dev/zero of=/dev/sdb bs=512 count=1
(replace sdb with the location of the drive to be erased eg sda sdb sdc sdd etc)
Check drives installed & their details using
fdisk -L

As both disks are used now, I'll zero them as you suggest. But I'm not really convinced. We'll see, I'll let you know.

Title: Re: RAID1 Problem (again)
Post by: janet on February 13, 2014, 03:42:50 AM

SchulzStefan

Quote

As both disks are used now, I'll zero them as you suggest. But I'm not really convinced. We'll see, I'll let you know.

At least you have finally heeded the advice to zero the disks using the recommended method (ie use dd command & not ddrescue).

Partitions & MBR's are not data, so while ddrescue may "delete all data on target drive", it may not be doing what is really needed. I do not know as I do not use it. The point here is that you are copying disks with ddrescue, which is not what rebuilding an array is all about.
If one disk has a problem & you copy it, then usually you will copy the problem, get it ?
When you mirror that disk in RAID1 with the (copied) problem, then you mirror the problem.

That's why I said to start afresh with a newly build RAID1 array, without any data on it, and then restore your data to that good & clean RAID1 array.
To me it is the quickest & best way to resolve your issues.

No one can easily troubleshoot your problem via Internet forum as you have made so many changes. It's really a hands on job to examine your drives etc.

Edit: To me it's prudent or wise to even zero out a new drive, at least that way you have totally ruled out the possibility of the drive being problematic for related reasons.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 13, 2014, 09:39:30 AM

janet,

Quote

Edit: To me it's prudent or wise to even zero out a new drive, at least that way you have totally ruled out the possibility of the drive being problematic for related reasons.

I'll try your advice.

I got this from the brand new drive which I yesterday plugged in:

Feb 13 05:23:22 saturn kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 13 05:23:22 saturn kernel: ata2.00: BMDMA stat 0x25
Feb 13 05:23:22 saturn kernel: ata2.00: cmd 35/00:08:02:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
Feb 13 05:23:22 saturn kernel: res 51/10:08:02:59:70/10:00:74:00:00/e0 Emask 0x81 (invalid argument)
Feb 13 05:23:22 saturn kernel: ata2.00: status: { DRDY ERR }
Feb 13 05:23:22 saturn kernel: ata2.00: error: { IDNF }
Feb 13 05:23:36 saturn kernel: ata2.00: configured for UDMA/133
Feb 13 05:23:36 saturn kernel: sd 1:0:0:0: Unhandled sense code
Feb 13 05:23:36 saturn kernel: sd 1:0:0:0: SCSI error: return code = 0x08000002
Feb 13 05:23:36 saturn kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Feb 13 05:23:36 saturn kernel: sdb: Current [descriptor]: sense key: Aborted Command
Feb 13 05:23:36 saturn kernel: Add. Sense: Recorded entity not found
Feb 13 05:23:36 saturn kernel:
Feb 13 05:23:38 saturn kernel: Descriptor sense data with sense descriptors (in hex):
Feb 13 05:23:38 saturn kernel: 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 13 05:23:39 saturn kernel: 74 70 59 02
Feb 13 05:23:39 saturn kernel: ata2: EH complete
Feb 13 05:23:39 saturn kernel: raid1: Disk failure on sdb2, disabling device.
Feb 13 05:23:39 saturn kernel:    Operation continuing on 1 devices
Feb 13 05:23:39 saturn kernel: SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 13 05:23:39 saturn kernel: sdb: Write Protect is off
Feb 13 05:23:39 saturn kernel: sdb: Mode Sense: 00 3a 00 00
Feb 13 05:23:39 saturn kernel: SCSI device sdb: drive cache: write back

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](F) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

smartctl -x /dev/sdb
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18-371.4.1.el5PAE] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EFRX-68PJCN0
Serial Number: WD-WMC4J0189055
LU WWN Device Id: 5 0014ee 25ebeaa72
Firmware Version: 01.01A01
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ACS-2 (revision not indicated)
Local Time is: Thu Feb 13 09:07:42 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00)   Offline data collection activity
               was never started.
               Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0)   The previous self-test routine completed
               without error or no self-test has ever
               been run.
Total time to complete Offline
data collection:       (13980) seconds.
Offline data collection
capabilities:           (0x7b) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities: (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability: (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     ( 2) minutes.
Extended self-test routine
recommended polling time:     ( 159) minutes.
Conveyance self-test routine
recommended polling time:     ( 5) minutes.
SCT capabilities:      (0x303d)   SCT Status supported.
               SCT Error Recovery Control supported.
               SCT Feature Control supported.
               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 253 051 - 0
3 Spin_Up_Time POS--K 100 253 021 - 0
4 Start_Stop_Count -O--CK 100 100 000 - 1
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 100 100 000 - 17
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 1
192 Power-Off_Retract_Count -O--CK 200 200 000 - 0
193 Load_Cycle_Count -O--CK 200 200 000 - 1047
194 Temperature_Celsius -O---K 115 109 000 - 28
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 5 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 6 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 1 sectors [Extended self-test log]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error log]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP Log at address 0x21 has 1 sectors [Write stream error log]
GP Log at address 0x22 has 1 sectors [Read stream error log]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xa0 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa1 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa2 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa3 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa4 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa5 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa6 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa7 has 16 sectors [Device vendor specific log]
GP/S Log at address 0xa8 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xa9 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaa has 1 sectors [Device vendor specific log]
GP/S Log at address 0xab has 1 sectors [Device vendor specific log]
GP/S Log at address 0xac has 1 sectors [Device vendor specific log]
GP/S Log at address 0xad has 1 sectors [Device vendor specific log]
GP/S Log at address 0xae has 1 sectors [Device vendor specific log]
GP/S Log at address 0xaf has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb0 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb1 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb2 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb3 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb4 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb5 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb6 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xb7 has 1 sectors [Device vendor specific log]
GP/S Log at address 0xbd has 1 sectors [Device vendor specific log]
GP/S Log at address 0xc0 has 1 sectors [Device vendor specific log]
GP Log at address 0xc1 has 93 sectors [Device vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 1
   CR = Command Register
   FEATR = Features Register
   COUNT = Count (was: Sector Count) Register
   LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
   LH = LBA High (was: Cylinder High) Register ] LBA
   LM = LBA Mid (was: Cylinder Low) Register ] Register
   LL = LBA Low (was: Sector Number) Register ]
   DV = Device (was: Device/Head) Register
   DC = Device Control Register
   ER = Error register
   ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1

occurred at disk power-on lifetime: 13 hours (0 days + 13 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 08 00 00 74 70 59 02 e0 00 Error: IDNF 8 sectors at LBA = 0x74705902 = 1953519874

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
35 00 00 00 08 00 00 74 70 59 02 e0 08 13:45:17.466 WRITE DMA EXT
ea 00 00 00 00 00 00 74 70 59 09 e0 08 13:45:12.148 FLUSH CACHE EXT
ea 00 00 00 00 00 00 74 70 59 09 e0 08 13:45:11.575 FLUSH CACHE EXT
35 00 00 00 08 00 00 74 70 59 02 e0 08 13:45:06.056 WRITE DMA EXT
ea 00 00 00 00 00 00 74 70 59 09 e0 08 13:45:04.380 FLUSH CACHE EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 28 Celsius
Power Cycle Min/Max Temperature: 25/34 Celsius
Lifetime Min/Max Temperature: 25/34 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (92)

Index Estimated Time Temperature Celsius
93 2014-02-13 01:10 29 **********
... ..( 3 skipped). .. **********
97 2014-02-13 01:14 29 **********
98 2014-02-13 01:15 30 ***********
... ..( 7 skipped). .. ***********
106 2014-02-13 01:23 30 ***********
107 2014-02-13 01:24 31 ************
108 2014-02-13 01:25 31 ************
109 2014-02-13 01:26 30 ***********
110 2014-02-13 01:27 31 ************
... ..( 19 skipped). .. ************
130 2014-02-13 01:47 31 ************
131 2014-02-13 01:48 32 *************
... ..( 34 skipped). .. *************
166 2014-02-13 02:23 32 *************
167 2014-02-13 02:24 31 ************
... ..( 14 skipped). .. ************
182 2014-02-13 02:39 31 ************
183 2014-02-13 02:40 30 ***********
... ..( 20 skipped). .. ***********
204 2014-02-13 03:01 30 ***********
205 2014-02-13 03:02 29 **********
... ..( 4 skipped). .. **********
210 2014-02-13 03:07 29 **********
211 2014-02-13 03:08 30 ***********
212 2014-02-13 03:09 30 ***********
213 2014-02-13 03:10 30 ***********
214 2014-02-13 03:11 29 **********
215 2014-02-13 03:12 30 ***********
... ..( 7 skipped). .. ***********
223 2014-02-13 03:20 30 ***********
224 2014-02-13 03:21 29 **********
225 2014-02-13 03:22 30 ***********
226 2014-02-13 03:23 29 **********
... ..( 87 skipped). .. **********
314 2014-02-13 04:51 29 **********
315 2014-02-13 04:52 28 *********
... ..( 10 skipped). .. *********
326 2014-02-13 05:03 28 *********
327 2014-02-13 05:04 29 **********
... ..(239 skipped). .. **********
89 2014-02-13 09:04 29 **********
90 2014-02-13 09:05 28 *********
91 2014-02-13 09:06 28 *********
92 2014-02-13 09:07 28 *********

SCT Error Recovery Control:
Read: 70 (7,0 seconds)
Write: 70 (7,0 seconds)

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 5 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 6 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 62972 Vendor specific

It seems to me, that beside your hints, that it could also be a misconfiguration or an incompatibility.

Quote

Feb 13 05:23:22 saturn kernel: ata2.00: cmd 35/00:08:02:59:70/00:00:74:00:00/e0 tag 0 dma 4096 out
Feb 13 05:23:22 saturn kernel: res 51/10:08:02:59:70/10:00:74:00:00/e0 Emask 0x81 (invalid argument)

That's not normal to me, that a brand new disk has dma errors after a few hours running.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 13, 2014, 10:04:43 AM

WD says, the Western-Digital-Red-WD10EFRX is linux compatible.

BUT: they are sata 6 Gb/s (SATA 3). The Board is an ASUS as P5Q SE2, which has only sata 3 Gb/s. I assume, that this maybe the reason for the errors I have with this disks. If so, I'm wondering whether the disks are not down-compatible.

Edit: WD says also, they are backward compatible.

I asked my dealer if SATA 6 Gb/s is backward compatible. He told me, there should be no problem. Perhaps it's a specific WD problem? I could buy a seagate disk and try how this works. Don't know how to proceed...

Anybody who could confirm?

Title: Re: RAID1 Problem (again)
Post by: janet on February 13, 2014, 12:29:39 PM

SchulzStefan

Quote

Don't know how to proceed...

You keep asking for help.
Did you follow the advice already given "ie backup, rebuild & restore" ?

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 13, 2014, 01:46:18 PM

janet,

I can't rebuild the server (with zeroing the disks) during the week. I have to wait for the weekend. I will then try what you suggested.

Quote

You keep asking for help.

Sure. If anybody could confirm a dma error is caused due a incompatibility from sata 3 Gb/s to sata 6 Gb/s with this disks, I could save time to rebuild the server twice.

Title: Re: RAID1 Problem (again)
Post by: janet on February 13, 2014, 01:57:31 PM

SchulzStefan

Quote

If anybody could confirm a dma error is caused due a incompatibility from sata 3 Gb/s to sata 6 Gb/s with this disks, I could save time to rebuild the server twice.

That is just a maximum rated data transfer speed. Any disk can run at a slower rate, it will in practise anyway, depending on how much data is being read or saved.
I doubt that there are any speed incompatibility issues.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 14, 2014, 08:45:01 PM

janet,

Quote

I suggest you perform a full backup to locally connected USB disk

Question: is it o.k. to use my workstation backup (configured via server-manager), or do you suggest to do a backup to USB via the admin-console?

Title: Re: RAID1 Problem (again)
Post by: janet on February 15, 2014, 12:03:29 AM

SchulzStefan

Quote

is it o.k. to use my workstation backup (configured via server-manager), or do you suggest to do a backup to USB via the admin-console?

Either method should work, there are pros & cons.
In order to use the workstation backup to restore from, you will need to configure your newly rebuilt sme server (with clean OS install from CD), so that it has the same local IP & network configuration, server name & domain name etc. Then setup the identical backup job in server manager. Doing that will allow you to perform the restore ftom within the server manager panel.

Backup & restore to/from a workstation across a network connection could be considered to be slower. Restoring 500Gb could take anywhere from 4-24 hours depending on your system specs etc.

Using the same workstation backup server manager panel, but backing up to a locally connected USB drive, removes workstation & network speed issues. You still need to recreate the backup job in server manager on the new server (with identical details), in order to do the restore.

If you use the admin console "one off" backup to USB, then it is very easy to start the restore. At first boot of the new server after installing OS from CD, you will be asked once if you want to restore from local USB drive. That is when you connect the USB drive you performed the console backup to. It will start the restore immediately then without needing any further configuration of the server. So the admin console backup has an easier restore procedure.

This feature ideally suits rebuilding SME on newer or different hardware.

The method you use is your choice. I personally pfefer one of the backup to locally connected USB methods, less hassle with networks & workstations then.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 15, 2014, 09:57:42 AM

janet,

I included several directories onto the standard backup directories. Will they be saved via USB-Backup also? And what about the compression?

Title: Re: RAID1 Problem (again)
Post by: janet on February 15, 2014, 10:07:38 AM

SchulzStefan

Quote

I included several directories onto the standard backup directories. Will they be saved via USB-Backup also? And what about the compression?

If you followed this
http://wiki.contribs.org/Backup_with_dar
& particularly this
http://wiki.contribs.org/Backup_with_dar#Adding_Files_and_Directories
then additional folders should be included in your backup, made using one of the standard default SME backup methods.

You can easily check a backup to see what is included.
IIRC dar files can be viewed in midnight commander & other ways too, Google for it.

Read this document for useful info about backup & restore, and alternative approaches.
It may help you to better understand the backup & restore process.
http://wiki.contribs.org/Backup_server_config

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 15, 2014, 09:05:40 PM

janet,

Quote

Run yum update
Then restore from USB backup
Then reinstall contribs

Edit:
Is a reboot required after the restore? Or should I install the contribs before rebooting?

Are you sure about the order? Why not installing first the contribs and then restore?

Title: Re: RAID1 Problem (again)
Post by: janet on February 15, 2014, 11:33:49 PM

SchulzStefan

Quote

Are you sure about the order? Why not installing first the contribs and then restore?

Why?
Do you have better knowledge of SME server than the developers ?

The procedure order:
Install clean OS
Restore from backup
Reinstall contribs
is and always has been the recommended procedure by devs for performing a SME server full backup & restore.

Note too that a full restore should NEVER be performed to a SME server that is configured with users, ibays & data & so on, other than basic network setup required to allow the restore to be performed (as that basic setup will be overwritten by the restored data).

A full restore expects the SME server file system to be in a certain pre defined "clean" state, as exists after a fesh install from a CD.
Other non default settings & configuration (eg ibays that are not in the backup) are not overwritten or written to the databases, refer to the standard backup inclusion to see what is written back to the server.
You can end up with a server in an indeterminate state if you do a full restore to an existing (not clean) highly configured installation.

Here is an example to prove this design concept:
if you use the admin console backup, you are then given an opportunity to restore from USB on first boot of the server immediately after you install the fresh OS from CD, clearly well before any contribs have been or can be reinstalled.

Obviously you did not read the Backup Server Config wiki Howto I pointed you to, as it is described there also, and in many posts in these forums by key developers.

IIUC the rpm contrib package will reconfigure itself with new default data if no restored data exists.
If restore is done first, then when the contrib rpm is installed, the rpm will see the existing config & data for the contrib and use that instead, so no need to reconfigure the contrib.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 16, 2014, 09:02:03 AM

janet,

Quote

Do you have better knowledge of SME server than the developers ?

Where did I say that?

I couldn't find an answer to my question:

Quote

Edit:
Is a reboot required after the restore? Or should I install the contribs before rebooting?

Quote

Obviously you did not read the Backup Server Config wiki Howto I pointed you to, as it is described there also, and in many posts in these forums by key developers.

Where has this question been answered?

It was a technical question. After restoring the server form the backup, there have also settings been restored from several contribs. While installing the contribs the restored settings might be overwritten. And how do I get back this settings?

Example: Installing firebird will overwrite the restored settings. How about openvpn bridge, zarafa, phpki and so on.

Title: Re: RAID1 Problem (again)
Post by: janet on February 16, 2014, 09:48:57 AM

SchulzStefan

You edited your original post after I answered it, so I did not see that question until now.

After a full restore you need to do
signal-event post-upgrade
signal-event reboot

The restored data & config does not become effective until a post upgrade & reboot is performed.

After reboot then reinstall your contribs.

Quote

While installing the contribs the restored settings might be overwritten. And how do I get back this settings?

I already answered that, please read my answers carefully.
The rpm command takes care of that, it identifies the contrib being installed has existing data & uses that without changing it, & without using a new default data set.

I am referring to installing contribs from rpm packages that are built to behave this way, if correctly built by authors.
Maybe Firebird is not an rpm or does not conform to sme standards of behaviour, I do not use it or know it.

Please read a Linux manual all about the rpm command & package creation & install.
You should even read the SME Server Developers Guide & other useful contribs.org wiki articles about creating rpm packages.

You can do as much testing as you want to do to prove what I am saying, feel free to set up a test server & play, play, play until you are content & fully understand the way various aspects of SME server works.
I have been doing that for 14 years now.

Title: Re: RAID1 Problem (again)
Post by: janet on February 16, 2014, 09:58:29 AM

SchulzStefan

As you have so many questions & concerns & counter opinions about procedures & specifics of outcomes, I STRONGLY suggest you setup a test SME server & run through the whole backup & restore process, including installation of contribs etc, until you are fully conversant with it all & understand all steps involved to your complete satisfaction.
I think that will be the only way to satisfy your mind.

Title: Re: RAID1 Problem (again)
Post by: SchulzStefan on February 17, 2014, 08:46:23 AM

janet,

thank you for your help and your patience with me. I proceeded as you suggested.

The server is rebuild and up. I used my backup from the server-manager. Before the install I zeroed both disks. The RAID is rebuild and in health, I see no I/O errors until now.

Well, re-installing all contribs is quite a lot of work. Not all are that smooth to re-install. I had manually to adjust phpki, firebird (not a contrib), virtualbox, phpvirtualbox.

I'm wondering that there is no script to save in a backup a list of all installed contribs (or addons) with the specific settings and configurations. Don't get me wrong, I'm just a user, not a developer nor a programmer. A script which would install after a restore all contribs (and/or addons) would make life a lot easier.

I will follow your hint trying things out in a virtual machine - there's only a little time problem.

stefan

Title: Re: RAID1 Problem (again)
Post by: janet on February 17, 2014, 12:12:54 PM

SchulzStefan

The SME server recommended (or standard) backup & restore procedure is designed to work between same versions (eg 8.0 to 8.0) or different versions (eg 7.6 to 8.0), & also within the same major version eg 7.1 to 7.4, or 8.0 to 8.1).

Inherent in that design concept, is that a contrib designed to run on SME 7.x (el4 rpms) may not function correctly on SME 8.0 (el5 rpms), so if a system (including installed contribs) is "automatically" upgraded or restored, then there is a high likelihood that the contrib(s) will no longer function correctly (eg due to underlying system design changes, dependency requirements, etc, etc).

This means a compatible version of the contrib must be reinstalled.

Unfortunately there is no other simple technical way to cover all upgrade possibilities.

It would be a good thing to be able to include in any backup/restore/upgrade all possible contribs, add ons & unique changes that a myriad of users could possibly make.
It would mean the SME code would need to cater for ANYTHING that an end user could do, which in reality would be technically impossible to achieve from a practical or realistic viewpoint. The code would be staggeringly complex to cater for every unknown eventuality.

I refer you to the debacle with the SME 6.x version. IIRC there was an official v6.0 version & another point release, but there was also an unofficial iso created (called something like 6.01.1), which included a lot of contribs & non standard configuration. More "gung ho" users installed the unofficisl release thinking it was a good idea & an easy way to install contribs etc in bulk, & even when not needed.
IIRC seperate to that iso there was also a large few thousand line script that would automatically install every contrib that was available, again under the pretext that this "was a good thing to do".

Both of these caused a lot of problems for end users & for official developers at upgrade time when SME7 was released. Much additional effort had to be put into uninstalling contribs & removing specialised non standard configuration etc, so that the base install of SME 7 would be work OK.

It's just not practical to implement your suggestion.
Even one errant contrib rpm or custom template piece of code can stop a server from functioning correctly. Make that 5 or 10 or a few dozen contribs, & your server could be impossible to troubleshoot for a novice admin and quite difficult for an experienced admin.

The practical answer (& long time recommended approach by developers), is to create an ibay called installedcontribs or similar. In that you put every contrib rpm you install on the server, and you maintain text files or similar that have all configuration & setup steps plus all changes & tweaks you make during the lifetime of the server. You will need a systematic approach to maintaining the data in this ibay, & need to make the time & effort to keep the information up to date whenever you make changes etc.
When restore or upgrade time comes, ( as this ibay is included in backups), then you have all the information needed to reinstate contribs etc.
Simply look in the restored ibay & start reinstalling contribs & tweaks.
You will still need to obtain compatible contrib rpms etc, but at least you have everything you have done to that server "at your fingertips".
I think professionals call this (or equate it with) "keeping a server technical support log file".

Quote

Well, re-installing all contribs is quite a lot of work. Not all are that smooth to re-install. I had manually to adjust phpki, firebird (not a contrib), virtualbox, phpvirtualbox.

I'm wondering that there is no script to save in a backup a list of all installed contribs (or addons) with the specific settings and configurations. Don't get me wrong, I'm just a user, not a developer nor a programmer. A script which would install after a restore all contribs (and/or addons) would make life a lot easier.