One of my colleagues has been facing to weird behaviour of ACFS during installation of two nodes RAC on RHEL 7 and OEL7 (started on RHEL, then tried OEL). ACFS volume creation (during mkfs execution) one of the nodes has either hung or thrown error like this:
mkfs.acfs: version = 12.1.0.2.0
mkfs.acfs: on-disk version = 39.0
mkfs.acfs: volume = /dev/asm/goldengate-32
mkfs.acfs: CLSU-00100: operating system function: ioctl failed with error data: 1
mkfs.acfs: CLSU-00101: operating system error message: Operation not permitted
mkfs.acfs: CLSU-00103: error location: OI_0
mkfs.acfs: ACFS-00546: failed to change on-disk signature
mkfs.acfs: ACFS-01004: /dev/asm/goldengate-32 was not formatted.
After couple of days of his struggling I decided to help him to find out the problem because my colleague is very experienced thus I knew that he has done everything correctly (well, definitely it wasn't his first rodeo). Moreover I have done similar installation just few days before, without any problem. And that was interesting. Firstly I checked logs and of course configuration (udev, multipath, etc.), it had been correct and traced (using one of my closest friends since 2000) that process with this result:
2216 stat("/dev/asm/goldengate-34", {st_mode=S_IFBLK|0770, st_rdev=makedev(251, 17409), ...}) = 0
2216 open("/dev/ofsctl", O_RDWR) = 9
2216 ioctl(9, 0xffffffffc1387015, 0x7ffdbee52880) = -1 EPERM (Operation not permitted)
And this was really weird as permissions for these devices are defined by Udev configuration which is created during installation.
Gathered all important information about environment I've created an action plan. Installation of the same environment on my virtual machine. Well I'd rather say almost the same as I decided to not install the clustered environment in first step and of course used hardware was different and hypervisor has been as well. Version of all software was pretty much same as in original (my colleague's) environment, except one so important thing - Linux kernel. I decided to use original (non UEK) kernel first.
Installation went smoothly, ACFS had smooth configuration and later functionality too, so two options (or unanswered question) remained - whether it's kernel related or RAC related problem where the first option was my preference according to previous tracing.
So I've installed exact version (3.8.13-118.6.2.el7uek) of Unbreakable Kernel provided by Oracle just as my colleague did. And here is the result (Note that it was an intention to perform all steps manually):
$ uname -a
Linux el1 3.8.13-118.6.2.el7uek.x86_64 #2 SMP Thu May 19 13:15:51 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ acfsdriverstate supported
ACFS-9200: Supported
# acfsroot install
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9154: Loading 'oracleoks.ko' driver.
ACFS-9154: Loading 'oracleadvm.ko' driver.
ACFS-9154: Loading 'oracleacfs.ko' driver.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
ACFS-9156: Detecting control device '/dev/ofsctl'.
ACFS-9309: ADVM/ACFS installation correctness verified.
# lsmod | grep ora
oracleacfs 3498177 0
oracleadvm 594197 0
oracleoks 503994 2 oracleacfs,oracleadvm
$ asmcmd volinfo --all
no volumes found
$ asmcmd volcreate -G GG GOLDENGATE -s 4G
$ asmcmd volinfo --all
Diskgroup Name: GG
Volume Name: GOLDENGATE
Volume Device: /dev/asm/goldengate-34
State: ENABLED
Size (MB): 4096
Resize Unit (MB): 64
Redundancy: UNPROT
Stripe Columns: 8
Stripe Width (K): 1024
Usage:
Mountpath:
$ mkfs -t acfs /dev/asm/goldengate-34
mkfs.acfs: version = 12.1.0.2.0
mkfs.acfs: on-disk version = 39.0
mkfs.acfs: volume = /dev/asm/goldengate-34
mkfs.acfs: CLSU-00100: operating system function: ioctl failed with error data: 1
mkfs.acfs: CLSU-00101: operating system error message: Operation not permitted
mkfs.acfs: CLSU-00103: error location: OI_0
mkfs.acfs: ACFS-00546: failed to change on-disk signature
mkfs.acfs: ACFS-01004: /dev/asm/goldengate-34 was not formatted.
Bingo! I've got the same problem on my test machine, so I was able to reproduce the problem. Now it's clear that is not related to a clustered configuration. Ok, let's see the trace output:
2061 stat("/dev/asm/goldengate-34", {st_mode=S_IFBLK|0770, st_rdev=makedev(251, 17409), ...}) = 0
2061 open("/dev/ofsctl", O_RDWR) = 9
2061 ioctl(9, 0xffffffffc1387015, 0x7ffd26305310) = -1 EPERM (Operation not permitted)
and permissions:
$ ll /dev/asm/.asm_ctl_spec
brwxrwx--- 1 root dba 251, 0 Jun 1 23:13 /dev/asm/.asm_ctl_spec
$ ll /dev/asm/goldengate-34
brwxrwx--- 1 root dba 251, 17409 Jun 1 23:24 /dev/asm/goldengate-34
$ ll /dev/ofsctl
brw-rw-r-- 1 root dba 250, 0 Jun 1 23:24 /dev/ofsctl
Output from tracing was the same and permissions are the same as defined in Udev rules and also the are correct. So according to this result I decided to remove ACFS related configuration and check the configuration and functionality of ACFS again using several older UEK3 kernels. Here is the result from the test of one of them (3.8.13-118.4.2.el7uek.x86_64):
$ uname -a
Linux el1 3.8.13-118.4.2.el7uek.x86_64 #2 SMP Tue Mar 22 20:46:48 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ acfsdriverstate supported
ACFS-9200: Supported
# acfsroot install
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9154: Loading 'oracleoks.ko' driver.
ACFS-9154: Loading 'oracleadvm.ko' driver.
ACFS-9154: Loading 'oracleacfs.ko' driver.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
ACFS-9156: Detecting control device '/dev/ofsctl'.
ACFS-9309: ADVM/ACFS installation correctness verified.
# lsmod | grep ora
oracleacfs 3498177 0
oracleadvm 594197 0
oracleoks 503994 2 oracleacfs,oracleadvm
$ asmcmd volinfo --all
no volumes found
$ asmcmd volcreate -G GG GOLDENGATE -s 4G
$ asmcmd volinfo --all
Diskgroup Name: GG
Volume Name: GOLDENGATE
Volume Device: /dev/asm/goldengate-34
State: ENABLED
Size (MB): 4096
Resize Unit (MB): 64
Redundancy: UNPROT
Stripe Columns: 8
Stripe Width (K): 1024
Usage:
Mountpath:
$ mkfs -t acfs /dev/asm/goldengate-34
mkfs.acfs: version = 12.1.0.2.0
mkfs.acfs: on-disk version = 39.0
mkfs.acfs: volume = /dev/asm/goldengate-34
mkfs.acfs: volume size = 4294967296 ( 4.00 GB )
mkfs.acfs: Format complete.
# mount -t acfs /dev/asm/goldengate-34 /mnt/
# mount | grep mnt
/dev/asm/goldengate-34 on /mnt type acfs (rw,relatime,device,rootsuid,ordered)
# touch /mnt/testfile
# ll /mnt/testfile
-rw-r--r-- 1 root root 0 Jun 2 00:52 /mnt/testfile
It works, we were able to format, mount and use the ACFS volume thus now we know that downgrade the UEK3 kernel is a workaround which solves the issue with ACFS until time when bug will be fixed by newer release of UEK3 kernel.
A bonus case: What if the Sys Admin will upgrade kernel thus we use the "buggy" version of UEK3 kernel on already configured ACFS volumes? I did several repeating tests on already configured and previously working ACFS volume. Let see what will happen:
1st run:
$ uname -a
Linux el1 3.8.13-118.6.2.el7uek.x86_64 #2 SMP Thu May 19 13:15:51 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
# acfsload start
ACFS-9391: Checking for existing ADVM/ACFS installation.
ACFS-9392: Validating ADVM/ACFS installation files for operating system.
ACFS-9393: Verifying ASM Administrator setup.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9154: Loading 'oracleoks.ko' driver.
ACFS-9154: Loading 'oracleadvm.ko' driver.
ACFS-9154: Loading 'oracleacfs.ko' driver.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
ACFS-9156: Detecting control device '/dev/ofsctl'.
ACFS-9322: completed
$ asmcmd volenable -G GG GOLDENGATE
$ asmcmd volinfo --all
Diskgroup Name: GG
Volume Name: GOLDENGATE
Volume Device: /dev/asm/goldengate-34
State: ENABLED
Size (MB): 4096
Resize Unit (MB): 64
Redundancy: UNPROT
Stripe Columns: 8
Stripe Width (K): 1024
Usage: ACFS
Mountpath: /mnt
# mount -t acfs /dev/asm/goldengate-34 /mnt/
# mount | grep mnt
/dev/asm/goldengate-34 on /mnt type acfs (rw,relatime,device,rootsuid,ordered)
# touch /mnt/testfile2
# ll /mnt/testfile*
-rw-r--r-- 1 root root 0 Jun 2 00:52 /mnt/testfile
-rw-r--r-- 1 root root 0 Jun 2 01:18 /mnt/testfile2
Result of 1st run - working.
2nd run:
# Kernel panic - not syncing: Holding spin lock
rid: 1925, comm: mount Tainted: PF 0 3.8.13-118.6.2.e17uek.x86_64 #2
all Trace:
[<ffffffff81574fb0>] panic+Oxc8/0x1d7
[<ffffffffa0484a7a>] __KsPanic+0x9a/Oxa0 [oracleoks]
[<ffffffffa072275d>] ? OfsDoMountRecovery+Oxldd/Ox2b0 [oracleacfs]
[<ffffffffa0722a5d>] OfsDoPhaselRecovery+0x22d/Ox4e0 [oracleacfs]
[<ffffffffa0722e28>] OfsWaitForRecoveryToComplete+0x118/0x3c0 [oracleacfs]
[<ffffffffa0667446>] OfsSetupVolume+Oxc6/0x2ff0 [oracleacfs]
[<ffffffffa048507c>] ? KsSleepEvent+Ox8c/Oxce [oracleoks]
[<ffffffff81081e90>] ? wake_up_bit+Ox30/0x30
[<ffffffffa04851bc>] ? KsCreateSystemThread+Ox10c/Ox1b0 [oracleoks]
[<ffffffffa066b6dd>] OfsMountVolume+0x136d/Ox27e0 [oracleacfs]
[<ffffffffa0766e8f>] ofs_fill_sb+0x6f/Ox7e0 [oracleacfs]
[<ffffffff8118af58>] mount_bdev+0x1b8/0x200
[<ffffffffa0766e20>] ? ofs_parse_flags+0x140/0x140 [oracleacfs]
[<ffffffffa0763ec7>] ofs_mount+Ox107/0x2f0 [oracleacfs]
[<ffffffff8118b8e9>] mount_fs+0x39/0x1b0
[<ffffffff811a5dc7>] ? alloc_vfsmnt+Oxd7/0x1b0
[<ffffffff811a5f3f>] vfs_kern_mount+Ox5f/Oxf0
[<ffffffff811a8250>] do_mount+0x220/0xaf0
[<ffffffff8114343b>] ? strndup_user+Ox4b/Oxf0
[<ffffffff811a8ba3>] sys_mount+0x83/0xc0
[<ffffffff81587179>] system_call_fastpath+0x16/0x1b
Result of the 2nd run was kernel panic, machine hung of course.
3rd run:
# acfsload start
ACFS-9391: Checking for existing ADVM/ACFS installation.
ACFS-9392: Validating ADVM/ACFS installation files for operating system.
ACFS-9393: Verifying ASM Administrator setup.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
ACFS-9156: Detecting control device '/dev/ofsctl'.
acfsutil plogconfig: CLSU-00100: operating system function: ioctl failed with error data: 22
acfsutil plogconfig: CLSU-00101: operating system error message: Invalid argument
acfsutil plogconfig: CLSU-00103: error location: OI_0
acfsutil plogconfig: ACFS-03500: Unable to access kernel persistent log entries.
ACFS-9225: Failed to start OKS persistent logging.
ACFS-9322: completed
Result of 3rd run: finished with CLSU-00100: operating system function: ioctl failed with error data: 22 , CLSU-00101: operating system error message: Invalid argument, CLSU-00103: error location: OI_0 error messages. After several tries of manual startup it has started successfully.
I did seven more runs and final result was 4x working, 5x kernel panic, 1x error message during startup, so there is 60% chance that your system won't work.
Final conclusions
- problem is not related to clustered environment
- based on several tests (not shown in this article) 11gR2 (11.2.0.4 + APR 2016 PSU) and 12cR1 (12.1.0.2 + Apr 2016 PSU) are affected by this issue
- problem is related to specific UEK3 kernel versions, to be precise: 3.8.13-118.6.1.el7uek and 3.8.13-118.6.2.el7uek
- (currently) last properly working version of UEK3 is 3.8.13-118.4.2.el7uek
- supported version of default kernels (non UEK) is one of possible solution/workaround
- using older UEK3 kernel is a temporary workaround, not solution until the bug will be fixed. There are two reasons to have updated kernel (security, bugfix)
Update:
Updated kernel 3.8.13-118.8.1.el7uek.x86_64 has been released and this bug doesn't occur with this version (and probably later versions which will come in future).
Hope that helps...