Discussion:
Improper Naming in /dev/disk/by-id and Drives Offline
Brandon R Schwartz
2014-09-11 02:34:06 UTC
Permalink
Hi,

I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).

The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
/dev/disk/by-id. We expect to see something like:

ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY

But instead we see:

ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb

The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.

The second issue is slightly related to the first in that it appears
during the same power cycle/reboot test. We have noticed that on
occasion, our drives will not be detected by the OS (not listed in
/dev/disk/by-id) at all. However, if we look at drive logs and
controller logs, we don't see any issue. The controller is able to
see the drives and communicate with them, but the OS is unable to.
Any ideas as to why communication is not established?

Also, is there a way to refresh the /dev/disk/by-id listing (udevadm
trigger?) once the OS has booted in order to rescan for attached
devices and repopulate it? Thanks for any information and let me know
if you need logs or anything else.

Regards,
Brandon
--
Brandon Schwartz
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH
2014-09-11 02:53:42 UTC
Permalink
Post by Brandon R Schwartz
Hi,
I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).
The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY
ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb
The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.
Sounds like blkid didn't read the uuid properly. Is this happening in
your initrd? Is this a systemd init system, or something else? What
distro / version is this? What kernel version is this?
Post by Brandon R Schwartz
The second issue is slightly related to the first in that it appears
during the same power cycle/reboot test. We have noticed that on
occasion, our drives will not be detected by the OS (not listed in
/dev/disk/by-id) at all. However, if we look at drive logs and
controller logs, we don't see any issue. The controller is able to
see the drives and communicate with them, but the OS is unable to.
Any ideas as to why communication is not established?
Also, is there a way to refresh the /dev/disk/by-id listing (udevadm
trigger?) once the OS has booted in order to rescan for attached
devices and repopulate it? Thanks for any information and let me know
if you need logs or anything else.
That depends on your distro, and how it's set up. You could "coldplug"
the by-id values by using udevadmn trigger, have you tried that? You
shouldn't have to do it, as it sounds like you have a boot time race
condition somewhere...

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brandon R Schwartz
2014-09-12 17:52:30 UTC
Permalink
Post by Greg KH
Post by Brandon R Schwartz
Hi,
I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).
The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY
ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb
The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.
Sounds like blkid didn't read the uuid properly. Is this happening in
your initrd? Is this a systemd init system, or something else? What
distro / version is this? What kernel version is this?
Hi Greg,

The distro is RHEL 6.3 with kernel version 2.6.32. We have also seen
the issue on a Debian based system with kernel 3.2.45. We ran into
this issue again yesterday on RHEL and tested the command 'udevadm
trigger' and it repopulated /dev/disk/by-id with the correct
information. Is there another level of debugging that we can enable
to see where the information might be getting read improperly?
Post by Greg KH
Post by Brandon R Schwartz
The second issue is slightly related to the first in that it appears
during the same power cycle/reboot test. We have noticed that on
occasion, our drives will not be detected by the OS (not listed in
/dev/disk/by-id) at all. However, if we look at drive logs and
controller logs, we don't see any issue. The controller is able to
see the drives and communicate with them, but the OS is unable to.
Any ideas as to why communication is not established?
Also, is there a way to refresh the /dev/disk/by-id listing (udevadm
trigger?) once the OS has booted in order to rescan for attached
devices and repopulate it? Thanks for any information and let me know
if you need logs or anything else.
That depends on your distro, and how it's set up. You could "coldplug"
the by-id values by using udevadmn trigger, have you tried that? You
shouldn't have to do it, as it sounds like you have a boot time race
condition somewhere...
What do you mean by 'coldplug' the by-id values with udevadm trigger?
This issue happens much more infrequently so we are still waiting for
a failure to test. We are also looking into ways that we can
exacerbate the issue if it is a boot time race condition.
Post by Greg KH
thanks,
greg k-h
Regards,
Brandon
--
Brandon Schwartz
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH
2014-09-12 18:03:50 UTC
Permalink
Post by Brandon R Schwartz
Post by Greg KH
Post by Brandon R Schwartz
Hi,
I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).
The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY
ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb
The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.
Sounds like blkid didn't read the uuid properly. Is this happening in
your initrd? Is this a systemd init system, or something else? What
distro / version is this? What kernel version is this?
Hi Greg,
The distro is RHEL 6.3 with kernel version 2.6.32.
Then I strongly suggest you get support from Red Hat, as you are paying
for it :)
Post by Brandon R Schwartz
We have also seen
the issue on a Debian based system with kernel 3.2.45. We ran into
this issue again yesterday on RHEL and tested the command 'udevadm
trigger' and it repopulated /dev/disk/by-id with the correct
information. Is there another level of debugging that we can enable
to see where the information might be getting read improperly?
I don't know how RHEL is set up at all, it's such an old kernel, and
userspace, the community can't help you out, sorry.

Work with Red Hat, you are paying them, might as well take advantage of
it.

good luck,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brandon R Schwartz
2014-09-12 18:53:23 UTC
Permalink
Post by Greg KH
Post by Brandon R Schwartz
Post by Greg KH
Post by Brandon R Schwartz
Hi,
I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).
The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY
ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb
The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.
Sounds like blkid didn't read the uuid properly. Is this happening in
your initrd? Is this a systemd init system, or something else? What
distro / version is this? What kernel version is this?
Hi Greg,
The distro is RHEL 6.3 with kernel version 2.6.32.
Then I strongly suggest you get support from Red Hat, as you are paying
for it :)
Post by Brandon R Schwartz
We have also seen
the issue on a Debian based system with kernel 3.2.45. We ran into
this issue again yesterday on RHEL and tested the command 'udevadm
trigger' and it repopulated /dev/disk/by-id with the correct
information. Is there another level of debugging that we can enable
to see where the information might be getting read improperly?
I don't know how RHEL is set up at all, it's such an old kernel, and
userspace, the community can't help you out, sorry.
Haha, that is true, but we do see the failures more often on the
Debian based system. If you think we'd be better off working with the
RHEL community or the Debian forums we'll try our luck there. Thanks
for all the help so far!
Post by Greg KH
Work with Red Hat, you are paying them, might as well take advantage of
it.
good luck,
greg k-h
Regards,
Brandon
--
Brandon Schwartz
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH
2014-09-12 22:42:49 UTC
Permalink
Post by Brandon R Schwartz
Post by Greg KH
Post by Brandon R Schwartz
Post by Greg KH
Post by Brandon R Schwartz
Hi,
I'm working on a particular issue (possibly two separate issues) where
our HDDs are (1) getting mislabeled in /dev/disk/by-id and (2)
dropping offline even though drive and controller logs show that the
drive is communicating and working as expected. I don't have much
knowledge on the udev side of things so it would be great if someone
could offer some insight into the way udev assigns device names and if
there are thoughts as to why the OS cannot see the drive in certain
cases (timing issue?).
The first issue, the mislabeling problem, is that on reboots or power
cycles we occasionally see our drives become mislabeled in
ata-ST3000DM001-1CH166_W1F26HKK
ata-ST3000DM001-1CH166_Z1F2FBBY
ata-ST3000DM001-1CH166_W1F26HKK
scsi-35000c500668a9bdb
The "scsi" drive is assigned a drive letter and the OS can communicate
with the drive. Drives logs and controller logs show the drive is
working properly, but for some reason it's getting labeled incorrectly
in /dev/disk/by-id. We have looked through dmesg and enabled logging
in udev (udevadm control --log-priority=debug), but we have not seen
where these labels are coming from.
Sounds like blkid didn't read the uuid properly. Is this happening in
your initrd? Is this a systemd init system, or something else? What
distro / version is this? What kernel version is this?
Hi Greg,
The distro is RHEL 6.3 with kernel version 2.6.32.
Then I strongly suggest you get support from Red Hat, as you are paying
for it :)
Post by Brandon R Schwartz
We have also seen
the issue on a Debian based system with kernel 3.2.45. We ran into
this issue again yesterday on RHEL and tested the command 'udevadm
trigger' and it repopulated /dev/disk/by-id with the correct
information. Is there another level of debugging that we can enable
to see where the information might be getting read improperly?
I don't know how RHEL is set up at all, it's such an old kernel, and
userspace, the community can't help you out, sorry.
Haha, that is true, but we do see the failures more often on the
Debian based system. If you think we'd be better off working with the
RHEL community or the Debian forums we'll try our luck there. Thanks
for all the help so far!
The "RHEL community" is corporate support, which you are are paying for,
use it!

As for the fact that it seems reproducable on two very different, and
both old, distros, it might be a hardware issue, try using a more
"modern" distro to see if it really is a kernel/udev issue, or hardware.

good luck,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...