Kernel update

It might happen that you start to see some lines like these in your log: DMA: Out of SW-IOMMU space for 65536 bytes at device. This is definitely not good, but it is easily solved with a kernel update.

It is a known bug:

Sep 16 07:52:22 hal kernel: [549535.388490] ata1.00: status: { DRDY }
Sep 16 07:52:22 hal kernel: [549535.408287] ata2.00: configured for UDMA/133
Sep 16 07:52:22 hal kernel: [549535.408294] ata2: EH complete
Sep 16 07:52:22 hal kernel: [549535.408394] DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:00:1f.2
Sep 16 07:52:22 hal kernel: [549535.408417] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 16 07:52:22 hal kernel: [549535.408420] ata2.00: failed command: WRITE DMA EXT
Sep 16 07:52:22 hal kernel: [549535.408426] ata2.00: cmd 35/00:98:d0:42:cf/00:00:18:00:00/e0 tag 0 dma 77824 out
Sep 16 07:52:22 hal kernel: [549535.408426]          res 50/00:00:af:6d:70/00:00:74:00:00/e0 Emask 0x40 (internal error)
Sep 16 07:52:22 hal kernel: [549535.408429] ata2.00: status: { DRDY }
Sep 16 07:52:22 hal kernel: [549535.412332] ata1.00: configured for UDMA/133
Sep 16 07:52:22 hal kernel: [549535.412339] ata1: EH complete
Sep 16 07:52:22 hal kernel: [549535.412440] DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:00:1f.2
Sep 16 07:52:22 hal kernel: [549535.412463] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 16 07:52:22 hal kernel: [549535.412466] ata1.00: failed command: WRITE DMA EXT
Sep 16 07:52:22 hal kernel: [549535.412472] ata1.00: cmd 35/00:98:d0:42:cf/00:00:18:00:00/e0 tag 0 dma 77824 out
Sep 16 07:52:22 hal kernel: [549535.412472]          res 50/00:00:af:6d:70/00:00:74:00:00/e0 Emask 0x40 (internal error)

Now it comes the interesting part: RAID corruption! :)

Sep 17 05:15:07 hal kernel: [37727.016503] ata1.00: status: { DRDY }
Sep 17 05:15:07 hal kernel: [37727.024414] ata2.00: configured for UDMA/133
Sep 17 05:15:07 hal kernel: [37727.024422] sd 1:0:0:0: [sdb]  
Sep 17 05:15:07 hal kernel: [37727.024424] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 17 05:15:07 hal kernel: [37727.024426] sd 1:0:0:0: [sdb]  
Sep 17 05:15:07 hal kernel: [37727.024427] Sense Key : Aborted Command [current] [descriptor]
Sep 17 05:15:07 hal kernel: [37727.024431] Descriptor sense data with sense descriptors (in hex):
Sep 17 05:15:07 hal kernel: [37727.024432]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Sep 17 05:15:07 hal kernel: [37727.024441]         00 70 6d af 
Sep 17 05:15:07 hal kernel: [37727.024445] sd 1:0:0:0: [sdb]  
Sep 17 05:15:07 hal kernel: [37727.024447] Add. Sense: No additional sense information
Sep 17 05:15:07 hal kernel: [37727.024449] sd 1:0:0:0: [sdb] CDB: 
Sep 17 05:15:07 hal kernel: [37727.024451] Write(10): 2a 00 01 7b ec e8 00 00 38 00
Sep 17 05:15:07 hal kernel: [37727.024458] end_request: I/O error, dev sdb, sector 24898792
Sep 17 05:15:07 hal kernel: [37727.024464] ata2: EH complete
Sep 17 05:15:07 hal kernel: [37727.040319] ata1.00: configured for UDMA/133
Sep 17 05:15:07 hal kernel: [37727.040326] sd 0:0:0:0: [sda]  
Sep 17 05:15:07 hal kernel: [37727.040328] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 17 05:15:07 hal kernel: [37727.040330] sd 0:0:0:0: [sda]  
Sep 17 05:15:07 hal kernel: [37727.040332] Sense Key : Aborted Command [current] [descriptor]
Sep 17 05:15:07 hal kernel: [37727.040335] Descriptor sense data with sense descriptors (in hex):
Sep 17 05:15:07 hal kernel: [37727.040337]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Sep 17 05:15:07 hal kernel: [37727.040346]         00 70 6d af 
Sep 17 05:15:07 hal kernel: [37727.040351] sd 0:0:0:0: [sda]  
Sep 17 05:15:07 hal kernel: [37727.040353] Add. Sense: No additional sense information
Sep 17 05:15:07 hal kernel: [37727.040355] sd 0:0:0:0: [sda] CDB: 
Sep 17 05:15:07 hal kernel: [37727.040357] Write(10): 2a 00 01 7b ec e8 00 00 38 00
Sep 17 05:15:07 hal kernel: [37727.040365] end_request: I/O error, dev sda, sector 24898792
Sep 17 05:15:07 hal kernel: [37727.040375] ata1: EH complete
Sep 17 05:15:07 hal kernel: [37727.040377] md/raid1:md1: Disk failure on sdb5, disabling device.
Sep 17 05:15:07 hal kernel: [37727.040377] md/raid1:md1: Operation continuing on 1 devices.
Sep 17 05:15:07 hal kernel: [37727.067568] RAID1 conf printout:
Sep 17 05:15:07 hal kernel: [37727.067571]  --- wd:1 rd:2
Sep 17 05:15:07 hal kernel: [37727.067573]  disk 0, wo:1, o:0, dev:sdb5
Sep 17 05:15:07 hal kernel: [37727.067576]  disk 1, wo:0, o:1, dev:sda5
Sep 17 05:15:07 hal kernel: [37727.067594] Aborting journal on device md1-8.
Sep 17 05:15:07 hal mdadm[1689]: Fail event detected on md device /dev/md/1
Sep 17 05:15:07 hal mdadm[1689]: FailSpare event detected on md device /dev/md/1, component device /dev/s
db5
Sep 17 05:15:07 hal kernel: [37727.068599] RAID1 conf printout:
Sep 17 05:15:07 hal kernel: [37727.068602]  --- wd:1 rd:2
Sep 17 05:15:07 hal kernel: [37727.068605]  disk 1, wo:0, o:1, dev:sda5
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

At this moment the RAID may be already corrupted and the system is in a read-only state. We will now reboot to have a degraded RAID but that allows writes:

_$: reboot

The first thing we should do after rebooting the server must be stopping the WiFi:

_$: service hostapd stop

Mount the RAID again and check the kernel version:

_$: uname -r
3.8.0-29-generic

We need to update the kernel to a version greater than 3.9 to avoid RAID corruption:

_$: apt-get install linux-generic-lts-trusty

In case your server has a desktop environment don’t forget to update that too:

_$: apt-get install xserver-xorg-lts-trusty
_$: reboot

Once your server has finished rebooting, check the kernel version:

_$: uname -r
3.13.0-36-generic

It should be fixed now.