Furor Teutonicus blog | over | volg | teuto | lyme | archief | doneer | todo
🕰️
  ⬩  
✍️ Evert Mouw
  ⬩  
⏱️ 5 min

The sad story of getting a computer connected to the network

Keywords: Asus Z87-K motherboard, integrated Realtek RTL8111/8168B, DHCP, Ethernet link speed.

Required prerequisites: basic understanding of:

How this started

Experimenting with reliable data storage and setting up a home backup storage server is one of the things I still want to do. When finished, I might write a post about it on this weblog. For now, suffice to say that I just installed Solaris 11.2 beta.

Asus Z87-K Motherboard with integrated LAN: Realtek RTL8111

Z87-K

The Asus Z87-K motherboard has a build-in Realtek RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 11).

No IP, lost in confusion

At first all was well. I got an IP address from my redundant failover DHCP setup. All was fine. The syslog on the DHCP server showed these nice messages:

May  1 19:31:24 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May  1 19:31:24 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May  1 19:31:26 localip dhcpd: DHCPREQUEST for 10.0.0.7 (10.0.0.49) from e0:3f:49:0e:ee:75 via eth0
May  1 19:31:26 localip dhcpd: DHCPACK on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0

But after a reboot, things stopped working smoothly. I no longer got an IP address. In the Solaris console,ifconfig showed no IP address. I could not ping to the outside world. So let’s inspect the DHCP log:

May  1 22:35:58 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May  1 22:35:58 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May  1 22:36:30 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May  1 22:36:30 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May  1 22:37:33 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May  1 22:37:33 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0

It is obvious: the DHCP server receives a DHCP request from the Solaris box, but somehow the Solaris machine does not receive or understand the DHCP answer and keeps sending new requests.

I re-installed Solaris. I ran Solaris from the installation USB stick. I tweaked lots of BIOS and UEFI settings. Nothing helped. UsingYUMI, I installedKali Linux on another USB memory stick. I needed a Linux distro with good and many networking diagnostics tools, and for that, Kali is a very good candidate.

Kali Linux also did not receive an IP address. The same logs were produced on the DHCP server. I began to suspect the ethernet layer. In Linux, you haveethtool for that – although most Linux distributions do not include that by default, and without an IP connection, anapt-get install oryum install orpacman -S is really not that useful. Of course, Kali just includes such programs.

ethtool showed alternating capabilities advertised by the link partner (a simple desktop network switch) when being invoked after plugging in the UTP cable multiple times. Also, I saw that thelink speedwas incorrectly set.

So that explains the DHCP logs! The DHCP server receives the DHCPDISCOVER messages, because the switch doesn’t care that much at which link speed the Ethernet packes arrive. But when the DHCP server sends an answer back, the switch assumes the wrong link speed, and the Realtek NIC (Network Interface Card) does not receive the answer.

For completeness: the link partner is a 3Com 3CGSU05A gigabit switch. Other desktops and laptops which I have previously connected to the 3Com switch worked flaslessly. Still, I need to get more confidence about the offender: should I blame the 3Com switch or the Realtek NIC?

3Com desktop switch 3CGSU05A

It worked all right when I cconnected my box to a Fritz!Box 7340 modemrouter that has two Gigabit LAN ports. You see, the 3Com switch is still suspect.

Booting to Solaris again. When connected to either the 3Com switch or the Fritz!Box:

root@solaris:~# dladm show-ether -x net0
LINK              PTYPE    STATE    AUTO  SPEED-DUPLEX                    PAUSE
net0              current  up       no    1G-f                            none
--                capable  --       yes   1G-f,100M-fh,10M-fh             bi
--                adv      --       yes   1G-f,100M-fh,10M-fh             bi
--                peeradv  --       no    --                              none

No link-layer auto negotiation. So, it probably just remembers the last state, or maybe just picks a speed at random. The Realtek RTL8111 has this problem no matter if I connect it to the 3Com switch or the Fritz!Box. This adds suspicion towards the RTL8111.

Let’s do additional searching. This specific Realtek NIC can be found in bug reports from some open source projects.

iPXE to the rescue

The stuff on the iPXE (open source network boot firmware) forum is very interesting.

Sebastian Nielsen wrote:

Found out that there was a setting “automatic link-down to 10MBIT on poweroff” that was set to “enabled”. Disabling this default setting, made the network card fully functional. Also the other setting causing problem was “receive-side scaling”. Seems like both setting caused problems in combination.

Michael Brown (mcb30) wrote:

Strange; those bits are ADVERTISEPAUSECAP and ADVERTISEPAUSEASYM: they control what flow control capabilities the NIC advertises. In the failure case, the NIC is therefore advertising that it does not support sending or receiving pause frames. This should not be causing the symptoms you are seeing.

So it looked promising at first, they had to search further.

Michael Brown (mcb30) wrote:

The Receive Configuration Register looks to be the most likely culprit. The difference is in bit 24, which is listed in the 8168B datasheet as “reserved”, and isn’t defined in any other Realtek datasheet I can find.

After this discussion, they got this Git commit[realtek] Clear bit 24 of RCR:

On an Asus Z87-K motherboard with an onboard 8168 NIC, booting into Windows 7 and then warm rebooting into iPXE results in a broken RX datapath: packets can be transmitted successfully but garbage is received. A cold reboot clears the problem.

A dump of the PHY registers reveals only one difference: in the failure case the bits ADVERTISEPAUSECAP and ADVERTISEPAUSEASYM are cleared. Explicitly setting these bits does not fix the problem.

A dump of the MAC registers reveals a few differences, of which the most obvious culprit is the undocumented bit 24 of the Receive Configuration Register (RCR), which is set in the failure case. Explicitly clearing this bit does fix the problem.

Finally, it’s working

No easy tool does exist to fix this. Disabling autonegotiation as documented for Solaris 11.1 does not work for me with Solaris 11.2 beta with this specific RTL8111.

root@solaris:~# dladm set-linkprop -p en_10hdx_cap=0 net0
dladm: warning: cannot set link property 'en_10hdx_cap' on 'net0': operation not supported

But until now I’ve not yet run into trouble again, and if I do, I know the cause.

Still, i conclude that the integrated Realtek RTL8111/8168B totally sucks. It should do better autonegotiation and/or not set 10 mpbs as default value when powering down.


Deze blogpost werd in december 2022 overgezet van WordPress naar een methode gebaseerd op Markdown; het is mogelijk dat hierbij fouten of wijzigingen zijn ontstaan t.o.v. de originele blogpost.