The sad story of getting a computer connected to the network
Keywords: Asus Z87-K motherboard, integrated Realtek RTL8111/8168B, DHCP, Ethernet link speed.
Required prerequisites: basic understanding of:
- Dynamic Host Configuration Protocol (DHCP)
- Basics of computer hardware
How this started
Experimenting with reliable data storage and setting up a home backup storage server is one of the things I still want to do. When finished, I might write a post about it on this weblog. For now, suffice to say that I just installed Solaris 11.2 beta.
Asus Z87-K Motherboard with integrated LAN: Realtek RTL8111
The Asus Z87-K motherboard has a build-in Realtek RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 11).
No IP, lost in confusion
At first all was well. I got an IP address from my redundant failover DHCP setup. All was fine. The syslog on the DHCP server showed these nice messages:
May 1 19:31:24 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May 1 19:31:24 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May 1 19:31:26 localip dhcpd: DHCPREQUEST for 10.0.0.7 (10.0.0.49) from e0:3f:49:0e:ee:75 via eth0
May 1 19:31:26 localip dhcpd: DHCPACK on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
But after a reboot, things stopped working smoothly. I no longer got
an IP address. In the Solaris console,
ifconfig showed no IP
address. I could not ping to the outside world. So let’s inspect the
May 1 22:35:58 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May 1 22:35:58 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May 1 22:36:30 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May 1 22:36:30 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
May 1 22:37:33 localip dhcpd: DHCPDISCOVER from e0:3f:49:0e:ee:75 via eth0
May 1 22:37:33 localip dhcpd: DHCPOFFER on 10.0.0.7 to e0:3f:49:0e:ee:75 via eth0
It is obvious: the DHCP server receives a DHCP request from the Solaris box, but somehow the Solaris machine does not receive or understand the DHCP answer and keeps sending new requests.
I re-installed Solaris. I ran Solaris from the installation USB stick. I tweaked lots of BIOS and UEFI settings. Nothing helped. UsingYUMI, I installedKali Linux on another USB memory stick. I needed a Linux distro with good and many networking diagnostics tools, and for that, Kali is a very good candidate.
Kali Linux also did not receive an IP address. The same logs were
produced on the DHCP server. I began to suspect the ethernet layer. In
Linux, you have
ethtool for that – although most Linux
distributions do not include that by default, and without an IP
apt-get install or
pacman -S is really not that useful. Of course, Kali just
includes such programs.
ethtool showed alternating capabilities advertised by
the link partner (a simple desktop network switch) when being invoked
after plugging in the UTP cable multiple times. Also, I saw that
thelink speedwas incorrectly set.
So that explains the DHCP logs! The DHCP server receives the DHCPDISCOVER messages, because the switch doesn’t care that much at which link speed the Ethernet packes arrive. But when the DHCP server sends an answer back, the switch assumes the wrong link speed, and the Realtek NIC (Network Interface Card) does not receive the answer.
For completeness: the link partner is a 3Com 3CGSU05A gigabit switch. Other desktops and laptops which I have previously connected to the 3Com switch worked flaslessly. Still, I need to get more confidence about the offender: should I blame the 3Com switch or the Realtek NIC?
It worked all right when I cconnected my box to a Fritz!Box 7340 modemrouter that has two Gigabit LAN ports. You see, the 3Com switch is still suspect.
Booting to Solaris again. When connected to either the 3Com switch or the Fritz!Box:
root@solaris:~# dladm show-ether -x net0
LINK PTYPE STATE AUTO SPEED-DUPLEX PAUSE
net0 current up no 1G-f none
-- capable -- yes 1G-f,100M-fh,10M-fh bi
-- adv -- yes 1G-f,100M-fh,10M-fh bi
-- peeradv -- no -- none
No link-layer auto negotiation. So, it probably just remembers the last state, or maybe just picks a speed at random. The Realtek RTL8111 has this problem no matter if I connect it to the 3Com switch or the Fritz!Box. This adds suspicion towards the RTL8111.
Let’s do additional searching. This specific Realtek NIC can be found in bug reports from some open source projects.
- UbuntuBug #347711 reads “Network goes down even though mii-tool still says negotiated, appears to be related to the network driver/adapter”.
- iPXE forum discussion:Problems with Asus Z87-K MB: NIC Does not reset after boot.
iPXE to the rescue
The stuff on the iPXE (open source network boot firmware) forum is very interesting.
Sebastian Nielsen wrote:
Found out that there was a setting “automatic link-down to 10MBIT on poweroff” that was set to “enabled”. Disabling this default setting, made the network card fully functional. Also the other setting causing problem was “receive-side scaling”. Seems like both setting caused problems in combination.
Michael Brown (mcb30) wrote:
Strange; those bits are ADVERTISEPAUSECAP and ADVERTISEPAUSEASYM: they control what flow control capabilities the NIC advertises. In the failure case, the NIC is therefore advertising that it does not support sending or receiving pause frames. This should not be causing the symptoms you are seeing.
So it looked promising at first, they had to search further.
Michael Brown (mcb30) wrote:
The Receive Configuration Register looks to be the most likely culprit. The difference is in bit 24, which is listed in the 8168B datasheet as “reserved”, and isn’t defined in any other Realtek datasheet I can find.
After this discussion, they got this Git commit[realtek] Clear bit 24 of RCR:
On an Asus Z87-K motherboard with an onboard 8168 NIC, booting into Windows 7 and then warm rebooting into iPXE results in a broken RX datapath: packets can be transmitted successfully but garbage is received. A cold reboot clears the problem.
A dump of the PHY registers reveals only one difference: in the failure case the bits ADVERTISEPAUSECAP and ADVERTISEPAUSEASYM are cleared. Explicitly setting these bits does not fix the problem.
A dump of the MAC registers reveals a few differences, of which the most obvious culprit is the undocumented bit 24 of the Receive Configuration Register (RCR), which is set in the failure case. Explicitly clearing this bit does fix the problem.
Finally, it’s working
No easy tool does exist to fix this. Disabling autonegotiation as documented for Solaris 11.1 does not work for me with Solaris 11.2 beta with this specific RTL8111.
root@solaris:~# dladm set-linkprop -p en_10hdx_cap=0 net0
dladm: warning: cannot set link property 'en_10hdx_cap' on 'net0': operation not supported
But until now I’ve not yet run into trouble again, and if I do, I know the cause.
Still, i conclude that the integrated Realtek RTL8111/8168B totally sucks. It should do better autonegotiation and/or not set 10 mpbs as default value when powering down.
Deze blogpost werd in december 2022 overgezet van WordPress naar een methode gebaseerd op Markdown; het is mogelijk dat hierbij fouten of wijzigingen zijn ontstaan t.o.v. de originele blogpost.