tcp_init_wnd_chk
Not everything in the world is brand-new and sometimes your brand new server has to communicate with a really old TCP/IP enabled device. As such devices are sometimes really limited in resources, sometimes they don’t obey some of the best practices. This article will describe and issue arising out of this and how to circumvent it.
Description of the problem
All started with a customer calling: “Hey, in Solaris 10 a connection between an LDOM and such a TCP/IP device works like a charm. In a Solaris 11 LDOM it stopped working”. At the first, i thought “WTF”, but the solution is quite trivial. The customer had the following tcpdump (which i have obfuscated for obvious reasons):
client
wants to establish a TCP-connection with server
. The application and server software isn’t relevant here so i won’ talk about it here..
Explanation
There are two important pieces of information in this tcpdump that are relevant for the explanation:
- client wants to establish a connection with a maximum segment size of 1460 bytes
client
wants to use a window size of 560
Why are those both pieces of information relevant? In Solaris 11 a number of security mechanisms are activated by default. One of it is checking the TCP window size. In order to protect the system against certain kinds of denial of service attacks, the networking stack looks into the setup of the connection for a certain pattern that is usually just used by such attacks. Almost all TCP/IP stacks really wanting to do some work and not just wanting eating away your resources will request a TCP receive window size significantly larger than the maximum segment size (like four times) because you get a more steady throughput with such a configuration. It’s a good “suggestion” (as in: Do it this way!) to size the receive window at an integer factor of the maximum segment size. It’s really a good “best practice” and all sane TCP/IP stacks will do it this way. However denial of service attacks are based on the point that the attacked target runs faster out of resources than the attacker. So you work with small receive windows, as you have to allocate the receive windows size for an TCP/IP connection. So there is a simple check: You check for a situation, that is unlikely to see in normal operation. Like a client connecting with a receive window size smaller than the maximum segment size as this would have several negative impacts to performance. So: When the maximum segment size is larger than the window size in the third step of the handshake the system will just ignores the step. There is no connection from the perspective of the server, so it doesn’t have to allocate resources. In addition to that, there is an timer set, that expires such an connection attempt earlier from the respective tables in the operating system. On the other side: The client believes there is a connection (the server has answered with a SYN/ACK on the initial SYN) and doesn’t know that the server doesn’t consider this connection as completely initiated so it’s normal that the client is still starting to send data, but will never receive an ACK on it. The described mechanism helps a lot against denial of service attacks, however it’s based on the assumption that the communication partner obeys the best practices and for 99,99% of all devices this is a correct assumption. But not all TCP/IP stacks are doing this … some have to divert from this best practices in order to enable an device to communicate via TCP/IP despite having only minimal resource. The example above is such a case. The client is a devise with minimal resources: As shown by the above tcpdump, the client specifies a maximum segment size of 1460, but only a receive window size of 560. Obviously the connection attempt fails the test describes before and such the connection will not be established. Solaris 10 didn’t knew the described check. So it’s quite obvious why the TCP connection worked with Solaris 10 but not with Solaris 11.
Workaround
So you are in a situation. Your device can’t communicate. But especially with older hardware there is often no way to change the client and even when you can do it, often you don’t want to do it, because you don’t want to introduce a single small software change. Out of this reason, there is a way to get around the check on the server side. It’s connected to a second check made by the TCP/IP-stack. With larger segment sizes, more TCP stacks will not set an larger TCP receive window size, this is based on the fact that every tcp connection relates to an allocation of memory for the received data in the size of the receive window. This requests would fail as well. To get around this, Solaris 11 checks for an additional value called tcp_init_wnd_chk
. The ACK is considered as invalid, when advertised window is smaller than both maximum segment size and tcp_init_wnd_chk
. By default the value of this variable is 4096. So a connection attempt with a window size larger than 4096 Byte will accepted, even when the maximum segment size is larger (and thus would fail the described check without this second check). The value is dynamic so you can change it.
So the solution for this customer was quite easy. Right after executing the command echo tcp_init_wnd_chk/W 0x0 | mdb -kw
the connections were established without any problems again. With a set ip:tcp_init_wnd_chk = 0
in /etc/system
the change was made boot persistent. Keepin in mind before doing so, that you just deactivated an security check. You system will be more susceptible against this special kind of DoS attack. On the other side, normally you have such embedded systems in networks that are tightly controlled and you would have a worse problem like this DoS when someone is able to kick such an attack off in your network (like in : the attacker has broken into your physical security perimeter). Nevertheless i wouldn’t use 0 but the smallest receive window you expect to see in packets with a larger maximum segement size than receive window size minus 1 byte.
So far i saw the necessity to set this parameter only with really really old TCP/IP stacks (old in the sense of somewhat from last century) or TCP/IP stacks of embedded systems.
Demonstration
I wanted to check the solution after the customer told me “Everything is working again” and see the issue by my own eyes. However i hadn’t the opportunity for looking on the real hardware. Thus i simulated the issue by writing a small script. However i needed some tighter control of the stuff the client was sending to the server. So i generated the TCP packets by hand. For a different project i’m playing around with Scapy. Scapy is a toolset in python to generate packets the raw way circumventing the TCP/IP-Stack. As i wrote on Facebook a few days: Working with it to create TCP communication feels like bitbanging an I2C bus with GPIO pins.
The script was called simulation.py
. It’s really a raw hack …
The script takes a single command line parameter. It’s the size of the window the script will use for the connection. It essentially implements a client for the ECHO service provided by the INETD of a solaris system.
Before you can use this script, there is an important prerequisite. The script totally circumvents the TCP/IP stack of the server. From the perspective of the clients OS this TCP/IP connection doesn’t exist. When a SYNACK packet for a TCP/IP connection that doesn’t exist arrived at the client, a security mechanism of the OS may kick in. The client sends a RST packet to terminate this connection to the server and nothing will work. You have to suppress this RST packets. I was using a notebook with ubuntu as a client, thus i used the following iptables
command.
So, when i’m running this script as ./simulation.py 8192
on my client system there is a perfect output.
However, if you start it as ./simulation.py 560
, the situation looks a little bit different, the connection isn’t working at all … for example there is no ECHO response.
It get’s a little bit more obvious is you comment out all lines in the script after the first sleep(0.5)
and run it with ./simulation.py 560
.You will see the retransmission the the SYNACK packet, like we saw it in the tcpdump of the customer.
The connection stays in SYN_RECV as shown by netstat -a
:
The reason is obvious when you put the state diagram of TCP in your mind. It switches to SYN_RECV
when the initial SYN received and SYN ACK is send, however as the next SYN by the client is ignored, it never transitions to ESTABLISHED
state.
However when you use the mdb
statement i have mentioned before, the connection will be established despite the small window as the check i described in the beginning is essentially deactivated.
When i start the script with ./simulation.py 560
the tcpdump will look like this:
It pretty much looks like the connection made with a window of 8192. Changing the value of tcp_init_wnd_chk
had the desired effect.