So my question is when exactly does the kernel trigger a write-ready for a socket?
tl;dr; As long as your socket has enough buffer space writes succeed and epoll_wait will return events to say so in the default level-triggered mode. If the socket runs out of space blocking writers will sleep. The kernel will wake processes (or deliver epoll events to say the socket is writable) when data is acknowledged freeing up space but only if the socket had run out of space. Just as before if nothing changes as long as the socket is writable the level-triggered events will pour in, even if no new notifications come from TCP.
The function that performs the actual notification is sk_write_space.
This is a member of struct sock and for TCP the relevant implementation is sk_stream_write_space in stream.c.
...
if (skwq_has_sleeper(wq))
wake_up_interruptible_poll(&wq->wait, EPOLLOUT |
EPOLLWRNORM | EPOLLWRBAND);
if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(wq, SOCK_WAKE_SPACE, POLL_OUT);
...
This function wakes up any callers that might be waiting for memory.
(Compare this with sock_def_write_space.
But when is sk_write_space called? There are a few call sites but the most prominent is tcp_new_space which is called by tcp_check_space, which is called by tcp_data_snd_check which is called from a bunch of places on the receive path. The function has a descriptive comment:
When incoming ACK allowed to free some skb from write_queue,
we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket
on the exit from tcp input handler.
tcp_check_space is interesting:
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
/* pairs with tcp_poll() */
smp_mb();
if (sk->sk_socket &&
test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
tcp_new_space(sk);
...
}
Some relevant bits here:
SOCK_QUEUE_SHRUNK is defined as "write queue has been shrunk recently" and is set set on the transmit path. tcp_check_space checks and clears it.
SOCK_NOSPACE is set on the transmit path when we run out of buffer space.
The conclusion from all this is that tcp_check_space avoids sending events unless the socket was out of space.
What about tcp_data_snd_check? During the steady state the most relevant calls are in tcp_rcv_established:
The fast-path:
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5575
The almost-fast path:
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5618
The slow-path:
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5658
All of these signal data was successfully ACKd.
There are other callers of sk_write_space in TCP. do_tcp_sendpages and tcp_sendmsg_locked call it on error paths to make sure callers are woken up. do_tcp_setsockopt calls it when setting TCP_NOTSENT_LOWAT.