简体   繁体   中英

connect() with unix-domain socket and full backlog

When the listening backlog is full for STREAM unix-domain sockets, connect(2) fails on most systems with ECONNREFUSED. It would be preferable for it to return EAGAIN.

The reasoning is that it is highly useful to be able to distinguish between the two cases of dead socket (node exists in filesystem, but no process listening anymore) and the case of full backlog. I ran into this problem when porting some Linux software which has some code to clean up dead sockets, but it's a security vulnerability if the code can be tricked into deleting sockets by spamming them to fill up their backlog.

Only Linux returns EAGAIN; AIX, Solaris and Darwin follow BSD behaviour (just tested on each).

POSIX doesn't list EAGAIN as a possible return code from connect() ( link ), so there may be some compliance issue here.

What's the best route to get everyone to change in line with Linux? I could go and file a bug report with Oracle, Apple, a FreeBSD PR, and fight it out on the mailing lists of each organisation. Or should I pester someone in a standards body (Austin group)? Is it even advisable to try and get everyone to change here, even though the advantage is clear?

Whether or not you attempt to change the standard, or change how vendors have implemented connect() , I would argue from the point of view of the software, it wouldn't make any difference. ECONNREFUSED and EAGAIN should both be treated as a retry.

Distinguishing between the two cases may allow you to write a more specific diagnostic message on the client, but the retry logic should be the same. Even if the listener doesn't currently exist, it may eventually exist, so a retry should be attempted.

try_again:
    rc = connect(s, (void *)&addr, sizeof(addr));
    if (rc == 0) return connect_succeeded(s, &addr);
    switch (errno) {
    case EAGAIN:
    case ECONNREFUSED:
        if (should_try_again(retries++)) {
            goto try_again;
        }
        break;
    case EINTR:
        goto try_again;
    default:        
        break;
    }
    return connect_failed(s, errno);

Like with most things, if you can convince the source to make the move, then all the users will have to comply. So having POSIX fix the issue is certainly what would be best. Then all implementations will have to comply.

Another solution is to use the system on which it works. ie only use Linux machines (in this case). Also with Linux you can tweak the kernel and make it work one way or the other (which is also doable with BSD).

Now, my opinion on the matter, it seems to me that the main issue here would be a rogue process attempting to open a socket to receive messages instead of the intended service. My question would be: how often does that happen?

If you use a system similar to systemctl, you will have one and only one instance of the service running. If you are attempting to open the AF_UNIX socket at that point in your service, you really are the only one trying to do so. Therefore, deleting the file is a non-issue.

If you allow rogue software to run on your system (ie it is public and you have many users with access, like in the old days when people would telnet to a server), then maybe you need to use a TCP or UDP connections instead.

Finally, if another software is able to open that AF_UNIX socket, I think that you're in trouble anyway since your service can be running while that rogue software (1) deletes your socket, (2) bind() 's anew, (3) your new clients are now talking to that rogue software, not your service, even though you are still running and listening for connections... on a hidden file socket. (Note that old clients will continue to talk with your service, any client that reconnects will talk to the rogue software).

So... where is your issue?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM