简体   繁体   English

从 C/C++ 中的 TCP 套接字读取的正确方法是什么?

[英]What is the correct way of reading from a TCP socket in C/C++?

Here's my code:这是我的代码:

// Not all headers are relevant to the code snippet.
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
#include <cstdlib>
#include <cstring>
#include <unistd.h>

char *buffer;
stringstream readStream;
bool readData = true;

while (readData)
{
    cout << "Receiving chunk... ";

    // Read a bit at a time, eventually "end" string will be received.
    bzero(buffer, BUFFER_SIZE);
    int readResult = read(socketFileDescriptor, buffer, BUFFER_SIZE);
    if (readResult < 0)
    {
        THROW_VIMRID_EX("Could not read from socket.");
    }

    // Concatenate the received data to the existing data.
    readStream << buffer;

    // Continue reading while end is not found.
    readData = readStream.str().find("end;") == string::npos;

    cout << "Done (length: " << readStream.str().length() << ")" << endl;
}

It's a little bit of C and C++ as you can tell.如您所知,它有点像 C 和 C++。 The BUFFER_SIZE is 256 - should I just increase the size? BUFFER_SIZE 是 256 - 我应该增加大小吗? If so, what to?如果是这样,该怎么办? Does it matter?有关系吗?

I know that if "end" is not received for what ever reason, this will be an endless loop, which is bad - so if you could suggest a better way, please also do so.我知道如果由于某种原因没有收到“结束”,这将是一个无限循环,这很糟糕 - 所以如果你能提出更好的方法,也请这样做。

Without knowing your full application it is hard to say what the best way to approach the problem is, but a common technique is to use a header which starts with a fixed length field, which denotes the length of the rest of your message.在不了解您的完整应用程序的情况下,很难说解决问题的最佳方法是什么,但一种常见的技术是使用以固定长度字段开头的 header,该字段表示您的消息的 rest 的长度。

Assume that your header consist only of a 4 byte integer which denotes the length of the rest of your message.假设您的 header 仅包含一个 4 字节的 integer ,它表示您的消息的 rest 的长度。 Then simply do the following.然后只需执行以下操作。

// This assumes buffer is at least x bytes long,
// and that the socket is blocking.
void ReadXBytes(int socket, unsigned int x, void* buffer)
{
    int bytesRead = 0;
    int result;
    while (bytesRead < x)
    {
        result = read(socket, buffer + bytesRead, x - bytesRead);
        if (result < 1 )
        {
            // Throw your error.
        }

        bytesRead += result;
    }
}

Then later in the code然后在后面的代码中

unsigned int length = 0;
char* buffer = 0;
// we assume that sizeof(length) will return 4 here.
ReadXBytes(socketFileDescriptor, sizeof(length), (void*)(&length));
buffer = new char[length];
ReadXBytes(socketFileDescriptor, length, (void*)buffer);

// Then process the data as needed.

delete [] buffer;

This makes a few assumptions:这做了一些假设:

  • ints are the same size on the sender and receiver.整数在发送方和接收方上的大小相同。
  • Endianess is the same on both the sender and receiver.发送方和接收方的字节序相同。
  • You have control of the protocol on both sides您可以控制双方的协议
  • When you send a message you can calculate the length up front.当您发送消息时,您可以预先计算长度。

Since it is common to want to explicitly know the size of the integer you are sending across the network define them in a header file and use them explicitly such as:由于通常希望明确知道您通过网络发送的 integer 的大小,因此在 header 文件中定义它们并明确使用它们,例如:

// These typedefs will vary across different platforms
// such as linux, win32, OS/X etc, but the idea
// is that a Int8 is always 8 bits, and a UInt32 is always
// 32 bits regardless of the platform you are on.
// These vary from compiler to compiler, so you have to 
// look them up in the compiler documentation.
typedef char Int8;
typedef short int Int16;
typedef int Int32;

typedef unsigned char UInt8;
typedef unsigned short int UInt16;
typedef unsigned int UInt32;

This would change the above to:这会将上述内容更改为:

UInt32 length = 0;
char* buffer = 0;

ReadXBytes(socketFileDescriptor, sizeof(length), (void*)(&length));
buffer = new char[length];
ReadXBytes(socketFileDescriptor, length, (void*)buffer);

// process

delete [] buffer;

I hope this helps.我希望这有帮助。

Several pointers:几个指针:

You need to handle a return value of 0, which tells you that the remote host closed the socket.您需要处理返回值 0,它告诉您远程主机关闭了套接字。

For nonblocking sockets, you also need to check an error return value (-1) and make sure that errno isn't EINPROGRESS, which is expected.对于非阻塞 sockets,您还需要检查错误返回值 (-1) 并确保 errno 不是 EINPROGRESS,这是预期的。

You definitely need better error handling - you're potentially leaking the buffer pointed to by 'buffer'.您肯定需要更好的错误处理 - 您可能会泄漏“缓冲区”指向的缓冲区。 Which, I noticed, you don't allocate anywhere in this code snippet.我注意到,您没有在此代码段中分配任何位置。

Someone else made a good point about how your buffer isn't a null terminated C string if your read() fills the entire buffer.如果您的 read() 填充了整个缓冲区,那么其他人就您的缓冲区如何不是 null 终止的 C 字符串提出了一个很好的观点。 That is indeed a problem, and a serious one.这确实是一个问题,而且是一个严重的问题。

Your buffer size is a bit small, but should work as long as you don't try to read more than 256 bytes, or whatever you allocate for it.您的缓冲区大小有点小,但只要您不尝试读取超过 256 个字节或您为其分配的任何内容,它就应该可以工作。

If you're worried about getting into an infinite loop when the remote host sends you a malformed message (a potential denial of service attack) then you should use select() with a timeout on the socket to check for readability, and only read if data is available, and bail out if select() times out.如果您担心在远程主机向您发送格式错误的消息(潜在的拒绝服务攻击)时进入无限循环,那么您应该使用 select() 并在套接字上超时以检查可读性,并且仅在数据可用,如果 select() 超时则退出。

Something like this might work for you:这样的事情可能对你有用:

fd_set read_set;
struct timeval timeout;

timeout.tv_sec = 60; // Time out after a minute
timeout.tv_usec = 0;

FD_ZERO(&read_set);
FD_SET(socketFileDescriptor, &read_set);

int r=select(socketFileDescriptor+1, &read_set, NULL, NULL, &timeout);

if( r<0 ) {
    // Handle the error
}

if( r==0 ) {
    // Timeout - handle that. You could try waiting again, close the socket...
}

if( r>0 ) {
    // The socket is ready for reading - call read() on it.
}

Depending on the volume of data you expect to receive, the way you scan the entire message repeatedly for the "end;"根据您期望接收的数据量,您重复扫描整条消息以寻找“结尾”的方式; token is very inefficient.令牌非常低效。 This is better done with a state machine (the states being 'e'->'n'->'d'->';') so that you only look at each incoming character once.最好使用 state 机器(状态为 'e'->'n'->'d'->';')来完成,这样您只需查看每个传入字符一次。

And seriously, you should consider finding a library to do all this for you.说真的,你应该考虑找一个图书馆来为你做这一切。 It's not easy getting it right.做到正确并不容易。

If you actually create the buffer as per dirks suggestion, then:如果您实际按照 dirks 的建议创建缓冲区,则:

  int readResult = read(socketFileDescriptor, buffer, BUFFER_SIZE);

may completely fill the buffer, possibly overwriting the terminating zero character which you depend on when extracting to a stringstream.可能会完全填满缓冲区,可能会覆盖您在提取到字符串流时所依赖的终止零字符。 You need:你需要:

  int readResult = read(socketFileDescriptor, buffer, BUFFER_SIZE - 1 );

1) Others (especially dirkgently) have noted that buffer needs to be allocated some memory space. 1)其他人(尤其是急切地)注意到缓冲区需要分配一些 memory 空间。 For smallish values of N (say, N <= 4096), you can also allocate it on the stack:对于较小的 N 值(例如,N <= 4096),您也可以在堆栈上分配它:

#define BUFFER_SIZE 4096
char buffer[BUFFER_SIZE]

This saves you the worry of ensuring that you delete[] the buffer should an exception be thrown.这使您不必担心在抛出异常时确保delete[]缓冲区。

But remember that stacks are finite in size (so are heaps, but stacks are finiter), so you don't want to put too much there.但请记住,堆栈的大小有限的(堆也是有限的,但堆栈是有限的),所以你不想在那里放太多。

2) On a -1 return code, you should not simply return immediately (throwing an exception immediately is even more sketchy.) There are certain normal conditions that you need to handle, if your code is to be anything more than a short homework assignment. 2)在 -1 返回代码上,您不应该简单地立即返回(立即抛出异常更加粗略。)如果您的代码不仅仅是简短的家庭作业,您需要处理某些正常情况. For example, EAGAIN may be returned in errno if no data is currently available on a non-blocking socket.例如,如果非阻塞套接字上当前没有可用数据,则 EAGAIN 可能会在 errno 中返回。 Have a look at the man page for read(2).查看 read(2) 的手册页。

Where are you allocating memory for your buffer ?你在哪里为你的buffer分配 memory ? The line where you invoke bzero invokes undefined behavior since buffer does not point to any valid region of memory.调用bzero的行会调用未定义的行为,因为 buffer 没有指向 memory 的任何有效区域。

char *buffer = new char[ BUFFER_SIZE ];
// do processing

// don't forget to release
delete[] buffer;

This is an article that I always refer to when working with sockets..这是我在使用 sockets 时经常参考的一篇文章。

THE WORLD OF SELECT() SELECT() 的世界

It will show you how to reliably use 'select()' and contains some other useful links at the bottom for further info on sockets.它将向您展示如何可靠地使用“select()”,并在底部包含一些其他有用的链接,以获取有关 sockets 的更多信息。

Just to add to things from several of the posts above:只是从上面的几个帖子中添加内容:

read() -- at least on my system -- returns ssize_t. read() - 至少在我的系统上 - 返回 ssize_t。 This is like size_t, except is signed.这类似于 size_t,但已签名。 On my system, it's a long, not an int.在我的系统上,它是一个长整数,而不是整数。 You might get compiler warnings if you use int, depending on your system, your compiler, and what warnings you have turned on.如果您使用 int,您可能会收到编译器警告,具体取决于您的系统、编译器以及您打开了哪些警告。

For any non-trivial application (IE the application must receive and handle different kinds of messages with different lengths), the solution to your particular problem isn't necessarily just a programming solution - it's a convention, IE a protocol.对于任何重要的应用程序(IE 应用程序必须接收和处理不同长度的不同类型的消息),您特定问题的解决方案不一定只是编程解决方案——它是一种约定,IE 是一种协议。

In order to determine how many bytes you should pass to your read call, you should establish a common prefix, or header, that your application receives.为了确定应该将多少字节传递给read调用,您应该建立一个公共前缀,即 header,您的应用程序将接收该前缀。 That way, when a socket first has reads available, you can make decisions about what to expect.这样,当套接字第一次读取可用时,您可以决定预期的内容。

A binary example might look like this:二进制示例可能如下所示:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <arpa/inet.h>

enum MessageType {
    MESSAGE_FOO,
    MESSAGE_BAR,
};

struct MessageHeader {
    uint32_t type;
    uint32_t length;
};

/**
 * Attempts to continue reading a `socket` until `bytes` number
 * of bytes are read. Returns truthy on success, falsy on failure.
 *
 * Similar to @grieve's ReadXBytes.
 */
int readExpected(int socket, void *destination, size_t bytes)
{
    /*
    * Can't increment a void pointer, as incrementing
    * is done by the width of the pointed-to type -
    * and void doesn't have a width
    *
    * You can in GCC but it's not very portable
    */
    char *destinationBytes = destination;
    while (bytes) {
        ssize_t readBytes = read(socket, destinationBytes, bytes);
        if (readBytes < 1)
            return 0;
        destinationBytes += readBytes;
        bytes -= readBytes;
    }
    return 1;
}

int main(int argc, char **argv)
{
    int selectedFd;

    // use `select` or `poll` to wait on sockets
    // received a message on `selectedFd`, start reading

    char *fooMessage;
    struct {
        uint32_t a;
        uint32_t b;
    } barMessage;

    struct MessageHeader received;
    if (!readExpected (selectedFd, &received, sizeof(received))) {
        // handle error
    }
    // handle network/host byte order differences maybe
    received.type = ntohl(received.type);
    received.length = ntohl(received.length);

    switch (received.type) {
        case MESSAGE_FOO:
            // "foo" sends an ASCII string or something
            fooMessage = calloc(received.length + 1, 1);
            if (readExpected (selectedFd, fooMessage, received.length))
                puts(fooMessage);
            free(fooMessage);
            break;
        case MESSAGE_BAR:
            // "bar" sends a message of a fixed size
            if (readExpected (selectedFd, &barMessage, sizeof(barMessage))) {
                barMessage.a = ntohl(barMessage.a);
                barMessage.b = ntohl(barMessage.b);
                printf("a + b = %d\n", barMessage.a + barMessage.b);
            }
            break;
        default:
            puts("Malformed type received");
            // kick the client out probably
    }
}

You can likely already see one disadvantage of using a binary format - for each attribute greater than a char you read, you will have to ensure its byte order is correct using the ntohl or ntohs functions.您可能已经看到使用二进制格式的一个缺点 - 对于每个大于您读取的char的属性,您必须使用ntohlntohs函数确保其字节顺序正确。

An alternative is to use byte-encoded messages, such as simple ASCII or UTF-8 strings, which avoid byte-order issues entirely but require extra effort to parse and validate.另一种方法是使用字节编码的消息,例如简单的 ASCII 或 UTF-8 字符串,这完全避免了字节顺序问题,但需要额外的努力来解析和验证。

There are two final considerations for network data in C. C 中的网络数据有两个最终考虑因素。

The first is that some C types do not have fixed widths.首先是一些 C 类型没有固定宽度。 For example, the humble int is defined as the word size of the processor, so 32 bit processors will produce 32 bit int s, while 64 bit processors will produces 64 bit int s.例如,不起眼的int被定义为处理器的字长,所以 32 位处理器会产生 32 位的int ,而 64 位的处理器会产生 64 位的int Good, portable code should have network data use fixed-width types, like those defined in stdint.h .好的、可移植的代码应该让网络数据使用固定宽度的类型,就像在stdint.h中定义的那样。

The second is struct padding.第二个是结构填充。 A struct with different-widthed members will add data in between some members to maintain memory alignment, making the struct faster to use in the program but sometimes producing confusing results.具有不同宽度成员的结构将在某些成员之间添加数据以维护 memory alignment,从而使该结构在程序中使用起来更快,但有时会产生令人困惑的结果。

#include <stdio.h>
#include <stdint.h>

int main()
{
    struct A {
        char a;
        uint32_t b;
    } A;

    printf("sizeof(A): %ld\n", sizeof(A));
}

In this example, its actual width won't be 1 char + 4 uint32_t = 5 bytes, it'll be 8:在这个例子中,它的实际宽度不是 1 char + 4 uint32_t = 5 bytes,而是 8:

mharrison@mharrison-KATANA:~$ gcc -o padding padding.c
mharrison@mharrison-KATANA:~$ ./padding 
sizeof(A): 8

This is because 3 bytes are added after char a to make sure uint32_t b is memory-aligned.这是因为在char a之后添加了 3 个字节,以确保uint32_t b是内存对齐的。

So if you write a struct A , then attempt to read a char and a uint32_t on the other side, you'll get char a , and a uint32_t where the first three bytes are garbage and the last byte is the first byte of the actual integer you wrote.因此,如果您write struct A ,然后尝试在另一侧读取charuint32_t ,您将得到char a和 uint32_t ,其中前三个字节是垃圾,最后一个字节是实际的第一个字节你写的 integer。

Either document your data format explicitly as C struct types or, better yet, document any padding bytes they might contain.将您的数据格式明确记录为 C 结构类型,或者更好的是,记录它们可能包含的任何填充字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM