This post explains the improvements to drop to one the number of the copies required to send data over AF_VSOCK. Current mechanism to send packets through virtio-vsock makes two copies before sending it through the virtio-vsock device. The first copy is from the user’s application to the socket buffer. The SysSocketSend() copies the content from the user to a Packet structure. This structure contains the content and the vsock header. The packet is sent to the driver by relying on the SysNetworkSend() function that only accepts packet structures. When the driver gets the packet, the second copy is from the packet to the corresponding virtqueue's ring. This is the available ring of the transmission queue of the virtio-vsock device. During the second copy, we get a free pre-allocated buffer, we copy the content from the user and then we enqueue it into the available ring.
To drop the number to one, we remove the need of the first copy by reusing the packet's content in the virtio-vsock driver. This modification has been addressed by the github issue https://github.com/torokernel/torokernel/issues/363. Instead of allocating a new buffer, the virtio-vsock buffer reuses the packet buffer. Note that the available ring contains indexes that identify descriptors in the descriptor ring. We make this descriptor to point to user's buffer. After the buffer is consumed by the device, the driver tells the kernel that the packet has been sent and the kernel simply releases the memory allocated for that packet. The mechanism to send packets does not wait for the packet to be sent. It fills the available ring of the transmission queue with packets and returns immediately to the user.
We have benchmarked these changes but the mechanism does not show significant improvements. We use ab to compare two instances of the WebServer appliance. This is an appliance that runs a WebServer over virtio-vsock and gets files by using virtio-fs. This is the appliance that hosts Toro's website. For the benchmark, we use the following command:
ab -c 1 -n 1000 http://127.0.0.1:4000/images/apporikernel.png (133 kb)
In this test, the Requests per second without the changes is 2.27 [#/sec] (mean) whereas the same parameter is 2.28 [#/sec] (mean) with the changes. The time per request is 440.995 [ms] (mean) wo the changes whereas the value with the change is 438.213 [ms] (mean). During this benchmark, we use socat to translate tcp connections to vsock connections.
We removed socat and we launch a ping-pong test with different packet sizes. These are the results with the changes:
request size: 8192 bytes, time: 1138.86 ms
request size: 16384 bytes, time: 1169.71 ms
These are the results without the changes:
request size: 8192 bytes, time: 1131.54 ms
request size: 16384 bytes, time: 1165.18 ms
For some reason, the instance that has two copies takes less time than the instance with a single copy. We require more benchmarks to understand why such a improvement has none effect on these tests. The next step would be to use zero copy by reusing the user’s buffer. This is how it is implemented in the virtio-fs driver in which the user’s buffer is shared with the device.