Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Study of I/O Performance of Virtual Machines

A Study of I/O Performance of Virtual Machines Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 © The British Computer Society 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Advance Access publication on 28 September 2017 doi:10.1093/comjnl/bxx092 A Study of I/O Performance of Virtual Machines 1* 1 1,2 GIUSEPPE LETTIERI ,VINCENZO MAFFIONE AND LUIGI RIZZO Dipartimento di Ingegneria dell’Informazione, Università di Pisa, Italy Google, Inc., Mountain View, CA, USA (work performed while at Università di Pisa, Italy) Corresponding author: [email protected] In this study, we investigate some counterintuitive but frequent performance issues that arise when doing high-speed networking (or I/O in general) with Virtual Machines (VMs). VMs use one or more single-producer/single-consumer systems to exchange I/O data (e.g. network packets) with their hypervisor. We show that when the producer and the consumer process packets at different rates, the high cost required for synchronization (interrupts and ‘kicks’) may reduce throughput of the sys- tem well below the slowest of the two parties; moreover, accelerating the faster party may cause the throughput to decrease. Our work provides a model for throughput, efficiency and latency of produ- cer/consumer systems when notifications or sleeping are used as a synchronization mechanism; iden- tifies different operating regimes depending on the operating parameters; validates the accuracy of our model against a VirtIO-based prototype, taking into account most of the details of real-world deployments; provides practical and robust strategies to maximize throughput and minimize energy while keeping the latency under control, without depending on precise timing measurements nor unreasonable assumptions on the system’s behavior. The study is particularly interesting for Network Function Virtualization deployments, where high-rate producer/consumer systems in vir- tualized environments are the core components. Keywords: virtual machines; high-speed I/O; energy efficiency Received 20 January 2017; revised 23 July 2017; editorial decision 11 September 2017; Handling editor: Iain Stewart 1. INTRODUCTION which must be kept streaming to avoid abysmal performance (and mechanical wear) due to frequent start/stops. Large buffers Computer systems have many components that need to in that case came to help in achieving decent throughput; the exchange data and synchronize with each other, to determine inherently unidirectional (and sequential) nature of tape I/Odoes when new data can be sent or received. The timescales of not call for more sophisticated solutions. these interactions span from the nanosecond range for on- We are interested in a similar problem in the communication chip hardware (CPU, memory), to hundreds of nanosecond or between a process that runs in a VM, issuing I/O operations at microseconds for processes or Virtual Machines (VMs) and high speed, and the hypervisor software implementing the corre- their hypervisors, up to milliseconds or more for peripherals sponding ‘virtual’ I/O device. In these cases, we aim at through- with moving parts (such as disks or tapes), or long-distance puts of tens of Gigabits per second, millions of I/O operations communication. per second, and reasonably low delays (tens of microseconds) in Synchronization can be implicit, e.g. when a piece of hard- the delivery of data. The problem is particularly interesting ware has a guaranteed response time; or it can be explicit, rely- when thetypeofI/O is networking. The latency aspect, tightly ing on polling (i.e. repeatedly reading memory or I/Oregisters related with the bidirectional nature of network communication, to figure out when to proceed, possibly using short sleeps to low- is what makes the problem a hard one. Moreover, mechanisms er CPU usage) and/or asynchronous notifications,e.g.interrupts. that allow VMs to exchange network packets between each The cost of synchronization can be highly variable, and some- other at high speed are an enabler technology for the Network times even much larger than the data processing costs. This used Function Virtualization paradigm [1]. Any optimization to be a well-known problem when accessing magnetic tapes, SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 809 addressing these basic mechanisms can potentially impact thou- may obtain estimates of these parameters. In Section 5,we sands deployments, through popular cloud management soft- experimentally validate our models using a representative ware like OpenStack [2]. implementation of a VirtIO producer/consumer system. In Synchronization in these scenarios typically requires inter- Section 6, we relax some of the simplifying assumptions rupts, context switches and thread scheduling for incoming adopted in the model and study the consequences using both traffic, system calls and I/O register access (which translates our VirtIO implementation and a simulator; here we show in expensive ‘VM exits’ on VMs) for outgoing traffic. The how our model is useful to understand real-world perform- high cost of these operations (often in the microsecond range) ance issues. In Section 7, we present some practical method means we cannot afford a synchronization on each packet to identify the operating regime of a system; then we suggest without killing throughput. how to choose the synchronization method and the tunable Amortizing the synchronization cost on batches of packets parameters to improve performance depending on the regime. [3–5] greatly improves throughput, but has an impact on In Section 8, we apply these strategies to two representative latency, which is why several network I/O frameworks [6–9] design examples and experimentally show their benefits. In rely on busy waiting to remove the cost of asynchronous noti- Section 9, we discuss some of the limitations of the proposed fications and keep latency under control. model and suggest some possible extensions. Finally, Busy wait polling has, however, a significant drawback Sections 10 and 11 report related works and our conclusions. related to resource usage: it consumes a full CPU core, may keep busy the datapath to the device or the memory being mon- itored, and the power dissipated in the polling loop may prevent 2. SYSTEM MODEL the use of higher clock speeds on other cores on the same chip. To gain a better understanding of the problem of our interest, Using short sleeps instead of busy waiting can help reduce the in this section, we will study the behavior of a system made CPU consumption while preserving good throughput. of two communicating parties, as in Fig. 1:a Producer P and A middle ground between asynchronous notifications and a Consumer C, where P sends one or more messages at a busy waiting is implemented by modern ‘paravirtualized’ VM time to C through a shared FIFO queue with L slots. devices [10] and interrupt handling [11] strategies. In these The basic assumptions of the model are that P and C can solutions, the system uses polling under high load conditions, work in parallel and the the cost of inspecting the shared state but reverts to asynchronous notifications after some unsuc- (e.g. to ascertain the number of messages in the queue) is cessful poll cycles. negligible if compared with the cost of all other operations The key problem in these solutions is that strategies to that they must perform. These operations include the process- switch from one mechanism to another are normally not ing of the messages, sending and receiving notifications, adaptive, and very susceptible to fall into pathological situa- going to sleep, waking up and so on. These assumptions are tions where small variations in the speed of one party cause typically true in VM environments, where P and C are two significant throughput changes. In our tests, we have fre- threads that live on the opposite sides of a VM boundary in a quently seen systems moving from 100–200 Kpps to 1 Mpps multi-core system. In this environment, accessing shared with minuscule changes in operating conditions [4]. Even memory is much cheaper than, e.g. sending notifications. On when the throughput shows less dramatic variations, the sys- the contrary, non-virtualized I/O where either the producer tem’s resource usage may be heavily affected, which is why (for reception) or the consumer (for transmission) is imple- we need to understand and address this instability. mented as part of a peripheral device, does not perfectly Note that these kinds of problem mostly show up under extreme operating conditions, e.g. when a system is processing a large number of packets-per-second during a DOS attack. In R Producer Consumer those situations, real-world applications may suffer from a num- ber of other, unrelated problems. To isolate the synchronization problem from the rest, this study is limited to mathematical FIGURE 1. System model. Producer and consumer exchange mes- modeling, simulation and synthetic-workload experiments. sages through a queue, blocking, sleeping or busy waiting when The rest of the paper is organized as follows. In Section 2, full/empty, and possibly exchanging notifications to wake up the we provide a model for a single-producer/single-consumer blocked peer. The producer receives request to produce new mes- system under different synchronization mechanisms, explain- sages from a request queue R. ing how different operating regimes may arise and what kind of impact on performance comes by speed differences, delays Note that at very high message rates (say, several tens of millions of mes- and queues. In Section 3, we analyze our models and derive sages per second) the costs of accessing the shared memory can no longer be criteria to compare the different operating regimes against neglected, since the time spent producing and consuming each message each other, basing on the value of operating parameters. In becomes comparable to the time spent stalling on cache misses. Such scen- Section 4, we give suggestions on how the system designer arios are out of the scope of this paper. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 810 G. LETTIERI et al. match this model. In fact, the peripherals can only access blocked or sleeping) because it had previously seen a full memory using relatively expensive DMA operations that go queue, or it may be busy sending a notification to C. The through the PCIe bus. Moreover, P can block and packets are main purpose of this additional queue is to decouple the time never dropped when the FIFO is full (packets may still be when new messages ‘should be produced’ from the time they dropped in the stages that come before P and after C, but this are actually produced when the other communication activ- activities are outside of our model). This is typically true for ities of P are taken into account. the VM/backend boundaries we are interested in, but it is Ideally, we would like our system to process messages at a conspicuously false if we map P to a network interface. For rate set by the slowest of the two parties, and with the min- this reason, models such as [12], that were developed in the imum possible latency and energy per message. As we will past for hardware-based interrupt and polling, do not transfer see, actual performance may be very far from our expecta- easily to virtualized environments and new models, such as tions and from optimal values. the one propose in this work, need to be studied. Before starting our analysis, we define below the para- In our model, because of the assumptions on threading and meters used to model the system (see Table 1). the cost of shared memory operations, many situations may We measure the cost (i.e. the amount of work) of the vari- arise in which P and C are able to work in parallel without ous operations in clock cycles rather than time. This will ease incurring any synchronization cost. Whenever C has finished reasoning about efficiency when our system has the option to processing the last message from the queue, it can inspect the use different clock speeds to achieve a given throughput. queue again and immediately see if new messages have Some parameter-specific additional assumptions: (i) all the become available, in which case it can process them right time spent in S and S is actual work that the CPU must per- away. Similarly, whenever P has finished producing a mes- form to complete the notification and schedule the notified task; sage that has filled the queue, it can look again and see if C (ii) the sleep cost Y , which is a system-dependent parameter, is has freed up some space in the meantime, allowing P to pro- also the minimum length of any sleep interval (Y or Y ). C P duce some more messages. Each message that P produces Throughput, energy and latency all depend on the pattern keeps C active for some more time, in turn giving more time of requests coming to the producer. For throughput and to P to produce more messages. In this way, P and C can sus- energy measurement, we assume greedy regimes, where R is tain each other for long. never empty, which means that the producer generates new If their speed does not match, however, the faster party messages continuously, and we observe the corresponding will eventually run out of work and will have to wait for the values at regime. slower one. C cannot proceed if it finds an empty queue after Regarding latency, we observe the time elapsed between the consumption of the last message, and dually P cannot pro- the moment a request reaches the extraction point of the R ceed if it finds a full queue. In these cases, the parties must queue and the moment the same request is served by C, for take special actions to find out when their activity is pos- any possible pattern of previous requests coming from R. The sible/needed again. We consider three kinds of special rationale of this definition is to study how much service delay actions: a latency-sensitive request can experience, especially when the system is under load—e.g. requests arriving on R at high rate. � polling by busy waiting, continuously checking the state of the queue without leaving the CPU core to any TABLE 1. The parameters used in the analysis. other task; � polling by sleeping for a fixed amount of time, pos- L The length of the queue sibly repeatedly, if nothing has changed after the wake W Cost for P to process one message and enqueue it up; Cost for C to dequeue one message and process it � blocking (yielding the CPU core to other tasks) and k Threshold used by P to notify C. When C is blocked and P asking for an explicit notification from the other party. queues a message, a notification is sent when the queue reaches k messages (typically k = 1) P P Busy waiting can waste large amounts of CPU cycles when Threshold used by C to notify P (notifications are sent when there is no communication. Notifications on the other hand slots are available) involve extra work to be sent and received, and may be deliv- N The cost for P to notify C about a queue state change ered with some delay. Sleeping, finally, may increase the N The cost for C to notify P about a queue state change latency of messages that arrive at the wrong time. S The cost for P to start after a notification from C In our model, P tries to produce a new message as soon as S The cost for C to start after a notification from P it receives a new request from a private, infinite queue R. The length of the sleep interval for C Once started, however, an operation cannot be interrupted. Y The length of the sleep interval for P Therefore, requests may queue up in R since P may be busy Y The cost of a sleep operation serving a previous request, or it may be inactive (either SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 811 The combinations of synchronization methods and para- representing the worst case service latency, where the meters can give rise to a large number of operating regimes, request arrives immediately after P has started to serve a which we describe next. As we will see, some regimes are previous request. more favorable than others, so we will try to determine the ... ... P: conditions that cause the system operate in a given regime x, ... ... C: and, for each of them, we will determine (i) the average time between messages, T (the inverse of the throughput); (ii) the In case (W < W ), the worst case latency has to take into PC total energy per message E (which includes the work of account the time needed by C to process the L messages both P and C); and (iii) an upper bound D for the latency (as already in the queue. Hence, we have defined above) experienced by any request. To study the evolution of the system, we will draw many TW ={ max ,W}, BW P C diagrams that show the parallel activities of P and C over E ==2, T WW+ +∣∣ WW- BW BW P C P C time, using the following symbols: ì ü ï2i WW+<fW W,ï PC C P ï ï D £ () 1 BW í ý ï(+LW1i ) fW >W.ï Symbol Description CC P îï þï producer processing a message In the BW regime, throughput and latency are optimal, with consumer processing a message latency only depending on the processing times and the producer busy waiting length of the shared queue. consumer busy waiting producer sleeping consumer sleeping producer sending a notification 2.2. Polling by sleeping consumer receiving a notification Here we assume that P and C synchronize by going to sleep consumer sending a notification for a fixed amount of time: Y units of time for the consumer producer receiving a notification and Y for the producer. We can identify three greedy regimes depending on whether the producer processes messages faster or than the consumer or not, and also depending on whether The length of the symbol measures the time spent by P or the queue between P and C is sufficiently long to absorb the C in the corresponding activity. sleep times Y and Y . C P For latency measurements, we focus on a single message When the queue is sufficiently long, the slowest party is and use the following additional symbols: always actively working, and the system throughput only depends on its processing time. The fastest party instead peri- Symbol Description odically sleeps, waiting for its peer to catch up and make message at extraction point of R/leaves the system more work available. If the fastest party sleeps for too long, producer processing the selected message however, also the slowest one will run out of work and sleep, consumer processing the selected message so that the system works at reduced throughput. In our model, the system may be in one of the three operat- ing regimes, depending on the relative size of the system parameters. The conditions to check can be grouped in three 2.1. Polling by busy waiting inequalities, whose possible states are summarized in Table 2 together with a corresponding acronym. Each regime corre- When the system uses busy waiting (BW), P and C are sponds to a different combination of the inequality condi- always active, and the slowest of the two spins for the other tions, and it is identified by an acronym (sleeping fast to be ready. On each message, this requires on average a number of cycles ∣WW - ∣ equal to the difference in pro- PC TABLE 2. Conditions for the sleeping-based regimes (‘−’ means cessing work between the two parties. ‘don’t care’). Detailed explanations are in Sections 2.2.1–2.2.3. In order to compute the latency as defined in Section 2,we consider all the possible states the system can be when a (-LW 1) -W (-LW 1) -W PC CP request arrives at the extraction point of R, and find the one W < W >Y − sFC CP C that has the worst latency. <Y − sLS An example of evolution of the system over time for W > W − <Y sLS CP P the case W < W is shown below, with rectangles repre- CP − >Y sFP senting processing and the two vertical arrows SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 812 G. LETTIERI et al. consumer (sFC), sleeping fast producer (sFP), long sleeps TW = , sFC P (sLS)) which is explained in the following sections. EW=+W + , sFC P C To find the worst service latency for each of the three regimes, all the possible internal states of the system must be DD £( max , 2W+Y+W). (4) sFC sI P C C examined, which means considering all the allowed combina- tions of P and C being active or sleeping and the number of Throughput is optimal because the system is processing mes- messages in the queue. In particular, if a request arrives when sages at the rate of the slowest party (P). Increasing Y the system is idle, independently of the regime, the upper bound reduces the energy, but increases maximum latency, so a D for the latency can be derived from the following diagram: sI trade-off is necessary. In any case, Y cannot be increased too much, to prevent P from fill up the queue and sleep. ... ... P: ... ... C: 2.2.2. Sleeping fast producer where the request arrives right after P starts to sleep, and C If W < W , we have a regime similar to sFC, but with the PC starts to sleep right before the request is published in the roles of P and C reversed. P is faster, so it eventually fills up queue. Hence, we have the queue and goes to sleep for Y cycles. Since the sleep interval is short enough (i.e.YL <( - 1)W -W ), C is PC P DY£+W +Y +W.2() sI P P C C never able to empty the queue, and can work at its maximum rate. In this regime, C is always active, while P alternatively This formula will be useful to describe the upper bound latency produces a batch of messages and sleeps. With a reasoning for sFC and sFP, as described in Sections 2.2.1 and 2.2.2. similar to the one reported in Section 2.2.1, we can derive the average batch size b = and write T and E . sFP sFP WW - CP 2.2.1. Sleeping fast consumer Note that as W approaches W , the batch b grows to infin- P C If P is slower than C (i.e. W < W ), C eventually empties the CP ity in both sFC and sFP, and P and C proceed in lockstep at queue and goes to sleep for Y cycles. Since the sleep interval is the ideal rate of one message every W = W cycles. PC not too long (i.e.YL <( - 1)W -W ), C never allows P to CP C The worst case service latency, depending on the relative fill up the queue, so that P can work at its maximum rate, produ- size of parameters, may show up when the system is idle or cing a packet every W cycles. Each time C wakes up, it quickly when a request has to wait for C to process the L packets empties the queue and goes to sleep. The evolution of the sys- already in the queue. The latter case happens when tem over time is shown below, with horizontal arrows represent- YY+<LW . Hence we have PC C ing sleeps. DD £( max ,(L+ 1)W). (5) sFP sI C ... ... P: ... ... C: Also in the sFP regime throughput is optimal, and energy decreases as Y increases. If the latency is bounded by While P is always active, C alternatively consumes a batch (+LW 1) , there is no dependency on Y and we can choose C P of messages and sleeps. The batch size is generally not con- its value as the maximum one that does not cause C to sleep. stant, but oscillating between two consecutive values. If , Otherwise, Y should be limited to bound latency as needed. and are rational numbers, the evolution is periodic. If W Y P C is the number of messages processed by C in a multiple of the period, and the number of sleeps in the same interval, 2.2.3. Long sleeps then b = is the average batch size, and we can write If the faster party sleeps for too long, also the slower one will run out of work and sleep. This clearly means that the nW=+ nW hY,3() CP C C C C throughput will not be optimal as it is for sFC and sFP, i.e. TW ³ if W < W andTW ³ if W < W . sLS P CP sLS C PC n Y C C from which we get b== . The batch size oscil- As confirmed by our simulations, sLS cause the system evo- h WW - C PC lates between ⌊b⌋ and ⌈b⌉, depending on how P and C inter- lution to be quite complex, although periodic. Closed formulas leave during the batch. Knowing b we can determine E , for T and E are hard to find and probably not much use- sFC sLS sLS considering that the sleep cost is amortized over a batch of b ful. Instead, we provide some upper and lower bounds by con- messages on average. sidering the best and the worst possible scenarios. If W ³ Y , the worst case service latency for sFC shows Throughput bounds for sLS: The best scenario is the PP up with a greedy input pattern, since the request has to wait one that maximizes the time for which P and C work in an additional Y before being served by C. Otherwise, if parallel, and the sleeps are perfectly aligned to make the W £ Y , the worst situation corresponds to the case when system process the same number of packets in each peri- PP the system is idle. In formulas, we have od, as shown in the figure below for the W < W case. CP SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 813 for E , we compute the energy assuming that both P and C m L sLS ... . .. P: are able to process their maximum batch each time they sleep, ... ... C: even if this is not actually possible. Thus, we can write In this scenario, the system processes L + m messages per Y Y EE EW>+W+ + .8() sLS P C (-LW 1) -W é CPù Lx + ⌈⌉ 2Lx + ⌈⌉ m = period, with . The number m is derived ê ú WW - PC ê ú noting that P starts filling the queue with a delay W and then Y opt E Considering that E ()YW= +W+ (as per sFC PC Lx + keeps working in parallel with C until C empties the queue. Equation (4)), it is enough to prove that following inequality Hence, we have holds (+Lm)W +Y CC C 11 1 T ³ =+ W .6() sLS C "³ x 0, < + ,9() Lm + Lm + L +x 2Lx + ⌈⌉ Lx + ⌈⌉ If W < W , we can write an analogous expression for m PC but this can be easily shown to be always true by means of and a lower bound for T by simply swapping P and C. sLS some algebraic manipulations. In the worst case scenario, P and C never work in parallel, Applying a specular reasoning to the case W < W , it can PC alternatively filling and emptying the whole queue in each opt be inferred that E >( EY ). In conclusion, we have sLS sFP P period. This can happen only ifYL >( - 1)W -W and CP C shown that the sLS regime is not convenient in terms of YL >( - 1)W -W (cf. Table 2). One of the two parties PC P energy efficiency. This is a useful information, because sLS sleeps only once per batch, while the other may sleep more is also not optimal for throughput, so that excluding it from times. As a consequence, the length of the period is not larger our solution space will not result into a trade-off. than max{+YLW ,Y +LW }, L packets are processed PP C C Latency upper bound for sLS: In the worst case, a request during each period and we have arrives at the extraction point of R when P has just started filling the last element in the queue, while C is sleeping (possible because Y Y P C TW£+ max , W+ . (7) sLS P C {} YL >( - 1)W -W ). CP C L L L − 1 h Y p p ... ... P: Equations (6)and (7) show that inter-message distance tends ... ... C: to increase linearly with the sleep interval length, and that, in the Y h Y c c c worst case, the sleep interval is amortized over L messages. P has to wait for C to wake up and empty the queue (pos- Energy lower bound for sLS: In Sections 2.2.1 and 2.2.2, sible becauseYL >( - 1)W -W ), before it can produce PC P we have seen that the most energy-efficient sleep length is opt opt the request. From the diagram, it is clear that YL =( - 1)W -W when W < W andYL =( - 1) PC C CP P YY³= ⟹h 1 (otherwise this would not be the worst PC P WW - when W < W . While lower and upper bounds for CP PC case). When P wakes up and serves the request, in the worst E could be obtained with techniques similar to the ones sLS case, C misses the new event and pays an additional sleep. If used for T , for our purposes, it is enough to show that sLS opt opt we ignore that P and C sleep together for a while before C E >( EY ) and that E >( EY ). This would sLS sFC sLS sFP C P starts draining the queue, pretending the two sleeps are serial- mean that the per-message energy for sLS is worse than the ized, then we have D £+YY +W+Y +W when sLS C P P C C best possible energy in sFC (or sFP), and consequently that YY ³ . Similarly,YY³= ⟹h 1. When C wakes up PC CP C the energy efficiency of the sleeping mechanism is optimal after the queue has been emptied, it will find the request pro- when the sleep interval of the faster party is the largest one duced and can serve it, hence D £+YLW ++Y W . sLS C C C C that still prevents the slower one to sleep. Using the inequalityYL >( - 1)W -W , we can upper PC P Focusing on the case W < W , we observe that the maximum CP (-LW 1) -W CP bound the term LW . In conclusion, independently on the batch size for C is L + ⌈x⌉,with x = , as described WW - PC relative size of Y and Y we have C P in the throughput lower bound scenario above. The maximum batch size for P is instead 2Lx + ⌈ ⌉, corresponding to the follow- DY£+22 Y+W+W. (10) sLS C P P C ing time diagram: L m L As a particular yet interesting case, ifYY » then we ... ... PC P: ... ... C: havehh== 1, which means that the worst case service PC delay is bounded by only two times the sleep interval: which is similar to the previous one, with the only difference DY£+2. W+W (11) sLS P C that P is not sleeping when C starts. To derive a lower bound SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 814 G. LETTIERI et al. 2.3. Notification-based regimes derived noting that C starts processing with an initial delay S , and then catches up draining the queue a little bit at a When the system uses notifications, we can identify five differ- time. Knowing b, it is easy to determine T and E , con- nFC nFC ent regimes. Similar to what we described in Section 2.2,the sidering that the notifications and startup costs are amortized regime depends on the relative size of W and W andalsoon P C over batches of b messages: whether the queue is able to absorb the startup times S and S . With a sufficiently long queue, also in this case, the slow- TW=+ ; nFC P est party will determine the overall throughput, but the need to periodically stop and restart using notifications will add an NS + PC EW=+W + .1()2 nFC P C overhead (which can be significantly large) to the average message processing time. When the queue becomes too short to absorb the notifica- A large b improves the performance of the system, and since tion latency, one party may block despite being slower than b ³ k we would like k to be large. However, systems nor- P P the other one, significantly reducing throughput. mally use k = 1 for two reasons: a larger k often increases P P Two non-intuitive results of our analysis are that (i) the sys- the latency of the system and more importantly, P often can- tem’s performance can be improved by slightly slowing down not tell whether there will be more messages to send after the the fastest party, in order to reduce the overhead of notifications, current one. and (ii) the threshold for notifications has opposite effects Assuming k = 1, the worst case delay experienced by a depending on whether we are in a long or short-queue regime. request at the head of R includes the cost of a producer notifi- As a consequence, correctly identifying the operating cation and a consumer startup. When m = 1, in particular, the regime is fundamental for properly tuning (either manually, request has to wait for two producer notifications before or automatically) the system’s parameters. being served, as illustrated in the following diagram: Similar to what we presented in Section 2.2,the five operat- ... ... P: ing regimes (notified fast consumer (nFC), notified fast producer ... ... C: (nFP), slow consumer startup (nSCS), slow producer startup (nSPS) and slow producer and consumer startup (nSS)) are told so that we have apart by means of three inequalities, summarized in Table 3. DW£+22N+S+W. (13) nFC P P C C 2.3.1. Notified fast consumer When C is faster than P (i.e. W < W ), C will start after the CP notification from P and eventually drain the queue and block. If 2.3.2. Notified fast producer C starts fast enough (i.e. S <(Lk - )W -W ),the queue CP P C When W > W , we can identify a different regime, which we CP will never become full and, therefore, P will never block. The call nFP (fast producer), which behaves like nFC but with periodic evolution of the system over time is shown below, the roles of P and C reversed. P is faster than C, so the queue with triangles indicating notifications and wake-ups. eventually fills up and P blocks. The notification from C to restart P is sent when there are k empty slots in the queue. If ... ... C P: ... ... P starts fast enough (i.e. S <(Lk - )W -W)), it refills C: PC C P b the queue before it becomes empty and, therefore, C never blocks. In this regime, P is always active, and periodically gener- We omit the T and E formulas for brevity, but the ana- ates notifications when C is blocked and the queue contains nFP nFP lysis and graphs in the rest of the article also cover this regime. k messages. The number of messages processed by C (and The latency analysis is more interesting, since a request at Sk +( - 1)W êCP Cú P) in each round is b = + k . The number b is ê ú WW - the head of R has to wait for C to process the L messages PC ë û already in the shared queue; since C periodically notifies P, the latency is delayed by a number of notifications that is pro- TABLE 3. Conditions for the notification-based regimes (‘−’ means ‘don’t care’). Detailed explanations are in Sections 2.3.1–2.3.5. portional to L and inversely proportional to the batch b: (-Lk)W -W (-Lk )W -W æ ö ê ú PP C CC P Lk - ç ÷ DW£+21 LW+N + . ()14 nFP P C C ê ú÷ è ê úø ë û W < W >S − nFC CP C <S >S nSCS C P >S <S W > W nSPS CP C P 2.3.3. Slow consumer startup − >S nFP Regime nSCS differs from nFC in that C is fast but has a − <S <S nSS C P long startup delay, so P can fill the queue before C has a SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 815 chance to remove the first message. This forces P to block Just looking at the equation, it might seem that there is a until k messages are drained and C generates a notification. good amortization of the notifications and wake-up costs The situation then repeats periodically once C has drained (once per L messages). However, the timing diagram shows the queue, as shown by the following diagram: clearly that P and C alternate their operation, making the throughput less than half of that of the fastest party. L − k k P m P ... ... In the worst case, latency scenario for nSS, assuming k = 1, P: ... ... C: a request arrives at the extraction point of R when P has just k L − k C C started producing the last available slot in the queue, while C is slowly starting up. Once C wakes up, it starts draining the queue, The cycle contains L + m messages, where notifying P after k messages. When P wakes up, it produces the request and notifies C that serves the request as soon as it wakes ê ú (-Lk )W -(S +W) CC P P up again. m = ê ú+( 1. 15) ê ú WW - PC ë û ... . .. P: ... ... C: We omit the formulas for T , E and D as they are nSCS nSCS nSCS k L − k C C long and not particularly useful. The worst case latency ana- It is clear from the diagram that the delay experienced by lysis reported in Section 2.3.5 is also also valid for nSCS.In the sensitive request includes all the notification (N , N ) and any case, important insights on throughput for this regime P C startup times (S , S ). The exact formula for nSS latency is come from the analysis of the above diagram and P C not shown, as it is more useful to present an upper bound that Equation (15). The slow party (P) has to wait because of a is also valid for the nSCS and nSPS regimes large S , and increasing k reduces m, thus extending the idle C C time for P and increasing the amortized cost of notifications DW£+21(k + )W+2S+N+N+S. (17) SQ P C C C C P P and startups. Note that k has opposite effects on performance in the In general, regimes with short queues are unfavorable and two regimes nSCS and nFP, due to the slow startup time: in we should avoid operating the system in them. nFP, a large k improves performance, whereas in nSCS,we should use a small k . 3. ANALYSIS OF THROUGHPUT, LATENCY AND 2.3.4. Slow producer startup EFFICIENCY This regime is symmetric to nSCS, and it appears when the producer is faster than the consumer, but slow to respond to a Using the equations for T , E and D derived in Section 2, x x x notification. For brevity, we omit the formulas, which be we now explore how throughput, efficiency and latency obtained from the nSCS case by swapping every P with C. change depending on the parameters, for each of the three The long startup time leads to different choices for the param- synchronization mechanisms modeled (busy waiting, sleep- eter k : in regime nFC, we aim for a large k , whereas in P P ing, notifications). We also compare the mechanisms against regime SPS, we should use a small value for that parameter. each other, highlighting advantages and drawbacks. 2.3.5. Slow producer and consumer startup This regime combines the previous two. P and C alternate 3.1. Throughput operation due to the large startup delays, and individual speeds only matter in relation to the startup times. An example of evo- We start our analysis looking at the average time between lution over time is shown in the following diagram. messages, T . Since busy waiting (BW) is optimal for this per- k L − k P P formance indicator, we first compare the other two mechan- ... . .. P: isms against BW. ... ... C: L − k k C C 3.1.1. Throughput for notification regimes Each round in this case comprises exactly L messages, and In Fig. 2,weplot T for notification-based regimes, for a and have a relatively simple form: T E given W (consumer processing time) and variable W (produ- nSS nSS C P cer processing time). The region to the left of W = W corre- PC kW++ k W N+ S+ N+ S PP CC P P C C sponds to a fast producer. T = ; nSS There are three curves of interest here. The dotted line at NS++N +S PP C C the bottom represents the minimum inter-message time, EW=+W + .1()6 nSS P C which isTW ={ max ,W }. This corresponds to the best BW P C throughput we can achieve if efficiency is not an issue, and SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 816 G. LETTIERI et al. can be obtained with busy waiting, i.e. keeping the fastest Since the equations governing the system are completely party continuously spinning for new opportunities to work. symmetric, the curves for a fixed W and variable W have a P C The next curve (solid line) represents T and T , corre- shape similar to those in Fig. 2. This shows that there are nFP nFC sponding to the first two notification regimes. Here the dis- regions of operation where increasing the processing costs tance between messages is higher than in the ideal case due (W in nFP, W in nFC) increases throughput. P C to the effect of notifications and startup times. These are While the graphs focus on variations of W and W , they P C amortized on the number b of messages per notification; b also show the sensitivity of the curves to other parameters. changes in a discrete way with the ratio W /W , hence, the As an example, the distance among T , T and the optimal nFP PC nFC curve has a staircase shape. value T is bounded by N /k and N /k , so we have CC BW PP It should be noted that depending on the queue size L and knobs to reduce the gaps. Also, the position of the last big the values of the operating parameters, we cannot guarantee jump in throughput in regime nFC can be controlled by that the system operates in regimes nFC or nFP. Sections increasing S . This means that all the rest being the same, a 2.3.3, 2.3.4 and 2.3.5 indicate the conditions for which we slower wake-up time improves performance. may enter one of the three regimes nSCS, nSPS or nSS, all of which have a larger inter-message time than nFC and nFP. 3.1.2. Throughput for sleeping regimes Hence, our third curve of interest is labeled T , which nSS In Fig. 3, we plot T as a function of Y for sleeping regimes, corresponds to W++WN/L+N/L and marks the best x PC P C C assuming W < W . While Y is small enough (i.e. possible performance in regime nSS. Operating curves for CP C YL £( - 1)W -W ), the curve corresponds to T , nSCS and nSPS are not shown for the sake of simplicity, but CP C sFC where the inter-message time is constant and optimal, match- they lay above T and T , and partially also above T . nFP nFC nSS ing the one achieved with busy polling. This happens because It is important to note that performance can jump among P works at full speed, never finding the queue full and never T , T and T even for small variations of the operating nFP nFC nSS spending time for synchronization. parameters. Hence, it is imperative to either make the region For larger values of Y , we reach the sLS regime, where the between the two curves small, or set parameters to minimize queue is not able to absorb the sleep interval anymore and P the likelihood of regime changes. repeatedly fills the queue and goes to sleep. As explained in Going back to the analysis of operating regimes, we note Section 2.2.3, the average distance between packets increases that both nFC and nFP have two different regions, separated and the exact dependency from Y is complex. Lower and by the vertical dotted lines in the figure. These boundaries upper bound curves (dashed lines) envelope the exact values occur when the batch of messages processed on each notifica- tion reaches the minimum value, respectively, k and k . The P C fact that k is usually 1 makes the jump much higher in regime nFC than in regime nFP. nFP nFP nFC nFC nSS nFC T T nFP BW W Y (L − 1)W − W P E P C W −S W k W + S C P C P C C FIGURE 3. The average time (T) and energy (C) for each message FIGURE 2. The time for each message as a function of W , for the as a function of Y , for the sleeping-based regimes. The dashed lines P C notification-based regimes. The message rate decreases as W moves are the lower and upper bounds for T in the region where both P and away from W . The curve for T represent the best case for regime C may sleep. T and C have similar shapes, but they do not differ by C nSS nSS, actual values may be much larger. a constant value. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 817 of T , obtained by simulations. The upper bound grows lin- T . In any case, energy efficiency in sLS is worse than the sLS opt early with slope 1/L, while the lower bound has a smaller energy in Y , which is, therefore, the optimal sleep value. slope, which becomes close to zero as W tends to W and For reasons similar to those explained in Section 3.1.2, P C becomes close to 1/L as W diverges from W . This means describing how E depends from Y is not particularly inter- P C P that the variability (oscillation) of T in the sLS regime is lar- esting when W < W , since we want to stay away from sLS CP ger when W and W are close, and it is smaller otherwise. regimes in any case. P C If W < W , showing how T depends from Y is less inter- A specular analysis can be done for the case W < W , CP P PC esting, because (i) P may never be sleeping, so T may not since equations describing E are symmetrical. In conclusion, x x depend at all from Y ; (ii)to stayawayfromlong sleep regimes the energy efficiency analysis shows that Y and Y should be P P C we have to make sure Y is small enough so that P never sleeps, chosen small enough to avoid entering the unfavorable sLS and so we are interested in the dependency from Y . regimes, but close enough to the optimum value to amortize In the W < W case, since the equations for T are sym- the sleep cost as much as possible. PC metric, it will be useful to study T as a function of Y , and It should be also noted that choosing very distant values the shapes will be similar to the ones of Fig. 3. for Y and Y (e.g. different orders of magnitude) is not con- P C The analysis clearly shows that Y and Y should be chosen venient w.r.t. efficiency. If both peers happen to be sleeping, C P small enough to keep the throughput to the maximum. With the one with the shorter sleep interval will need to sleep proper tuning of the operating system—to make sure the many times if it is waiting for the other to wake up and sleep timeliness is respected—this can be actually achieved. advance the queue processing; this results into unnecessary Compared with notification-based regimes, sleeping has a energy consumption. If the sleep intervals are comparable, on significant advantage in terms of throughput: the slower party the contrary, one or two sleeps will usually suffice. does not need to slow down in order to notify its peer. The It is important to observe that the energy efficiency of faster party can indeed wake-up autonomously to poll the sFP/sFC regimes is always better than the efficiency of BW, queue for new work. with W and W being the same. This can be evinced from P C Y Y E E Equations (1) and (4), noting that both and are smaller Y Y C P than one. A meaningful comparison with notifications can be done once some estimates for the various parameters are 3.2. Efficiency known. In Section 4.3, we report some measurements for the While busy waiting (BW) has the highest throughput in gen- sleep cost Y and, in Section 5.2, the notification/startup costs eral, its performance may come at a high cost in terms of involved in nFP/nFC regimes that can be taken as a refer- CPU usage. In regime BW, the fast party must burn cycles ence to support the decision process. proportionally to the difference of processing times, ∣WW - ∣ . This can possibly double the total overall cost in CP 3.2.2. Efficiency for notification regimes terms of time/cycles, and can have even worse impact on In Fig. 4, we show the energy per message in different energy if the fast party has higher energy consumption per regimes. For simplicity, here we use only one graph with cycle. As an example, the fast party could be an expensive, variable W , having already established that the system is dedicated CPU/NIC/controller. symmetric and we can repeat the same reasoning for variable Therefore, it is important to also take into account the total W . Also in this case, we have three curves of interest, but energy consumption per message, i.e. the values E deter- they are not as nicely ordered as in Fig. 2. mined in Section 2. We see that the E values have the form The curve for BW (solid thin line) is no more the absolute W++WX, where the additional term X depends on the PC best in terms of efficiency. This is because the additional term operating regime. X in E is ∣WW - ∣, whereas in other cases the term X is CP BW Similar to the analysis conducted for throughput, we start upper bounded by some constant independent of the differ- by comparing notifications and sleeping against BW, and then ence W - W . As a consequence, the slope of E is twice CP BW compare them between each other. that of the other curves, and when W becomes too large (or more precisely, when ∣WW - ∣ becomes large) busy waiting CP 3.2.1. Efficiency for sleeping regimes is the worst option in terms of energy per message. Figure 3 shows E as a function of Y for sleeping regimes, The energy curve (solid thick in Fig. 4) for regimes nFC assuming W < W . For all the Y values smaller than and nFP has the same step-wise behavior as the ones for CP C opt YL =( - 1)W -W , the plot corresponds to the sFC inter-message time. The slope is however unitary (it grows as PC regime (E ), where P never sleeps and the energy is max{WW , }), and lies within the gray region in the figure sFC PC inversely proportional to Y , as the cost of each consumer depending on the actual parameters. As the graph shows, the sleep is amortized over a larger batch. For larger values of Y , curves for BW (solid thin) and notifications (solid thick) the system enters the sLS regime, where also P sleeps, and regimes may intersect in several points, whose values and the shape of E is irregular, roughly following the shape of position depend heavily on the actual parameters. sLS SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 818 G. LETTIERI et al. from when a request (that we can imagine is latency-sensitive) arrives at the extraction point of the R queue to when the E same request is serviced by the consumer, considering all the BW possible input patterns for R. For each regime, we have defined the worst case situation that can happen with the cor- nSS responding relative sizes of parameters (as specified in Tables 2 and 3), not assuming that R requests arrive greedily, 2W and we have expressed an upper bound for the latency. The BW regime, which is optimal for throughput, is also optimal in terms of latency, that is D ³ D . This happens xBW N +S C P because P and C do not have fixed-cost synchronization overheads (i.e. notifications, startups, sleeps); P can actually produce the high-priority request as soon as there is an avail- able slot and C can consume it as soon as it has processed all N +S the messages already pending in the queue. It is worth P C remarking that when W > W the latency-sensitive request CP needs to wait for C to process up to L elements before it can be serviced; although this delay (LW ) can be considerable in practice, no strategy can do better under our FIFO assump- FIGURE 4. The total energy per-message as a function of W . There may be regions where busy waiting (thin line) is more energy tion. We, therefore, compare the sleeping and notification efficient than notifications (thick and dotted lines). mechanisms against BW, in order to see how these mechan- isms introduce additional delays that make latency move The shape of the curves and their discontinuities make it away from the optimum. difficult to identify intervals in which one regime is preferable to another. We can compute them using the equations in 3.3.1. Latency for notification regimes Section 2.3, but these rely on perfect knowledge of the operat- The D inequalities and the evolution diagrams reported in ing parameters, hence the information is of little practical use. Section 2.3 show that for almost all the notification regimes Comparing the total energy per message in regime BW (nFC, nSCS, nSPS, nSS), the worst case service delay is upper with the other regimes, however, can give some useful prac- bounded by some linear combination of the notification/ tical insight. Busy waiting consumes an extra ∣WW - ∣ PC startup parameters (N , N , S , S ). Moreover, k has nega- P C C C cycles per message, so it is convenient when the cost is lower tive impact in short-queue regimes (nSPS, nSCS, nSS), since NS + PC than the extra notification and startup cost, which is in it delays the consumer notification that P needs to wake up NS + CP nFC, in nFP. Since in nFP we have b ³ k and k is C C and produce the request at the head of R. In particular, the typically large, it very unlikely that busy waiting can be nSS regime includes all these latency contributes, and so it is energy efficient. the most unfavorable one among the ones listed. Notifications with short queues: The energy efficiency The fast producer regime (nFP) deserves a separate ana- when the queue fills up is heavily dependent on the values of lysis. Since W < W , the high-priority request may need to PC the parameters. Equation (16) for E shows that the extra nSS wait C to consume L messages before it can be served. This term includes all the four startup and notification times instead is not an issue by itself, because also the busy waiting (opti- of only two of them for E and E . Given that we expect nFC nFP mal) mechanisms have the same limitation. However, C is one of S , S to be large, this might be a significant cost. C P slowed down by the notifications that it needs to send in On the other hand, the energy efficiency of these regimes order to wake up P periodically. The number of notifications is not too bad, because producer and consumer tend to have is not constant, but depends on the queue length and the significant idle times, and the overheads are amortized over batch size. The inequality (14) implies that notifications relatively large batches (e.g. the entire queue size in regime are needed in the worst case. Since D is not bounded by nFP nSS). This phenomenon is evidenced by the curve E (dot- nSS a linear combination of the parameters, like for the other ted) in Fig. 4, which also intercepts the others. notification regimes, is not possible to tell in general whether nFP is more favorable than nSS or not. It is useful to note that improves latency in nFP, while it has the 3.3. Latency opposite effect with short queues. We now complete our analysis with the latency, using the As already stated previously, in a real system, the para- upper bounds derived in Section 2. As explained, the worst meters are not exactly constant, and thus it is usually difficult case latency is defined as the maximum time that can elapse to guarantee that the system never ends up (even temporarily) SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 819 in a short-queue regime. As a result, the estimation of the platform has an Intel Core i7–3770 K CPU at 3.50 GHz (4 worst case service latency of a producer–consumer system cores/8 threads), 8 GB RAM DDR3 at 1.33 GHz, and runs based on the notification mechanism should take into account Linux 4.6.4. A recent version of the QEMU hypervisor (git the latency bounds for both nSS and nFP. master 9a48e3, June 2016) is used to run the guest VM using KVM hardware-assisted virtualization. The guest is given 1 3.3.2. Latency for sleeping regimes vCPU and runs Linux 4.6.4. In order to improve the reprodu- The D inequalities presented in Section 2.2 for sFC and sLS cibility of results, all the tests have been run with the follow- illustrate that the worst case delay for sleeping regimes is upper ing configuration (except when explicitly noted): bounded by a linear combination of the sleep intervals, namely Y and Y . As a notable case, we have also seen that ifYY » (1) No load on the machine other than essential operat- C P PC then the worst case latency does not exceed two times the sleep ing system services. interval, plus the time necessary to process the request. The lat- (2) Dynamic frequency scaling disabled, so that all the ter result is particularly interesting, because choosing the sleep CPUs run at maximum frequency. intervals similar to each other is also a good choice in terms of (3) Sleeping C-states disabled, that is all the CPUs in C0 energy efficiency, as discussed in Section 3.2. all the time; the host O.S. never issues the halt The sFP regime requires a separate discussion, similar to instruction to pause the CPU, even when there is no the corresponding fast producer notification regime (nFP). active process to schedule. This is not the default The worst case latency for sFP is optimal, because the behavior of Linux, and requires the idle=poll latency-sensitive request has to wait for C to consume all boot parameter to be specified. messages already in the queue, which is not distinguishable (4) Hyperthreading and turbo mode are disabled. from the behavior of BW. In other words, a larger Y does (5) Each thread part of the experiment is pinned to a dif- not impact latency, as long as C does not sleep (so that the ferent physical core. system does not enter the sLS regime). (6) KVM halt polling disabled by setting the Compared with BW, the latency of the sleeping mechanism halt_poll_ns module parameter to 0. This is is worse (idle system, sFC, sLS), but it can be kept under con- necessary to isolate the CPU utilization related to our trol by properly limiting Y and Y . A comparison between producer/consumer system, not including the cycles C P the notifications and sleeping mechanisms can be done with wasted by KVM because of this optimization that can some estimates of the notification parameters and the sleep take up to 60% of the CPU time in some pathological cost Y , using the D upper bounds. Sleeping can be conveni- cases. E x ent if the Y and Y interval values can be chosen sufficiently C P small w.r.t. the notification parameters. 4.2. Description of the system under study 4. ESTIMATING THE SYSTEM PARAMETERS VirtIO [10] is a widely used standard and API for I/Oparavir- tualization; most of Hypervisor software (QEMU, bhyve, The best mechanism for a given set of requirements— VirtualBox, Xen, etc.) and guest operating systems (Linux, throughput, energy and latency—can be chosen once the FreeBSD, Windows) are rapidly converging to VirtIO as the designer has some estimation of the system parameters, which default I/O infrastructure for VMs. Taking it as a reference for heavily depend on the producer and consumer implementa- experimentation is meant to maximize the impact of our work. tion, the host machine hardware and the O.S. implementation. VirtIO is a generic producer–consumer API that allows a In this section, we describe how these parameters can be guest O.S. to exchange data with its hypervisor (also referred obtained in a representative case. Since our work is primarily as the host). It provides a guest-side API and an hypervisor- focused on virtualization environments, we have chosen to side API that are used by the guest and the hypervisor, experiment with VirtIO systems, as illustrated in Section 4.2. respectively, to access VirtIO data structures. The main data structure is called virtqueue and is implemented in a portion of memory shared between the guest and the hypervisor; it is 4.1. Description of the test environment composed of two separate circular arrays (rings): the avail For all the experiments presented in this article, we have con- ring and the used ring. A guest driver inserts buffers (in the figured the testbed to minimize the noise introduced by the O. S. scheduler and by the power management features of the A feature [13] recently added to KVM that lets the vCPU thread polling for a while when the guest issues an halt instruction, instead of scheduling modern CPUs: this includes the frequency scaling and the out immediately. processor C-states (which are a significant source of latency, More precisely, a virtqueue also includes a descriptor table, which is an as several microseconds may be necessary for a core to array containing buffer descriptors. Each slot in the avail and used ring just recover from the deepest C-states). Our reference test references the head of a chain of descriptors (e.g. a scatter–gather list). SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 820 G. LETTIERI et al. form of scatter–gather lists) in the virtqueue avail ring, where � The hypervisor device implementation (consumer- the hypervisor can extract them (in FIFO order). Once the vhost.c), where the consumer (C) code runs in the hypervisor has consumed a buffer, it pushes it to the used context of a vhost thread. ring, where the guest can recover them (and possibly do some cleanup). Each virtqueue has a mechanism to let the guest Note that P and C run in two different threads, consistently send a notification to the hypervisor and vice versa. A VirtIO with our model. P and C can be configured to set different device may be composed of one or more virtqueues. As an values for the W , W , Y and Y parameters, and to choose P C P C example, the VirtIO network device has at least a virtqueue between the three strategies (notifications, sleeping, busy for packet transmission and another one for packet reception. waiting). In this way, once the W and W and D para- P C MAX To ease measurements and experimentation, we implemen- meters have been fixed, we can experiment with the different ted an ad hoc VirtIO producer/consumer device for QEMU/ strategies to optimize an objective function (cf. Section 7). It KVM on Linux, referred as virtio-pc in the following. The is worth noting that there is an implicit lower bound for the device has a single virtqueue, where only the producer and validity of the W and W parameters, related to the imple- P C consumer processing is emulated (by means of a program- mentation limits of the Linux guest-side and vhost mable busy wait); all the other operations involving the virt- hypervisor-side of the VirtIO API we are using. Our measure- queue are performed using the real VirtIO API. We have ments show that the virtqueue cannot process more than 8 chosen to use the QEMU/KVM Linux hypervisor and Linux millions items per second on our testing platform, even when as a guest O.S. for a valid reason: they provide the most com- all costly notifications are suppressed. As a consequence, it is plete, updated and optimized implementation of both VirtIO not meaningful to carry out experiments where W and W are P C APIs. In particular, the QEMU/KVM hypervisor supports a <125 ns. To stay safe and avoid possible border effects, we Linux-specific high-performance in-kernel implementation of will use values equal or greater than 200 ns. the VirtIO hypervisor-side API, known as vhost. With vhost, the hypervisor-side implementation of a VirtIO device runs in 4.2.1. Code instrumentation for time measurements a dedicated kernel thread, without requiring any intervention C is able to compute latencies which include both the W /W 4 PC from the associated user-space QEMU process. The guest costs and the queuing delay. To achieve this, P stores a time- can write into a VirtIO device register to notify the vhost stamp inside each buffer passed to C, so that the latter can thread; the register access is intercepted in host kernel space take its timestamp at the end of its processing cycle and com- by the KVM kernel module, which wakes up the vhost thread pute the difference. A distribution of latencies gets collected without the need to switch to user-space. If the vhost thread and the 98th percentile is computed as the representative of is scheduled to run on a different core than the one issuing 5 the worst case latency. Timestamps are samples using the the notification, an Inter Processor Interrupt (IPI) must be x86 TSC register, which is incremented at constant rate and sent to the destination core. Similarly, the vhost thread can is consistent across all the cores. However, TSC values read notify the guest directly instructing the KVM module to inject from the guest O.S. differ by a constant offset from the ones an interrupt. read on the bare metal. This TSC offset must be taken into Our producer/consumer experimentation framework is account when computing time difference; it can be obtained available as open source software at https://github.com/ using the Linux ftrace [14] tracing system, once the kvm/ vmaffione/qemu/tree/virtio-pc, and includes the following kvm_write_tsc_offset tracepoint is enabled. components: To validate our model correctly (and cross-check the mea- surements), virtio-pc has also been instrumented to measure � The driver for Linux guests (producer.c), exported all of the parameters we take into consideration. This is to user-space as as a character device (/dev/ important because sometimes the measured value differs from virtio-pc), where the producer (P) code runs the nominal one; for example, this is the case for Y and Y in C P entirelyinkernelspace,inthe contextofanioctl()sys- our testbed. In the following, we always use the measured tem call, which returns only when a test run is finished. values rather than the nominal ones. Parameter estimation is P is implemented by means of the Linux guest-side done both online and offline: W , N and Y are estimated by P P P VirtIO API. P with running averages; W , N , Y are measured by C in a C C C � The support in the QEMU hypervisor necessary to similar way; S is computed by C using timestamps put by P expose the VirtIO device to the guest O.S. as a PCI in the first packet of each batch of C (similar to how the device. latencies are computed). Finally, since k is greater than one, S cannot be measured online. As a part of the instrumentation, both P and C trace The usual hypervisor-side VirtIO implementation resides in user-space, which implies continuous transitions between the user-space VirtIO device implementation code and the kernel-space code which runs guest code using Higher percentiles are pruned to rule out rare large fluctuations due to hardware-assisted virtualization. interrupts and scheduling. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 821 some events of interest, that is (i) P publishing a new item in undesirable, since we expect Y and Y to be in the range C P the shared queue; (ii) C seeing a new item in the shared 5–50sm in common scenarios. To remove this systematic queue; (iii) P completing a notification to C; (iv) P blocking source of delay, we have set timerslack to 0 for the entire or sleeping (queue full); (v) C completing a notification to P duration of the tests. and (vi) C blocking or sleeping (queue empty). An event is Figure 5 illustrates the results of the test runs, with the x made up of an event type, a TSC timestamp, and a sequence axis representing the nominal sleep interval (i.e. the argument number identifying the next item to be produced or con- passed to the system call) in microseconds. The first curve sumed. Both P and C store the events in a local large circular shows the measured average sleep interval Y, in microseconds. array (2 elements), so that the tracing overhead is negli- For nominal intervals <10ms, the kernel is not able to support gible. Once a test run terminates, the two event arrays are the sleep with a low relative error: the overheads involved in accessed offline using the ftrace facility and merged, taking programming the timer, updating the kernel data structures into account the TSC offset. The merged logs allow us to and perform the user-kernel context switches exceed2sm , and examine the whole evolution of the virtio-pc system, and in the the curve never goes below this value. As the nominal particular also to measure an average for S and all the other interval grows, the fixed costs are amortized more and that the parameters. relative error decreases; for nominal intervals over 50sm the relative error is close to zero. The second curve shows the average per-sleep cost. Up to ~m 2.5 s,YY » , which means that the CPU is nearly 100% 4.3. Estimating sleeping costs busy serving the nanosleep system call. No process sched- Using the sleeping mechanism requires the value of Y to be uling happens, since the expiration time is already passed measured, since (i) Y determines the energy efficiency; and when the call to the scheduler would be performed. For larger (ii) it is a lower bound for Y and Y ,i.e.itisthe minimum nominal intervals, the scheduling and wake-up start to hap- C P sleepintervalallowedbythe system.Inorder to evaluate Y pen, and CPU utilization decreases. As expected, the mea- () n and understand the behavior of the sleep primitive in our refer- sured Y is constantly 2.5ms, not depending on Y , at least () n ence test environment, we set up an experiment where a process up to ~m 20 s.As Y increases again, Y grows in a invokes the nanosleep system call N times in a tight loop, staircase-shape fashion. This is a consequence of the Linux with a fixed sleep length passed as argument. The number N is implementation of the timer subsystem, which hierarchically chosen large enough (in the order of 10 ) to collect meaningful groups expiration events depending of the order of magni- statistics. By measuring the total duration of the run (N sleeps), tude; a bigger order of magnitude means more operations are we can compute the average effective sleep interval length, needed to insert and remove the expiration event from the which is in general higher than the nominal length. internal data structures. To measure the sleep cost, we used the cpupower monitor From this analysis, we can conclude that Y»m 2.5 s on our tool (and in particular the Mperf high precision monitor), test platform, at least assuming that Y and Y are not chosen to C P which is able to compute, for each CPU, the fraction of time be larger than 20ms. If the latency requirements allow for worst the CPU is in the C0 state (i.e. actively executing instruc- tions). When the CPU is not in C0, it is in the C1 shallow sleep state; for this particular test, differently from what described in Section 4.1, we used the default value for the idle boot parameter, so that the O.S. is allowed to put the 1,000 CPUs in C1. Since the sleeping process is pinned to a CPU during the run, and there are no other processes using observ- able amounts of processing time on that CPU, we can com- 100 pute Y as the product between the measured average sleep E E interval and the fraction of time the CPU is in the state C0. The run is repeated for different values of the sleep interval, ranging from 900 ns to 1 ms; as we will see, this range is suf- ficient to illustrate the properties of the sleep primitive on our test platform. 1 10 100 1,000 (n) When an application asks the Linux kernel to sleep for Y relatively short intervals (e.g. <1ms), the timerslack per- process parameter must be considered. Unless the process FIGURE 5. Average effective sleep interval (Y) and per-sleep has real-time priority, the nanosleep Linux implementa- energy (Y ) versus nominal sleep interval. The system is not able to tion will silently add the value of this parameter, which deal with sleeps shorter than 2.5 ms, and the cost depends on the order of magnitude of the sleep interval. defaults to 50sm , to the sleep interval length. This is really SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 cycles Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 822 G. LETTIERI et al. case latencies larger than 40sm , Y can be estimated consider- constant values. In this section, we validate the model predic- ing another step of the curve, but this is not common for the tions by comparing them to actual measurements on the sys- kind of system under study in this work. Our analysis also con- tem introduced in Section 4.2. firms that it does not actually make sense to sleep for less than Y , because it would just be a convoluted and unreliable way of doing busy waiting. For the sake of completeness, we have 5.1. Validation of sleep-based regimes repeated the measurements giving real-time priorities This section presents an extended experiment meant to check (SCHED_FIFO) to the sleeping process, with the Linux kernel how much the virtio-pc system described in Section 4.2 built with real-time support (linux-rt). As expected, no measur- matches our model. For the purpose of validation (and also able differences have been observed, since the tests have been for the strategies presented in Section 7), we will slightly sim- run with the machine unloaded. plify our model, assuming that both P and C use the same sleep length, that isYY = . This practical simplification PC does not impact our study, because it has only effect when 4.4. Estimating notification costs the system operates in the sLS regimes that we want to avoid The values of notification parameters depend on how P, C in any case; moreover, usingYY = simplifies latency esti- PC and the queue are implemented (O.S. processes, VMs, shared mation and entails a simpler, staircase-shaped throughput memory, hardware controllers, etc.). The measurements curve than the one of Section 3.1.2. reported here are related to the virtio-pc reference system For the experiments, we have chosen a fast consumer scen- described in Section 4.2, and rely on its event tracing facil- ario with W =m 2s, W =m 1s and L = 512, while Y varies P C ities. In a virtualization environment notifications are quite between 4ms and 3 ms, so that we also check that the system expensive, involving VM exits, Inter-Processor Interrupt transitions to sLS regimes when Y goes beyond (IPI), calls to the host scheduler, and VM enters. LW=m 1024 s. Figure 6 shows that there is a very good In order to measure the four notification constants we have agreement between our model (values for the sLS regime are conducted two kinds of experiments. A fast consumer experi- obtained by simulation) and virtio-pc. In particular, both ment, with W = 2000 and W = 4000, is used to compute curve agrees on the fact that the average per-item time C P N and S ,as W - W is large enough that there is a notifi- increases approximately by W - W each time Y increases P C PC PC cation for each item. A different fast producer experiment, by L(-WW). The slight disagreement for large values of PC with W = 2000 and W = 500, is used to compute N and Y (which is not really interesting to us) is explained by the C P C S . Since k > 1, we do not have a C notification for each fact that the measured Y is actually quite larger thanYY = . P C P C item, and so we choose a small L = 8 to have enough sam- Figure 6 does not validate our energy model, which is ples in the event trace. especially interesting in the sFC/sFP regimes. A simple way Table 4 reports the measured average notification costs, to do that (without measuring CPU utilization) is to validate together with their standard deviations. As expected, the noti- the overall batch that is the average number of packets pro- fication cost is higher for P, since it involves an expensive cessed for each sleep, taking into account all of P and C VM enter and exit operation. The start-up cost for P is also extremely expensive, since it involves the cost of interrupt virtio-pc processing in the guest and context-switch to the user-space model process. The start-up cost for C is less expensive because it is mostly the time required to wake-up and schedule the kernel thread, and invoke the processing loop. 5. MODEL VALIDATION The model illustrated in Section 2 is a mathematical abstrac- tion where the operating parameters are assumed to be 500 1,000 1,500 2,000 2,500 TABLE 4. Measured average notification costs. Y [µs] N 1.10m 0.22 s FIGURE 6. Average per-item time versus sleep length, with N 0.58m 0.03 s Y = Y ; the dotted curve shows the measured values, whereas the PC S 28.0m 3.50 s continuous one shows the model prediction. The system enters LS S 0.42m 0.02 s regimes beyond 1024 ms. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 823 sleeps. For sFC and sFP regimes, this batch corresponds to virtio-pc the b parameter described in Sections 2.2.1–2.2.2. The per- model item energy consumption is still connected to the overall batch b by the second equation in (4). Figure 7 shows again a very good match between model predictions and the measure- ments on virtio-pc, also for the sLS regimes. 5.2. Validation of notification regimes Similar to Section 5.1, we now try to validate the throughput behavior for nFP and notified fast consumer regimes, as 1.822.22.42.62.8 depicted in Fig. 2, to check to what extent a real system W [µs] matches our model. We use long enough queues (L = 512) to stay away from short-queue regimes. For the validation FIGURE 8. Average per-item time in the mathematical and the syn- experiment, we have chosen a fixed W = 2000 ns, while W C P thetic model (notification regimes). is fixed at , , W 2 ms L = 512 varies between 200 ns and 2900 ns; as we will show, this K = 1 and . Notification costs are taken from Table 4. K = 384 P C range is sufficient to show all the properties of the system, which depend on the difference between W and W . For each P C value of we have run 12 tests, each one 5 s long, measur- consistent with the fact that the interrupt rate is always very ing the average throughput, P and C notifications rate and small w.r.t. the processing rate, which is approximately 95th percentile of latency over the 5 s. Note that the valid- 500 000 items per second. In other words, the large k is very ation of the energy model comes as a consequence of the val- effective at amortizing the notifications from C to P. idation of throughput, since in nFC and nFP both throughput In the fast consumer zone (W<= W 2000 ns), the PC and energy have a strong dependency on the average batch virtio-pc system shows the effect of the increasing number of size b. notifications as the speed difference between the consumer The measured average per-packet time is depicted in and the producer increases, lowering the throughput in Fig. 8, which does not report variance as it is sufficiently accordance with the model. There are nonetheless some min- small (<3%). We can see that there is a very good agreement or deviations that need to be explained. The slope of the between the model and virtio-pc, with some minor deviations virtio-pc curve around 2.4ms is much more smooth than that will be explained later on. expected, but this is not very interesting, since it is only an For the fast producer zone ( ), the W<= W 2000 ns PC effect of random variations of the emulated W and W around P C throughput curve is mostly flat, with a very small negative the desired values (see Section 6). For values of W between slope, as the interrupt rate slowly decreases from ~570 to 2 and 2.2ms, instead, we note that the virtio-pc curve lies <10. This is a consequence of the very large used by slightly above the model curve, and it features spikes at each L = 384 VirtIO (it is set to ). The very small slope is 4 discontinuity point. This discrepancy is more interesting and it is due to unwanted notifications that the producer sends to an already running consumer. We call these notifications virtio-pc 1,000 model spurious: they are the effect of an unavoidable race in the ‘double check’ scheme used by the notification-suppression algorithm. When the consumer finds an empty queue and must therefore block (first check), it first re-enables notifi- cations, then checks the queue again (second check): if new items are found, it disables notifications again and processes the new items without blocking. If this double check were not performed, the consumer might block and the producer might not notify the next new item: this is the case if the producer pushes a new item after the first check 500 1,000 1,500 2,000 2,500 by the consumer, but before the notifications have been re- [µs] enabled. The double check avoids this possible stall, but it opens up the possibility of spurious notifications: these occur when the producer inserts a new item between the FIGURE 7. Average overall batch versus sleep length, with Y = Y ; the dotted curve shows the measured values, whereas the consumer first and the second check, and sends the notifi- PC continuous one shows the model prediction. cation between the enabling and the disabling of SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 batch size T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 824 G. LETTIERI et al. 0.5 0.8 0.4 0.6 0.4 0.3 0.2 regular 0.2 spurious 22.22.42.62.8 0.1 W [µs] FIGURE 9. Regular and spurious notifications per-packet measured 0 2 2.2 2.4 2.6 2.8 during the fast-consumer experiments of Fig. 8. W [µs] notifications. A spurious notification is illustrated in the FIGURE 10. A plot of D() SW,,W,K with W =m 2s, K = 1 CP C P P P following diagram: and S taken from Table 4. spurious proper ... ... P: considering). Figure 10 shows a plot of Δ with W =m 2s ... ... C: and W varying in the fast consumer range of Fig. 8. We can The consumer sees no new item in the queue when the see that the probability of spurious notifications increases pre- spurious notification is received. Moreover, the consumer will cisely when Δ comes closer to zero. most likely not be able to see yet another packet after the first one, so it will go to sleep and the producer will have to send another notification, this time a proper one. Figure 9 shows 6. RELAXING THE ASSUMPTIONS the average number of per-packet spurious notifications The system used in Section 5 to validate the model still received by the consumer during the same set of experiments makes some important simplifications, namely: of Fig. 8 in the fast-consumer range. For reference, the figure also plots the ‘regular’ (i.e. non-spurious) notifications (1) the system parameters are independent of each other; received per-packet. Since spurious notifications cause add- (2) processing times (W and W ) are constant. itional work for the producer, they increase the per-item aver- P C age time. Therefore, the spurious curve in Fig. 9 clearly Assumption 1 does not hold in real systems, since features explains the differences between the model and the virtio-pc like frequency scaling or C-states may create complex rela- curves in Fig. 8. tions among the parameters. The close match of Fig. 8,in Even if the model does not account for spurious notifica- fact, is only possible because all advanced CPU features have tions, it helps in predicting them. Spurious notifications are been disabled. Nonetheless, the model can be useful to better more probable the closer the consumer and the producer understand the behavior of the system even with some of are when they look and update the empty queue between these features turned on. As an example, we now examine the them. The crucial observation here is that depending on the throughput obtained for the same experiments of Fig. 8, but difference between W and W , the model predicts that the P C with the idle=halt option instead of idle=poll. With instant t when the producer pushes the last packet in a idle=halt, the idle kernel thread will issue the hlt CPU batch and the instant t when the consumer misses, it (and instruction, putting the core into some C-state higher than 0 (C1 therefore goes to sleep) comes recurrently closer as in our case). This is a realistic example, since idle=poll W - W varies. Let us call D=-tt the interval between PC PC always keep the CPU busy and is not an option that should be these two instants, as shown in the following diagram: normally used. Figure 11 shows the new results. We can see t t C P ... ... that now, in the fast consumer region, the model and virtio-pc P: ... ... C: have significant discrepancies that become worse for higher values of W . We can also see that for these values of W ,there P P Interval Δ is a function of S , W , W and K (in the dia- is a somewhat better match if we plot the model for an higher C P C P gram we have assumed K = 1 as in the system we are value of S . This gives a clue on what is going on: the average P C SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 notifications per-packet Δ [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 825 value of S observed during the experiments now depends on model captures the most important effects, and it may be the value of W .Thisisconfirmed in Fig. 12,where we show used to better understand some of the secondary ones. the average values of S in the same set of experiments of Let us now explore some scenarios in which W and/or W C P C Fig. 11 (fast consumer region). The observed S is generally is not constant, therefore, relaxing Assumption 2. In order to higher than the one observed in the idle=poll experiments, examine a larger number of cases, we run these new experi- and also shows a complex dependency on W . This dependency ments in a simulator. Figure 13 shows the results obtained can be explained as follows, making use of the Δ function from the simulator when the system parameters are chosen to introduced above. When the consumer thread goes to sleep the be compatible with Fig. 8. The notification costs (N , N , S P C kernel will switch to the idle thread, which will execute the and S ) and the W and W parameters are now random vari- C P C hlt instruction, thus putting the CPU core in the C1 state. The ables, while L, K and K are as in Fig. 8. The notification P C notification IPI sent by the producer may reach the consumer costs are normally distributed; their averages and standard core either before or after the core has entered the C1 state. deviations are taken from Table 4. The W parameter is also This clearly affects S , since coming back from C1 may take normally distributed; in each experiment, the average is taken ~0.5ms [15]. Of course, the longer the elapsed time between from the x axis and the standard deviation is fixed at 5‰. The the instant the consumer decides to go to sleep and the instant average of W is 2sm in all experiments, but the distribution the producer sends the IPI, the higher is the probability that the is different for each curve: the first four curves use a normal consumer core will have entered the C1 state when the IPI is distribution with standard deviations of 5‰, 5%, 25% and received. Therefore, an high Δ should imply an higher (on average) S ,and alower Δ should cause a lower S ,which is C C essentially what we observe. For example, when W is between P S = .83 µs C   2.6ms and 2.8ms,the Δ is very high and the producer IPI 0.8 almost always find the consumer core already in C1, entailing a large S»m 0.83 s. This explains why the model with 0.6 S=m 0.83 s closelymatches virtio-pcinthisregionof Fig. 11. Note that the dependency of S on Δ is clear, but the 0.4 S = .42 µs correlation between Figs. 10 and 9 is only qualitative; this is   due to a couple of reasons: Fig. 10 is plottedassumingacon- 0.2 stant S , while we know that S varies; moreover, spurious C C notifications also affect Δ (and, therefore, S ), since they tend to increase the Δ for the next batch. In particular, this explains 2 2.2 2.4 2.6 2.8 the high values of S when W is close to W ,since,inthat C P C W [µs] region, there are as many spurious notifications as regular ones. In summary, we have seen that even if real systems are FIGURE 12. Measured average in the fast-consumer experiments much more complex than our simplified model, still the S of Fig. 11. σ =5 virtio-pc model (S = .42 µs) σ =5% C      model (S = .83 µs) σ =25% C      σ =50% 1.8 2 2.2 2.4 2.6 2.8 exponential W [µs] 1.8 2 2.2 2.4 2.6 2.8 W [µs] FIGURE 11. Average per-item time in the mathematical and the synthetic model (notification regimes) with idle=halt. w is fixed at 2 ms, L = 512, K = 1 and K = 384. The model curve is plot FIGURE 13. Average per-item time obtained by simulation with P C two times for two different values of S . The other notifications costs randomly distributed parameters. Each curve uses a different distri- are taken from Table 4. bution for the W parameter. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 T [µs] average S [µs] T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 826 G. LETTIERI et al. 50%; the fifth curve uses an exponential (Poisson) distribu- leads to some trade-offs. A reasonable choice can, therefore, tion. All normal distributions are truncated at zero, to exclude be done once the objective function to be optimized is clearly non-meaningful negative values. These experiments may defined. In this work, we want study how to simultaneously model a real-world packet capturing scenario in which we minimize average inter-message distance (T) and average can expect the incoming packets to arrive rather regularly, but per-message energy (E), while keeping worst case service where each packet may need a different amount of processing latency below an user-provided value D , focusing on the MAX in the consumer. case where the system is under high workload most of the The first curve (s = 5‰) closely matches the experimental time (i.e. P has almost always requests to serve). curve in Fig. 8 (once the spurious notifications are dis- The rationale behind this objective function is that we tar- counted) and is used to validate the simulator. We can now get packet processing systems requiring high throughput but precisely explain why the experimental curve of Fig. 8 does that do not want to resort to busy waiting, which may waste not feature the discontinuities of the theoretical curves considerable amount of energy when the load is low. obtained with constant parameters. In fact, when the system Examples of such systems come from the use-cases of NFV: is working near a discontinuity, the variability of the para- network middle-boxes like firewalls, Intrusion Detection meters randomly mixes the theoretical regimes expected Systems (IDSs), load balancers, routers, etc., which are com- before and after the critical point; as a result, the average T monly deployed by network service providers, Data Center may lie slightly above or slightly below the predicted value. environments and private business network infrastructures. A Something more interesting can be seen in the other curves solution which guarantees limited delay is still a good candi- produced by the simulator. While we move to higher values date for these systems, also considering that the overall of σ,at first the T curve simply becomes more smooth (e.g. latency experienced by the end users once the producer/con- see the curve for s = 5‰); for very high values of σ, how- sumer system is deployed in a real network is often in the ever, the entire T curve lies below the theoretical one, i.e. the order of hundreds of microseconds (or more) and not under throughput is consistently better than predicted. This can be control, because introduced by other network middle-boxes. easily understood for high values of W (W>m 2.4 s in On the other hand, when minimizing latency is the strongest P P Fig. 13). Recall that in a fast consumer scenario, any slow- requirement—which for instance is the case with high- down of the consumer is actually beneficial for throughput, frequency trading systems—the only acceptable solution is since it keeps the consumer running, relieving the producer busy waiting in any case. from the task of sending notifications, while a faster con- Taking into account the objective function as defined sumer may put more strain on the notification system. above and all the analysis carried out so far, we now illustrate However, if the system is already sending one notification for the high-level strategy that should drive the design and each packet, any W smaller than expected can do no add- deployment of high-performance producer–consumer systems itional harm; on the contrary, any W larger than expected under high workloads. may increase the producer batches and improve the through- put (as long as the queue is not overflowed). Therefore, for large values of W , the throughput must improve when larger 7.1. Regime identification variations of W become statistically more common. Similar, even if more complex, consideration can be made for the As a first step, it is necessary to understand whether the system smaller values of W . The main point is that the batch of pack- tends to behave as a fast producer or as a fast consumer. In real ets that the producer is able to put in the queue while the con- deployments, W and W arenot constant,so wecould at most P C sumer is waking up after a notification (i.e. during time S )are measure and average value for these parameters. However, able to absorb the lower values of W , while the higher values measuring W and W directly often requires some code instru- C P C of W continue to be beneficial. mentation, which should be avoided if possible. A better From these experiments, we can see that the theoretical approach would be to deduce the operational regime by meas- model actually captures a scenario that is typical more uring the rate of notifications in both directions. Fast-consumer demanding than usual and may be seen as ‘worst case’ in systems have a relative high number of P-to-C notifications, practice (even if it is not a worst case mathematically). and a low number of C-to-P notifications. The contrary is true for fast producer systems. The rate of notifications is, therefore, a simple way to roughly distinguish the two cases. Measuring these rates is usually easy in the scenarios we are focusing on, 7. DESIGN STRATEGIES that is with I/O devices emulated by an hypervisor, where P The discussion and comparisons reported in Sections 3.1–3.3 runs in the guest and C runs in the host (or the other way illustrate how the three mechanisms (busy waiting, sleeping, around). Notifications from C to P turn into interrupts in the notifications) have different properties in terms of throughput, guest, so that the average interrupt rate for a given workload energy efficiency and latency, a situation which naturally can be easily measured from within the guest using the tools SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 827 provided by the guest O.S. Also the hypervisor usually pro- vides statistics useful to measure the rate of notifications from P to C, since these kinds of notifications cause a VM exit event. Measuring notifications would also be easy in the case where 3 MAX theconsumer isanhardware device(e.g. aNIC), sinceinthat case interrupt O.S. statistics and device driver statistics would be available. Finally, the maximum between W and W can be deter- P C mined by measuring the system throughput when both P and C Y long sleeps 1 E use sleeping (so that notifications costs are not involved), with asufficiently short Y and Y (or with a sufficiently large L)to C P avoid the sLS regime. In practice, the designer can choose 10 20 30 40 50 60 70 YY== 200ms and measure the throughput while gradually Y = Y [µs] P C CP reducing the sleeping value (and may be gradually increasing L); once the throughput stops increasing with the sleeping FIGURE 14. Average per-item time and energy for the sleeping time, it means that the system is working in the sFP or sFC mechanism with Y = Y and variable Y . Dashed vertical bars PC C regime, and the maximum between W and W is the inverse of delimit the region of valid Y , while the solid one represents the P C C the measured throughput (expressed in items per second). user-specified latency constraint. sleeping time provided by the O.S.; it is therefore a good idea to stay away from the limit by a small value (e.g. 500 ns). 7.2. Fast-consumer design Also in this case the choice minimizes energy, maximizes throughput and limit latency as required by the user. If the system tends to behave as a fast consumer, increasing k is not an option (since P usually does not know when the next item will be produced), so a general strategy is to use sleeping on the consumer in order to avoid the notification storms that 7.3. Fast producer design are typical of this regime—anotification per item in the worst If the system tends to behave as a fast producer, our sug- case, which is also a common case. In fact, P-to-C notifications gested strategy is to use notifications, selecting a value for the are not used at all when C uses sleeping. To keep latency k parameter which is a large fraction (e.g. ) of the queue under control, we choose Y (and Y ) so that the worst case P C latency does not exceed the user-provided D ,which could length L. With this choice, the C-to-P notifications are suffi- MAX be in the 10–100ms range. Using inequality (11), we can ciently amortized over a large batch of packets, so that the derive a suitable value forYY = , once W=( max WW , ) CP CP throughput has little or no practical dependency on the has been estimated as described in Section 7.1.This means W - W difference, as explained in the following. As CP MAX selecting a sleeping length not larger thanYW =- . MAX described in Section 2.3.2, the number of packets processed Note that this strategy is only applicable when the resulting Sk +( - 1)W êPC Pú by P for each notification is b = + k , that is b ê ú YY > , that is when the O.S. supports sleeping times WW - MAX E CP ë û smaller than Y . If this is not true, it means that the latency MAX is the sum of two components. When k = L (i.e. k is in requirements are too stringent to use sleeping (or even unfeas- the 200–1000 range), b is already large because of the second ible), and resorting to busy waiting is unavoidable. component, irrespective of the value of the first component, The possible choices for the sleeping time are highlighted that could also be very large. The cost that C needs to pay for in Fig. 14, in the region where the latency constraint is met. notifications (N ), which is typically <1sm , is, therefore, If Y falls in the sFC region, we chooseYY==Y , MAX CP MAX amortized over at least 200–1000 packets, which result into to minimize energy and limit latency, while the throughput is <1–5 ns per packet. The effect of the first component of b on not affected by the choice. If Y falls beyond, in the sLS MAX the throughput is, therefore, expected to be very little in abso- region, we choose the largest Y which is still in the sFC lute numbers. As a result, the overall throughput is very close region. To make a robust choice we need to avoid the border to the optimal one , because C spends a very little time () effects that may result from the instability of the actual C to send notifications to P. For similar reasons, the per-item As an example, interrupts statistics on Linux are exported through the energy consumption is close to the optimum (W + W ), PC /proc/interrupts file. because N and S costs are amortized over a large b. As an example, the KVM kernel module on Linux exports detailed statis- As discussed in Section 3.3.1, with a large k (or a suffi- tics about the number of VM exits and injected guest interrupts that can be ciently small Y ), the latency of a fast producer system tends easily collected using the perf-stat tool (more info at http://www.linux- P kvm.org/page/Perf_events). to be dominated by the queuing delay LW , which is often in SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 time, energy [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 828 G. LETTIERI et al. the range 50–1000ms. The queuing delay does not depend case latency measured is relatively low (2240 ns) only includ- on the synchronization mechanism deployed, and so using ing W , N , S (600 ns on average) and W . The theoretical P P C C notifications or sleeping does not really make a difference in worst case would also include N (980 ns) and S , adding up practice. The only thing that can be done if the constraint on to~m 10 s. D is not met is to reduce L. The poor throughput of fast consumer is a common prob- MAX The discussion so far indicates that using the sleeping lem for VirtIO deployments, since it is common for the vhost mechanism in fast producer scenario does not really improve thread to quickly start and empty the avail ring. This example (nor worsen) the average throughput, energy or latency, at is, therefore, a good candidate to try using the sleeping strat- least assuming the system is under high workload. When the egy. We chooseYY== 5sm to make sure the worst case CP system is idle or has a very low workload, the sleeping mech- latency is approximately <10ms (cf. Section 2.2.3) and to anism easily becomes more energy inefficient, as both P and take into account the lower bound of 2.5ms related to the C repeatedly wake up and go to sleep again as there is almost sleeping costs (cf. Section 4.3). Our measurements show an never work to do, paying Y each time. In conclusion, the average throughput of~3.31 Mops, roughly corresponding to notification mechanism is a good candidate for fast producer 300 ns, which is the processing time of the slower party. As systems, since it provides near optimal throughput, energy predicted by our model (Section 2.2.1), the measured and latency, addressing both the high-workload and low- throughput is optimal. We measured an average of 50.5 items workload scenarios. processed by C for each sleep, whereas the model (using the nominal value of Y ) predicts 50. Actually, the average mea- sured value of Y is ~5007 ns, while the measured W - W C PC is actually 99 ns; plugging in these values in the batch for- 8. CASE STUDIES mula gives approximately 50.6, which is even a closer match. In order to validate the strategies presented in Section 7,we This batch corresponds to over 65 thousands sleeps per present some experimental examples of producer–consumer second, which may still be considered quite high with respect design, using the virtio-pc system presented in Section 4.2. to energy consumption. In any case, if relaxing the constraint on D is acceptable, it would be easy to increase the batch MAX (and thus reduce energy consumption) by increasing Y . The 8.1. Fast-consumer example energy measurement reports C using 76% of its CPU; since the system uses 1.76 CPUs to process 3.31 Mops, the average In the first example, we focus on a fast consumer case, with per-item energy consumption is ~531 ns, which is consider- W = 300 ns, W = 200 ns, and we also assume P C ably better than what could be obtained with the notification D =m 10 s. The values of W and W include ~100 ns of MAX P C strategy. Finally, the worst case latency measured is virtqueue processing plus 100–200 ns of useful work. These ~5500 ns, including Y and the processing costs, which is in numbers are realistic for network packet processing scenarios: line with our model. as an example, 100 ns may be needed by the consumer to In summary, this fast consumer example shows how the invoke a NIC driver to program packet transmission; the pro- sleeping strategy can be a better choice than notifications, as ducer may spend 200 ns to allocate (and deallocate) a packet it allows to optimize throughput and energy while keeping buffer in the guest O.S., look-up forwarding data structures the latency under control. and modify packets headers. Using the notification mechanism on both producer and consumer threads, we measured an average throughput of 8.2. Fast producer example ~1.81 Mops (millions operations per second), corresponding to 550 ns per item on average, which is almost twice slower In the second example, we will examine a fast producer scen- than the slowest party (P). As predicted by our model ario, with processing times similar to the ones used in the first (Section 2.3.1), this is due to the high cost of P notifications example, that is W = 200 ns and W = 300 ns. As reported P C (the measured N is ~1100 ns on average), amortized over in Section 4.2, the VirtIO uses an hardcoded k which is of relatively small batches (~5.3 items per batch), which means the virtqueue length; with L = 512, we have therefore that there are almost 350 thousands notifications per second. k = 384. In terms of energy, we found that C consumes 62% of its Using the notification mechanism, we measured an average CPU, while the CPU where P runs is busy all the time; in throughput of 3.32 Mops, corresponding to roughly 300 ns total, 1.62 CPUs running at 3.5 GHz are necessary to process per item on average, which matches the speed of the slowest 1.81 Mops, which means that on average 895 ns of CPU party (C). This is a good behavior and it is predicted by our cycles are spent for each item. Finally, as expected, the worst model, as each notification from C to P (interrupt) is amor- tized over a very large batch of items, so that P is not over- That is, 51.2ms when L = 256 and W = 200 ns, and 1 ms when L = 1024 and W =m 1s whelmed by the cost of notifications. More precisely, our SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 829 measurements report an average batch size of 1480 items, blocked because the FIFO leading to the next VM is full), whereas our model predicts batches of 1429 items (using and, therefore, the CPU utilization estimates would be off. It S =m 28 s, cf. Section 4.4). The measured latency is domi- is important to note, however, that these new, externally gen- nated by the queuing delay and it is ~152ms (512 items, erated, blocking situations never cause notifications not 300 ns each) as expected. Regarding energy, we measured already accounted by the model: even with chaining, notifica- that P consumes ~74% of its CPUs, which means that the tions only depend on the state of the FIFO between each per-item energy is 524 ns on average. Producer and Consumer pair. Using the sleeping mechanism with Y =m 20 s, we mana- We expect to observe counterintuitive effect also in chains of ged to remove even the few remaining interrupts (~2200 per VMs. For example, think of a chainPC ( /P)C (i.e. 11 2 2 second), and measured an average throughput of 3.33 Mops, Producer P in VM sending to a Consumer C in VM through 1 1 2 2 which is almost indistinguishable from the throughput mea- athread(/ CP) that acts both as Consumer and Producer) and sured with notifications. However, this choice of Y results assume that bothPC ( /P) and (/ CP) C show a P 11 2 12 2 into a batch of 200 elements, which is much smaller than the Fast-Consumer problem when run in isolation. Now, C may batch obtained with notifications; as a consequence, the num- slow down(/ CP) by forcing it to spend a lot of time sending ber sleep rate is relatively high (over 16 thousands sleeps per notifications, and, as a consequence, hide the Fast-Consumer second) which means an higher energy per item (87% of problem in thePC ( /P) path. Conversely, fixing the Fast- 11 2 CPU utilization for P, corresponding to 562 ns per item). In Consumer problem in the downstream path may expose it in the order to increase the batch (so lowering the energy consump- upstream one. It is clear that further study is necessary to address tion), we would need to increase Y to over 100ms. This is all the regimes that may be observed in such scenarios. feasible, but quite dangerous since it is not very far from the 152ms threshold for sLS regimes. Finally, since we have avoided sLS regimes, the latency behavior is again dominated 9.2. Batching by the queuing delay. In conclusion, this fast producer example shows how the Batching, i.e. sending several packets at once across an inter- notification mechanism—empowered with a large k —can be face, is widely used to improve throughput since it signifi- a better choice than sleeping, as the cost of each notification cantly amortizes fixed costs. Batching is a prominent feature is largely amortized over many items, so that the throughout in our model, as a single notification may be issued after any manages to follow the slower party and the energy consump- number of new packets have been inserted in the FIFO, or tion remains low. removed from it. Still, the model only accounts for the amortization of notifi- cation and sleep/wake-up costs (N , N , S , S and Y ). P C C E 9. LIMITATIONS Processing costs (W and W ) remain constant, independently C P of the number of packets that are processed in a single run. Even if our model matches precisely some important features Real systems may have many more fixed costs that are amor- of VM networking I/O, it does not of course encompass all tized when batches of packets are made available, thanks to possible scenarios. We discuss here some limitations and pos- caching effects, reduced context switching and other optimi- sible extensions that may significantly broaden the scope of zations. This may be modeled in at least two ways: by letting the model. W and W decrease depending on the number of packets C P already processed since last notification, wake-up or sleep; by assuming that each W and W box represents the processing C P 9.1. VM chaining of a batch of more than one packet. Virtualized networking I/O at high packet rates, which is the The latter approach is especially useful in modeling the main target of our study, is very important for NFV applica- behavior of APIs like netmap [3], where producer batching is tions. Our study covers the expected I/O performance of the controlled by the application and may be approximately taken input and output I/O paths of a single VM. However, com- as a constant, call it B, especially in the high packet rates plete NFV applications typically consist of chains of VMs scenarios, we are interested in. A FIFO of L packets between [16, 17]. Our Consumer can, therefore, be the Producer for the netmap producer and the consumer must now be modeled another VM down the chain. As a first approximation, the as a FIFO of L/B batches, and a large B may easily bring the throughput of each path can still be studied in isolation, using system in a ‘short-queue’ regime (one of nSPS, nSPS or nSS, our model, if the cumulative effect of the upstream and down- depending on the wake-up times), where the consumer and the stream VMs are modeled as random variations in the W and producer alternatively block without doing any work in paral- W parameters (using, e.g. the simulator of Section 6). The lel. In these situations, reducing the application batching can chaining, however, also introduces new possibilities for increase the throughput, by moving the system into a more blocking not considered by our model (e.g. a Consumer is favorable regime—yet another counterintuitive effect [5]. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 830 G. LETTIERI et al. An aspect of batching that is neglected by our model is k = 1, k =/34 of queue occupation. We have shown in that large batches may lead to other reductions in throughput, Section 8 that this form of adaptivity is only effective with due to large packet drops in the internal queues of the high load and slow consumers. Recent versions of vhost (an Producer and/or Consumer when they are implemented by optimized in-kernel VirtIO hypervisor-side implementation), complex multi-layered software (like, e.g. the OS network included in the Linux kernel, support an optional short busy stack). These problems, however, should generally be wait to limit the amount of notifications showing up with fast addressed in the multi-layered software itself, by properly siz- consumers. This further confirms how the problem of produ- ing the queues and making sure that livelock problems are cer–consumer speed mismatch that we address in our work is avoided [18]. central to high-performance I/O virtualization. There is an extensive literature on the performance study and modeling for VMs [23], focusing on the general overhead of virtualization on CPU-intensive computations [24], but also on 10. RELATED WORK the performance of disk I/O[25], end-to-end networking [26] Pure polling (also known as ‘busy wait’ or ‘spinning’)isprob- and live migration [27]. To the best of our knowledge, how- ably the oldest form of synchronization, and the most expensive ever, little attention has been devoted to the modeling of the in terms of system resource usage. Its use is mostly justified by notification/synchronization I/O costs. The works most similar its simplicity and not reliance on any hardware support. Pure to our own remain the studies on hybrid interrupt/polling polling is used by a number of high-speed networking applica- schemes [12, 21, 28], where several options among interrupt tions and libraries such as the Click Modular Router [8], Intel’s and polling are modeled and compared. These studies apply to DPDK [6] and Luca Deri’s PFRING/DNA [7]. non-virtualized networking, and, as a consequence, they show Aside from high-energy consumption, polling may also several differences with our own. In particular, delays in notifi- abuse of shared resources, such memory or I/O buses. This cations are not accounted for, while we have found that they worsens the situation from a simple annoyance (high-energy have several counterintuitive effects in our model. Moreover, consumption) to a threat to other parts of the system, and those studies focus on the receive path only, while our model is requires some form of mitigation. more general and also encompasses transmission. In particular, In the FreeBSD polling architecture [18], polling occurs the fast consumer problem is usually encountered in the trans- periodically on timer interrupts and opportunistically on other mission path from a relatively slow producer running in the events. An adaptive limit on the maximum amount of work to VM with a fast backend consumer [29, 30]. be performed in each iteration is used to schedule the CPU between user processes and kernel activities. Adaptive polling schemes are also widely used in radio protocols, sensor net- 11. CONCLUSIONS works, multicast protocols. We have presented and analyzed a model for the operation of A seminal work on interrupt moderation [19] points out a producer and consumer in a typical VM environment, how mixed strategies (notifications to start processing, fol- focusing on three synchronization mechanisms: notifications, lowed by polling to process data as long as possible) can sleeping and busy waiting; described how throughput, effi- reduce system’s overhead. The Linux NAPI architecture ciency and latency are affected by operating parameters for [11, 20, 21] is based on the above ideas. When an interrupt the three mechanisms; and validated the model against a set comes, NAPI activates a kernel thread to process packets using of simulation experiments and a realistic VirtIO-based proto- polling, and disables further interrupts until done with pending type running on a hypervisor. packets. A bound on the maximum amount of work to be per- We have then discussed some strategies that can lead the formed by the polling thread in each round helps reducing design or optimization of a producer–consumer system under latency and fairness on systems with multiple interfaces. NAPI assumptions that are common for NFV scenarios, helping to does not use any special strategy to adapt the speed of produ- decide what synchronization mechanism to use and how to use cer and consumer, and as such, it is subject to the performance it. The main idea, exposed in Section 7,is to first identify the instabilities discussed in this paper, and, in particular, to the P- notification regime and then apply a different strategy according to-C notification storms typical of a fast consumer scenario (in to it. Finally, we have validated our strategies against our VirtIO this case, the NAPI thread is the consumer for network packets prototype to show the benefits of our analysis in practice. coming from a physical NIC or from a possibly paravirtualized NIC emulated by the hypervisor). The VirtIO framework [10, 22]isthe de facto standard ACKNOWLEDGEMENTS deployed to provide high-performance I/O in virtualized envir- onments, and uses a notification-based system which matches The authors would like to thank the anonymous reviewers for the one presented in Section 2, as explained in Section 4.2. their helpful and constructive comments that greatly contribu- The notification thresholds for VirtIO are typically chosen as ted to improving the final version of the paper. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 831 [16] Herrera, J.G. and Botero, J.F. (2016) Resource allocation in NFV: FUNDING a comprehensive survey. IEEE Trans. Netw. Serv. Manage., 13, This work was supported by the European Union’s Horizon 518–532. 2020 research and innovation programme 2014–18 (Grant no. [17] Luizelli, M.C., Bays, L.R., Buriol, L.S. et al. (2015) Piecing 644866). This paper reflects only the authors’ views and the together the NFV Provisioning Puzzle: Efficient Placement and European Commission is not responsible for any use that Chaining of Virtual Network Functions. In Proc. IM’2015, may be made of the information it contains. Ottawa, Canada, May 11–15, pp. 98–106. IEEE. [18] Rizzo, L. (2001). Polling versus interrupts in network device drivers. http://info.iet.unipi.it/~luigi/polling/. accessed July 12, 2017. REFERENCES [19] Mogul, J.C. and Ramakrishnan, K. (1997) Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. [1] Chiosi, M.ClarkeD.WillisP.et al2012, Network function vir- Syst., 15, 217–252. tualisation introductory white paper. http://portal.etsi.org/ [20] Salim, J.H., Olsson, R. and Kuznetsov, A. (2001) Beyond NFV/NFV_white_paper.pdf (accessed July 12, 2017). Softnet. In Proc. 5th Annual Linux Showcase & Conference, [2] Abdelrazik, A., Bunce, G., Cacciatore, K. et al. (2015). Adding pp. 165–172. speed and agility to virtualized infrastructure with OpenStack. [21] Salah, K. and Qahatan, A. (2009) Implementation and experi- https://www.openstack.org/assets/pdf-downloads/virtualization- mental performance evaluation of a hybrid interrupt-handling Integration-whitepaper-2015.pdf. accessed July 12, 2017. scheme. Comput. Commun., 32, 178–188. [3] Rizzo, L. (2012) netmap: A Novel Framework for Fast Packet [22] Motika, G. and Weiss, S. (2012) Virtio network paravirtualiza- I/O. In Proc. USENIX ATC’12, Boston, MA, June 13–15, pp. tion driver: implementation and performance of a de-facto 101–112. USENIX Association, Berkeley, CA. standard. Comput. Stand. Interfaces, 34,36–47. [4] Rizzo, L., Lettieri, G. and Maffione, V. (2013) Speeding up [23] Xu, F., Liu, F., Jin, H. and Vasilakos, A.V. (2013) Managing Packet I/O in Virtual Machines. In Proc. ANCS’13, San Jose, performance overhead of virtual machines in cloud computing: CA, 21–21 October, pp. 47–58. IEEE Press, Piscataway, NJ. a survey, state of the art, and future directions. Proc. IEEE, [5] Garzarella, S., Lettieri, G. and Rizzo, L. (2015) Virtual Device 102,11–31. Passthrough for High Speed VM Networking. In Proc. [24] Huber, N., von Quast, M., Hauck, M. and Kounev, S. (2011) ANCS’15, Oakland, CA, 7–8 May, pp. 99–110. IEEE Evaluating and Modeling Virtualization Performance Overhead Computer Society, Washington, DC. for Cloud Environments. In Proc. CLOSER’11, [6] DPDK web page. dpdk.org. Accessed: 2017-07-12. Noordwijkerhout, The Netherlands, May 7–9, pp. 563–573. [7] Deri, L. PF_RING DNA web page. http://www.ntop.org/ Scitepress, Setúbal, Portugal. products/pf_ring/dna/. accessed July 12, 2017. [25] Noorshams, Q., Rostami, K., Kounev, S. et al. (2013) I/O [8] Kohler, E., Morris, R., Chen, B., Jannotti, J. and Kaashoek, M. Performance Modeling of Virtualized Storage Systems. In (2000) The Click modular router. ACM Trans. Comput. Syst. Proc. MASCOTS’13, San Francisco, CA, August 14–16. (TOCS), 18, 263–297. IEEE. [9] Zhou, D., Fan, B., Lim, H., Kaminsky, M. and Andersen, D.G. [26] Wang, G. and Ng, T. (2010) The Impact of Virtualization on (2013) Scalable, High Performance Ethernet Forwarding with Network Performance of Amazon EC2 Data Center. In Proc. Cuckooswitch. In Proc. CoNEXT’13, Santa Barbara, CA, INFOCOM’10, San Diego, CA, March 14–19, pp. 1163–1171. December 9–12, pp. 97–108. ACM, New York, NY. IEEE Press, Piscataway, NJ. [10] Russell, R. (2008) virtio: towards a de-facto standard for virtual [27] Wu, Y. and Zhao, M. (2011) Performance Modeling of Virtual I/O devices. ACM SIGOPS Oper. Syst. Rev., 42,95–103. Machine Live Migration. In Proc. CLOUD’11, Washington [11] NAPI (‘New API’). http://www.linuxfoundation.org/networking/ DC, 4–9 July, pp. 492–499. IEEE. napi. accessed July 12, 2017. [28] Dovrolis, C., Thayer, B. and Ramanathan, P. (2001) HIP: [12] Salah, K., El-Badawi, K. and Haidari, F. (2007) Performance hybrid interrupt-polling for the network interface. ACM analysis and comparison of interrupt-handling schemes in giga- SIGOPS Oper. Syst. Rev., 35,50–60. bit networks. Comput. Commun., 30, 3425–3441. [29] Honda, M., Huici, F., Lettieri, G. and Rizzo, L. (2015) [13] Bonzini, P. KVM halt-poll optimization. https://lkml.org/ mSwitch: A Highly-scalable, Modular Software Switch. In lkml/2015/2/6/319. accessed July 12, 2017. Proc. SOSR’15, Santa Clara, CA, 17–18 June, pp. 1–13. ACM [14] Rostedt, S. (2008). ftrace - function tracer. https://www. New York, NY. kernel.org/doc/Documentation/trace/ftrace.txt. accessed July [30] Hwang, J., Ramakrishnan, K.K. and Wood, T. (2014) NetVM: 12, 2017. High Performance and Flexible Networking using [15] Schöne, R., Molka, D. and Werner, M. (2015) Wake-up laten- Virtualization on Commodity Platforms. In Proc. NSDI’14, cies for processor idle states on current x86 processors. Seattle, WA, April 2–4, pp. 445–458. USENIX Association, Comput. Sci. – Res. Dev., 30, 219–227. Berkeley, CA. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

A Study of I/O Performance of Virtual Machines

Loading next page...
 
/lp/ou_press/a-study-of-i-o-performance-of-virtual-machines-fO1eIusWUF

References (27)

Publisher
Oxford University Press
Copyright
Copyright © 2022 British Computer Society
ISSN
0010-4620
eISSN
1460-2067
DOI
10.1093/comjnl/bxx092
Publisher site
See Article on Publisher Site

Abstract

Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 © The British Computer Society 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Advance Access publication on 28 September 2017 doi:10.1093/comjnl/bxx092 A Study of I/O Performance of Virtual Machines 1* 1 1,2 GIUSEPPE LETTIERI ,VINCENZO MAFFIONE AND LUIGI RIZZO Dipartimento di Ingegneria dell’Informazione, Università di Pisa, Italy Google, Inc., Mountain View, CA, USA (work performed while at Università di Pisa, Italy) Corresponding author: [email protected] In this study, we investigate some counterintuitive but frequent performance issues that arise when doing high-speed networking (or I/O in general) with Virtual Machines (VMs). VMs use one or more single-producer/single-consumer systems to exchange I/O data (e.g. network packets) with their hypervisor. We show that when the producer and the consumer process packets at different rates, the high cost required for synchronization (interrupts and ‘kicks’) may reduce throughput of the sys- tem well below the slowest of the two parties; moreover, accelerating the faster party may cause the throughput to decrease. Our work provides a model for throughput, efficiency and latency of produ- cer/consumer systems when notifications or sleeping are used as a synchronization mechanism; iden- tifies different operating regimes depending on the operating parameters; validates the accuracy of our model against a VirtIO-based prototype, taking into account most of the details of real-world deployments; provides practical and robust strategies to maximize throughput and minimize energy while keeping the latency under control, without depending on precise timing measurements nor unreasonable assumptions on the system’s behavior. The study is particularly interesting for Network Function Virtualization deployments, where high-rate producer/consumer systems in vir- tualized environments are the core components. Keywords: virtual machines; high-speed I/O; energy efficiency Received 20 January 2017; revised 23 July 2017; editorial decision 11 September 2017; Handling editor: Iain Stewart 1. INTRODUCTION which must be kept streaming to avoid abysmal performance (and mechanical wear) due to frequent start/stops. Large buffers Computer systems have many components that need to in that case came to help in achieving decent throughput; the exchange data and synchronize with each other, to determine inherently unidirectional (and sequential) nature of tape I/Odoes when new data can be sent or received. The timescales of not call for more sophisticated solutions. these interactions span from the nanosecond range for on- We are interested in a similar problem in the communication chip hardware (CPU, memory), to hundreds of nanosecond or between a process that runs in a VM, issuing I/O operations at microseconds for processes or Virtual Machines (VMs) and high speed, and the hypervisor software implementing the corre- their hypervisors, up to milliseconds or more for peripherals sponding ‘virtual’ I/O device. In these cases, we aim at through- with moving parts (such as disks or tapes), or long-distance puts of tens of Gigabits per second, millions of I/O operations communication. per second, and reasonably low delays (tens of microseconds) in Synchronization can be implicit, e.g. when a piece of hard- the delivery of data. The problem is particularly interesting ware has a guaranteed response time; or it can be explicit, rely- when thetypeofI/O is networking. The latency aspect, tightly ing on polling (i.e. repeatedly reading memory or I/Oregisters related with the bidirectional nature of network communication, to figure out when to proceed, possibly using short sleeps to low- is what makes the problem a hard one. Moreover, mechanisms er CPU usage) and/or asynchronous notifications,e.g.interrupts. that allow VMs to exchange network packets between each The cost of synchronization can be highly variable, and some- other at high speed are an enabler technology for the Network times even much larger than the data processing costs. This used Function Virtualization paradigm [1]. Any optimization to be a well-known problem when accessing magnetic tapes, SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 809 addressing these basic mechanisms can potentially impact thou- may obtain estimates of these parameters. In Section 5,we sands deployments, through popular cloud management soft- experimentally validate our models using a representative ware like OpenStack [2]. implementation of a VirtIO producer/consumer system. In Synchronization in these scenarios typically requires inter- Section 6, we relax some of the simplifying assumptions rupts, context switches and thread scheduling for incoming adopted in the model and study the consequences using both traffic, system calls and I/O register access (which translates our VirtIO implementation and a simulator; here we show in expensive ‘VM exits’ on VMs) for outgoing traffic. The how our model is useful to understand real-world perform- high cost of these operations (often in the microsecond range) ance issues. In Section 7, we present some practical method means we cannot afford a synchronization on each packet to identify the operating regime of a system; then we suggest without killing throughput. how to choose the synchronization method and the tunable Amortizing the synchronization cost on batches of packets parameters to improve performance depending on the regime. [3–5] greatly improves throughput, but has an impact on In Section 8, we apply these strategies to two representative latency, which is why several network I/O frameworks [6–9] design examples and experimentally show their benefits. In rely on busy waiting to remove the cost of asynchronous noti- Section 9, we discuss some of the limitations of the proposed fications and keep latency under control. model and suggest some possible extensions. Finally, Busy wait polling has, however, a significant drawback Sections 10 and 11 report related works and our conclusions. related to resource usage: it consumes a full CPU core, may keep busy the datapath to the device or the memory being mon- itored, and the power dissipated in the polling loop may prevent 2. SYSTEM MODEL the use of higher clock speeds on other cores on the same chip. To gain a better understanding of the problem of our interest, Using short sleeps instead of busy waiting can help reduce the in this section, we will study the behavior of a system made CPU consumption while preserving good throughput. of two communicating parties, as in Fig. 1:a Producer P and A middle ground between asynchronous notifications and a Consumer C, where P sends one or more messages at a busy waiting is implemented by modern ‘paravirtualized’ VM time to C through a shared FIFO queue with L slots. devices [10] and interrupt handling [11] strategies. In these The basic assumptions of the model are that P and C can solutions, the system uses polling under high load conditions, work in parallel and the the cost of inspecting the shared state but reverts to asynchronous notifications after some unsuc- (e.g. to ascertain the number of messages in the queue) is cessful poll cycles. negligible if compared with the cost of all other operations The key problem in these solutions is that strategies to that they must perform. These operations include the process- switch from one mechanism to another are normally not ing of the messages, sending and receiving notifications, adaptive, and very susceptible to fall into pathological situa- going to sleep, waking up and so on. These assumptions are tions where small variations in the speed of one party cause typically true in VM environments, where P and C are two significant throughput changes. In our tests, we have fre- threads that live on the opposite sides of a VM boundary in a quently seen systems moving from 100–200 Kpps to 1 Mpps multi-core system. In this environment, accessing shared with minuscule changes in operating conditions [4]. Even memory is much cheaper than, e.g. sending notifications. On when the throughput shows less dramatic variations, the sys- the contrary, non-virtualized I/O where either the producer tem’s resource usage may be heavily affected, which is why (for reception) or the consumer (for transmission) is imple- we need to understand and address this instability. mented as part of a peripheral device, does not perfectly Note that these kinds of problem mostly show up under extreme operating conditions, e.g. when a system is processing a large number of packets-per-second during a DOS attack. In R Producer Consumer those situations, real-world applications may suffer from a num- ber of other, unrelated problems. To isolate the synchronization problem from the rest, this study is limited to mathematical FIGURE 1. System model. Producer and consumer exchange mes- modeling, simulation and synthetic-workload experiments. sages through a queue, blocking, sleeping or busy waiting when The rest of the paper is organized as follows. In Section 2, full/empty, and possibly exchanging notifications to wake up the we provide a model for a single-producer/single-consumer blocked peer. The producer receives request to produce new mes- system under different synchronization mechanisms, explain- sages from a request queue R. ing how different operating regimes may arise and what kind of impact on performance comes by speed differences, delays Note that at very high message rates (say, several tens of millions of mes- and queues. In Section 3, we analyze our models and derive sages per second) the costs of accessing the shared memory can no longer be criteria to compare the different operating regimes against neglected, since the time spent producing and consuming each message each other, basing on the value of operating parameters. In becomes comparable to the time spent stalling on cache misses. Such scen- Section 4, we give suggestions on how the system designer arios are out of the scope of this paper. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 810 G. LETTIERI et al. match this model. In fact, the peripherals can only access blocked or sleeping) because it had previously seen a full memory using relatively expensive DMA operations that go queue, or it may be busy sending a notification to C. The through the PCIe bus. Moreover, P can block and packets are main purpose of this additional queue is to decouple the time never dropped when the FIFO is full (packets may still be when new messages ‘should be produced’ from the time they dropped in the stages that come before P and after C, but this are actually produced when the other communication activ- activities are outside of our model). This is typically true for ities of P are taken into account. the VM/backend boundaries we are interested in, but it is Ideally, we would like our system to process messages at a conspicuously false if we map P to a network interface. For rate set by the slowest of the two parties, and with the min- this reason, models such as [12], that were developed in the imum possible latency and energy per message. As we will past for hardware-based interrupt and polling, do not transfer see, actual performance may be very far from our expecta- easily to virtualized environments and new models, such as tions and from optimal values. the one propose in this work, need to be studied. Before starting our analysis, we define below the para- In our model, because of the assumptions on threading and meters used to model the system (see Table 1). the cost of shared memory operations, many situations may We measure the cost (i.e. the amount of work) of the vari- arise in which P and C are able to work in parallel without ous operations in clock cycles rather than time. This will ease incurring any synchronization cost. Whenever C has finished reasoning about efficiency when our system has the option to processing the last message from the queue, it can inspect the use different clock speeds to achieve a given throughput. queue again and immediately see if new messages have Some parameter-specific additional assumptions: (i) all the become available, in which case it can process them right time spent in S and S is actual work that the CPU must per- away. Similarly, whenever P has finished producing a mes- form to complete the notification and schedule the notified task; sage that has filled the queue, it can look again and see if C (ii) the sleep cost Y , which is a system-dependent parameter, is has freed up some space in the meantime, allowing P to pro- also the minimum length of any sleep interval (Y or Y ). C P duce some more messages. Each message that P produces Throughput, energy and latency all depend on the pattern keeps C active for some more time, in turn giving more time of requests coming to the producer. For throughput and to P to produce more messages. In this way, P and C can sus- energy measurement, we assume greedy regimes, where R is tain each other for long. never empty, which means that the producer generates new If their speed does not match, however, the faster party messages continuously, and we observe the corresponding will eventually run out of work and will have to wait for the values at regime. slower one. C cannot proceed if it finds an empty queue after Regarding latency, we observe the time elapsed between the consumption of the last message, and dually P cannot pro- the moment a request reaches the extraction point of the R ceed if it finds a full queue. In these cases, the parties must queue and the moment the same request is served by C, for take special actions to find out when their activity is pos- any possible pattern of previous requests coming from R. The sible/needed again. We consider three kinds of special rationale of this definition is to study how much service delay actions: a latency-sensitive request can experience, especially when the system is under load—e.g. requests arriving on R at high rate. � polling by busy waiting, continuously checking the state of the queue without leaving the CPU core to any TABLE 1. The parameters used in the analysis. other task; � polling by sleeping for a fixed amount of time, pos- L The length of the queue sibly repeatedly, if nothing has changed after the wake W Cost for P to process one message and enqueue it up; Cost for C to dequeue one message and process it � blocking (yielding the CPU core to other tasks) and k Threshold used by P to notify C. When C is blocked and P asking for an explicit notification from the other party. queues a message, a notification is sent when the queue reaches k messages (typically k = 1) P P Busy waiting can waste large amounts of CPU cycles when Threshold used by C to notify P (notifications are sent when there is no communication. Notifications on the other hand slots are available) involve extra work to be sent and received, and may be deliv- N The cost for P to notify C about a queue state change ered with some delay. Sleeping, finally, may increase the N The cost for C to notify P about a queue state change latency of messages that arrive at the wrong time. S The cost for P to start after a notification from C In our model, P tries to produce a new message as soon as S The cost for C to start after a notification from P it receives a new request from a private, infinite queue R. The length of the sleep interval for C Once started, however, an operation cannot be interrupted. Y The length of the sleep interval for P Therefore, requests may queue up in R since P may be busy Y The cost of a sleep operation serving a previous request, or it may be inactive (either SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 811 The combinations of synchronization methods and para- representing the worst case service latency, where the meters can give rise to a large number of operating regimes, request arrives immediately after P has started to serve a which we describe next. As we will see, some regimes are previous request. more favorable than others, so we will try to determine the ... ... P: conditions that cause the system operate in a given regime x, ... ... C: and, for each of them, we will determine (i) the average time between messages, T (the inverse of the throughput); (ii) the In case (W < W ), the worst case latency has to take into PC total energy per message E (which includes the work of account the time needed by C to process the L messages both P and C); and (iii) an upper bound D for the latency (as already in the queue. Hence, we have defined above) experienced by any request. To study the evolution of the system, we will draw many TW ={ max ,W}, BW P C diagrams that show the parallel activities of P and C over E ==2, T WW+ +∣∣ WW- BW BW P C P C time, using the following symbols: ì ü ï2i WW+<fW W,ï PC C P ï ï D £ () 1 BW í ý ï(+LW1i ) fW >W.ï Symbol Description CC P îï þï producer processing a message In the BW regime, throughput and latency are optimal, with consumer processing a message latency only depending on the processing times and the producer busy waiting length of the shared queue. consumer busy waiting producer sleeping consumer sleeping producer sending a notification 2.2. Polling by sleeping consumer receiving a notification Here we assume that P and C synchronize by going to sleep consumer sending a notification for a fixed amount of time: Y units of time for the consumer producer receiving a notification and Y for the producer. We can identify three greedy regimes depending on whether the producer processes messages faster or than the consumer or not, and also depending on whether The length of the symbol measures the time spent by P or the queue between P and C is sufficiently long to absorb the C in the corresponding activity. sleep times Y and Y . C P For latency measurements, we focus on a single message When the queue is sufficiently long, the slowest party is and use the following additional symbols: always actively working, and the system throughput only depends on its processing time. The fastest party instead peri- Symbol Description odically sleeps, waiting for its peer to catch up and make message at extraction point of R/leaves the system more work available. If the fastest party sleeps for too long, producer processing the selected message however, also the slowest one will run out of work and sleep, consumer processing the selected message so that the system works at reduced throughput. In our model, the system may be in one of the three operat- ing regimes, depending on the relative size of the system parameters. The conditions to check can be grouped in three 2.1. Polling by busy waiting inequalities, whose possible states are summarized in Table 2 together with a corresponding acronym. Each regime corre- When the system uses busy waiting (BW), P and C are sponds to a different combination of the inequality condi- always active, and the slowest of the two spins for the other tions, and it is identified by an acronym (sleeping fast to be ready. On each message, this requires on average a number of cycles ∣WW - ∣ equal to the difference in pro- PC TABLE 2. Conditions for the sleeping-based regimes (‘−’ means cessing work between the two parties. ‘don’t care’). Detailed explanations are in Sections 2.2.1–2.2.3. In order to compute the latency as defined in Section 2,we consider all the possible states the system can be when a (-LW 1) -W (-LW 1) -W PC CP request arrives at the extraction point of R, and find the one W < W >Y − sFC CP C that has the worst latency. <Y − sLS An example of evolution of the system over time for W > W − <Y sLS CP P the case W < W is shown below, with rectangles repre- CP − >Y sFP senting processing and the two vertical arrows SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 812 G. LETTIERI et al. consumer (sFC), sleeping fast producer (sFP), long sleeps TW = , sFC P (sLS)) which is explained in the following sections. EW=+W + , sFC P C To find the worst service latency for each of the three regimes, all the possible internal states of the system must be DD £( max , 2W+Y+W). (4) sFC sI P C C examined, which means considering all the allowed combina- tions of P and C being active or sleeping and the number of Throughput is optimal because the system is processing mes- messages in the queue. In particular, if a request arrives when sages at the rate of the slowest party (P). Increasing Y the system is idle, independently of the regime, the upper bound reduces the energy, but increases maximum latency, so a D for the latency can be derived from the following diagram: sI trade-off is necessary. In any case, Y cannot be increased too much, to prevent P from fill up the queue and sleep. ... ... P: ... ... C: 2.2.2. Sleeping fast producer where the request arrives right after P starts to sleep, and C If W < W , we have a regime similar to sFC, but with the PC starts to sleep right before the request is published in the roles of P and C reversed. P is faster, so it eventually fills up queue. Hence, we have the queue and goes to sleep for Y cycles. Since the sleep interval is short enough (i.e.YL <( - 1)W -W ), C is PC P DY£+W +Y +W.2() sI P P C C never able to empty the queue, and can work at its maximum rate. In this regime, C is always active, while P alternatively This formula will be useful to describe the upper bound latency produces a batch of messages and sleeps. With a reasoning for sFC and sFP, as described in Sections 2.2.1 and 2.2.2. similar to the one reported in Section 2.2.1, we can derive the average batch size b = and write T and E . sFP sFP WW - CP 2.2.1. Sleeping fast consumer Note that as W approaches W , the batch b grows to infin- P C If P is slower than C (i.e. W < W ), C eventually empties the CP ity in both sFC and sFP, and P and C proceed in lockstep at queue and goes to sleep for Y cycles. Since the sleep interval is the ideal rate of one message every W = W cycles. PC not too long (i.e.YL <( - 1)W -W ), C never allows P to CP C The worst case service latency, depending on the relative fill up the queue, so that P can work at its maximum rate, produ- size of parameters, may show up when the system is idle or cing a packet every W cycles. Each time C wakes up, it quickly when a request has to wait for C to process the L packets empties the queue and goes to sleep. The evolution of the sys- already in the queue. The latter case happens when tem over time is shown below, with horizontal arrows represent- YY+<LW . Hence we have PC C ing sleeps. DD £( max ,(L+ 1)W). (5) sFP sI C ... ... P: ... ... C: Also in the sFP regime throughput is optimal, and energy decreases as Y increases. If the latency is bounded by While P is always active, C alternatively consumes a batch (+LW 1) , there is no dependency on Y and we can choose C P of messages and sleeps. The batch size is generally not con- its value as the maximum one that does not cause C to sleep. stant, but oscillating between two consecutive values. If , Otherwise, Y should be limited to bound latency as needed. and are rational numbers, the evolution is periodic. If W Y P C is the number of messages processed by C in a multiple of the period, and the number of sleeps in the same interval, 2.2.3. Long sleeps then b = is the average batch size, and we can write If the faster party sleeps for too long, also the slower one will run out of work and sleep. This clearly means that the nW=+ nW hY,3() CP C C C C throughput will not be optimal as it is for sFC and sFP, i.e. TW ³ if W < W andTW ³ if W < W . sLS P CP sLS C PC n Y C C from which we get b== . The batch size oscil- As confirmed by our simulations, sLS cause the system evo- h WW - C PC lates between ⌊b⌋ and ⌈b⌉, depending on how P and C inter- lution to be quite complex, although periodic. Closed formulas leave during the batch. Knowing b we can determine E , for T and E are hard to find and probably not much use- sFC sLS sLS considering that the sleep cost is amortized over a batch of b ful. Instead, we provide some upper and lower bounds by con- messages on average. sidering the best and the worst possible scenarios. If W ³ Y , the worst case service latency for sFC shows Throughput bounds for sLS: The best scenario is the PP up with a greedy input pattern, since the request has to wait one that maximizes the time for which P and C work in an additional Y before being served by C. Otherwise, if parallel, and the sleeps are perfectly aligned to make the W £ Y , the worst situation corresponds to the case when system process the same number of packets in each peri- PP the system is idle. In formulas, we have od, as shown in the figure below for the W < W case. CP SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 813 for E , we compute the energy assuming that both P and C m L sLS ... . .. P: are able to process their maximum batch each time they sleep, ... ... C: even if this is not actually possible. Thus, we can write In this scenario, the system processes L + m messages per Y Y EE EW>+W+ + .8() sLS P C (-LW 1) -W é CPù Lx + ⌈⌉ 2Lx + ⌈⌉ m = period, with . The number m is derived ê ú WW - PC ê ú noting that P starts filling the queue with a delay W and then Y opt E Considering that E ()YW= +W+ (as per sFC PC Lx + keeps working in parallel with C until C empties the queue. Equation (4)), it is enough to prove that following inequality Hence, we have holds (+Lm)W +Y CC C 11 1 T ³ =+ W .6() sLS C "³ x 0, < + ,9() Lm + Lm + L +x 2Lx + ⌈⌉ Lx + ⌈⌉ If W < W , we can write an analogous expression for m PC but this can be easily shown to be always true by means of and a lower bound for T by simply swapping P and C. sLS some algebraic manipulations. In the worst case scenario, P and C never work in parallel, Applying a specular reasoning to the case W < W , it can PC alternatively filling and emptying the whole queue in each opt be inferred that E >( EY ). In conclusion, we have sLS sFP P period. This can happen only ifYL >( - 1)W -W and CP C shown that the sLS regime is not convenient in terms of YL >( - 1)W -W (cf. Table 2). One of the two parties PC P energy efficiency. This is a useful information, because sLS sleeps only once per batch, while the other may sleep more is also not optimal for throughput, so that excluding it from times. As a consequence, the length of the period is not larger our solution space will not result into a trade-off. than max{+YLW ,Y +LW }, L packets are processed PP C C Latency upper bound for sLS: In the worst case, a request during each period and we have arrives at the extraction point of R when P has just started filling the last element in the queue, while C is sleeping (possible because Y Y P C TW£+ max , W+ . (7) sLS P C {} YL >( - 1)W -W ). CP C L L L − 1 h Y p p ... ... P: Equations (6)and (7) show that inter-message distance tends ... ... C: to increase linearly with the sleep interval length, and that, in the Y h Y c c c worst case, the sleep interval is amortized over L messages. P has to wait for C to wake up and empty the queue (pos- Energy lower bound for sLS: In Sections 2.2.1 and 2.2.2, sible becauseYL >( - 1)W -W ), before it can produce PC P we have seen that the most energy-efficient sleep length is opt opt the request. From the diagram, it is clear that YL =( - 1)W -W when W < W andYL =( - 1) PC C CP P YY³= ⟹h 1 (otherwise this would not be the worst PC P WW - when W < W . While lower and upper bounds for CP PC case). When P wakes up and serves the request, in the worst E could be obtained with techniques similar to the ones sLS case, C misses the new event and pays an additional sleep. If used for T , for our purposes, it is enough to show that sLS opt opt we ignore that P and C sleep together for a while before C E >( EY ) and that E >( EY ). This would sLS sFC sLS sFP C P starts draining the queue, pretending the two sleeps are serial- mean that the per-message energy for sLS is worse than the ized, then we have D £+YY +W+Y +W when sLS C P P C C best possible energy in sFC (or sFP), and consequently that YY ³ . Similarly,YY³= ⟹h 1. When C wakes up PC CP C the energy efficiency of the sleeping mechanism is optimal after the queue has been emptied, it will find the request pro- when the sleep interval of the faster party is the largest one duced and can serve it, hence D £+YLW ++Y W . sLS C C C C that still prevents the slower one to sleep. Using the inequalityYL >( - 1)W -W , we can upper PC P Focusing on the case W < W , we observe that the maximum CP (-LW 1) -W CP bound the term LW . In conclusion, independently on the batch size for C is L + ⌈x⌉,with x = , as described WW - PC relative size of Y and Y we have C P in the throughput lower bound scenario above. The maximum batch size for P is instead 2Lx + ⌈ ⌉, corresponding to the follow- DY£+22 Y+W+W. (10) sLS C P P C ing time diagram: L m L As a particular yet interesting case, ifYY » then we ... ... PC P: ... ... C: havehh== 1, which means that the worst case service PC delay is bounded by only two times the sleep interval: which is similar to the previous one, with the only difference DY£+2. W+W (11) sLS P C that P is not sleeping when C starts. To derive a lower bound SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 814 G. LETTIERI et al. 2.3. Notification-based regimes derived noting that C starts processing with an initial delay S , and then catches up draining the queue a little bit at a When the system uses notifications, we can identify five differ- time. Knowing b, it is easy to determine T and E , con- nFC nFC ent regimes. Similar to what we described in Section 2.2,the sidering that the notifications and startup costs are amortized regime depends on the relative size of W and W andalsoon P C over batches of b messages: whether the queue is able to absorb the startup times S and S . With a sufficiently long queue, also in this case, the slow- TW=+ ; nFC P est party will determine the overall throughput, but the need to periodically stop and restart using notifications will add an NS + PC EW=+W + .1()2 nFC P C overhead (which can be significantly large) to the average message processing time. When the queue becomes too short to absorb the notifica- A large b improves the performance of the system, and since tion latency, one party may block despite being slower than b ³ k we would like k to be large. However, systems nor- P P the other one, significantly reducing throughput. mally use k = 1 for two reasons: a larger k often increases P P Two non-intuitive results of our analysis are that (i) the sys- the latency of the system and more importantly, P often can- tem’s performance can be improved by slightly slowing down not tell whether there will be more messages to send after the the fastest party, in order to reduce the overhead of notifications, current one. and (ii) the threshold for notifications has opposite effects Assuming k = 1, the worst case delay experienced by a depending on whether we are in a long or short-queue regime. request at the head of R includes the cost of a producer notifi- As a consequence, correctly identifying the operating cation and a consumer startup. When m = 1, in particular, the regime is fundamental for properly tuning (either manually, request has to wait for two producer notifications before or automatically) the system’s parameters. being served, as illustrated in the following diagram: Similar to what we presented in Section 2.2,the five operat- ... ... P: ing regimes (notified fast consumer (nFC), notified fast producer ... ... C: (nFP), slow consumer startup (nSCS), slow producer startup (nSPS) and slow producer and consumer startup (nSS)) are told so that we have apart by means of three inequalities, summarized in Table 3. DW£+22N+S+W. (13) nFC P P C C 2.3.1. Notified fast consumer When C is faster than P (i.e. W < W ), C will start after the CP notification from P and eventually drain the queue and block. If 2.3.2. Notified fast producer C starts fast enough (i.e. S <(Lk - )W -W ),the queue CP P C When W > W , we can identify a different regime, which we CP will never become full and, therefore, P will never block. The call nFP (fast producer), which behaves like nFC but with periodic evolution of the system over time is shown below, the roles of P and C reversed. P is faster than C, so the queue with triangles indicating notifications and wake-ups. eventually fills up and P blocks. The notification from C to restart P is sent when there are k empty slots in the queue. If ... ... C P: ... ... P starts fast enough (i.e. S <(Lk - )W -W)), it refills C: PC C P b the queue before it becomes empty and, therefore, C never blocks. In this regime, P is always active, and periodically gener- We omit the T and E formulas for brevity, but the ana- ates notifications when C is blocked and the queue contains nFP nFP lysis and graphs in the rest of the article also cover this regime. k messages. The number of messages processed by C (and The latency analysis is more interesting, since a request at Sk +( - 1)W êCP Cú P) in each round is b = + k . The number b is ê ú WW - the head of R has to wait for C to process the L messages PC ë û already in the shared queue; since C periodically notifies P, the latency is delayed by a number of notifications that is pro- TABLE 3. Conditions for the notification-based regimes (‘−’ means ‘don’t care’). Detailed explanations are in Sections 2.3.1–2.3.5. portional to L and inversely proportional to the batch b: (-Lk)W -W (-Lk )W -W æ ö ê ú PP C CC P Lk - ç ÷ DW£+21 LW+N + . ()14 nFP P C C ê ú÷ è ê úø ë û W < W >S − nFC CP C <S >S nSCS C P >S <S W > W nSPS CP C P 2.3.3. Slow consumer startup − >S nFP Regime nSCS differs from nFC in that C is fast but has a − <S <S nSS C P long startup delay, so P can fill the queue before C has a SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 815 chance to remove the first message. This forces P to block Just looking at the equation, it might seem that there is a until k messages are drained and C generates a notification. good amortization of the notifications and wake-up costs The situation then repeats periodically once C has drained (once per L messages). However, the timing diagram shows the queue, as shown by the following diagram: clearly that P and C alternate their operation, making the throughput less than half of that of the fastest party. L − k k P m P ... ... In the worst case, latency scenario for nSS, assuming k = 1, P: ... ... C: a request arrives at the extraction point of R when P has just k L − k C C started producing the last available slot in the queue, while C is slowly starting up. Once C wakes up, it starts draining the queue, The cycle contains L + m messages, where notifying P after k messages. When P wakes up, it produces the request and notifies C that serves the request as soon as it wakes ê ú (-Lk )W -(S +W) CC P P up again. m = ê ú+( 1. 15) ê ú WW - PC ë û ... . .. P: ... ... C: We omit the formulas for T , E and D as they are nSCS nSCS nSCS k L − k C C long and not particularly useful. The worst case latency ana- It is clear from the diagram that the delay experienced by lysis reported in Section 2.3.5 is also also valid for nSCS.In the sensitive request includes all the notification (N , N ) and any case, important insights on throughput for this regime P C startup times (S , S ). The exact formula for nSS latency is come from the analysis of the above diagram and P C not shown, as it is more useful to present an upper bound that Equation (15). The slow party (P) has to wait because of a is also valid for the nSCS and nSPS regimes large S , and increasing k reduces m, thus extending the idle C C time for P and increasing the amortized cost of notifications DW£+21(k + )W+2S+N+N+S. (17) SQ P C C C C P P and startups. Note that k has opposite effects on performance in the In general, regimes with short queues are unfavorable and two regimes nSCS and nFP, due to the slow startup time: in we should avoid operating the system in them. nFP, a large k improves performance, whereas in nSCS,we should use a small k . 3. ANALYSIS OF THROUGHPUT, LATENCY AND 2.3.4. Slow producer startup EFFICIENCY This regime is symmetric to nSCS, and it appears when the producer is faster than the consumer, but slow to respond to a Using the equations for T , E and D derived in Section 2, x x x notification. For brevity, we omit the formulas, which be we now explore how throughput, efficiency and latency obtained from the nSCS case by swapping every P with C. change depending on the parameters, for each of the three The long startup time leads to different choices for the param- synchronization mechanisms modeled (busy waiting, sleep- eter k : in regime nFC, we aim for a large k , whereas in P P ing, notifications). We also compare the mechanisms against regime SPS, we should use a small value for that parameter. each other, highlighting advantages and drawbacks. 2.3.5. Slow producer and consumer startup This regime combines the previous two. P and C alternate 3.1. Throughput operation due to the large startup delays, and individual speeds only matter in relation to the startup times. An example of evo- We start our analysis looking at the average time between lution over time is shown in the following diagram. messages, T . Since busy waiting (BW) is optimal for this per- k L − k P P formance indicator, we first compare the other two mechan- ... . .. P: isms against BW. ... ... C: L − k k C C 3.1.1. Throughput for notification regimes Each round in this case comprises exactly L messages, and In Fig. 2,weplot T for notification-based regimes, for a and have a relatively simple form: T E given W (consumer processing time) and variable W (produ- nSS nSS C P cer processing time). The region to the left of W = W corre- PC kW++ k W N+ S+ N+ S PP CC P P C C sponds to a fast producer. T = ; nSS There are three curves of interest here. The dotted line at NS++N +S PP C C the bottom represents the minimum inter-message time, EW=+W + .1()6 nSS P C which isTW ={ max ,W }. This corresponds to the best BW P C throughput we can achieve if efficiency is not an issue, and SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 816 G. LETTIERI et al. can be obtained with busy waiting, i.e. keeping the fastest Since the equations governing the system are completely party continuously spinning for new opportunities to work. symmetric, the curves for a fixed W and variable W have a P C The next curve (solid line) represents T and T , corre- shape similar to those in Fig. 2. This shows that there are nFP nFC sponding to the first two notification regimes. Here the dis- regions of operation where increasing the processing costs tance between messages is higher than in the ideal case due (W in nFP, W in nFC) increases throughput. P C to the effect of notifications and startup times. These are While the graphs focus on variations of W and W , they P C amortized on the number b of messages per notification; b also show the sensitivity of the curves to other parameters. changes in a discrete way with the ratio W /W , hence, the As an example, the distance among T , T and the optimal nFP PC nFC curve has a staircase shape. value T is bounded by N /k and N /k , so we have CC BW PP It should be noted that depending on the queue size L and knobs to reduce the gaps. Also, the position of the last big the values of the operating parameters, we cannot guarantee jump in throughput in regime nFC can be controlled by that the system operates in regimes nFC or nFP. Sections increasing S . This means that all the rest being the same, a 2.3.3, 2.3.4 and 2.3.5 indicate the conditions for which we slower wake-up time improves performance. may enter one of the three regimes nSCS, nSPS or nSS, all of which have a larger inter-message time than nFC and nFP. 3.1.2. Throughput for sleeping regimes Hence, our third curve of interest is labeled T , which nSS In Fig. 3, we plot T as a function of Y for sleeping regimes, corresponds to W++WN/L+N/L and marks the best x PC P C C assuming W < W . While Y is small enough (i.e. possible performance in regime nSS. Operating curves for CP C YL £( - 1)W -W ), the curve corresponds to T , nSCS and nSPS are not shown for the sake of simplicity, but CP C sFC where the inter-message time is constant and optimal, match- they lay above T and T , and partially also above T . nFP nFC nSS ing the one achieved with busy polling. This happens because It is important to note that performance can jump among P works at full speed, never finding the queue full and never T , T and T even for small variations of the operating nFP nFC nSS spending time for synchronization. parameters. Hence, it is imperative to either make the region For larger values of Y , we reach the sLS regime, where the between the two curves small, or set parameters to minimize queue is not able to absorb the sleep interval anymore and P the likelihood of regime changes. repeatedly fills the queue and goes to sleep. As explained in Going back to the analysis of operating regimes, we note Section 2.2.3, the average distance between packets increases that both nFC and nFP have two different regions, separated and the exact dependency from Y is complex. Lower and by the vertical dotted lines in the figure. These boundaries upper bound curves (dashed lines) envelope the exact values occur when the batch of messages processed on each notifica- tion reaches the minimum value, respectively, k and k . The P C fact that k is usually 1 makes the jump much higher in regime nFC than in regime nFP. nFP nFP nFC nFC nSS nFC T T nFP BW W Y (L − 1)W − W P E P C W −S W k W + S C P C P C C FIGURE 3. The average time (T) and energy (C) for each message FIGURE 2. The time for each message as a function of W , for the as a function of Y , for the sleeping-based regimes. The dashed lines P C notification-based regimes. The message rate decreases as W moves are the lower and upper bounds for T in the region where both P and away from W . The curve for T represent the best case for regime C may sleep. T and C have similar shapes, but they do not differ by C nSS nSS, actual values may be much larger. a constant value. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 817 of T , obtained by simulations. The upper bound grows lin- T . In any case, energy efficiency in sLS is worse than the sLS opt early with slope 1/L, while the lower bound has a smaller energy in Y , which is, therefore, the optimal sleep value. slope, which becomes close to zero as W tends to W and For reasons similar to those explained in Section 3.1.2, P C becomes close to 1/L as W diverges from W . This means describing how E depends from Y is not particularly inter- P C P that the variability (oscillation) of T in the sLS regime is lar- esting when W < W , since we want to stay away from sLS CP ger when W and W are close, and it is smaller otherwise. regimes in any case. P C If W < W , showing how T depends from Y is less inter- A specular analysis can be done for the case W < W , CP P PC esting, because (i) P may never be sleeping, so T may not since equations describing E are symmetrical. In conclusion, x x depend at all from Y ; (ii)to stayawayfromlong sleep regimes the energy efficiency analysis shows that Y and Y should be P P C we have to make sure Y is small enough so that P never sleeps, chosen small enough to avoid entering the unfavorable sLS and so we are interested in the dependency from Y . regimes, but close enough to the optimum value to amortize In the W < W case, since the equations for T are sym- the sleep cost as much as possible. PC metric, it will be useful to study T as a function of Y , and It should be also noted that choosing very distant values the shapes will be similar to the ones of Fig. 3. for Y and Y (e.g. different orders of magnitude) is not con- P C The analysis clearly shows that Y and Y should be chosen venient w.r.t. efficiency. If both peers happen to be sleeping, C P small enough to keep the throughput to the maximum. With the one with the shorter sleep interval will need to sleep proper tuning of the operating system—to make sure the many times if it is waiting for the other to wake up and sleep timeliness is respected—this can be actually achieved. advance the queue processing; this results into unnecessary Compared with notification-based regimes, sleeping has a energy consumption. If the sleep intervals are comparable, on significant advantage in terms of throughput: the slower party the contrary, one or two sleeps will usually suffice. does not need to slow down in order to notify its peer. The It is important to observe that the energy efficiency of faster party can indeed wake-up autonomously to poll the sFP/sFC regimes is always better than the efficiency of BW, queue for new work. with W and W being the same. This can be evinced from P C Y Y E E Equations (1) and (4), noting that both and are smaller Y Y C P than one. A meaningful comparison with notifications can be done once some estimates for the various parameters are 3.2. Efficiency known. In Section 4.3, we report some measurements for the While busy waiting (BW) has the highest throughput in gen- sleep cost Y and, in Section 5.2, the notification/startup costs eral, its performance may come at a high cost in terms of involved in nFP/nFC regimes that can be taken as a refer- CPU usage. In regime BW, the fast party must burn cycles ence to support the decision process. proportionally to the difference of processing times, ∣WW - ∣ . This can possibly double the total overall cost in CP 3.2.2. Efficiency for notification regimes terms of time/cycles, and can have even worse impact on In Fig. 4, we show the energy per message in different energy if the fast party has higher energy consumption per regimes. For simplicity, here we use only one graph with cycle. As an example, the fast party could be an expensive, variable W , having already established that the system is dedicated CPU/NIC/controller. symmetric and we can repeat the same reasoning for variable Therefore, it is important to also take into account the total W . Also in this case, we have three curves of interest, but energy consumption per message, i.e. the values E deter- they are not as nicely ordered as in Fig. 2. mined in Section 2. We see that the E values have the form The curve for BW (solid thin line) is no more the absolute W++WX, where the additional term X depends on the PC best in terms of efficiency. This is because the additional term operating regime. X in E is ∣WW - ∣, whereas in other cases the term X is CP BW Similar to the analysis conducted for throughput, we start upper bounded by some constant independent of the differ- by comparing notifications and sleeping against BW, and then ence W - W . As a consequence, the slope of E is twice CP BW compare them between each other. that of the other curves, and when W becomes too large (or more precisely, when ∣WW - ∣ becomes large) busy waiting CP 3.2.1. Efficiency for sleeping regimes is the worst option in terms of energy per message. Figure 3 shows E as a function of Y for sleeping regimes, The energy curve (solid thick in Fig. 4) for regimes nFC assuming W < W . For all the Y values smaller than and nFP has the same step-wise behavior as the ones for CP C opt YL =( - 1)W -W , the plot corresponds to the sFC inter-message time. The slope is however unitary (it grows as PC regime (E ), where P never sleeps and the energy is max{WW , }), and lies within the gray region in the figure sFC PC inversely proportional to Y , as the cost of each consumer depending on the actual parameters. As the graph shows, the sleep is amortized over a larger batch. For larger values of Y , curves for BW (solid thin) and notifications (solid thick) the system enters the sLS regime, where also P sleeps, and regimes may intersect in several points, whose values and the shape of E is irregular, roughly following the shape of position depend heavily on the actual parameters. sLS SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 818 G. LETTIERI et al. from when a request (that we can imagine is latency-sensitive) arrives at the extraction point of the R queue to when the E same request is serviced by the consumer, considering all the BW possible input patterns for R. For each regime, we have defined the worst case situation that can happen with the cor- nSS responding relative sizes of parameters (as specified in Tables 2 and 3), not assuming that R requests arrive greedily, 2W and we have expressed an upper bound for the latency. The BW regime, which is optimal for throughput, is also optimal in terms of latency, that is D ³ D . This happens xBW N +S C P because P and C do not have fixed-cost synchronization overheads (i.e. notifications, startups, sleeps); P can actually produce the high-priority request as soon as there is an avail- able slot and C can consume it as soon as it has processed all N +S the messages already pending in the queue. It is worth P C remarking that when W > W the latency-sensitive request CP needs to wait for C to process up to L elements before it can be serviced; although this delay (LW ) can be considerable in practice, no strategy can do better under our FIFO assump- FIGURE 4. The total energy per-message as a function of W . There may be regions where busy waiting (thin line) is more energy tion. We, therefore, compare the sleeping and notification efficient than notifications (thick and dotted lines). mechanisms against BW, in order to see how these mechan- isms introduce additional delays that make latency move The shape of the curves and their discontinuities make it away from the optimum. difficult to identify intervals in which one regime is preferable to another. We can compute them using the equations in 3.3.1. Latency for notification regimes Section 2.3, but these rely on perfect knowledge of the operat- The D inequalities and the evolution diagrams reported in ing parameters, hence the information is of little practical use. Section 2.3 show that for almost all the notification regimes Comparing the total energy per message in regime BW (nFC, nSCS, nSPS, nSS), the worst case service delay is upper with the other regimes, however, can give some useful prac- bounded by some linear combination of the notification/ tical insight. Busy waiting consumes an extra ∣WW - ∣ PC startup parameters (N , N , S , S ). Moreover, k has nega- P C C C cycles per message, so it is convenient when the cost is lower tive impact in short-queue regimes (nSPS, nSCS, nSS), since NS + PC than the extra notification and startup cost, which is in it delays the consumer notification that P needs to wake up NS + CP nFC, in nFP. Since in nFP we have b ³ k and k is C C and produce the request at the head of R. In particular, the typically large, it very unlikely that busy waiting can be nSS regime includes all these latency contributes, and so it is energy efficient. the most unfavorable one among the ones listed. Notifications with short queues: The energy efficiency The fast producer regime (nFP) deserves a separate ana- when the queue fills up is heavily dependent on the values of lysis. Since W < W , the high-priority request may need to PC the parameters. Equation (16) for E shows that the extra nSS wait C to consume L messages before it can be served. This term includes all the four startup and notification times instead is not an issue by itself, because also the busy waiting (opti- of only two of them for E and E . Given that we expect nFC nFP mal) mechanisms have the same limitation. However, C is one of S , S to be large, this might be a significant cost. C P slowed down by the notifications that it needs to send in On the other hand, the energy efficiency of these regimes order to wake up P periodically. The number of notifications is not too bad, because producer and consumer tend to have is not constant, but depends on the queue length and the significant idle times, and the overheads are amortized over batch size. The inequality (14) implies that notifications relatively large batches (e.g. the entire queue size in regime are needed in the worst case. Since D is not bounded by nFP nSS). This phenomenon is evidenced by the curve E (dot- nSS a linear combination of the parameters, like for the other ted) in Fig. 4, which also intercepts the others. notification regimes, is not possible to tell in general whether nFP is more favorable than nSS or not. It is useful to note that improves latency in nFP, while it has the 3.3. Latency opposite effect with short queues. We now complete our analysis with the latency, using the As already stated previously, in a real system, the para- upper bounds derived in Section 2. As explained, the worst meters are not exactly constant, and thus it is usually difficult case latency is defined as the maximum time that can elapse to guarantee that the system never ends up (even temporarily) SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 819 in a short-queue regime. As a result, the estimation of the platform has an Intel Core i7–3770 K CPU at 3.50 GHz (4 worst case service latency of a producer–consumer system cores/8 threads), 8 GB RAM DDR3 at 1.33 GHz, and runs based on the notification mechanism should take into account Linux 4.6.4. A recent version of the QEMU hypervisor (git the latency bounds for both nSS and nFP. master 9a48e3, June 2016) is used to run the guest VM using KVM hardware-assisted virtualization. The guest is given 1 3.3.2. Latency for sleeping regimes vCPU and runs Linux 4.6.4. In order to improve the reprodu- The D inequalities presented in Section 2.2 for sFC and sLS cibility of results, all the tests have been run with the follow- illustrate that the worst case delay for sleeping regimes is upper ing configuration (except when explicitly noted): bounded by a linear combination of the sleep intervals, namely Y and Y . As a notable case, we have also seen that ifYY » (1) No load on the machine other than essential operat- C P PC then the worst case latency does not exceed two times the sleep ing system services. interval, plus the time necessary to process the request. The lat- (2) Dynamic frequency scaling disabled, so that all the ter result is particularly interesting, because choosing the sleep CPUs run at maximum frequency. intervals similar to each other is also a good choice in terms of (3) Sleeping C-states disabled, that is all the CPUs in C0 energy efficiency, as discussed in Section 3.2. all the time; the host O.S. never issues the halt The sFP regime requires a separate discussion, similar to instruction to pause the CPU, even when there is no the corresponding fast producer notification regime (nFP). active process to schedule. This is not the default The worst case latency for sFP is optimal, because the behavior of Linux, and requires the idle=poll latency-sensitive request has to wait for C to consume all boot parameter to be specified. messages already in the queue, which is not distinguishable (4) Hyperthreading and turbo mode are disabled. from the behavior of BW. In other words, a larger Y does (5) Each thread part of the experiment is pinned to a dif- not impact latency, as long as C does not sleep (so that the ferent physical core. system does not enter the sLS regime). (6) KVM halt polling disabled by setting the Compared with BW, the latency of the sleeping mechanism halt_poll_ns module parameter to 0. This is is worse (idle system, sFC, sLS), but it can be kept under con- necessary to isolate the CPU utilization related to our trol by properly limiting Y and Y . A comparison between producer/consumer system, not including the cycles C P the notifications and sleeping mechanisms can be done with wasted by KVM because of this optimization that can some estimates of the notification parameters and the sleep take up to 60% of the CPU time in some pathological cost Y , using the D upper bounds. Sleeping can be conveni- cases. E x ent if the Y and Y interval values can be chosen sufficiently C P small w.r.t. the notification parameters. 4.2. Description of the system under study 4. ESTIMATING THE SYSTEM PARAMETERS VirtIO [10] is a widely used standard and API for I/Oparavir- tualization; most of Hypervisor software (QEMU, bhyve, The best mechanism for a given set of requirements— VirtualBox, Xen, etc.) and guest operating systems (Linux, throughput, energy and latency—can be chosen once the FreeBSD, Windows) are rapidly converging to VirtIO as the designer has some estimation of the system parameters, which default I/O infrastructure for VMs. Taking it as a reference for heavily depend on the producer and consumer implementa- experimentation is meant to maximize the impact of our work. tion, the host machine hardware and the O.S. implementation. VirtIO is a generic producer–consumer API that allows a In this section, we describe how these parameters can be guest O.S. to exchange data with its hypervisor (also referred obtained in a representative case. Since our work is primarily as the host). It provides a guest-side API and an hypervisor- focused on virtualization environments, we have chosen to side API that are used by the guest and the hypervisor, experiment with VirtIO systems, as illustrated in Section 4.2. respectively, to access VirtIO data structures. The main data structure is called virtqueue and is implemented in a portion of memory shared between the guest and the hypervisor; it is 4.1. Description of the test environment composed of two separate circular arrays (rings): the avail For all the experiments presented in this article, we have con- ring and the used ring. A guest driver inserts buffers (in the figured the testbed to minimize the noise introduced by the O. S. scheduler and by the power management features of the A feature [13] recently added to KVM that lets the vCPU thread polling for a while when the guest issues an halt instruction, instead of scheduling modern CPUs: this includes the frequency scaling and the out immediately. processor C-states (which are a significant source of latency, More precisely, a virtqueue also includes a descriptor table, which is an as several microseconds may be necessary for a core to array containing buffer descriptors. Each slot in the avail and used ring just recover from the deepest C-states). Our reference test references the head of a chain of descriptors (e.g. a scatter–gather list). SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 820 G. LETTIERI et al. form of scatter–gather lists) in the virtqueue avail ring, where � The hypervisor device implementation (consumer- the hypervisor can extract them (in FIFO order). Once the vhost.c), where the consumer (C) code runs in the hypervisor has consumed a buffer, it pushes it to the used context of a vhost thread. ring, where the guest can recover them (and possibly do some cleanup). Each virtqueue has a mechanism to let the guest Note that P and C run in two different threads, consistently send a notification to the hypervisor and vice versa. A VirtIO with our model. P and C can be configured to set different device may be composed of one or more virtqueues. As an values for the W , W , Y and Y parameters, and to choose P C P C example, the VirtIO network device has at least a virtqueue between the three strategies (notifications, sleeping, busy for packet transmission and another one for packet reception. waiting). In this way, once the W and W and D para- P C MAX To ease measurements and experimentation, we implemen- meters have been fixed, we can experiment with the different ted an ad hoc VirtIO producer/consumer device for QEMU/ strategies to optimize an objective function (cf. Section 7). It KVM on Linux, referred as virtio-pc in the following. The is worth noting that there is an implicit lower bound for the device has a single virtqueue, where only the producer and validity of the W and W parameters, related to the imple- P C consumer processing is emulated (by means of a program- mentation limits of the Linux guest-side and vhost mable busy wait); all the other operations involving the virt- hypervisor-side of the VirtIO API we are using. Our measure- queue are performed using the real VirtIO API. We have ments show that the virtqueue cannot process more than 8 chosen to use the QEMU/KVM Linux hypervisor and Linux millions items per second on our testing platform, even when as a guest O.S. for a valid reason: they provide the most com- all costly notifications are suppressed. As a consequence, it is plete, updated and optimized implementation of both VirtIO not meaningful to carry out experiments where W and W are P C APIs. In particular, the QEMU/KVM hypervisor supports a <125 ns. To stay safe and avoid possible border effects, we Linux-specific high-performance in-kernel implementation of will use values equal or greater than 200 ns. the VirtIO hypervisor-side API, known as vhost. With vhost, the hypervisor-side implementation of a VirtIO device runs in 4.2.1. Code instrumentation for time measurements a dedicated kernel thread, without requiring any intervention C is able to compute latencies which include both the W /W 4 PC from the associated user-space QEMU process. The guest costs and the queuing delay. To achieve this, P stores a time- can write into a VirtIO device register to notify the vhost stamp inside each buffer passed to C, so that the latter can thread; the register access is intercepted in host kernel space take its timestamp at the end of its processing cycle and com- by the KVM kernel module, which wakes up the vhost thread pute the difference. A distribution of latencies gets collected without the need to switch to user-space. If the vhost thread and the 98th percentile is computed as the representative of is scheduled to run on a different core than the one issuing 5 the worst case latency. Timestamps are samples using the the notification, an Inter Processor Interrupt (IPI) must be x86 TSC register, which is incremented at constant rate and sent to the destination core. Similarly, the vhost thread can is consistent across all the cores. However, TSC values read notify the guest directly instructing the KVM module to inject from the guest O.S. differ by a constant offset from the ones an interrupt. read on the bare metal. This TSC offset must be taken into Our producer/consumer experimentation framework is account when computing time difference; it can be obtained available as open source software at https://github.com/ using the Linux ftrace [14] tracing system, once the kvm/ vmaffione/qemu/tree/virtio-pc, and includes the following kvm_write_tsc_offset tracepoint is enabled. components: To validate our model correctly (and cross-check the mea- surements), virtio-pc has also been instrumented to measure � The driver for Linux guests (producer.c), exported all of the parameters we take into consideration. This is to user-space as as a character device (/dev/ important because sometimes the measured value differs from virtio-pc), where the producer (P) code runs the nominal one; for example, this is the case for Y and Y in C P entirelyinkernelspace,inthe contextofanioctl()sys- our testbed. In the following, we always use the measured tem call, which returns only when a test run is finished. values rather than the nominal ones. Parameter estimation is P is implemented by means of the Linux guest-side done both online and offline: W , N and Y are estimated by P P P VirtIO API. P with running averages; W , N , Y are measured by C in a C C C � The support in the QEMU hypervisor necessary to similar way; S is computed by C using timestamps put by P expose the VirtIO device to the guest O.S. as a PCI in the first packet of each batch of C (similar to how the device. latencies are computed). Finally, since k is greater than one, S cannot be measured online. As a part of the instrumentation, both P and C trace The usual hypervisor-side VirtIO implementation resides in user-space, which implies continuous transitions between the user-space VirtIO device implementation code and the kernel-space code which runs guest code using Higher percentiles are pruned to rule out rare large fluctuations due to hardware-assisted virtualization. interrupts and scheduling. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 821 some events of interest, that is (i) P publishing a new item in undesirable, since we expect Y and Y to be in the range C P the shared queue; (ii) C seeing a new item in the shared 5–50sm in common scenarios. To remove this systematic queue; (iii) P completing a notification to C; (iv) P blocking source of delay, we have set timerslack to 0 for the entire or sleeping (queue full); (v) C completing a notification to P duration of the tests. and (vi) C blocking or sleeping (queue empty). An event is Figure 5 illustrates the results of the test runs, with the x made up of an event type, a TSC timestamp, and a sequence axis representing the nominal sleep interval (i.e. the argument number identifying the next item to be produced or con- passed to the system call) in microseconds. The first curve sumed. Both P and C store the events in a local large circular shows the measured average sleep interval Y, in microseconds. array (2 elements), so that the tracing overhead is negli- For nominal intervals <10ms, the kernel is not able to support gible. Once a test run terminates, the two event arrays are the sleep with a low relative error: the overheads involved in accessed offline using the ftrace facility and merged, taking programming the timer, updating the kernel data structures into account the TSC offset. The merged logs allow us to and perform the user-kernel context switches exceed2sm , and examine the whole evolution of the virtio-pc system, and in the the curve never goes below this value. As the nominal particular also to measure an average for S and all the other interval grows, the fixed costs are amortized more and that the parameters. relative error decreases; for nominal intervals over 50sm the relative error is close to zero. The second curve shows the average per-sleep cost. Up to ~m 2.5 s,YY » , which means that the CPU is nearly 100% 4.3. Estimating sleeping costs busy serving the nanosleep system call. No process sched- Using the sleeping mechanism requires the value of Y to be uling happens, since the expiration time is already passed measured, since (i) Y determines the energy efficiency; and when the call to the scheduler would be performed. For larger (ii) it is a lower bound for Y and Y ,i.e.itisthe minimum nominal intervals, the scheduling and wake-up start to hap- C P sleepintervalallowedbythe system.Inorder to evaluate Y pen, and CPU utilization decreases. As expected, the mea- () n and understand the behavior of the sleep primitive in our refer- sured Y is constantly 2.5ms, not depending on Y , at least () n ence test environment, we set up an experiment where a process up to ~m 20 s.As Y increases again, Y grows in a invokes the nanosleep system call N times in a tight loop, staircase-shape fashion. This is a consequence of the Linux with a fixed sleep length passed as argument. The number N is implementation of the timer subsystem, which hierarchically chosen large enough (in the order of 10 ) to collect meaningful groups expiration events depending of the order of magni- statistics. By measuring the total duration of the run (N sleeps), tude; a bigger order of magnitude means more operations are we can compute the average effective sleep interval length, needed to insert and remove the expiration event from the which is in general higher than the nominal length. internal data structures. To measure the sleep cost, we used the cpupower monitor From this analysis, we can conclude that Y»m 2.5 s on our tool (and in particular the Mperf high precision monitor), test platform, at least assuming that Y and Y are not chosen to C P which is able to compute, for each CPU, the fraction of time be larger than 20ms. If the latency requirements allow for worst the CPU is in the C0 state (i.e. actively executing instruc- tions). When the CPU is not in C0, it is in the C1 shallow sleep state; for this particular test, differently from what described in Section 4.1, we used the default value for the idle boot parameter, so that the O.S. is allowed to put the 1,000 CPUs in C1. Since the sleeping process is pinned to a CPU during the run, and there are no other processes using observ- able amounts of processing time on that CPU, we can com- 100 pute Y as the product between the measured average sleep E E interval and the fraction of time the CPU is in the state C0. The run is repeated for different values of the sleep interval, ranging from 900 ns to 1 ms; as we will see, this range is suf- ficient to illustrate the properties of the sleep primitive on our test platform. 1 10 100 1,000 (n) When an application asks the Linux kernel to sleep for Y relatively short intervals (e.g. <1ms), the timerslack per- process parameter must be considered. Unless the process FIGURE 5. Average effective sleep interval (Y) and per-sleep has real-time priority, the nanosleep Linux implementa- energy (Y ) versus nominal sleep interval. The system is not able to tion will silently add the value of this parameter, which deal with sleeps shorter than 2.5 ms, and the cost depends on the order of magnitude of the sleep interval. defaults to 50sm , to the sleep interval length. This is really SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 cycles Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 822 G. LETTIERI et al. case latencies larger than 40sm , Y can be estimated consider- constant values. In this section, we validate the model predic- ing another step of the curve, but this is not common for the tions by comparing them to actual measurements on the sys- kind of system under study in this work. Our analysis also con- tem introduced in Section 4.2. firms that it does not actually make sense to sleep for less than Y , because it would just be a convoluted and unreliable way of doing busy waiting. For the sake of completeness, we have 5.1. Validation of sleep-based regimes repeated the measurements giving real-time priorities This section presents an extended experiment meant to check (SCHED_FIFO) to the sleeping process, with the Linux kernel how much the virtio-pc system described in Section 4.2 built with real-time support (linux-rt). As expected, no measur- matches our model. For the purpose of validation (and also able differences have been observed, since the tests have been for the strategies presented in Section 7), we will slightly sim- run with the machine unloaded. plify our model, assuming that both P and C use the same sleep length, that isYY = . This practical simplification PC does not impact our study, because it has only effect when 4.4. Estimating notification costs the system operates in the sLS regimes that we want to avoid The values of notification parameters depend on how P, C in any case; moreover, usingYY = simplifies latency esti- PC and the queue are implemented (O.S. processes, VMs, shared mation and entails a simpler, staircase-shaped throughput memory, hardware controllers, etc.). The measurements curve than the one of Section 3.1.2. reported here are related to the virtio-pc reference system For the experiments, we have chosen a fast consumer scen- described in Section 4.2, and rely on its event tracing facil- ario with W =m 2s, W =m 1s and L = 512, while Y varies P C ities. In a virtualization environment notifications are quite between 4ms and 3 ms, so that we also check that the system expensive, involving VM exits, Inter-Processor Interrupt transitions to sLS regimes when Y goes beyond (IPI), calls to the host scheduler, and VM enters. LW=m 1024 s. Figure 6 shows that there is a very good In order to measure the four notification constants we have agreement between our model (values for the sLS regime are conducted two kinds of experiments. A fast consumer experi- obtained by simulation) and virtio-pc. In particular, both ment, with W = 2000 and W = 4000, is used to compute curve agrees on the fact that the average per-item time C P N and S ,as W - W is large enough that there is a notifi- increases approximately by W - W each time Y increases P C PC PC cation for each item. A different fast producer experiment, by L(-WW). The slight disagreement for large values of PC with W = 2000 and W = 500, is used to compute N and Y (which is not really interesting to us) is explained by the C P C S . Since k > 1, we do not have a C notification for each fact that the measured Y is actually quite larger thanYY = . P C P C item, and so we choose a small L = 8 to have enough sam- Figure 6 does not validate our energy model, which is ples in the event trace. especially interesting in the sFC/sFP regimes. A simple way Table 4 reports the measured average notification costs, to do that (without measuring CPU utilization) is to validate together with their standard deviations. As expected, the noti- the overall batch that is the average number of packets pro- fication cost is higher for P, since it involves an expensive cessed for each sleep, taking into account all of P and C VM enter and exit operation. The start-up cost for P is also extremely expensive, since it involves the cost of interrupt virtio-pc processing in the guest and context-switch to the user-space model process. The start-up cost for C is less expensive because it is mostly the time required to wake-up and schedule the kernel thread, and invoke the processing loop. 5. MODEL VALIDATION The model illustrated in Section 2 is a mathematical abstrac- tion where the operating parameters are assumed to be 500 1,000 1,500 2,000 2,500 TABLE 4. Measured average notification costs. Y [µs] N 1.10m 0.22 s FIGURE 6. Average per-item time versus sleep length, with N 0.58m 0.03 s Y = Y ; the dotted curve shows the measured values, whereas the PC S 28.0m 3.50 s continuous one shows the model prediction. The system enters LS S 0.42m 0.02 s regimes beyond 1024 ms. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 823 sleeps. For sFC and sFP regimes, this batch corresponds to virtio-pc the b parameter described in Sections 2.2.1–2.2.2. The per- model item energy consumption is still connected to the overall batch b by the second equation in (4). Figure 7 shows again a very good match between model predictions and the measure- ments on virtio-pc, also for the sLS regimes. 5.2. Validation of notification regimes Similar to Section 5.1, we now try to validate the throughput behavior for nFP and notified fast consumer regimes, as 1.822.22.42.62.8 depicted in Fig. 2, to check to what extent a real system W [µs] matches our model. We use long enough queues (L = 512) to stay away from short-queue regimes. For the validation FIGURE 8. Average per-item time in the mathematical and the syn- experiment, we have chosen a fixed W = 2000 ns, while W C P thetic model (notification regimes). is fixed at , , W 2 ms L = 512 varies between 200 ns and 2900 ns; as we will show, this K = 1 and . Notification costs are taken from Table 4. K = 384 P C range is sufficient to show all the properties of the system, which depend on the difference between W and W . For each P C value of we have run 12 tests, each one 5 s long, measur- consistent with the fact that the interrupt rate is always very ing the average throughput, P and C notifications rate and small w.r.t. the processing rate, which is approximately 95th percentile of latency over the 5 s. Note that the valid- 500 000 items per second. In other words, the large k is very ation of the energy model comes as a consequence of the val- effective at amortizing the notifications from C to P. idation of throughput, since in nFC and nFP both throughput In the fast consumer zone (W<= W 2000 ns), the PC and energy have a strong dependency on the average batch virtio-pc system shows the effect of the increasing number of size b. notifications as the speed difference between the consumer The measured average per-packet time is depicted in and the producer increases, lowering the throughput in Fig. 8, which does not report variance as it is sufficiently accordance with the model. There are nonetheless some min- small (<3%). We can see that there is a very good agreement or deviations that need to be explained. The slope of the between the model and virtio-pc, with some minor deviations virtio-pc curve around 2.4ms is much more smooth than that will be explained later on. expected, but this is not very interesting, since it is only an For the fast producer zone ( ), the W<= W 2000 ns PC effect of random variations of the emulated W and W around P C throughput curve is mostly flat, with a very small negative the desired values (see Section 6). For values of W between slope, as the interrupt rate slowly decreases from ~570 to 2 and 2.2ms, instead, we note that the virtio-pc curve lies <10. This is a consequence of the very large used by slightly above the model curve, and it features spikes at each L = 384 VirtIO (it is set to ). The very small slope is 4 discontinuity point. This discrepancy is more interesting and it is due to unwanted notifications that the producer sends to an already running consumer. We call these notifications virtio-pc 1,000 model spurious: they are the effect of an unavoidable race in the ‘double check’ scheme used by the notification-suppression algorithm. When the consumer finds an empty queue and must therefore block (first check), it first re-enables notifi- cations, then checks the queue again (second check): if new items are found, it disables notifications again and processes the new items without blocking. If this double check were not performed, the consumer might block and the producer might not notify the next new item: this is the case if the producer pushes a new item after the first check 500 1,000 1,500 2,000 2,500 by the consumer, but before the notifications have been re- [µs] enabled. The double check avoids this possible stall, but it opens up the possibility of spurious notifications: these occur when the producer inserts a new item between the FIGURE 7. Average overall batch versus sleep length, with Y = Y ; the dotted curve shows the measured values, whereas the consumer first and the second check, and sends the notifi- PC continuous one shows the model prediction. cation between the enabling and the disabling of SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 batch size T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 824 G. LETTIERI et al. 0.5 0.8 0.4 0.6 0.4 0.3 0.2 regular 0.2 spurious 22.22.42.62.8 0.1 W [µs] FIGURE 9. Regular and spurious notifications per-packet measured 0 2 2.2 2.4 2.6 2.8 during the fast-consumer experiments of Fig. 8. W [µs] notifications. A spurious notification is illustrated in the FIGURE 10. A plot of D() SW,,W,K with W =m 2s, K = 1 CP C P P P following diagram: and S taken from Table 4. spurious proper ... ... P: considering). Figure 10 shows a plot of Δ with W =m 2s ... ... C: and W varying in the fast consumer range of Fig. 8. We can The consumer sees no new item in the queue when the see that the probability of spurious notifications increases pre- spurious notification is received. Moreover, the consumer will cisely when Δ comes closer to zero. most likely not be able to see yet another packet after the first one, so it will go to sleep and the producer will have to send another notification, this time a proper one. Figure 9 shows 6. RELAXING THE ASSUMPTIONS the average number of per-packet spurious notifications The system used in Section 5 to validate the model still received by the consumer during the same set of experiments makes some important simplifications, namely: of Fig. 8 in the fast-consumer range. For reference, the figure also plots the ‘regular’ (i.e. non-spurious) notifications (1) the system parameters are independent of each other; received per-packet. Since spurious notifications cause add- (2) processing times (W and W ) are constant. itional work for the producer, they increase the per-item aver- P C age time. Therefore, the spurious curve in Fig. 9 clearly Assumption 1 does not hold in real systems, since features explains the differences between the model and the virtio-pc like frequency scaling or C-states may create complex rela- curves in Fig. 8. tions among the parameters. The close match of Fig. 8,in Even if the model does not account for spurious notifica- fact, is only possible because all advanced CPU features have tions, it helps in predicting them. Spurious notifications are been disabled. Nonetheless, the model can be useful to better more probable the closer the consumer and the producer understand the behavior of the system even with some of are when they look and update the empty queue between these features turned on. As an example, we now examine the them. The crucial observation here is that depending on the throughput obtained for the same experiments of Fig. 8, but difference between W and W , the model predicts that the P C with the idle=halt option instead of idle=poll. With instant t when the producer pushes the last packet in a idle=halt, the idle kernel thread will issue the hlt CPU batch and the instant t when the consumer misses, it (and instruction, putting the core into some C-state higher than 0 (C1 therefore goes to sleep) comes recurrently closer as in our case). This is a realistic example, since idle=poll W - W varies. Let us call D=-tt the interval between PC PC always keep the CPU busy and is not an option that should be these two instants, as shown in the following diagram: normally used. Figure 11 shows the new results. We can see t t C P ... ... that now, in the fast consumer region, the model and virtio-pc P: ... ... C: have significant discrepancies that become worse for higher values of W . We can also see that for these values of W ,there P P Interval Δ is a function of S , W , W and K (in the dia- is a somewhat better match if we plot the model for an higher C P C P gram we have assumed K = 1 as in the system we are value of S . This gives a clue on what is going on: the average P C SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 notifications per-packet Δ [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 825 value of S observed during the experiments now depends on model captures the most important effects, and it may be the value of W .Thisisconfirmed in Fig. 12,where we show used to better understand some of the secondary ones. the average values of S in the same set of experiments of Let us now explore some scenarios in which W and/or W C P C Fig. 11 (fast consumer region). The observed S is generally is not constant, therefore, relaxing Assumption 2. In order to higher than the one observed in the idle=poll experiments, examine a larger number of cases, we run these new experi- and also shows a complex dependency on W . This dependency ments in a simulator. Figure 13 shows the results obtained can be explained as follows, making use of the Δ function from the simulator when the system parameters are chosen to introduced above. When the consumer thread goes to sleep the be compatible with Fig. 8. The notification costs (N , N , S P C kernel will switch to the idle thread, which will execute the and S ) and the W and W parameters are now random vari- C P C hlt instruction, thus putting the CPU core in the C1 state. The ables, while L, K and K are as in Fig. 8. The notification P C notification IPI sent by the producer may reach the consumer costs are normally distributed; their averages and standard core either before or after the core has entered the C1 state. deviations are taken from Table 4. The W parameter is also This clearly affects S , since coming back from C1 may take normally distributed; in each experiment, the average is taken ~0.5ms [15]. Of course, the longer the elapsed time between from the x axis and the standard deviation is fixed at 5‰. The the instant the consumer decides to go to sleep and the instant average of W is 2sm in all experiments, but the distribution the producer sends the IPI, the higher is the probability that the is different for each curve: the first four curves use a normal consumer core will have entered the C1 state when the IPI is distribution with standard deviations of 5‰, 5%, 25% and received. Therefore, an high Δ should imply an higher (on average) S ,and alower Δ should cause a lower S ,which is C C essentially what we observe. For example, when W is between P S = .83 µs C   2.6ms and 2.8ms,the Δ is very high and the producer IPI 0.8 almost always find the consumer core already in C1, entailing a large S»m 0.83 s. This explains why the model with 0.6 S=m 0.83 s closelymatches virtio-pcinthisregionof Fig. 11. Note that the dependency of S on Δ is clear, but the 0.4 S = .42 µs correlation between Figs. 10 and 9 is only qualitative; this is   due to a couple of reasons: Fig. 10 is plottedassumingacon- 0.2 stant S , while we know that S varies; moreover, spurious C C notifications also affect Δ (and, therefore, S ), since they tend to increase the Δ for the next batch. In particular, this explains 2 2.2 2.4 2.6 2.8 the high values of S when W is close to W ,since,inthat C P C W [µs] region, there are as many spurious notifications as regular ones. In summary, we have seen that even if real systems are FIGURE 12. Measured average in the fast-consumer experiments much more complex than our simplified model, still the S of Fig. 11. σ =5 virtio-pc model (S = .42 µs) σ =5% C      model (S = .83 µs) σ =25% C      σ =50% 1.8 2 2.2 2.4 2.6 2.8 exponential W [µs] 1.8 2 2.2 2.4 2.6 2.8 W [µs] FIGURE 11. Average per-item time in the mathematical and the synthetic model (notification regimes) with idle=halt. w is fixed at 2 ms, L = 512, K = 1 and K = 384. The model curve is plot FIGURE 13. Average per-item time obtained by simulation with P C two times for two different values of S . The other notifications costs randomly distributed parameters. Each curve uses a different distri- are taken from Table 4. bution for the W parameter. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 T [µs] average S [µs] T [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 826 G. LETTIERI et al. 50%; the fifth curve uses an exponential (Poisson) distribu- leads to some trade-offs. A reasonable choice can, therefore, tion. All normal distributions are truncated at zero, to exclude be done once the objective function to be optimized is clearly non-meaningful negative values. These experiments may defined. In this work, we want study how to simultaneously model a real-world packet capturing scenario in which we minimize average inter-message distance (T) and average can expect the incoming packets to arrive rather regularly, but per-message energy (E), while keeping worst case service where each packet may need a different amount of processing latency below an user-provided value D , focusing on the MAX in the consumer. case where the system is under high workload most of the The first curve (s = 5‰) closely matches the experimental time (i.e. P has almost always requests to serve). curve in Fig. 8 (once the spurious notifications are dis- The rationale behind this objective function is that we tar- counted) and is used to validate the simulator. We can now get packet processing systems requiring high throughput but precisely explain why the experimental curve of Fig. 8 does that do not want to resort to busy waiting, which may waste not feature the discontinuities of the theoretical curves considerable amount of energy when the load is low. obtained with constant parameters. In fact, when the system Examples of such systems come from the use-cases of NFV: is working near a discontinuity, the variability of the para- network middle-boxes like firewalls, Intrusion Detection meters randomly mixes the theoretical regimes expected Systems (IDSs), load balancers, routers, etc., which are com- before and after the critical point; as a result, the average T monly deployed by network service providers, Data Center may lie slightly above or slightly below the predicted value. environments and private business network infrastructures. A Something more interesting can be seen in the other curves solution which guarantees limited delay is still a good candi- produced by the simulator. While we move to higher values date for these systems, also considering that the overall of σ,at first the T curve simply becomes more smooth (e.g. latency experienced by the end users once the producer/con- see the curve for s = 5‰); for very high values of σ, how- sumer system is deployed in a real network is often in the ever, the entire T curve lies below the theoretical one, i.e. the order of hundreds of microseconds (or more) and not under throughput is consistently better than predicted. This can be control, because introduced by other network middle-boxes. easily understood for high values of W (W>m 2.4 s in On the other hand, when minimizing latency is the strongest P P Fig. 13). Recall that in a fast consumer scenario, any slow- requirement—which for instance is the case with high- down of the consumer is actually beneficial for throughput, frequency trading systems—the only acceptable solution is since it keeps the consumer running, relieving the producer busy waiting in any case. from the task of sending notifications, while a faster con- Taking into account the objective function as defined sumer may put more strain on the notification system. above and all the analysis carried out so far, we now illustrate However, if the system is already sending one notification for the high-level strategy that should drive the design and each packet, any W smaller than expected can do no add- deployment of high-performance producer–consumer systems itional harm; on the contrary, any W larger than expected under high workloads. may increase the producer batches and improve the through- put (as long as the queue is not overflowed). Therefore, for large values of W , the throughput must improve when larger 7.1. Regime identification variations of W become statistically more common. Similar, even if more complex, consideration can be made for the As a first step, it is necessary to understand whether the system smaller values of W . The main point is that the batch of pack- tends to behave as a fast producer or as a fast consumer. In real ets that the producer is able to put in the queue while the con- deployments, W and W arenot constant,so wecould at most P C sumer is waking up after a notification (i.e. during time S )are measure and average value for these parameters. However, able to absorb the lower values of W , while the higher values measuring W and W directly often requires some code instru- C P C of W continue to be beneficial. mentation, which should be avoided if possible. A better From these experiments, we can see that the theoretical approach would be to deduce the operational regime by meas- model actually captures a scenario that is typical more uring the rate of notifications in both directions. Fast-consumer demanding than usual and may be seen as ‘worst case’ in systems have a relative high number of P-to-C notifications, practice (even if it is not a worst case mathematically). and a low number of C-to-P notifications. The contrary is true for fast producer systems. The rate of notifications is, therefore, a simple way to roughly distinguish the two cases. Measuring these rates is usually easy in the scenarios we are focusing on, 7. DESIGN STRATEGIES that is with I/O devices emulated by an hypervisor, where P The discussion and comparisons reported in Sections 3.1–3.3 runs in the guest and C runs in the host (or the other way illustrate how the three mechanisms (busy waiting, sleeping, around). Notifications from C to P turn into interrupts in the notifications) have different properties in terms of throughput, guest, so that the average interrupt rate for a given workload energy efficiency and latency, a situation which naturally can be easily measured from within the guest using the tools SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 827 provided by the guest O.S. Also the hypervisor usually pro- vides statistics useful to measure the rate of notifications from P to C, since these kinds of notifications cause a VM exit event. Measuring notifications would also be easy in the case where 3 MAX theconsumer isanhardware device(e.g. aNIC), sinceinthat case interrupt O.S. statistics and device driver statistics would be available. Finally, the maximum between W and W can be deter- P C mined by measuring the system throughput when both P and C Y long sleeps 1 E use sleeping (so that notifications costs are not involved), with asufficiently short Y and Y (or with a sufficiently large L)to C P avoid the sLS regime. In practice, the designer can choose 10 20 30 40 50 60 70 YY== 200ms and measure the throughput while gradually Y = Y [µs] P C CP reducing the sleeping value (and may be gradually increasing L); once the throughput stops increasing with the sleeping FIGURE 14. Average per-item time and energy for the sleeping time, it means that the system is working in the sFP or sFC mechanism with Y = Y and variable Y . Dashed vertical bars PC C regime, and the maximum between W and W is the inverse of delimit the region of valid Y , while the solid one represents the P C C the measured throughput (expressed in items per second). user-specified latency constraint. sleeping time provided by the O.S.; it is therefore a good idea to stay away from the limit by a small value (e.g. 500 ns). 7.2. Fast-consumer design Also in this case the choice minimizes energy, maximizes throughput and limit latency as required by the user. If the system tends to behave as a fast consumer, increasing k is not an option (since P usually does not know when the next item will be produced), so a general strategy is to use sleeping on the consumer in order to avoid the notification storms that 7.3. Fast producer design are typical of this regime—anotification per item in the worst If the system tends to behave as a fast producer, our sug- case, which is also a common case. In fact, P-to-C notifications gested strategy is to use notifications, selecting a value for the are not used at all when C uses sleeping. To keep latency k parameter which is a large fraction (e.g. ) of the queue under control, we choose Y (and Y ) so that the worst case P C latency does not exceed the user-provided D ,which could length L. With this choice, the C-to-P notifications are suffi- MAX be in the 10–100ms range. Using inequality (11), we can ciently amortized over a large batch of packets, so that the derive a suitable value forYY = , once W=( max WW , ) CP CP throughput has little or no practical dependency on the has been estimated as described in Section 7.1.This means W - W difference, as explained in the following. As CP MAX selecting a sleeping length not larger thanYW =- . MAX described in Section 2.3.2, the number of packets processed Note that this strategy is only applicable when the resulting Sk +( - 1)W êPC Pú by P for each notification is b = + k , that is b ê ú YY > , that is when the O.S. supports sleeping times WW - MAX E CP ë û smaller than Y . If this is not true, it means that the latency MAX is the sum of two components. When k = L (i.e. k is in requirements are too stringent to use sleeping (or even unfeas- the 200–1000 range), b is already large because of the second ible), and resorting to busy waiting is unavoidable. component, irrespective of the value of the first component, The possible choices for the sleeping time are highlighted that could also be very large. The cost that C needs to pay for in Fig. 14, in the region where the latency constraint is met. notifications (N ), which is typically <1sm , is, therefore, If Y falls in the sFC region, we chooseYY==Y , MAX CP MAX amortized over at least 200–1000 packets, which result into to minimize energy and limit latency, while the throughput is <1–5 ns per packet. The effect of the first component of b on not affected by the choice. If Y falls beyond, in the sLS MAX the throughput is, therefore, expected to be very little in abso- region, we choose the largest Y which is still in the sFC lute numbers. As a result, the overall throughput is very close region. To make a robust choice we need to avoid the border to the optimal one , because C spends a very little time () effects that may result from the instability of the actual C to send notifications to P. For similar reasons, the per-item As an example, interrupts statistics on Linux are exported through the energy consumption is close to the optimum (W + W ), PC /proc/interrupts file. because N and S costs are amortized over a large b. As an example, the KVM kernel module on Linux exports detailed statis- As discussed in Section 3.3.1, with a large k (or a suffi- tics about the number of VM exits and injected guest interrupts that can be ciently small Y ), the latency of a fast producer system tends easily collected using the perf-stat tool (more info at http://www.linux- P kvm.org/page/Perf_events). to be dominated by the queuing delay LW , which is often in SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 time, energy [µs] Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 828 G. LETTIERI et al. the range 50–1000ms. The queuing delay does not depend case latency measured is relatively low (2240 ns) only includ- on the synchronization mechanism deployed, and so using ing W , N , S (600 ns on average) and W . The theoretical P P C C notifications or sleeping does not really make a difference in worst case would also include N (980 ns) and S , adding up practice. The only thing that can be done if the constraint on to~m 10 s. D is not met is to reduce L. The poor throughput of fast consumer is a common prob- MAX The discussion so far indicates that using the sleeping lem for VirtIO deployments, since it is common for the vhost mechanism in fast producer scenario does not really improve thread to quickly start and empty the avail ring. This example (nor worsen) the average throughput, energy or latency, at is, therefore, a good candidate to try using the sleeping strat- least assuming the system is under high workload. When the egy. We chooseYY== 5sm to make sure the worst case CP system is idle or has a very low workload, the sleeping mech- latency is approximately <10ms (cf. Section 2.2.3) and to anism easily becomes more energy inefficient, as both P and take into account the lower bound of 2.5ms related to the C repeatedly wake up and go to sleep again as there is almost sleeping costs (cf. Section 4.3). Our measurements show an never work to do, paying Y each time. In conclusion, the average throughput of~3.31 Mops, roughly corresponding to notification mechanism is a good candidate for fast producer 300 ns, which is the processing time of the slower party. As systems, since it provides near optimal throughput, energy predicted by our model (Section 2.2.1), the measured and latency, addressing both the high-workload and low- throughput is optimal. We measured an average of 50.5 items workload scenarios. processed by C for each sleep, whereas the model (using the nominal value of Y ) predicts 50. Actually, the average mea- sured value of Y is ~5007 ns, while the measured W - W C PC is actually 99 ns; plugging in these values in the batch for- 8. CASE STUDIES mula gives approximately 50.6, which is even a closer match. In order to validate the strategies presented in Section 7,we This batch corresponds to over 65 thousands sleeps per present some experimental examples of producer–consumer second, which may still be considered quite high with respect design, using the virtio-pc system presented in Section 4.2. to energy consumption. In any case, if relaxing the constraint on D is acceptable, it would be easy to increase the batch MAX (and thus reduce energy consumption) by increasing Y . The 8.1. Fast-consumer example energy measurement reports C using 76% of its CPU; since the system uses 1.76 CPUs to process 3.31 Mops, the average In the first example, we focus on a fast consumer case, with per-item energy consumption is ~531 ns, which is consider- W = 300 ns, W = 200 ns, and we also assume P C ably better than what could be obtained with the notification D =m 10 s. The values of W and W include ~100 ns of MAX P C strategy. Finally, the worst case latency measured is virtqueue processing plus 100–200 ns of useful work. These ~5500 ns, including Y and the processing costs, which is in numbers are realistic for network packet processing scenarios: line with our model. as an example, 100 ns may be needed by the consumer to In summary, this fast consumer example shows how the invoke a NIC driver to program packet transmission; the pro- sleeping strategy can be a better choice than notifications, as ducer may spend 200 ns to allocate (and deallocate) a packet it allows to optimize throughput and energy while keeping buffer in the guest O.S., look-up forwarding data structures the latency under control. and modify packets headers. Using the notification mechanism on both producer and consumer threads, we measured an average throughput of 8.2. Fast producer example ~1.81 Mops (millions operations per second), corresponding to 550 ns per item on average, which is almost twice slower In the second example, we will examine a fast producer scen- than the slowest party (P). As predicted by our model ario, with processing times similar to the ones used in the first (Section 2.3.1), this is due to the high cost of P notifications example, that is W = 200 ns and W = 300 ns. As reported P C (the measured N is ~1100 ns on average), amortized over in Section 4.2, the VirtIO uses an hardcoded k which is of relatively small batches (~5.3 items per batch), which means the virtqueue length; with L = 512, we have therefore that there are almost 350 thousands notifications per second. k = 384. In terms of energy, we found that C consumes 62% of its Using the notification mechanism, we measured an average CPU, while the CPU where P runs is busy all the time; in throughput of 3.32 Mops, corresponding to roughly 300 ns total, 1.62 CPUs running at 3.5 GHz are necessary to process per item on average, which matches the speed of the slowest 1.81 Mops, which means that on average 895 ns of CPU party (C). This is a good behavior and it is predicted by our cycles are spent for each item. Finally, as expected, the worst model, as each notification from C to P (interrupt) is amor- tized over a very large batch of items, so that P is not over- That is, 51.2ms when L = 256 and W = 200 ns, and 1 ms when L = 1024 and W =m 1s whelmed by the cost of notifications. More precisely, our SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 829 measurements report an average batch size of 1480 items, blocked because the FIFO leading to the next VM is full), whereas our model predicts batches of 1429 items (using and, therefore, the CPU utilization estimates would be off. It S =m 28 s, cf. Section 4.4). The measured latency is domi- is important to note, however, that these new, externally gen- nated by the queuing delay and it is ~152ms (512 items, erated, blocking situations never cause notifications not 300 ns each) as expected. Regarding energy, we measured already accounted by the model: even with chaining, notifica- that P consumes ~74% of its CPUs, which means that the tions only depend on the state of the FIFO between each per-item energy is 524 ns on average. Producer and Consumer pair. Using the sleeping mechanism with Y =m 20 s, we mana- We expect to observe counterintuitive effect also in chains of ged to remove even the few remaining interrupts (~2200 per VMs. For example, think of a chainPC ( /P)C (i.e. 11 2 2 second), and measured an average throughput of 3.33 Mops, Producer P in VM sending to a Consumer C in VM through 1 1 2 2 which is almost indistinguishable from the throughput mea- athread(/ CP) that acts both as Consumer and Producer) and sured with notifications. However, this choice of Y results assume that bothPC ( /P) and (/ CP) C show a P 11 2 12 2 into a batch of 200 elements, which is much smaller than the Fast-Consumer problem when run in isolation. Now, C may batch obtained with notifications; as a consequence, the num- slow down(/ CP) by forcing it to spend a lot of time sending ber sleep rate is relatively high (over 16 thousands sleeps per notifications, and, as a consequence, hide the Fast-Consumer second) which means an higher energy per item (87% of problem in thePC ( /P) path. Conversely, fixing the Fast- 11 2 CPU utilization for P, corresponding to 562 ns per item). In Consumer problem in the downstream path may expose it in the order to increase the batch (so lowering the energy consump- upstream one. It is clear that further study is necessary to address tion), we would need to increase Y to over 100ms. This is all the regimes that may be observed in such scenarios. feasible, but quite dangerous since it is not very far from the 152ms threshold for sLS regimes. Finally, since we have avoided sLS regimes, the latency behavior is again dominated 9.2. Batching by the queuing delay. In conclusion, this fast producer example shows how the Batching, i.e. sending several packets at once across an inter- notification mechanism—empowered with a large k —can be face, is widely used to improve throughput since it signifi- a better choice than sleeping, as the cost of each notification cantly amortizes fixed costs. Batching is a prominent feature is largely amortized over many items, so that the throughout in our model, as a single notification may be issued after any manages to follow the slower party and the energy consump- number of new packets have been inserted in the FIFO, or tion remains low. removed from it. Still, the model only accounts for the amortization of notifi- cation and sleep/wake-up costs (N , N , S , S and Y ). P C C E 9. LIMITATIONS Processing costs (W and W ) remain constant, independently C P of the number of packets that are processed in a single run. Even if our model matches precisely some important features Real systems may have many more fixed costs that are amor- of VM networking I/O, it does not of course encompass all tized when batches of packets are made available, thanks to possible scenarios. We discuss here some limitations and pos- caching effects, reduced context switching and other optimi- sible extensions that may significantly broaden the scope of zations. This may be modeled in at least two ways: by letting the model. W and W decrease depending on the number of packets C P already processed since last notification, wake-up or sleep; by assuming that each W and W box represents the processing C P 9.1. VM chaining of a batch of more than one packet. Virtualized networking I/O at high packet rates, which is the The latter approach is especially useful in modeling the main target of our study, is very important for NFV applica- behavior of APIs like netmap [3], where producer batching is tions. Our study covers the expected I/O performance of the controlled by the application and may be approximately taken input and output I/O paths of a single VM. However, com- as a constant, call it B, especially in the high packet rates plete NFV applications typically consist of chains of VMs scenarios, we are interested in. A FIFO of L packets between [16, 17]. Our Consumer can, therefore, be the Producer for the netmap producer and the consumer must now be modeled another VM down the chain. As a first approximation, the as a FIFO of L/B batches, and a large B may easily bring the throughput of each path can still be studied in isolation, using system in a ‘short-queue’ regime (one of nSPS, nSPS or nSS, our model, if the cumulative effect of the upstream and down- depending on the wake-up times), where the consumer and the stream VMs are modeled as random variations in the W and producer alternatively block without doing any work in paral- W parameters (using, e.g. the simulator of Section 6). The lel. In these situations, reducing the application batching can chaining, however, also introduces new possibilities for increase the throughput, by moving the system into a more blocking not considered by our model (e.g. a Consumer is favorable regime—yet another counterintuitive effect [5]. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 830 G. LETTIERI et al. An aspect of batching that is neglected by our model is k = 1, k =/34 of queue occupation. We have shown in that large batches may lead to other reductions in throughput, Section 8 that this form of adaptivity is only effective with due to large packet drops in the internal queues of the high load and slow consumers. Recent versions of vhost (an Producer and/or Consumer when they are implemented by optimized in-kernel VirtIO hypervisor-side implementation), complex multi-layered software (like, e.g. the OS network included in the Linux kernel, support an optional short busy stack). These problems, however, should generally be wait to limit the amount of notifications showing up with fast addressed in the multi-layered software itself, by properly siz- consumers. This further confirms how the problem of produ- ing the queues and making sure that livelock problems are cer–consumer speed mismatch that we address in our work is avoided [18]. central to high-performance I/O virtualization. There is an extensive literature on the performance study and modeling for VMs [23], focusing on the general overhead of virtualization on CPU-intensive computations [24], but also on 10. RELATED WORK the performance of disk I/O[25], end-to-end networking [26] Pure polling (also known as ‘busy wait’ or ‘spinning’)isprob- and live migration [27]. To the best of our knowledge, how- ably the oldest form of synchronization, and the most expensive ever, little attention has been devoted to the modeling of the in terms of system resource usage. Its use is mostly justified by notification/synchronization I/O costs. The works most similar its simplicity and not reliance on any hardware support. Pure to our own remain the studies on hybrid interrupt/polling polling is used by a number of high-speed networking applica- schemes [12, 21, 28], where several options among interrupt tions and libraries such as the Click Modular Router [8], Intel’s and polling are modeled and compared. These studies apply to DPDK [6] and Luca Deri’s PFRING/DNA [7]. non-virtualized networking, and, as a consequence, they show Aside from high-energy consumption, polling may also several differences with our own. In particular, delays in notifi- abuse of shared resources, such memory or I/O buses. This cations are not accounted for, while we have found that they worsens the situation from a simple annoyance (high-energy have several counterintuitive effects in our model. Moreover, consumption) to a threat to other parts of the system, and those studies focus on the receive path only, while our model is requires some form of mitigation. more general and also encompasses transmission. In particular, In the FreeBSD polling architecture [18], polling occurs the fast consumer problem is usually encountered in the trans- periodically on timer interrupts and opportunistically on other mission path from a relatively slow producer running in the events. An adaptive limit on the maximum amount of work to VM with a fast backend consumer [29, 30]. be performed in each iteration is used to schedule the CPU between user processes and kernel activities. Adaptive polling schemes are also widely used in radio protocols, sensor net- 11. CONCLUSIONS works, multicast protocols. We have presented and analyzed a model for the operation of A seminal work on interrupt moderation [19] points out a producer and consumer in a typical VM environment, how mixed strategies (notifications to start processing, fol- focusing on three synchronization mechanisms: notifications, lowed by polling to process data as long as possible) can sleeping and busy waiting; described how throughput, effi- reduce system’s overhead. The Linux NAPI architecture ciency and latency are affected by operating parameters for [11, 20, 21] is based on the above ideas. When an interrupt the three mechanisms; and validated the model against a set comes, NAPI activates a kernel thread to process packets using of simulation experiments and a realistic VirtIO-based proto- polling, and disables further interrupts until done with pending type running on a hypervisor. packets. A bound on the maximum amount of work to be per- We have then discussed some strategies that can lead the formed by the polling thread in each round helps reducing design or optimization of a producer–consumer system under latency and fairness on systems with multiple interfaces. NAPI assumptions that are common for NFV scenarios, helping to does not use any special strategy to adapt the speed of produ- decide what synchronization mechanism to use and how to use cer and consumer, and as such, it is subject to the performance it. The main idea, exposed in Section 7,is to first identify the instabilities discussed in this paper, and, in particular, to the P- notification regime and then apply a different strategy according to-C notification storms typical of a fast consumer scenario (in to it. Finally, we have validated our strategies against our VirtIO this case, the NAPI thread is the consumer for network packets prototype to show the benefits of our analysis in practice. coming from a physical NIC or from a possibly paravirtualized NIC emulated by the hypervisor). The VirtIO framework [10, 22]isthe de facto standard ACKNOWLEDGEMENTS deployed to provide high-performance I/O in virtualized envir- onments, and uses a notification-based system which matches The authors would like to thank the anonymous reviewers for the one presented in Section 2, as explained in Section 4.2. their helpful and constructive comments that greatly contribu- The notification thresholds for VirtIO are typically chosen as ted to improving the final version of the paper. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018 Downloaded from https://academic.oup.com/comjnl/article/61/6/808/4259797 by DeepDyve user on 14 July 2022 ASTUDY OF I/OPERFORMANCE OF VIRTUAL MACHINES 831 [16] Herrera, J.G. and Botero, J.F. (2016) Resource allocation in NFV: FUNDING a comprehensive survey. IEEE Trans. Netw. Serv. Manage., 13, This work was supported by the European Union’s Horizon 518–532. 2020 research and innovation programme 2014–18 (Grant no. [17] Luizelli, M.C., Bays, L.R., Buriol, L.S. et al. (2015) Piecing 644866). This paper reflects only the authors’ views and the together the NFV Provisioning Puzzle: Efficient Placement and European Commission is not responsible for any use that Chaining of Virtual Network Functions. In Proc. IM’2015, may be made of the information it contains. Ottawa, Canada, May 11–15, pp. 98–106. IEEE. [18] Rizzo, L. (2001). Polling versus interrupts in network device drivers. http://info.iet.unipi.it/~luigi/polling/. accessed July 12, 2017. REFERENCES [19] Mogul, J.C. and Ramakrishnan, K. (1997) Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. [1] Chiosi, M.ClarkeD.WillisP.et al2012, Network function vir- Syst., 15, 217–252. tualisation introductory white paper. http://portal.etsi.org/ [20] Salim, J.H., Olsson, R. and Kuznetsov, A. (2001) Beyond NFV/NFV_white_paper.pdf (accessed July 12, 2017). Softnet. In Proc. 5th Annual Linux Showcase & Conference, [2] Abdelrazik, A., Bunce, G., Cacciatore, K. et al. (2015). Adding pp. 165–172. speed and agility to virtualized infrastructure with OpenStack. [21] Salah, K. and Qahatan, A. (2009) Implementation and experi- https://www.openstack.org/assets/pdf-downloads/virtualization- mental performance evaluation of a hybrid interrupt-handling Integration-whitepaper-2015.pdf. accessed July 12, 2017. scheme. Comput. Commun., 32, 178–188. [3] Rizzo, L. (2012) netmap: A Novel Framework for Fast Packet [22] Motika, G. and Weiss, S. (2012) Virtio network paravirtualiza- I/O. In Proc. USENIX ATC’12, Boston, MA, June 13–15, pp. tion driver: implementation and performance of a de-facto 101–112. USENIX Association, Berkeley, CA. standard. Comput. Stand. Interfaces, 34,36–47. [4] Rizzo, L., Lettieri, G. and Maffione, V. (2013) Speeding up [23] Xu, F., Liu, F., Jin, H. and Vasilakos, A.V. (2013) Managing Packet I/O in Virtual Machines. In Proc. ANCS’13, San Jose, performance overhead of virtual machines in cloud computing: CA, 21–21 October, pp. 47–58. IEEE Press, Piscataway, NJ. a survey, state of the art, and future directions. Proc. IEEE, [5] Garzarella, S., Lettieri, G. and Rizzo, L. (2015) Virtual Device 102,11–31. Passthrough for High Speed VM Networking. In Proc. [24] Huber, N., von Quast, M., Hauck, M. and Kounev, S. (2011) ANCS’15, Oakland, CA, 7–8 May, pp. 99–110. IEEE Evaluating and Modeling Virtualization Performance Overhead Computer Society, Washington, DC. for Cloud Environments. In Proc. CLOSER’11, [6] DPDK web page. dpdk.org. Accessed: 2017-07-12. Noordwijkerhout, The Netherlands, May 7–9, pp. 563–573. [7] Deri, L. PF_RING DNA web page. http://www.ntop.org/ Scitepress, Setúbal, Portugal. products/pf_ring/dna/. accessed July 12, 2017. [25] Noorshams, Q., Rostami, K., Kounev, S. et al. (2013) I/O [8] Kohler, E., Morris, R., Chen, B., Jannotti, J. and Kaashoek, M. Performance Modeling of Virtualized Storage Systems. In (2000) The Click modular router. ACM Trans. Comput. Syst. Proc. MASCOTS’13, San Francisco, CA, August 14–16. (TOCS), 18, 263–297. IEEE. [9] Zhou, D., Fan, B., Lim, H., Kaminsky, M. and Andersen, D.G. [26] Wang, G. and Ng, T. (2010) The Impact of Virtualization on (2013) Scalable, High Performance Ethernet Forwarding with Network Performance of Amazon EC2 Data Center. In Proc. Cuckooswitch. In Proc. CoNEXT’13, Santa Barbara, CA, INFOCOM’10, San Diego, CA, March 14–19, pp. 1163–1171. December 9–12, pp. 97–108. ACM, New York, NY. IEEE Press, Piscataway, NJ. [10] Russell, R. (2008) virtio: towards a de-facto standard for virtual [27] Wu, Y. and Zhao, M. (2011) Performance Modeling of Virtual I/O devices. ACM SIGOPS Oper. Syst. Rev., 42,95–103. Machine Live Migration. In Proc. CLOUD’11, Washington [11] NAPI (‘New API’). http://www.linuxfoundation.org/networking/ DC, 4–9 July, pp. 492–499. IEEE. napi. accessed July 12, 2017. [28] Dovrolis, C., Thayer, B. and Ramanathan, P. (2001) HIP: [12] Salah, K., El-Badawi, K. and Haidari, F. (2007) Performance hybrid interrupt-polling for the network interface. ACM analysis and comparison of interrupt-handling schemes in giga- SIGOPS Oper. Syst. Rev., 35,50–60. bit networks. Comput. Commun., 30, 3425–3441. [29] Honda, M., Huici, F., Lettieri, G. and Rizzo, L. (2015) [13] Bonzini, P. KVM halt-poll optimization. https://lkml.org/ mSwitch: A Highly-scalable, Modular Software Switch. In lkml/2015/2/6/319. accessed July 12, 2017. Proc. SOSR’15, Santa Clara, CA, 17–18 June, pp. 1–13. ACM [14] Rostedt, S. (2008). ftrace - function tracer. https://www. New York, NY. kernel.org/doc/Documentation/trace/ftrace.txt. accessed July [30] Hwang, J., Ramakrishnan, K.K. and Wood, T. (2014) NetVM: 12, 2017. High Performance and Flexible Networking using [15] Schöne, R., Molka, D. and Werner, M. (2015) Wake-up laten- Virtualization on Commodity Platforms. In Proc. NSDI’14, cies for processor idle states on current x86 processors. Seattle, WA, April 2–4, pp. 445–458. USENIX Association, Comput. Sci. – Res. Dev., 30, 219–227. Berkeley, CA. SECTION B: COMPUTER AND COMMUNICATIONS NETWORKS AND SYSTEMS THE COMPUTER JOURNAL,VOL.61NO. 6, 2018

Journal

The Computer JournalOxford University Press

Published: Jun 1, 2018

There are no references for this article.