Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs

ABSTRACT The distribution of videos over the Internet is drastically transforming how media is consumed and monetized. Content providers, such as media outlets and video subscription services, would like to ensure that their videos do not fail, startup quickly, and play without interruptions. In r... more
Research / Analytics > Business / Finance
Published on: 2013-06-21
Pages: 14

Video Stream Quality Impacts Viewer Behavior: Inferring Causality Using Quasi-Experimental Designs S. Shunmuga Krishnan Ramesh K. Sitaraman Akamai Technologies University of Massachusetts, Amherst & Akamai Technologies ABSTRACT Keywords The distribution of videos over the Internet is drastically Video quality, Internet Content Delivery, User Behavior, transforming how media is consumed and monetized. Con- Causal Inference, Quasi-Experimental Design, Streaming Video, tent providers, such as media outlets and video subscrip- Multimedia tion services, would like to ensure that their videos do not fail, startup quickly, and play without interruptions. In re- turn for their investment in video stream quality, content 1. INTRODUCTION providers expect less viewer abandonment, more viewer en- The Internet is radically transforming all aspects of human gagement, and a greater fraction of repeat viewers, resulting society by enabling a wide range of applications for business, in greater revenues. The key question for a content provider commerce, entertainment, news and social networking. Per- or a CDN is whether and to what extent changes in video haps no industry has been transformed more radically than quality can cause changes in viewer behavior. Our work the media and entertainment segment of the economy. As is the first to establish a causal relationship between video media such as television and movies migrate to the Internet, quality and viewer behavior, taking a step beyond purely there are twin challenges that content providers face whose correlational studies. To establish causality, we use Quasi- ranks include major media companies (e.g., NBC, CBS), Experimental Designs, a novel technique adapted from the news outlets (e.g., CNN), sports organizations (e.g., NFL, medical and social sciences. MLB), and video subscription services (e.g., Netflix, Hulu). We study the impact of video stream quality on viewer The first major challenge for content providers is pro- behavior in a scientific data-driven manner by using exten- viding a high-quality streaming experience for their view- sive traces from Akamai’s streaming network that include ers, where videos are available without failure, they startup 23 million views from 6.7 million unique viewers. We show quickly, and stream without interruptions [24]. A major that viewers start to abandon a video if it takes more than 2 technological innovation of the past decade that allows con- seconds to start up, with each incremental delay of 1 second tent providers to deliver higher-quality video streams to a resulting in a 5.8% increase in the abandonment rate. Fur- global audience of viewers is the content delivery network ther, we show that a moderate amount of interruptions can (or, CDN for short) [8, 17]. CDNs are large distributed sys- decrease the average play time of a viewer by a significant tems that consist of hundreds of thousands of servers placed amount. A viewer who experiences a rebuffer delay equal to in thousands of ISPs close to end users. CDNs employ sev- 1% of the video duration plays 5% less of the video in com- eral techniques for transporting [2, 12] media content from parison to a similar viewer who experienced no rebuffering. the content provider’s origin to servers at the “edges” of the Finally, we show that a viewer who experienced failure is Internet where they are cached and served with higher qual- 2.32% less likely to revisit the same site within a week than ity to the end user. (See [17] for a more detailed description a similar viewer who did not experience a failure. of a typical CDN architecture.) The second major challenge of a content provider is to actually monetize their video content through ad-based or Categories and Subject Descriptors subscription-based models. Content providers track key met- C.4 [Performance of Systems]: Measurement techniques, rics of viewer behavior that lead to better monetization. Performance attributes; C.2.4 [Computer-Communication Primary among them relate to viewer abandonment, engage- Networks]: Distributed Systems—Client/server ment, and repeat viewership. Content providers know that reducing the abandonment rate, increasing the play time of each video watched, and enhancing the rate at which view- ers return to their site increase opportunities for advertising Permission to make digital or hard copies of all or part of this work for and upselling, leading to greater revenues. The key ques- personal or classroom use is granted without fee provided that copies are tion is whether and by how much increased stream quality not made or distributed for profit or commercial advantage and that copies can cause changes in viewer behavior that are conducive to bear this notice and the full citation on the first page. To copy otherwise, to improved monetization. Relatively little is known from a republish, to post on servers or to redistribute to lists, requires prior specific scientific standpoint about the all-important causal link be- permission and/or a fee. IMC’12, November 14–16, 2012, Boston, Massachusetts, USA. tween video stream quality and viewer behavior for online Copyright 2012 ACM 978-1-4503-1705-4/12/11 ...$15.00. media. Exploring the causal impact of quality on behavior

and developing tools for such an exploration are the primary the significant effort expended in improving stream quality foci of our work. will in fact result in favorable viewer behavior. While understanding the link between stream quality and In fact, a purely correlational relationship could even lead viewer behavior is of paramount importance to the content one astray, if there is no convincing evidence of causality, provider, it also has profound implications for how a CDN leading to poor business decisions. For instance, both video must be architected. An architect is often faced with trade- quality (say, video bitrates) and viewer behavior (say, play offs on which quality metrics need to be optimized by the time) have been steadily improving over the past decade CDN. A scientific study of which quality metrics have the and are hence correlated in a statistical sense. But, that most impact on viewer behavior can guide these choices. fact alone is not sufficient to conclude that higher bitrates As an example of viewer behavior impacting CDN archi- cause viewers to watch longer, unless one can account for tecture, we performed small-scale controlled experiments on other potential “confounding” factors such as the available viewer behavior a decade ago that established the relative video content itself becoming more captivating over time. importance of the video to startup quickly and play without While inferring causality is generally difficult, a key tool interruptions. These behavioral studies motivated an archi- widely used in the social and medical sciences to infer causal- tectural feature called prebursting [12] that was deployed ity from observational data is a Quasi Experimental Design on Akamai’s live streaming network that enabled the CDN (QED) [23]. Intuitively, a QED is constructed to infer if to deliver streams to a media player at higher than the en- a particular “treatment” (i.e., cause) results in a particular coded rate for short periods of time to fill the media player’s “outcome” (i.e., effect) by pairing each person in the observa- buffer with more data more quickly, resulting in the stream tional data who has had treatment with a random untreated starting up faster and playing with fewer interruptions. It person who is “significantly identical” to the treated person is notable that the folklore on the importance of startup in all other respects. Thus, the pairing eliminates the ef- time and rebuffering were confirmed in two recent impor- fect of the hidden confounding variables by ensuring that tant large-scale scientific studies [9, 15]. Our current work both members of a pair have sufficiently identical values for sheds further light on the important nexus between stream those variables. Thus, evaluating the differential outcomes quality and viewer behavior and, importantly, provides the between treated and untreated pairs can either strengthen first evidence of a causal impact of quality on behavior. or weaken a causal conclusion that the treatment causally impacts the outcome. While it is impossible to completely 1.1 Measuring quality and viewer behavior eliminate all hidden factors, our causal analysis using QEDs The advent of customizable media players supporting ma- should be viewed as strengthening our correlational observa- jor formats such as Adobe Flash, Microsoft Silverlight, and tions between treatments and outcomes by eliminating the Apple HTTP streaming has revolutionized our ability to per- common threats to a causal conclusion. form truly large-scale studies of stream quality and viewer behavior as we do in this work, in a way not possible even a 1.3 Our Contributions few years ago. It has become possible to instrument media Our study is one of the largest of its kind of video stream players with an analytics plug-in that accurately measures quality and viewer behavior that collects and analyzes a data and reports both quality and behavioral metrics from every set consisting of more than 23 million video playbacks from viewer on a truly planetary scale. 6.7 million unique viewers who watched an aggregate of 216 million minutes of 102 thousand videos over 10 days. 1.2 From correlation to causality To our knowledge, our work is the first to provide evidence that video stream quality causally impacts viewer behavior, The ability to measure stream quality and viewer behavior a conclusion that is important to both content providers on a global scale allows us to correlate the two in a statis- and CDNs. Further, our adaptation of Quasi-Experimental tically significant way. For each video watched by a viewer, Designs (QEDs) is an unique contribution and is of indepen- we are able to measure its quality including whether the dent interest. QEDs have been used extensively in medical stream was available, how long the stream took to start up, research and the social sciences in the past decades. We ex- and how much rebufferring occurred causing interruptions. pect that our adaptation of QEDs for measurement research We are also able to measure the viewer’s behavior including in networked systems could be key in a variety of other do- whether he/she abandoned the video and how long he/she mains that have so far been limited to correlational studies. watched the video. Our work is also the first to quantitatively explore viewer As a first step, we begin by simply correlating important abandonment rates and repeat viewership in relation to stream quality metrics experienced by the viewers to the behavior quality, last-mile connectivity, and video duration. In addi- that they exhibit. For instance, we discover a strong cor- tion, we also study viewer engagement (e.g., play time) in relation between an increase in the delay for the video to relation to stream quality (e.g., rebuffering) that has also start up and an increase in rate at which viewers abandon been recently studied in [9] in a correlational setting, but the video. Several of our results are the first quantitative we take a step beyond correlational analysis to establish a demonstration that certain key streaming quality metrics causal relationship between quality and engagement using are correlated with key behavioral metrics of the viewer. QEDs. Our work makes the following specific contributions However, the deeper question is not just whether quality on the impact stream quality on viewer behavior. and behavior are correlated but whether quality can causally impact viewer behavior. While correlation is an important - We show that an increase in the startup delay beyond first step, correlation does not necessarily imply causality. 2 seconds causes viewers to abandon the video. Using The holy grail of a content provider or a CDN architect is regression, we show that an additional increase of the to discover causal relationships rather than just correlational startup delay by 1 second increases the abandonment ones, since they would like to know with some certainty that rate by 5.8%.

- Viewers are less tolerant to startup delay for a short it connects to the server and downloads a certain specified video such as news clip than a long video such an hour- amount of data, before transitioning to the play state. In the long TV episode. In a quasi experiment, the likelihood play state, the player uses the data from its buffer and ren- of a viewer of short video abandoning earlier than a ders the video on the viewer’s screen. Meanwhile, the player similar viewer of a long video exceeded the likelihood continues to download data from the server and stores it that the opposite happens by 11.5%. in the buffer. Poor network conditions between the server - Viewers watching video on a better connected com- and the player could lead to a situation where the buffer is puter or device have less patience for startup delay drained faster than it is being filled. This could lead to a con- and abandon sooner. In particular, viewers on mo- dition where the buffer is empty causing the player to enter bile devices have the most patience and abandon the the rebuffer state where the viewer experiences an interrup- least, while those on fiber-based broadband abandon tion or “freeze” in the video play back. While in the rebuffer the soonest. In a quasi experiment, the likelihood state, the player continues to fill its buffer from the server. that a viewer on fiber abandoned earlier than a similar When the buffer has a specified amount of data, the player viewer on a mobile device exceeded the likelihood that enters the play state and the video starts to play again (see the opposite happens by 38.25%. Figure 1). A view can end in three ways: a successful view - Viewers who experienced an increase in the normal- ends normally when the video completes; a failed view ends ized rebuffer delay, i.e., they experienced more inter- with a failure or error due to a problem with the server, ruptions in the video, played the video for lesser time. network, or content; and, finally, an abandoned view ends In a quasi experiment, a viewer who experienced a re- with the viewer voluntarily abandoning the stream either buffer delay that equals or exceeds 1% of the video before the video starts up or after watching some portion duration played 5.02% less of the video in comparison of it. Note that a viewer may abandon the view by closing to a similar viewer who experienced no rebuffering. the browser, stopping the stream, or clicking on a different - A viewer who experienced a failed visit is less likely stream. There are other secondary player-initiated events or to return to the content provider’s site to view more viewer-initiated events that are part of the viewing process. videos within a specified time period than a similar For instance, a viewer could initiate actions such as pausing, viewer who did not experience the failure. In a quasi fast-forwarding, or rewinding the video stream. Further, the experiment, the likelihood that a viewer who expe- player may switch the bitrate of the encoded media in re- rienced failure returns to the content provider’s site sponse to network conditions, such as reduce the bandwidth within a week is less than the likelihood of a similar if there is packet loss. We do not explicitly analyze behav- viewer who did not experience failures by 2.32%. iors associated with these secondary events in this paper, though these could be part of future work. We show that above results are statistically significant using the sign test. Further, these results show a significant level of causal impact of stream quality on viewer behavior. In this View regard, it is important to recall that small changes in viewer behavior can lead to large changes in monetization, since States Startup Play Rebuffer Play the impact of a few percentage points over tens of millions Events Viewer Buffer Buffer Buffer Video ends. Play of viewers can accrue to large impact over a period of time. clicks filled. empty. Play filled. Play stops. "play". Play starts. freezes. resumes. Finally, our work on deriving a causal relationship by sys- tematically accounting for the confounding variables must More than not be viewed as a definitive proof of causality, as indeed T minutes of Visit inactivity Visit there can be no definitive proof of causality. But, rather, our work significantly increases the confidence in a causal conclusion by eliminating the effect of major confounding Views factors that could threaten such a conclusion. 2. BACKGROUND Figure 1: Views and Visits. We describe the process of a user watching a stream, defin- ing terms along the way that we will use in this paper. Visits. A visit is intended to capture a single session of Viewer. A viewer is a user who watches one or more a viewer visiting a content provider’s site to view videos. A streams using a specific media player installed on the user’s visit is a maximal set of contiguous views from a viewer at a device. A viewer is uniquely identified and distinguished specific content provider site such that each visit is separated from other viewers by using a GUID (Globally Unique Iden- from the next visit by at least T minutes of inactivity, where tifier) value that is set as a cookie when the media player is we choose T = 30 minutes2 (see Figure 1). accessed. To identify the viewer uniquely, the GUID value Stream Quality Metrics. At the level of a view, the key is generated to be distinct from other prior values in use. metrics that measure the quality perceived by the viewer are Views. A view represents an attempt by a viewer to watch shown in Figure 2. Failures address the question of whether a specific video stream. A typical view would start with the viewer initiating the video playback, for instance, by clicking media player. In that case, our view starts at the point the play button of the media player1 (see Figure 1). During where the actual video is requested. 2 a view, the media player begins in the startup state where Our definition is similar to the standard notion of a visit (also called a session) in web analytics where each visit is 1 For some content providers, a “pre-roll” advertisement is a set of page views separated by a period of idleness of at shown before the actual content video is requested by the least 30 minutes (say) from the next visit.

Key Metrics Definition Type Metric Definition Failures Number (or, percentage) of views that Abandonment Abandonment % views abandoned fail due to problems with the network, Rate during startup. server, or content. Engagement Play time Total time in play Startup Delay Total time in startup state. state (per view). Average Bitrate The average bitrate at which the video Repeat View- Return Rate Prob. of return to site was watched. ers within time period Normalized Re- Total time in rebuffer state divided by buffer Delay the total duration of the video. Figure 3: Key metrics for viewer behavior. Figure 2: View-level stream quality metrics. which we account for in the next category of metrics. The second category is viewer engagement that can be measured by play time which is simply the amount of video that the the stream was available or if viewing of the video initiated viewer watches. The final category speaks to the behavior by the viewer failed due to a problem with the network or the of viewers over longer periods of time. A key metric is the server or the content itself (such as a broken link). A failed return rate of viewers measured as the probability that a view can be frustrating to the viewer as he/she is unable to viewer returns to the content provider’s site over period of watch the video. A second key metric is startup delay which time, say, returning within a day or returning within a week. is the amount of time the viewer waits for the video to start up. Once the video starts playing, the average bitrate at which the video was rendered on the viewer’s screen is a mea- sure of the richness of the presented content. This metric is 3. DATA SETS somewhat complex since it is a function of how the video was The data sets that we use for our analysis are collected encoded, the network connectivity between the server and from a large cross section of actual users around the world the client, and the heuristics for bitrate-switching employed who play videos using media players that incorporate the by the player. Finally, a fourth type of metric quantifies widely-deployed Akamai’s client-side media analytics plug the extent to which the viewer experienced rebuffering. Re- in3 . When content providers build their media player, they buffering is also frustrating for the viewer because the video can choose to incorporate the plugin that provides an ac- stops playing and “freezes”. We can quantify the rebuffering curate means for measuring a variety of stream quality and by computing the rebuffer delay which is total time spent viewer behavioral metrics. When the viewer uses the media in a rebuffer state and normalizing it by dividing it by the player to play a video, the plugin is loaded at the client-side duration of the video. and it “listens” and records a variety of events that can then Many of the above view-level metrics can be easily ex- be used to stitch together an accurate picture of the play- tended to visit-level or viewer-level metrics. One key visit- back. For instance, player transitions between the startup, level metric that we examine in this paper is a failed visit rebuffering, seek, pause, and play states are recorded so which is a visit that ends with a failed view. A failed visit that one may compute the relevant metrics. Properties of could have had successful views prior to the failed view(s). the playback, such as the current bitrate, bitrate switching, However, a failed visit is important because the viewer tries state of the player’s data buffer are also recorded. Further, to play a video one or more times but is unable to do so viewer-initiated action that lead to abandonment such as and leaves the site right after the failure, presumably with closing the browser or browser tab, clicking on a different a level of frustration. link, etc can also be accurately captured. Once the metrics In our paper, we use many of the key metrics in Fig- are captured by the plugin, the information is “beaconed” ure 2 in our evaluation of the impact of quality on viewer to an analytics backend that can process huge volumes of behavior, though these are by no means the only metrics of data. From every media player at the beginning and end of stream quality. It should be noted that many of the above every view, the relevant measurements are sent to the ana- metrics were incorporated into measurement tools within lytics backend. Further, incremental updates are sent at a Akamai and have been in use for more than a decade [24, configurable periodicity even as the video is playing. 1]. The lack of client-side measurements in the early years led to measurements based on automated “agents” deployed 3.1 Data Characteristics around the Internet that simulated synthetic viewers [1, 19] While the Akamai platform serves a significant amount that were then supplemented with server-side logs. In recent of the world’s enterprise streaming content accounting for years, there is a broad consensus among content providers, several million concurrent views during the day, we choose CDNs, and analytics providers that these metrics or varia- a smaller but representative slice of the data from 12 con- tions of these metrics matter. tent providers that include major enterprises in a variety Metrics for Viewer Behavior. Our metrics are focused on of verticals including news, entertainment, and movies. We the key aspects of viewer behavior that are often tracked consider only on-demand videos in this study, leaving live closely by content providers which we place in three cate- videos for future work. We tracked the viewers and views gories (see Figure 3). The first category is abandonment for the chosen content providers for a period of 10 days (see where a viewer voluntarily decides to stop watching the Figure 4). Our data set is extensive and captures 23 million video. Here we are primarily concerned with abandonment 3 While all our data is from media players that are instru- where the viewer abandons the video even before it starts mented with Akamai’s client-side plugin, the actual delivery playing. A viewer can also abandon a stream after watching of the streams could have used any platform and not neces- a portion of the video which results in a smaller play time, sarily just Akamai’s CDN.

Total Avg Per Avg Per Visit Viewer Views 23 million 2.39 3.42 Minutes 216 million 22.48 32.2 Videos 102 thousand 1.96 2.59 Bytes 1431 TB 148 MB 213 MB Figure 4: Summary of views, minutes watched, dis- tinct videos, and bytes downloaded for our data set. Viewer Geography Percent Views North America 78.85% Asia 12.80% Europe 7.75% Figure 6: Connection type as percent of views. Other 0.60% Figure 5: The geography of viewers in our trace at the continent-level. views from 6.7 million unique viewers, where each viewer on average made 3.42 visits over the period and viewed a total of 32.2 minutes of video. In each visit, there were on average 2.39 views but only 1.96 unique videos viewed, indi- cating that sometimes the viewer saw the same video twice. The geography of the viewer was mostly concentrated in North America, Europe and Asia with small contributions from other continents (see Figure 5). More than half the views used cable, though fiber, mobile, and DSL were signif- icant. The fiber category consisted mostly of AT&T Uverse and Verizon FiOS that contributed in roughly in equal pro- portion. The other connection types such as dialup were negligible (see Figure 6). Video duration is the total length (in minutes) of the video (See Figure 7). We divide the videos into short that have a duration of less than 30 min- utes and long that have a duration of more than 30 minutes. Figure 7: A CDF of the total video duration. The Examples of short video include news clips, highlight reels median duration is 19.92 minutes over all videos, 1.8 for sports, and short television episodes. The median dura- minutes for short, and 43.2 minutes for long videos. tion was 1.8 minutes, though the mean duration was longer at 5.95 minutes. In contrast, long video consists of long tele- vision episodes and movies. The median duration for long tion between X and Y . There are many different ways to videos was 43.2 minutes and the mean was 47.8 minutes. calculate the correlation. Primary among them are Pear- son’s correlation and Kendall’s correlation that is a type of rank correlation. As observed in [9], Kendall’s correla- 4. ANALYSIS TECHNIQUES tion is more suitable for a situation such as ours since it A key goal is to establish a causal link between a stream does not assume any particular distributional relationship quality metric X and viewer behavior metric Y . The first between the two variables. Pearson’s correlation is more ap- key step is to establish a correlational link between X and propriate when the correlated variables are approximately Y using the statistical tools for correlation and regression. linearly related, unlike the relationships that we explore in Next, in accordance with the maxim that “correlation does our work. Kendall’s correlation measures the whether the not imply causation”, we do a more careful analysis to es- two variables X and Y are statistically dependent (i.e., cor- tablish causation. We adapt the innovative tool of Quasi related) without assuming any specific functional form of Experimental Design (QED) used extensively in the social their relationship. Kendall’s correlation coefficient τ takes and medical sciences to problem domains such as ours. values in the interval [−1, 1] where τ = 1 meaning that X and Y are perfectly concordant, i.e., larger values of X are 4.1 Correlational Analysis always associated with larger values for Y , τ = −1 meaning To study the impact of a stream quality metric X (say, that X and Y are perfectly discordant, i.e., larger values of startup delay) with a viewer behavioral metric Y (say aban- X are always associated with smaller values of Y , and τ near donment rate), we start by visually plotting metric X versus 0 implying that X and Y are independent. metric Y in the observed data. The visual representations are a good initial step to estimating whether or not a cor- 4.2 Causal Analysis relation exist. As a next step, we also quantify the correla- A correlational analysis of stream quality metric X (say,

Confounding media player could also impact stream quality. For Variables instance, the player heuristics that could differ from C one content provider to another specifies how much of A B the video needs to be buffered before the stream can startup or resume play after rebuffering. Connection Type. The manner in which a viewer con- nects to the Internet, both the device used and typical connectivity characteristics can influence both stream quality and viewer behavior. We use the connection Independent ? Dependent type of the viewer as a confounding variable, where Variable X Variable Y the connection type can take discrete values such as (Treatment) (Outcome) mobile, DSL, cable, and fiber (such as AT&T’s Uverse and Verizon’s FiOS). Geography. Geography of viewer captures several social, economic, religious, and cultural aspects that can in- fluence viewer behavior. For instance, it has been ob- Figure 8: An example of a QED Model. The con- served by social scientists that the level of patience founding variables are kept the same while the treat- that consumers exhibit towards a delay in receiving ment variable is varied to observe impact on out- a product varies based on geography of the consumer come. [5]. Such a phenomena might well be of significance in the extent to which the viewer’s behavior is altered startup delay) and a viewer behavior metric Y (say, aban- by stream quality. In our work, we analyze viewer’s donment rate) could show that X and Y are associated with geography at the granularity of a country. each other. A primary threat to a causal conclusion that an independent variable X causes the dependent variable Y is 4.2.1 The Quasi Experimental Design (QED) Method the existence of confounding variables that can impact both A primary technique for showing that an independent X and Y (see Figure 8). To take a recent example from the variable X (called the treatment variable) has a causal im- medical literature, a study published in Nature [20] made pact on a dependent variable Y (called the outcome vari- the causal conclusion that children who sleep with the light able) is to design a controlled experiment. To design a true on are more likely to develop myopia later in life. But, as experiment in our context, one would have to randomly as- it turns out, myopic parents tend to leave the light on more sign viewers to differing levels of stream quality (i.e., values often, as well as pass their genetic predisposition to myopia of X) and observe the resultant viewer behaviors (values of to their children. Accounting for the confounding variable Y ). The random assignment in such an experiment removes of parent’s myopia, the causal results were subsequently in- any systematic bias due to the confounding variables that validated or substantially weakened. are threats to a causal conclusion. However, the level of More relevant to our own work, lets consider a potential control needed to perform such an experiment at scale for threat to a causal conclusion that a stream quality metric our problem is either prohibitively hard, expensive, or even X (say, startup delay) results in a viewer behavior Y (say, impossible. In fact, there are legal, ethical, and other is- abandonment). As a hypothetical example, suppose that sues with intentionally degrading the stream quality of a set mobile users tend to have less patience for videos to startup of viewers to do a controlled experiment. However, there as they tend to be busy and are “on the go”, resulting in are other domains were a controlled experiment can and are greater abandonment. Further assume that mobile users performed, example, A|B testing of web page layouts [11]. tend to have larger startup delays due to poor wireless con- Given the inability to perform true experiments, we adapt nectivity. In this situation, a correlation between startup a technique called QED to discover causal relationships from delay and abandonment may not imply causality unless we observational data that already exists. QEDs were devel- can account for the confounding variable of how the viewer oped by social and medical scientists as a similar inability is connected to the Internet. In our causal analyses, we sys- to perform controlled experiments is very common in those tematically identify and account for all or a subset of the domains [23]. In particular, we use a specific type of QED following three categories of confounding variables as rele- called the matched design [18] where a treated individual vant (see Figure 8). (in our case, a view or viewer) is randomly matched with an untreated individual, where both individuals have identi- Content. The video4 being watched could itself influence cal values for the confounding variables. Consequently, any both quality and viewer behavior. For instance, some difference in the outcome for this pair can be attributed to videos are more captivating than others leading view- the treatment. Our population typically consists of views or ers to watch more of it. Or, some videos may have viewers and treatment variable is typically is binary. For in- higher perceived value than others, leading viewers to stance, in Section 7, viewers who experienced “bad” stream tolerate more startup delay. The manner in which the quality in the form a failed visit are deemed to be treated video is encoded and the player heuristic used by the and viewers who had normal experience are untreated. We 4 Note that our notion of video content is url-based and thus form comparison sets by randomly matching each treated also incorporates the content provider. If the same movie is viewer with an untreated viewer such that both viewers are available from two content providers, they would constitute as identical as possible on the confounding variables. Need- two different pieces of content for our analysis. less to say, the more identical the viewers are in each pair

the more effective the matching is in neutralizing the con- outcome is equally likely to be a positive number as a neg- founding variables. Note that matching ensures that the ative number. Thus, for n independently selected matched distributions of the confounding variables in the treated and pairs, the number of positive values of the differential out- untreated set of viewers are identical, much as if viewers were come (call it X) follows the binomial distribution with n randomly assigned to treated and untreated sets in a con- trials and probability 1/2. In a measured sample consisting trolled experiment. Now, by studying the behavioral out- of a total of n non-zero values of the differential outcome, comes of matched pairs one can deduce whether or not the suppose that x have positive values. Given that Ho holds, treatment variable X has a causal effect on variable Y , with the probability (i.e., p-value) of such an occurrence is at the influence of the confounding variables neutralized. Note most P rob (|X − n/2| ≥| x − n/2|) , which is sum of both that treatment variable need not always be stream quality. tails of the binomial distribution. Evaluating the above tail Depending on the causal conclusion, we could choose the probability provides us the required bound on the p-value. treatment variable to content length or connection type, if As an aside, note that a different but distribution-specific we would like to study their impact on viewer behavior. significance test called the paired T-test may be applicable in other QED situations. A paired T-test uses the Student’s T Statistical Significance of the QED Analysis. distribution and requires that the differential outcome has a As with any statistical analysis, it is important to evalu- normal distribution. Since our differential outcome does not ate whether the results are statistically significant or if they have a normal distribution, we rely on the distribution-free could have occurred by random chance. As is customary in non-parametric sign test that is more generally applicable. hypothesis testing [14], we state a null hypothesis Ho that contradicts the assertion that we want establish. That is, Some Caveats. Ho contradicts the assertion that X impacts Y and states It is important to understand the limitations of our QED that the treatment variable X has no impact on the outcome tools, or for that matter any experimental technique of infer- variable Y . We then compute the “p-value” defined to be the ence. Care should be taken in designing the quasi-experiment probability that the null hypothesis Ho is consistent with the to ensure that the major confounding variables are explicitly observed results. A “low” p-value lets us reject the null hy- or implicitly captured in the analysis. If there exists con- pothesis, bolstering our conclusions from the QED analysis founding variables that are not easily measurable (example, as being statistically significant. However, a “high” p-value the gender of the viewer) and/or are not identified and con- would not allow us to reject the null hypothesis. That is, the trolled, these unaccounted dimensions could pose a risk to a QED results could have happened through random chance causal conclusion, if indeed they turn out to be significant. with a “sufficiently” high probability that we cannot reject Our work on deriving a causal relationship by systematically Ho . In this case, we conclude that the results from the QED accounting for the confounding variables must not be viewed analysis are not statistically significant. as a definitive proof of causality, as indeed there can be no The definition of what constitutes a “low” p-value for a definitive proof of causality. But, rather, our work increases result to be considered statistically significant is somewhat the confidence in a causal conclusion by accounting for po- arbitrary. It is customary in the medical sciences to con- tential major sources of confounding. This is of course a clude that a treatment is effective if the p-value is at most general caveat that holds for all domains across the sciences 0.05. The choice of 0.05 as the significance level is largely that attempt to infer causality from observational data. cultural and can be traced back to the classical work of R.A. Fisher about 90 years ago. Many have recently argued that 5. VIEWER ABANDONMENT the significance level must be much smaller. We concur and We address the question of how long a viewer will wait choose the much more stringent 0.001 as our significance for the stream to start up, a question of great importance level, a level achievable in our field given the large amount of that has not been studied systematically to our knowledge. experimental subjects (tens of thousands treated-untreated However, the analogous problem of how long a user will wait pairs) but is rarely achievable in medicine with human sub- for web content to download has received much attention. jects (usually in the order of hundreds of treated-untreated In 2006, Jupiter Research published a study based on inter- pairs). However, our results are unambiguously significant viewing 1,058 online shoppers and postulated what is known and not very sensitive to the choice of significance level. All in the industry as the “4-second rule” that states that an av- our results turned out to be highly significant with p-values erage online shopper is likely to abandon a web site if a of 4×10−5 or smaller, except for one conclusion with a larger web page does not download in 4 seconds [21]. But, a recent p-value that we deemed statistically insignificant. study [16] implied that the users have become impatient over The primary technique that we employ for evaluating sta- time and that even a 400 ms delay can make users search tistical significance is the sign test that is a non-parametric less. Our motivation is to derive analogous rules for stream- test that makes no distributional assumptions and is par- ing where startup delay for video is roughly analogous to ticularly well-suited for evaluating matched pairs in a QED download time for web pages. setting [26]. We sketch the intuition of the technique here, while deferring the specifics to the technical sections. For Assertion 5.1. An increase in startup delay causes more each matched pair (u, v), where u received treatment and abandonment of viewers. v did not receive treatment, we define the differential out- To investigate if our assertion holds, we classify each view come denoted by outcome(u, v) as the numerical difference into 1-second buckets based on their startup delay. We then in the outcome of u and the outcome of v. If Ho holds, compute for each bucket the percentage of views assigned then the outcomes of the treated and untreated individuals to that bucket that were abandoned. From Figure 9, we see are identically distributed, since the treatment is assumed that the percent of abandoned views and startup delay are to have no impact on the outcome. Thus, the differential positively correlated with a Kendall correlation of 0.72.

Figure 9: Percentage of abandoned views and Figure 10: Viewers start to abandon the video if the startup delay are positively correlated. startup delay exceeds about 2 seconds. Beyond that point, a 1-second increase in delay results in roughly a 5.8% increase in abandonment rate. Suppose now that we build a media delivery service that provides a startup delay of exactly x seconds for every view. What percent of views delivered by this system will be aban- be more patient for the video to startup if they expect to be doned? To estimate this metric, we define a function called watching the video for longer period of time? AbandonmentRate(x) that equals To investigate our assertion, we first classify the views based on whether the content is short with duration smaller 100 × Impatient(x)/(Impatient(x) + Patient(x)), than 30 minutes (e.g., news clip), or long with duration longer than 30 minutes (e.g., movies). The Kendall cor- where Impatient(x) is all views that were abandoned af- relation between the two variables, percent of abandoned ter experiencing less than x seconds of startup delay and videos and startup delay, were 0.68 and 0.90 for short and Patient(x) are views where the viewer waited at least x time long videos respectively, indicating a strong correlation for without abandoning. That is, Impatient(x) (resp., Patient(x)) each category. Further, Figure 11 shows abandonment rate corresponds to views where the viewer did not (resp., did) for each type of content as a function of the startup delay. demonstrate the patience to hold on for x seconds with- One can see that viewers typically abandon at a larger rate out abandoning. Note that a view in Patient(x) could still for short videos than for long videos. have been abandoned at some time greater than x. Also, note that a view where the video started to play before x Assertion 5.3. Viewers watching videos on a better con- seconds does not provide any information on whether the nected computer or device have less patience for startup de- viewer would have waited until x seconds or not, and so lay and so abandon sooner. is considered neither patient or impatient. Figure 10 shows the abandonment rate computed from our data which is near The above assertion is plausible because there is some ev- zero for the first 2 seconds, but starts to rise rapidly as the idence that users who expect faster service are more likely startup delay increases. Fitting a simple regression to initial to be disappointed when that service is slow. In fact, this part of the curve shows that abandonment rate increases by is often touted as a reason for why users are becoming less 5.8% for each 1-second increase in startup delay. and less able to tolerate web pages that download slowly. To study whether or not this is true in a scientific manner, we Assertion 5.2. Viewers are less tolerant of startup delay segment our views into four categories based on their con- for short videos in comparison to longer videos. nection type that indicates how the corresponding viewer is connected to the Internet. The categories in roughly the Researchers who study the psychology of queuing [13] have increasing order of connectivity are mobile, DSL, cable mo- shown that people have more patience for waiting in longer dem, and fiber (such as Verizon FIOS or AT&T Uverse). queues if the perceived value of the service that they are In all four categories, we see a strong correlation between waiting for is greater. Duration of the service often influ- the two variables, percent of abandoned views and startup ences its perceived value with longer durations often per- delay. The Kendall correlations for mobile, DSL, cable, and ceived as having greater value. People often tolerate the 30 fiber are 0.68, 0.74, 0.71, and 0.75 respectively. Further, in minute delay for the checkin process for a 4-hour plane ride Figure 12, we show the abandonment rate for each connec- but would find the same wait excessive for a 10-minute bus tion type. We can see that viewers abandon significantly ride. On the same principle, is it true that viewers would less on mobile in comparison with the other categories, for

Figure 11: Viewers abandon at a higher rate for Figure 12: Viewers who are better connected aban- short videos than for long videos. don sooner. a given startup delay. Some difference in abandonment is Note that a positive value for net outcome provides posi- discernible between the other categories in the rough order tive (supporting) evidence for Assertion 5.3, while a nega- of cable, DSL, and fiber though they are much smaller. tive value provides negative evidence for the assertion. The results of the matching algorithm produced a net outcome of 5.1 QED for Assertion 5.2 11.5%. The net outcome shows that the matched pairs that support Assertion 5.2 exceed those that negate the asser- First, we devise a QED to study the impact of content tion by 11.5%. The positive net outcome provides evidence length on abandonment (Assertion 5.2). Therefore, we make of causality that was not provided by the prior correlational the content length (long or short) the treatment variable and analysis alone by eliminating the threats posed by the iden- the outcome measures patience of the viewer to startup de- tified confounding variables. lay. The viewer’s patience to startup delay can be influenced To derive statistical significance of the above QED re- by both the viewer’s geography and connection type which sult, we formulate a null hypothesis Ho that states that we use as the confounding variables. Specifically, we form the treatment (long versus short video) has no impact on matched pairs (u, v) such that view u is a short video that abandonment. If Ho holds, the outcome(u, v) is equally was abandoned and view v is a long video that was aban- likely to be positive (+1) as negative (-1). We now use doned and u and v are watched by viewers from the same the sign test that we described in Section 4.2 to derive a geography, and the viewers have the same connection type. bound on the p-value. Since we matched n = 78, 840 pairs, The matching algorithm is described as follows. if Ho holds, the expected number pairs with a positive out- 1. Match step. Let the treated set T be all abandoned come is n/2 = 78, 840/2 = 39, 420. Our observational data views for short content and let untreated set C be all however had x = 43, 954 pairs with positive scores, i.e., the abandoned views for long content. For each u ∈ T x − n/2 = 4534 pairs in excess of the mean. We bound we pick uniformly and randomly a v ∈ C such that the p-value by showing that it is extremely unlikely to have u and v belong to viewers in the same geography and had 4534 positive pairs in excess of the mean by computing have the same connection type. The matched set of the the two-sided tail of the binomial distribution with n pairs M ⊆ T × C have the same attributes for the trials and probability 1/2: confounding variables and differ only on the treatment. n n p-value ≤ Prob |X − | ≥| x − ≤ 3.3 × 10−229 (1) 2 2 2. Score step. For each pair (u, v) ∈ M , we compute The above bound for the p-value is much smaller than the an outcome(u, v) to be +1 if u was abandoned with a required significance level of 0.001 and leads us to reject smaller startup delay than v. If u was abandoned with the null hypothesis Ho . Thus, we conclude that our QED a larger startup delay than v, then outcome(u, v) = analysis is statistically significant. −1. And, outcome(u, v) = 0, if the startup delays when u and v were abandoned are equal. Now, 5.2 QED for Assertion 5.3 outcome(u, v) To investigate a causal conclusion for Assertion 5.3, we (u,v)∈M N et Outcome = × 100. set up a QED where the treatment is the connection type of |M | the user and the outcome measures the relative tolerance of

the viewer to startup delay. For each pair of network types A and B, we run a matching algorithm where the treated set T is the set of all abandoned views with connection type A and untreated set is all abandoned views with connection type B. The matching algorithm used is identical to the one described earlier except that the match criterion in step 1 is changed to match for identical content and identical geography. That is, for every matched pair (u, v), view u has network type A and view v has network type B but both are views for the same video and belong to viewers in the same geography. The results of the matching algorithm are shown in Fig- ure 13. For instance, our results show that the likelihood that a mobile viewer exhibited more patience than a fiber viewer is greater than the likelihood that opposite holds by a margin of 38.25%. Much as in Section 5.1, we use the sign test to compute the p-value for each QED outcome in the table. All QED outcomes in Figure 13 turned out to be statistically significant with exceedingly small p-values, ex- cept the dsl-versus-cable comparison that was inconclusive. Specifically, our results show that a that a mobile viewer ex- hibits more patience than other (non-mobile) viewers, and the result holds with exceedingly small p-values (< 10−17 ). Our results also provide strong evidence for DSL and cable Figure 14: A significant fraction of the views have users being more patient than fiber users, though the p-value small duration. for the dsl-versus-fiber was somewhat larger (4.6×10−5 ) but still statistically significant. The dsl-versus-cable compari- itself be a function of complex factors, for instance, Italian son was however inconclusive and not statistically significant viewers might be more interested in soccer world cup videos as the p-value of the score was 0.06 that is larger than our than American viewers, even more so if the video is of a required significance level of 0.001. game where Italy is playing. In understanding the impact ``` of stream quality on viewer engagement, the challenge is to ```Untreated ``` dsl cable fiber neutralize the bias from confounding variables not related Treated `` to stream quality such as viewer interest, geography, and mobile 33.81 35.40 38.25 connection type. Since more rebuffer delay is expected of dsl - -0.75 2.67 videos with a longer duration, we use normalized rebuffer cable - - 3.65 delay5 that equals 100 × (rebuffer delay/video duration). Figure 13: Net QED outcomes support the causal Assertion 6.1. An increase in (normalized) rebuffer de- impact of connection type on viewer patience for lay can cause a decrease in play time. startup delay, though the impact is more pro- To evaluate the above assertion, we first classify views by nounced between mobile and the rest. The p-value bucketing their normalized rebuffer delay into 1% buckets. for all entrees are very small (< 10−17 ), except dsl- Then, we compute and plot the average play time for all versus-cable (0.06) and dsl-versus-fiber (4.6 × 10−5 ). views within each bucket (see Figure 15). The decreasing trend visualizes the negative correlation that exists between normalized rebuffer delay and play time. The Kendall cor- 6. VIEWER ENGAGEMENT relation between the two metrics is −0.421, quantifying the We study the extent to which a viewer is engaged with negative correlation. the video content of the content provider. A simple metric that measures engagement is play time. Here we study play 6.1 QED Analysis time on a per view basis, though one could study play time To examine the causality of Assertion 6.1, we devise a aggregated over all views of a visit (called visit play time) or QED where the treatment set T consists of all views that suf- play time aggregated over all visits of a viewer (called viewer fered normalized rebuffer delay more than a certain thresh- play time). Figure 14 shows the CDF of play time over old γ%. Given a value of γ as input, the treated views are our entire data set. A noticeable fact is that a significant matched with untreated views that did not experience re- number of views have very small play time with the median buffering as follows. play time only 35.4 seconds. This is likely caused by “video surfing” where a viewer quickly views a sequence of videos 1. Match step. We form a set of matched pairs M as to see what might of interest to him/her, before settling in follows. Let T be the set of all views who have a nor- on the videos that he/she wants to watch. The fact that malized rebuffer delay of at least γ%. For each view u a viewer watched on average of 22.48 minutes per visit (cf. in T , suppose that u reaches the normalized rebuffer Figure 4) is consistent with this observation. Play time is delay threshold γ% when viewing the tth second of the clearly impacted by both the interest level of the viewer 5 Note that normalized rebuffer delay can go beyond 100% if in the video and the stream quality. Viewer interest could we rebuffer for longer than the total duration of the video.

Normalized Rebuffer Delay γ Net Outcome P-Value (percent) (percent) 1 5.02 < 10−143 2 5.54 < 10−123 3 5.7 < 10−87 4 6.66 < 10−86 5 6.27 < 10−57 6 7.38 < 10−47 7 7.48 < 10−36 Figure 16: A viewer who experienced more rebuffer delay on average watched less video than an identical viewer who had no rebuffer. increasing values of γ. Much as in Section 5.1, we use the sign test to compute the p-values for each QED outcome. All p-values were extremely small as shown in Figure 16, making the results statistically significant. 7. REPEAT VIEWERSHIP We study the viewers who after watching videos on a con- Figure 15: Correlation of normalized rebuffer delay tent provider’s site return after some period of time to watch with play time. more. Repeat viewers are valued highly-valued by media content providers as these viewers are more engaged and more loyal to the content provider’s site. Even a small de- video, i.e., view u receives treatment after watching crease (or, increase) in the return rate of viewers can have a the first t seconds of the video, though more of the large impact on the business metrics of the content provider. video could have been played after that point. We Clearly, a number of factors, including how captivating the pick a view v uniformly and randomly from the set of video content is to the viewer, influence whether or not a all possible views such that viewer returns. However, we show that stream quality can also influence whether or not a viewer returns. (a) the viewer of v has the same geography, connec- The most drastic form of quality degradation is failure tion type, and is watching the same video as the when a viewer is unable to play a video successfully. Fail- viewer of u. ures can be caused by a number of issues, including problems (b) View v has played at least t seconds of the video with the content (broken links, missing video files, etc), the without rebuffering till that point. client software (media player bugs, etc), or the infrastruc- ture (network failure, server overload, etc). More frustrating 2. Score step. For each pair (u, v) ∈ M , we compute than a failed view is a failed visit where a viewer tries to play play time of v − play time of u videos from the content providers site but fails and leaves outcome(u, v) = . the site immediately after the failure, presumably with some video duration level of frustration. (Note that the definition of a failed visit outcome(u, v) does not preclude successful views earlier in that visit before (u,v)∈M N et Outcome = × 100. the last view(s) that failed.) We focus on the impact of a |M | failed visit experienced by a viewer on his/her likelihood of returning to the content provider’s site. Note that closer we can make the matched views u and v in variables other than the treatment, the more accurate Assertion 7.1. A viewer who experienced a failed visit our QED results. Though as a practical matter, adding too is less likely to return to the content provider’s site to view many matching parameters can highly reduce the availabil- more videos within a specified time period than a similar ity of matches, eventually impacting the statistical signifi- viewer who did not experience a failed visit. cance of the results. It is worth noting step 1(b) above where we ensure that v watches the video to at least the same point To examine if the above assertion holds, we classify each as when u first received treatment. Thus, at the time both of our views as either failed or normal (i.e., not failed). For u and v play the tth second of the video, they have viewed each failed visit (resp., normal visit), we compute the return the same content and the only difference between them is time which is defined to be the next time the viewer returns one had rebuffering and the other did not. The net outcome to the content provider’s site. (Return time could be infinite of the matching algorithm can be viewed as the difference in if they do not return to the site within our trace window.) the play time of u and v expressed as a percent of the video Figure 17 shows the CDF of the return time for both failed duration. Figure 16 shows that on average a view that ex- visits and normal visits. It can be seen that there is signifi- perienced normalized rebuffer delay of 1% or more played cant reduction in the probability of return following a failed 5.02% of less of the video. There is a general upward trend visit as opposed to a normal one. For instance, the proba- in the net outcome when the treatment gets harsher with bility of returning within 1 day after a failed visit is 8.0%

versus 11% after a normal one. Likewise, the probability of returning within 1 week after a failed visit is 25% versus 27% after a normal one. Figure 18: CDF of the viewer play time for all, treated, and untreated viewers. Figure 17: Probability of the return after a failed The matching algorithm follows. visit and after a normal visit. The probability of 1. Match step. We produce a matched set of pairs M returning within a specified return time is distinctly as follows. Let T be the set of all viewers who have smaller after a failed visit than after a normal one. had a failed visit. For each u ∈ T we pick the first failed visit of viewer u. We then pair u with a viewer v picked uniformly and randomly from the set of all 7.1 QED Analysis possible viewers such that We perform a QED analysis to strengthen Assertion 7.1 by considering viewers6 with a failed visit to be the treated (a) Viewer v has the same geography, same connec- set T . For each u ∈ T we find a matching viewer v that tion type as u, and is watching the content from is similar to u in all the confounding variables. As before, the same content provider as u. we ensure that viewers u and v are from the same geogra- (b) Viewer v had a normal visit at about the same phy, have the same connection type, and are viewing content time (within ±3 hours) as the first failed visit of from the same content provider. However, there is a subtle viewer u. We call the failed visit of u and the characteristic that need to be matched. Specifically, we need corresponding normal visit of v that occurred at also ensure that the propensity of u to watch videos prior to a similar time as matched visits. when u received treatment is equivalent to the corresponding propensity of v. This ensures that any differential behavior (c) Viewer u and v have the same number of visits after the treatment can be attributed to the treatment itself. and about the same total viewing time (±10 min- To reinforce the last point, a viewer who watches more utes) prior to their matched visits. video at a site is more likely to have had a failed view. There- 2. Score step. For each pair (u, v) ∈ M and each return fore, the treated set T of viewers has a bias towards contain- time δ. We assign outcome(u, v, δ) to −1 if u returns ing more frequent visitors to site who also watch more video. within the return time and v does not, +1 if v returns Figure 18 shows the CDF of the aggregate play time of a within the return time and u does not, and 0 otherwise. viewer across all visits. It can be seen that the treated set T has viewers who have watched for more time in aggregate. (u,v)∈M outcome(u, v, δ) To neutralize this effect, we match on the number of prior N et Outcome(δ) = ×100. |M | visits and aggregate play time in step 1 (c) below and make them near identical, so that we are comparing two viewers Figure 19 shows the outcome of the matching algorithm for who have exhibited similar propensity to visit the site prior various values of the return time (δ). The positive values to treatment. The use of a similarity metric of this kind for of the outcome provide strong evidence of the causality of matching is common in QED analysis and is similar in spirit Assertion 7.1 since it shows that viewers who experienced to propensity score matching of [22]. a normal visit returned more than their identical pair with 6 Note that in this matching we are matching viewers and a failed visit. To take a numerical example, for δ = 1 day, not views as we are evaluating the repeat viewership of a 458, 621 pairs were created. The pairs where the normal viewer over time. viewer returned but its identical failed pair did not exceed

Return Time δ Outcome P-Value its application to data mining is more recent. In [18], the (in days) (percent) authors use QEDs to answer questions about user behavior 1 2.38 <10−57 in social media such as Stack Overflow and Yahoo Answers. 2 2.51 <10−51 There are a number of other studies on perceived quality 3 2.42 <10−44 though they tend to be small-scale studies or do not link the 4 2.35 <10−37 quality to user behavior [10, 7]. There has also been prior 5 2.15 <10−22 work for other types of systems. For instance, the rela- 6 1.90 <10−11 tionship between page download times and user satisfaction 7 2.32 <10−6 [3] for the web and quantifying user satisfaction for Skype [6]. There has also been work on correlating QoS with QoE Figure 19: A viewer who experienced a failed visit (quality of experience) for multimedia systems using human is less likely to return within a time period than a subjects [27]. These of course have a very different focus viewer who experienced a normal visit. from our work and do not show causal impact. There has been significant amount of work in workload characteriza- tion of streaming media, P2P, and web workloads [25, 4]. Even though we do characterize the workload to a degree, the pairs where the opposite happened. The amount of pairs our focus is quality and viewer behavior. in excess was 10, 909 pairs, which is 2.38% of the total pairs. Using the sign test, we show that the p-value is extremely small (2.2 × 10−58 ), providing strong evidence of statistical significance for the outcome. Note that as δ increases, the 9. CONCLUSIONS outcome score remained in a similar range. However, one Our work is the first to demonstrate a causal nexus be- would expect that for very large δ values the effect of the tween stream quality and viewer behavior. The results pre- failed event should wear off, but we did not analyze traces sented in our work are important because they are the first that were long enough to evaluate if such a phenomenon quantitative demonstration that key quality metrics causally occurs. All p-values remain significantly smaller than our impact viewer behavioral metrics that are key to both con- threshold of significance of 0.001, allowing us to conclude tent providers and CDN operators. As all forms of media that the results are statistically significant. migrate to the Internet, both video monetization and the de- sign of CDNs will increasingly demand a true causal under- standing of this nexus. Establishing a causal relationship by 8. RELATED WORK systematically eliminating the confounding variables is im- The quality metrics considered here have more than a mensely important, as mere correlational studies have the dozen years of history within industry where early measure- potential costly risk of making incorrect conclusions. ment systems used synthetic “measurement agents” deployed Our work breaks new ground in understanding viewer around the world to measure metrics such as failures, startup abandonment and repeat viewership. Further, it sheds more delay, rebuffering, and bitrate, example, Akamai’s Stream light on the known correlational impact of quality on viewer Analyzer measurement system [1, 24]. There have been early engagement by establishing its causal impact. Our work on studies at Akamai on streaming quality metrics using these startup delay show that more delay causes more abandon- tools [19]. However, truly large-scale studies were made pos- ment, for instance, a 1 second increase in delay increases sible only with the recent advent of client-side measurement the abandonment rate by 5.8%. We also showed the strong technology that could measure and report detailed quality impact of rebuffering on the video play time. For instance, and behavioral data from actual viewers. To our knowledge, we showed that a viewer experiencing a rebuffer delay that the first important large-scale study and closest in spirit equals or exceeds 1% of the video duration played 5.02% less to our work is the study of viewer engagement published of the video in comparison with a similar viewer who expe- last year [9] that shows several correlational relationships rienced no rebuffering. Finally, we examined the impact of between quality (such as rebuffering), content type (such failed visits and showed that a viewer who experienced fail- as live, short/long VoD), and viewer engagement (such as ures is less likely to return to the content provider’s site in play time). A recent sequel to the above work [15] studies comparison to a similar viewer who did not experience fail- the use of quality metrics to enhance video delivery. A key ures. In particular, we showed that a failed visit decreased differentiation of our work from prior work is our focus on the likelihood of a viewer returning within a week by 2.32%. establishing causal relationships, going a step beyond just While reviewing these results, it is important to remem- correlation. While our viewer engagement analysis was also ber that small changes in viewer behavior can lead to large correlationally established in [9], our work takes the next changes in monetization, since the impact of a few percent- step in ascertaining the causal impact of rebuffering on play age points over tens of millions of viewers can accrue to large time. Besides our results on viewer engagement, we also es- impact over a period of time. tablish key assertions pertaining to viewer abandonment and As more and more data become available, we expect that repeat viewership that are the first quantitative results of its our QED tools will play an increasing larger role in establish- kind. However, it must be noted that [9] studies a larger set ing key causal relationships that are key drivers of both the of quality metrics, including join time, average bitrate, and content provider’s monetization framework and the CDN’s rendering quality, and a larger class of videos including live next-generation delivery architecture. The increasing scale streaming, albeit without establishing causality. of the measured data greatly enhances the statistical sig- The work on quasi-experimental design in the social and nificance of the derived conclusions and the efficacy of our medical sciences has a long and distinguished history stretch- tools. Further, we expect that our work provides an impor- ing several decades that is well documented in [23]. Though tant tool for establishing causal relationships in other areas

of measurement research in networked systems that have so [12] L. Kontothanassis, R. Sitaraman, J. Wein, D. Hong, far been limited to correlational studies. R. Kleinberg, B. Mancuso, D. Shaw, and D. Stodolsky. A transport layer for live streaming in a content 10. ACKNOWLEDGEMENTS delivery network. Proceedings of the IEEE, 92(9):1408–1419, 2004. We thank Ethendra Bommaiah, Harish Kammanahalli, [13] R.C. Larson. Perspectives on queues: Social justice and David Jensen for insightful discussions about the work. and the psychology of queueing. Operations Research, Further, we thank our shepherd Meeyoung Cha and our pages 895–905, 1987. anonymous referees for their detailed comments that re- [14] E.L. Lehmann and J.P. Romano. Testing statistical sulted in significant improvements to the paper. Any opin- hypotheses. Springer Verlag, 2005. ions expressed in this work are solely those of the authors and not necessarily those of Akamai Technologies. [15] X. Liu, F. Dobrian, H. Milner, J. Jiang, V. Sekar, I. Stoica, and H. Zhang. A case for a coordinated internet video control plane. In Proceedings of the 11. REFERENCES ACM SIGCOMM Conference on Applications, [1] Akamai. Stream Analyzer Service Description. Technologies, Architectures, and Protocols for Computer Communication, pages 359–370, 2012. Stream_Analyzer_Service_Description.pdf. [16] Steve Lohr. For impatient web users, an eye blink is [2] K. Andreev, B.M. Maggs, A. Meyerson, and R.K. just too long to wait. New York Times, February 2012. Sitaraman. Designing overlay multicast networks for [17] E. Nygren, R.K. Sitaraman, and J. Sun. The Akamai streaming. In Proceedings of the fifteenth annual ACM Network: A platform for high-performance Internet symposium on Parallel algorithms and architectures, applications. ACM SIGOPS Operating Systems pages 149–158. ACM, 2003. Review, 44(3):2–19, 2010. [3] N. Bhatti, A. Bouch, and A. Kuchinsky. Integrating [18] H. Oktay, B.J. Taylor, and D.D. Jensen. Causal user-perceived quality into web server design. discovery in social media using quasi-experimental Computer Networks, 33(1):1–16, 2000. designs. In Proceedings of the First Workshop on [4] M. Cha, H. Kwak, P. Rodriguez, Y.Y. Ahn, and Social Media Analytics, pages 1–9. ACM, 2010. S. Moon. I tube, You Tube, Everybody Tubes: [19] Akamai White Paper. Akamai Streaming: When Analyzing the World’s Largest User Generated Performance Matters, 2004. Content Video System. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, Streaming_Performance_Whitepaper.pdf. pages 1–14, 2007. [20] G.E. Quinn, C.H. Shin, M.G. Maguire, R.A. Stone, [5] H. Chen, S. Ng, and A.R. Rao. Cultural differences in et al. Myopia and ambient lighting at night. Nature, consumer impatience. Journal of Marketing Research, 399(6732):113–113, 1999. pages 291–301, 2005. [21] Jupiter Research. Retail Web Site Performance, June [6] K.T. Chen, C.Y. Huang, P. Huang, and C.L. Lei. 2006. Quantifying skype user satisfaction. In ACM releases/2006/press_110606.html. SIGCOMM Computer Communication Review, [22] P.R. Rosenbaum and D.B. Rubin. Constructing a volume 36, pages 399–410. ACM, 2006. control group using multivariate matched sampling [7] M. Claypool and J. Tanner. The effects of jitter on the methods that incorporate the propensity score. peceptual quality of video. In Proceedings of the American Statistician, pages 33–38, 1985. seventh ACM international conference on Multimedia [23] W.R. Shadish, T.D. Cook, and D.T. Campbell. (Part 2), pages 115–118. ACM, 1999. Experimental and quasi-experimental designs for [8] John Dilley, Bruce M. Maggs, Jay Parikh, Harald generalized causal inference. Houghton, Mifflin and Prokop, Ramesh K. Sitaraman, and William E. Weihl. Company, 2002. Globally distributed content delivery. IEEE Internet [24] R.K. Sitaraman and R.W. Barton. Method and Computing, 6(5):50–58, 2002. apparatus for measuring stream availability, quality [9] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, and performance, February 2003. US Patent 7,010,598. Dilip Joseph, Aditya Ganjam, Jibin Zhan, and Hui [25] K. Sripanidkulchai, B. Maggs, and H. Zhang. An Zhang. Understanding the impact of video quality on analysis of live streaming workloads on the internet. In user engagement. In Proceedings of the ACM Proceedings of the 4th ACM SIGCOMM Conference SIGCOMM Conference on Applications, Technologies, on Internet Measurement, pages 41–54, 2004. Architectures, and Protocols for Computer [26] D.A. Wolfe and M. Hollander. Nonparametric Communication, pages 362–373, New York, NY, USA, statistical methods. Nonparametric statistical methods, 2011. ACM. 1973. [10] S.R. Gulliver and G. Ghinea. Defining user perception [27] W. Wu, A. Arefin, R. Rivas, K. Nahrstedt, of distributed multimedia quality. ACM Transactions R. Sheppard, and Z. Yang. Quality of experience in on Multimedia Computing, Communications, and distributed interactive multimedia environments: Applications (TOMCCAP), 2(4):241–257, 2006. toward a theoretical framework. In Proceedings of the [11] R. Kohavi, R. Longbotham, D. Sommerfield, and 17th ACM international conference on Multimedia, R.M. Henne. Controlled experiments on the web: pages 481–490, 2009. survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140–181, 2009.

The end

do you like it?
Share with friends