Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly

Despite prior research on outlier mitigation, our analysisof jobs from the Facebook cluster shows that outliersstill occur, especially in small jobs. Small jobsare particularly sensitive to long-running outlier tasksbecause of their interactive nature. Outlier mitigationstrategies rely on comparing different tasks of the samejob and launching speculative copies for the slower tasks.However, small jobs execute all their tasks simultaneously,thereby not providing sufficient time to observeand compare tasks. Building on the observation that clustersare underutilized, we take speculation to its logicalextreme—run full clones of jobs to mitigate the effectof outliers. The heavy-tail distribution of job sizes impliesthat we can impact most jobs without using muchresources. Trace-driven simulations show that average


