DSN 2014:Atlanta, GA, USA

44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23-26, 2014. IEEE Computer Society 【DBLP Link】

Paper Num: 82 || Session Num: 5

(DCDV 2014) Session 1: Cloud Dependability 3 Session 1: Cloud Dependability 3)
(DSSO 2014) Session 1: Error Detection and Diagnosis 3 Session 1: Error Detection and Diagnosis 3)
(ToSG 2014) Session 4 Session 4)
Best Paper Award Session 3
DSN 2014
Session 2
Session 1: Checkpoint/Restart Modeling and Message Logging 2
Session 1A: Identifying Malicious Activity 3
Session 1B: Networking 3
Session 2: Application and Algorithm Resiliency 2
Session 2: Dependability Evaluation 3
Session 2: Design Strategies for Dependability 2
Session 2A: Software Vulnerabilities 3
Session 2B: Cyber-Physical Systems 3
Session 3: Hardware - Reliability Studies and Tailored Resilience Techniques 3
Session 3A: Apps Attacks 3
Session 3B: Memory 3
Session 4: Resiliency in HPC Messaging 2
Session 4A: Cloud Computing 4
Session 4B: Social and Online Services 4
Session 5A: State Machine Replication 3
Session 5B: Faults 3
Session 6A: Databases and Storage 3
Session 6B: GPUs 3
Session 7A: System Configuration and Provisioning 3
Session 7B: Formal Methods 3
Session 8A: System and Component Reliability 3
Session 8B: Miscellaneous 3
Session 9: Failure Analysis and Assurance 3
The First International Workshop on Dependability and Security of System Operation (DSSO 2014))
The First International Workshop on Trustworthiness of Smart Grids (ToSG 2014))
The Fourth Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop 2014 Workshop 2014)
The Fourth International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2014))

DSN 2014

Best Paper Award Session 3

1. Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics.

【Paper Link】【Pages】:1-12

【Authors】: Robert Birke ; Ioana Giurgiu ; Lydia Y. Chen ; Dorothea Wiesmann ; Ton Engbersen

【Abstract】: In today's commercial data centers, the computation density grows continuously as the number of hardware components and workloads in units of virtual machines increase. The service availability guaranteed by data centers heavily depends on the reliability of the physical and virtual servers. In this study, we conduct an analysis on 10K virtual and physical machines hosted on five commercial data centers over an observation period of one year. Our objective is to establish a sound understanding of the differences and similarities between failures of physical and virtual machines. We first capture their failure patterns, i.e., the failure rates, the distributions of times between failures and of repair times, as well as, the time and space dependency of failures. Moreover, we correlate failures with the resource capacity and run-time usage to identify the characteristics of failing servers. Finally, we discuss how virtual machine management actions, i.e., consolidation and on/off frequency, impact virtual machine failures.

【Keywords】: failure root causes; Datacenters; VM failures

2. Reliability and Security Monitoring of Virtual Machines Using Hardware Architectural Invariants.

【Paper Link】【Pages】:13-24

【Authors】: Cuong Manh Pham ; Zachary Estrada ; Phuong Cao ; Zbigniew T. Kalbarczyk ; Ravishankar K. Iyer

【Abstract】: This paper presents a solution that simultaneously addresses both reliability and security (RnS) in a monitoring framework. We identify the commonalities between reliability and security to guide the design of Hyper Tap, a hyper visor-level framework that efficiently supports both types of monitoring in virtualization environments. In Hyper Tap, the logging of system events and states is common across monitors and constitutes the core of the framework. The audit phase of each monitor is implemented and operated independently. In addition, Hyper Tap relies on hardware invariants to provide a strongly isolated root of trust. Hyper Tap uses active monitoring, which can be adapted to enforce a wide spectrum of RnS policies. We validate Hyper Tap by introducing three example monitors: Guest OS Hang Detection (GOSHD), Hidden Root Kit Detection (HRKD), and Privilege Escalation Detection (PED). Our experiments with fault injection and real root kits/exploits demonstrate that Hyper Tap provides robust monitoring with low performance overhead.

【Keywords】: Fault Injection; Reliability; Security; Monitoring; Hypervisor; Invariant; Rootkit

3. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems.

【Paper Link】【Pages】:25-36

【Authors】: Devesh Tiwari ; Saurabh Gupta ; Sudharshan S. Vazhkudai

【Abstract】: Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on "check pointing'" as a resilience mechanism to store application state to permanent storage and recover from failures. Unfortunately, check pointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how check pointing overhead can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the check pointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.

【Keywords】: storage; checkpointing; resilience; system failures; locality; extreme-scale; supercomputing

Session 1A: Identifying Malicious Activity 3

4. Titan: Enabling Low Overhead and Multi-faceted Network Fingerprinting of a Bot.

【Paper Link】【Pages】:37-44

【Authors】: Osama Haq ; Waqar Ahmed ; Affan A. Syed

【Abstract】: Botnets are an evolutionary form of malware, unique in requiring network connectivity for herding by a botmaster that allows coordinated attacks as well as dynamic evasion from detection. Thus, the most interesting features of a bot relate to its rapidly evolving network behavior. The few academic and commercial malware observation systems that exist, however, are either proprietary or have large cost and management overhead. Moreover, the network behavior of bots changes considerably under different operational contexts. We first identify these various contexts that can impact its fingerprint. We then present Titan: a system that generates faithful network fingerprints by recreating all these contexts and stressing the bot with different network settings and host interactions. This effort includes a semi-automated and tunable containment policy to prevent bot proliferation. Most importantly, Titan has low cost overhead as a minimal setup requires just two machines, while the provision of a user-friendly web interface reduces the setup and management overhead. We then show a fingerprint of the Crypto locker bot to demonstrate automatic detection of its domain generation algorithm (DGA). We also demonstrate the effective identification of context specific behavior with a controlled deployment of Zeus botnet.

【Keywords】: software defined networking; botnets; containment policy; malware fingerprint; testbed

5. pSigene: Webcrawling to Generalize SQL Injection Signatures.

【Paper Link】【Pages】:45-56

【Authors】: Gaspar Modelo-Howard ; Christopher N. Gutierrez ; Fahad A. Arshad ; Saurabh Bagchi ; Yuan Qi

【Abstract】: Intrusion detection systems (IDS) are an important component to effectively protect computer systems. Misuse detection is the most popular approach to detect intrusions, using a library of signatures to find attacks. The accuracy of the signatures is paramount for an effective IDS, still today's practitioners rely on manual techniques to improve and update those signatures. We present a system, called pSigene, for the automatic generation of intrusion signatures by mining the vast amount of public data available on attacks. It follows a four-step process to generate the signatures, by first crawling attack samples from multiple public cyber security web portals. Then, a feature set is created from existing detection signatures to model the samples, which are then grouped using a biclustering algorithm which also gives the distinctive features of each cluster. Finally the system automatically creates a set of signatures using regular expressions, one for each cluster. We tested our architecture for SQL injection attacks and found our signatures to have a True and False Positive Rates of 90.52% and 0.03%, respectively and compared our findings to other SQL injection signature sets from popular IDS and web application firewalls. Results show our system to be very competitive to existing signature sets.

【Keywords】: SQL injection; web application security; signature generalization; biclustering

6. Probabilistic Inference for Obfuscated Network Attack Sequences.

【Paper Link】【Pages】:57-67

【Authors】: Haitao Du ; Shanchieh Jay Yang

【Abstract】: Facing diverse network attack strategies and overwhelming alters, much work has been devoted to correlate observed malicious events to pre-defined scenarios, attempting to deduce the attack plans based on expert models of how network attacks may transpire. Sophisticated attackers can, however, employ a number of obfuscation techniques to confuse the alert correlation engine or classifier. Recognizing the need for a systematic analysis of the impact of attack obfuscation, this paper models attack strategies as general finite order Markov models, and treats obfuscated observations as noises. Taking into account that only finite observation window and limited computational time can be afforded, this work develops an algorithm to efficiently inference on the joint distribution of clean and obfuscated attack sequences. The inference algorithm recovers the optimal match of obfuscated sequences to attack models, and enables a systematic and quantitative analysis on the impact of obfuscation on attack classification.

【Keywords】: Hidden Markov models; Vectors; Inference algorithms; Probabilistic logic; Markov processes; Dynamic programming; Computational modeling

Session 1B: Networking 3

7. Anomaly Characterization in Large Scale Networks.

【Paper Link】【Pages】:68-79

【Authors】: Emmanuelle Anceaume ; Yann Busnel ; Erwan Le Merrer ; Romaric Ludinard ; Jean Louis Marchand ; Bruno Sericola

【Abstract】: The context of this work is the online characterization of errors in large scale systems. In particular, we address the following question: Given two successive configurations of the system, can we distinguish massive errors from isolated ones, the former ones impacting a large number of nodes while the second ones affect solely a small number of them, or even a single one? The rationale of this question is twofold. First, from a theoretical point of view, we characterize errors with respect to their neighbourhood, and we show that there are error scenarios for which isolated and massive errors are indistinguishable from an omniscient observer point of view. We then relax the definition of this problem by introducing unresolved configurations, and exhibit necessary and sufficient conditions that allow any node to determine the type of errors it has been impacted by. These conditions only depend on the close neighbourhood of each node and thus are locally computable. We present algorithms that implement these conditions, and show through extensive simulations, their performances. Now from a practical point of view, distinguishing isolated errors from massive ones is of utmost importance for networks providers. For instance, for Internet service providers that operate millions of home gateways, it would be very interesting to have procedures that allow gateways to self distinguish whether their dysfunction is caused by network-level errors or by their own hardware or software, and to notify the service provider only in the latter case.

【Keywords】: local algorithms; Error detection; large scale systems

8. SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics.

【Paper Link】【Pages】:80-87

【Authors】: Qin Liu ; John C. S. Lui ; Cheng He ; Lujia Pan ; Wei Fan ; Yunlong Shi

【Abstract】: Many long-running network analytics applications impose a high-throughput and high reliability requirements on stream processing systems. However, previous stream processing systems cannot sustain high-speed traffic at the core router level. Furthermore, their fault-tolerant schemes cannot provide strong consistency which is essential for network analytics. In this paper, we present the design and implementation of SAND, a fault-tolerant distributed stream processing system for network analytics. SAND is designed to operate under high-speed network traffic, and it uses a novel check pointing protocol which can perform failure recovery based on upstream backup and check pointing. We prove our fault-tolerant scheme provides strong consistency even under multiple node failure. We implement several real-world network analytics applications on SAND, evaluate their performance using network traffic captured from commercial cellular core networks, and demonstrate that SAND can sustain high-speed network traffic and that our fault-tolerant scheme is efficient.

【Keywords】: fault-tolerance; stream processing; network analytics

9. An Adaptable Rule Placement for Software-Defined Networks.

【Paper Link】【Pages】:88-99

【Authors】: Shuyuan Zhang ; Franjo Ivancic ; Cristian Lumezanu ; Yifei Yuan ; Aarti Gupta ; Sharad Malik

【Abstract】: There is a strong trend in networking to move towards Software-Defined Networks (SDN). SDNs enable easier network configuration through a separation between a centralized controller and a distributed data plane comprising a network of switches. The controller implements network policies through installing rules on switches. Recently the "Big Switch" abstraction [1] was proposed as a specification mechanism for high-level network behavior, i.e., the network policies. The network operating system or compiler can use his specification for placing rules on individual switches. However, this is constrained by the limited capacity of the Ternary Content Addressable Memories (TCAMs) used for rules in each switch. We propose an Integer Linear Programming (ILP) based solution for placing rules on switches for a given firewall policy while optimizing for the total number of rules and meeting the switch capacity constraints. Experimental results demonstrate that our approach is scalable to practical sized networks.

【Keywords】: Distributed Firewall; SDN; Big Switch Abstraction; Rule Placement

Session 2A: Software Vulnerabilities 3

10. Detecting Malicious Javascript in PDF through Document Instrumentation.

【Paper Link】【Pages】:100-111

【Authors】: Daiping Liu ; Haining Wang ; Angelos Stavrou

【Abstract】: An emerging threat vector, embedded malware inside popular document formats, has become rampant since 2008. Owed to its wide-spread use and Javascript support, PDF has been the primary vehicle for delivering embedded exploits. Unfortunately, existing defenses are limited in effectiveness, vulnerable to evasion, or computationally expensive to be employed as an on-line protection system. In this paper, we propose a context-aware approach for detection and confinement of malicious Javascript in PDF. Our approach statically extracts a set of static features and inserts context monitoring code into a document. When an instrumented document is opened, the context monitoring code inside will cooperate with our runtime monitor to detect potential infection attempts in the context of Javascript execution. Thus, our detector can identify malicious documents by using both static and runtime features. To validate the effectiveness of our approach in a real world setting, we first conduct a security analysis, showing that our system is able to remain effective in detection and be robust against evasion attempts even in the presence of sophisticated adversaries. We implement a prototype of the proposed system, and perform extensive experiments using 18623 benign PDF samples and 7370 malicious samples. Our evaluation results demonstrate that our approach can accurately detect and confine malicious Javascript in PDF with minor performance overhead.

【Keywords】: document instrumentation; Malcode bearing PDF; malicious Javascript; malware detection and confinement

11. Scriptless Timing Attacks on Web Browser Privacy.

【Paper Link】【Pages】:112-123

【Authors】: Bin Liang ; Wei You ; Liangkun Liu ; Wenchang Shi ; Mario Heiderich

【Abstract】: The existing Web timing attack methods are heavily dependent on executing client-side scripts to measure the time. However, many techniques have been proposed to block the executions of suspicious scripts recently. This paper presents a novel timing attack method to sniff users' browsing histories without executing any scripts. Our method is based on the fact that when a resource is loaded from the local cache, its rendering process should begin earlier than when it is loaded from a remote website. We leverage some Cascading Style Sheets (CSS) features to indirectly monitor the rendering of the target resource. Three practical attack vectors are developed for different attack scenarios and applied to six popular desktop and mobile browsers. The evaluation shows that our method can effectively sniff users' browsing histories with very high precision. We believe that modern browsers protected by script-blocking techniques are still likely to suffer serious privacy leakage threats.

【Keywords】: browsing history; timing attack; scriptless attack; Web privacy

12. Automatically Fixing C Buffer Overflows Using Program Transformations.

【Paper Link】【Pages】:124-135

【Authors】: Alex Shaw ; Dusten Doggett ; Munawar Hafiz

【Abstract】: Fixing C buffer overflows at source code level remains a manual activity, at best semi-automated. We present an automated approach to fix buffer overflows by describing two program transformations that automatically introduce two well-known security solutions to C source code. The transformations embrace the difficulties of correctly analyzing and modifying C source code considering pointers and aliasing. They are effective: they fixed all buffer overflows featured in 4,505 programs of NIST's SAMATE reference dataset, making the changes automatically on over 2.3 million lines of code (MLOC). They are also safe: we applied them to make hundreds of changes on four open source programs (1.7 MLOC) without breaking the programs. Automated transformations such as these can be used by developers during coding, and by maintainers to fix problems in legacy code. They can be applied on a case by case basis, or as a batch to fix the root causes behind buffer overflows, thereby improving the dependability of systems.

【Keywords】: security; buffer; overflow; dependability

Session 2B: Cyber-Physical Systems 3

13. Application-Level Autonomic Hardware to Predict and Preempt Software Attacks on Industrial Control Systems.

【Paper Link】【Pages】:136-147

【Authors】: Lee W. Lerner ; Zane R. Franklin ; William T. Baumann ; Cameron D. Patterson

【Abstract】: We mitigate malicious software threats to industrial control systems, not by bolstering perimeter security, but rather by using application-specific configurable hardware to monitor and possibly override software operations in real time at the lowest (I/O pin) level of a system-on-chip platform containing a micro controller augmented with configurable logic. The process specifications, stability-preserving backup controller, and switchover logic are specified and formally verified as C code commonly used in control systems, but synthesized into hardware to resist software reconfiguration attacks. In addition, a copy of the production controller task is optionally implemented in an on-chip, isolated soft processor, connected to a model of the physical process, and accelerated to preview what the controller will attempt to do in the near future. This prediction provides greater assurance that the backup controller can be invoked before the physical process becomes unstable. Adding trusted, application-tailored, software-invisible, autonomic hardware is well-supported in a commercial system-on-chip platform.

【Keywords】: formal analysis; industrial control system security; software threats; hardware root-of-trust

14. Monitor Based Oracles for Cyber-Physical System Testing: Practical Experience Report.

【Paper Link】【Pages】:148-155

【Authors】: Aaron Kane ; Thomas E. Fuhrman ; Philip Koopman

【Abstract】: Testing Cyber-Physical Systems is becoming increasingly challenging as they incorporate advanced autonomy features. We investigate using an external runtime monitor as a partial test oracle to detect violations of critical system behavioral requirements on an automotive development platform. Despite limited source code access and using only existing network messages, we were able to monitor a hardware-in-the-loop vehicle simulator and analyze prototype vehicle log data to detect violations of high-level critical properties. Interface robustness testing was useful to further exercise the monitors. Beyond demonstrating feasibility, the experience emphasized a number of remaining research challenges, including: approximating system intent based on limited system state observability, how to best balance the simplicity and expressiveness of the specification language used to define monitored properties, how to warm up monitoring of system variable state after mode change discontinuities, and managing the differences between simulation and real vehicles when conducting such tests.

【Keywords】: cyber-physical systems; runtime monitoring; testing

15. Security Threat Analytics and Countermeasure Synthesis for Power System State Estimation.

【Paper Link】【Pages】:156-167

【Authors】: Mohammad Ashiqur Rahman ; Ehab Al-Shaer ; Rajesh G. Kavasseri

【Abstract】: State estimation plays a critically important role in ensuring the secure and reliable operation of the power grid. However, recent works have shown that the widely used weighted least squares (WLS) estimator, which uses several system wide measurements, is vulnerable to cyber attacks wherein an adversary can alter certain measurements to corrupt the estimator's solution, but evade the estimator's existing bad data detection algorithms and thus remain invisible to the system operator. Realistically, such a stealthy attack in its most general form has several constraints, particularly in terms of an adversary's knowledge and resources for achieving a desired attack outcome. In this light, we present a formal framework to systematically investigate the feasibility of stealthy attacks considering constraints of the adversary. In addition, unlike prior works, our approach allows the modeling of attacks on topology mappings, where an adversary can drastically strengthen stealthy attacks by intentionally introducing topology errors. Moreover, we show that this framework allows an operator to synthesize cost-effective countermeasures based on given resource constraints and security requirements in order to resist stealthy attacks. The proposed approach is illustrated on standard IEEE test cases.

【Keywords】: Formal Method; Power Grid; State Estimation; False Data Injection Attack

Session 3A: Apps Attacks 3

16. You Can Call but You Can't Hide: Detecting Caller ID Spoofing Attacks.

【Paper Link】【Pages】:168-179

【Authors】: Hossen A. Mustafa ; Wenyuan Xu ; Ahmad-Reza Sadeghi ; Steffen Schulz

【Abstract】: Caller ID (caller identification) is a service provided by telephone carriers to transmit the phone number and/or the name of a caller to a callee. Today, most people trust the caller ID information, and it is increasingly used to authenticate customers (e.g., by banks or credit card companies). However, with the proliferation of smartphones and VoIP, it is easy to spoof caller ID by installing corresponding Apps on smartphones or by using fake ID providers. As telephone networks are fragmented between enterprises and countries, no mechanism is available today to easily detect such spoofing attacks. This vulnerability has already been exploited with crucial consequences such as faking caller IDs to emergency services (e.g., 9-1-1) or to commit fraud. In this paper, we propose an end-to-end caller ID verification mechanism CallerDec that works with existing combinations of landlines, cellular and VoIP networks. CallerDec can be deployed at the liberty of users, without any modification to the existing infrastructures. We implemented our scheme as an App for Android-based phones and validated the effectiveness of our solution in detecting spoofing attacks in various scenarios.

【Keywords】: Caller ID Spoofing; End-user Security

17. On Tracking Information Flows through JNI in Android Applications.

【Paper Link】【Pages】:180-191

【Authors】: Chenxiong Qian ; Xiapu Luo ; Yuru Shao ; Alvin T. S. Chan

【Abstract】: Android provides native development kit through JNI for developing high-performance applications (or simply apps). Although recent years have witnessed a considerable increase in the number of apps employing native libraries, only a few systems can examine them. However, none of them scrutinizes the interactions through JNI in them. In this paper, we conduct a systematic study on tracking information flows through JNI in apps. More precisely, we first perform a large-scale examination on apps using JNI and report interesting observations. Then, we identify scenarios where information flows uncaught by existing systems can result in information leakage. Based on these insights, we propose and implement NDroid, an efficient dynamic taint analysis system for checking information flows through JNI. The evaluation through real apps shows NDroid can effectively identify information leaks through JNI with low performance overheads.

【Keywords】: Java; Libraries; Context; Androids; Humanoid robots; Engines; Games

18. Optical Delusions: A Study of Malicious QR Codes in the Wild.

【Paper Link】【Pages】:192-203

【Authors】: Amin Kharraz ; Engin Kirda ; William K. Robertson ; Davide Balzarotti ; Aurélien Francillon

【Abstract】: QR codes, a form of 2D barcode, allow easy interaction between mobile devices and websites or printed material by removing the burden of manually typing a URL or contact information. QR codes are increasingly popular and are likely to be adopted by malware authors and cyber-criminals as well. In fact, while a link can "look" suspicious, malicious and benign QR codes cannot be distinguished by simply looking at them. However, despite public discussions about increasing use of QR codes for malicious purposes, the prevalence of malicious QR codes and the kinds of threats they pose are still unclear. In this paper, we examine attacks on the Internet that rely on QR codes. Using a crawler, we performed a large-scale experiment by analyzing QR codes across 14 million unique web pages over a ten-month period. Our results show that QR code technology is already used by attackers, for example to distribute malware or to lead users to phishing sites. However, the relatively few malicious QR codes we found in our experiments suggest that, on a global scale, the frequency of these attacks is not alarmingly high and users are rarely exposed to the threats distributed via QR codes while surfing the web.

【Keywords】: phishing; Mobile devices; malicious QR codes; malware

Session 3B: Memory 3

19. A Reliable 3D MLC PCM Architecture with Resistance Drift Predictor.

【Paper Link】【Pages】:204-215

【Authors】: Majid Jalili ; Mohammad Arjomand ; Hamid Sarbazi-Azad

【Abstract】: In this paper, we study the problem of resistance drift in an MLC Phase Change Memory (PCM) and propose a solution to circumvent its thermally-affected accelerated rate in 3D CMPs. Our scheme is based on the observation that instead of alleviating the problem of resistance drift by using large margins or error correction codes, the PCM read circuit can be reconfigured for tolerating most of the resistance drift errors in a dynamic manner. Through detailed characterization of memory access patterns for 22 applications, we propose an efficient mechanism to facilitate such reliable read scheme via tolerating (a) early-cycle resistance drifts by using narrow margins so that considerably saving energy of writes and improving cell endurance, and (b) late-cycle resistance drifts by accurately estimating resistance thresholds that separate states for sensing. Evaluations on a true 3D architecture, consisting of a 4-core CMP and a banked 2-bit PCM memory, show that our proposal provides 10⁶ × lower error rate compared to the state-of-the-art designs of PCMs.

【Keywords】: Reliability; Phase Change memory; Resistance Drift; Chip-Multiprocessor

20. Mitigating Write Disturbance in Super-Dense Phase Change Memories.

【Paper Link】【Pages】:216-227

【Authors】: Lei Jiang ; Youtao Zhang ; Jun Yang

【Abstract】: Constructing a highly scalable and dense main memory subsystem with large access bandwidth has become a major challenge for modern computing systems. Traditional memory technologies, like DRAM and NAND Flash, suffer from either poor scalability or limited access bandwidth. Recent studies have identified emerging Phase Change Memory (PCM) as one of the most promising low power main memory technology candidates, because of its short read latency and good scalability. However, PCM still faces serious write disturbance problem below 20nm technology. Write disturbance leads to more cell programming errors, and thus degrades write reliability. Simple solutions, such as allocating large inter-cell space and adopting strong error correction code (ECC), either reduce memory density or incur large performance overhead. In this paper, we propose DIN, a Data encoding based Insulation technique, to mitigate write disturbance in highly dense PCMs. DIN improves memory density by eliminating inter-cell thermal band along a word line. The non-negligible disturbance errors, are then minimized by disturbance-aware data encoding, based on how PCM cells are programmed at device level. Our experimental results show that DIN gains write disturbance resistance in high density PCM chips while achieving comparable performance for a wide range of applications.

【Keywords】: Error Correcting Code; Phase Change Memories; Write Disturbance

21. WL-Reviver: A Framework for Reviving any Wear-Leveling Techniques in the Face of Failures on Phase Change Memory.

【Paper Link】【Pages】:228-239

【Authors】: Jie Fan ; Song Jiang ; Jiwu Shu ; Long Sun ; Qingda Hu

【Abstract】: While Phase Change Memory (PCM) has emerged as one of most promising complements or even replacements of DRAM-based memory, it has only limited write endurance. Because of uneven write distribution, PCM is highly likely to have early failures, which can spread over the chip space and leave the entire chip unusable. Wear leveling is an indispensable technique to even out wear caused by the writes. However, because of process variation early failure cannot be fully avoided. State-of-the-art wear-leveling schemes, such as Start-Gap and Security Refresh, cease to function once even a single block failure occurs because their designs require persistent writ able address space for wear leveling operations. Existent solutions attempting to address the problem demand substantial OS supports, such as explicit space allocations and data migrations. The demand on substantial OS cooperation creates a barrier to widespread adoption of the PCM technique. While fault-tolerance techniques, such as FREE-p and zombie, that remap failed blocks to inaccessible but healthy space have the potential to address the wear-leveling issue by relocating data from failed blocks to healthy ones, they cannot work together with the wear-leveling schemes as data migration may change placement of relocated data. In this paper, we propose a framework, WL-Reviver, that allows any in-PCM wear-leveling scheme to keep delivering its designed leveling service even after failures occur in its working address space. The design is unique on two aspects: (1) it leverages the fault-tolerance techniques so that they can work together with the wear leveling schemes, and (2) it requires no OS supports additional to what're available to today's DRAM-based memory system. Furthermore, WL-Reviver is a lightweight framework of very low overhead. Our extensive experiments show that WLReviver can efficiently revive a wear-leveling scheme without compromising the scheme's wear-leveling effect.

【Keywords】: Wear Leveling; PCM; Fault Tolerance

Session 4A: Cloud Computing 4

22. Performance Sensitive Replication in Geo-distributed Cloud Datastores.

【Paper Link】【Pages】:240-251

【Authors】: Shankaranarayanan P. N. ; Ashiwan Sivakumar ; Sanjay G. Rao ; Mohit Tawarmalani

【Abstract】: Modern web applications face stringent requirements along many dimensions including latency, scalability, and availability. In response, several geo-distributed cloud data stores have emerged in recent years. Customizing data stores to meet application SLAs is challenging given the scale of applications, and their diverse and dynamic workloads. In this paper, we tackle these challenges in the context of quorum-based systems (e.g. Amazon Dynamo, Cassandra), an important class of cloud storage systems. We present models that optimize percentiles of response time under normal operation and under a data-center (DC) failure. Our models consider factors like the geographic spread of users, DC locations, consistency requirements and inter-DC communication costs. We evaluate our models using real-world traces of three applications: Twitter, Wikipedia and Go Walla on a Cassandra cluster deployed in Amazon EC2. Our results confirm the importance and effectiveness of our models, and highlight the benefits of customizing replication in cloud datastores.

【Keywords】: networks; geo-replicated datastores; cloud storage; distributed systems

23. POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications.

【Paper Link】【Pages】:252-263

【Authors】: Xiwei Xu ; Liming Zhu ; Ingo Weber ; Len Bass ; Daniel Sun

【Abstract】: Applications in the cloud are subject to sporadic changes due to operational activities such as upgrade, redeployment, and on-demand scaling. These operations are also subject to interferences from other simultaneous operations. Increasing the dependability of these sporadic operations is non-trivial, particularly since traditional anomaly-detection-based diagnosis techniques are less effective during sporadic operation periods. A wide range of legitimate changes confound anomaly diagnosis and make baseline establishment for "normal" operation difficult. The increasing frequency of these sporadic operations (e.g. due to continuous deployment) is exacerbating the problem. Diagnosing failures during sporadic operations relies heavily on logs, while log analysis challenges stemming from noisy, inconsistent and voluminous logs from multiple sources remain largely unsolved. In this paper, we propose Process Oriented Dependability (POD)-Diagnosis, an approach that explicitly models these sporadic operations as processes. These models allow us to (i) determine orderly execution of the process, and (ii) use the process context to filter logs, trigger assertion evaluations, visit fault trees and perform on-demand assertion evaluation for online error diagnosis and root cause analysis. We evaluated the approach on rolling upgrade operations in Amazon Web Services (A WS) while performing other simultaneous operations. During our evaluation, we correctly detected all of the 160 injected faults, as well as 46 interferences caused by concurrent operations. We did this with 91.95% precision. Of the correctly detected faults, the accuracy rate of error diagnosis is 96.55%.

【Keywords】: DevOps; system administration; cloud; deployment; process mining; error detection; error diagnosis

24. Catch Me If You Can: A Cloud-Enabled DDoS Defense.

【Paper Link】【Pages】:264-275

【Authors】: Quan Jia ; Huangxin Wang ; Dan Fleck ; Fei Li ; Angelos Stavrou ; Walter Powell

【Abstract】: We introduce a cloud-enabled defense mechanism for Internet services against network and computational Distributed Denial-of-Service (DDoS) attacks. Our approach performs selective server replication and intelligent client re-assignment, turning victim servers into moving targets for attack isolation. We introduce a novel system architecture that leverages a "shuffling" mechanism to compute the optimal re-assignment strategy for clients on attacked servers, effectively separating benign clients from even sophisticated adversaries that persistently follow the moving targets. We introduce a family of algorithms to optimize the runtime client-to-server re-assignment plans and minimize the number of shuffles to achieve attack mitigation. The proposed shuffling-based moving target mechanism enables effective attack containment using fewer resources than attack dilution strategies using pure server expansion. Our simulations and proof-of-concept prototype using Amazon EC2 [1] demonstrate that we can successfully mitigate large-scale DDoS attacks in a small number of shuffles, each of which incurs a few seconds of user-perceived latency.

【Keywords】: Cloud; DDoS; Moving Target Defense; Shuffling

25. Secure Ranked Multi-keyword Search for Multiple Data Owners in Cloud Computing.

【Paper Link】【Pages】:276-286

【Authors】: Wei Zhang ; Sheng Xiao ; Yaping Lin ; Ting ; Si-wang Zhou

【Abstract】: With the advent of cloud computing, it becomes increasingly popular for data owners to outsource their data to public cloud servers while allowing data users to retrieve these data. For privacy concerns, secure searches over encrypted cloud data motivated several researches under the single owner model. However, most cloud servers in practice do not just serve one owner, instead, they support multiple owners to share the benefits brought by cloud servers. In this paper, we propose schemes to deal with secure ranked multi-keyword search in a multi-owner model. To enable cloud servers to perform secure search without knowing the actual data of both keywords and trapdoors, we systematically construct a novel secure search protocol. To rank the search results and preserve the privacy of relevance scores between keywords and files, we propose a novel Additive Order and Privacy Preserving Function family. Extensive experiments on real-world datasets confirm the efficacy and efficiency of our proposed schemes.

【Keywords】: privacy and additive order preserving; cloud computing; secure ranked keyword search; multiple data owners

【Paper Link】【Pages】:287-298

【Authors】: Xiaojing Liao ; A. Selcuk Uluagac ; Raheem A. Beyah

【Abstract】: Mobile social services utilize profile matching to help users find friends with similar social attributes (e.g., interests, location, background). However, privacy concerns often hinder users from enabling this functionality. In this paper, we introduce S-MATCH, a novel framework for privacy-preserving profile matching based on property-preserving encryption (PPE). First, we illustrate that PPE should not be considered secure when directly used on social attribute data due to its key-sharing problem and information leakage problem. Then, we address the aforementioned problems of applying PPE to social network data and develop an efficient and verifiable privacy-preserving profile matching scheme. We implement both the client and server portions of S-MATCH and evaluate its performance under three real-world social network datasets. The results show that S-MATCH can achieve at least one order of magnitude better computational performance than the techniques that use homomorphic encryption.

【Keywords】: symmetric encryption; profile matching; privacy; property-preserving encryption

【Paper Link】【Pages】:299-310

【Authors】: Murtuza Jadliwala ; Anindya Maiti ; Vinod Namboodiri

【Abstract】: The increasing popularity of online social networks (OSNs) is spawning new security and privacy concerns. Currently, a majority of OSNs offer very naive access control mechanisms that are primarily based on static access control lists (ACL) or policies. But as the number of social connections grow, static ACL based approaches become ineffective and unappealing to OSN users. There is an increased need in social-networking and data-sharing applications to control access to data based on the associated context (e.g., event, location, and users involved), rather than solely on data ownership and social connections. Surveillance is another critical concern for OSN users, as the service provider may further scrutinize data posted or shared by users for personal gains (e.g., targeted advertisements), for use by corporate partners or to comply with legal orders. In this paper, we introduce a novel paradigm of context-based access control in OSNs, where users are able to access the shared data only if they have knowledge of the context associated with it. We propose two constructions for context-based access control in OSNs: the first is based on a novel application of Shamir's secret sharing scheme, whereas the second makes use of an attribute-based encryption scheme. For both constructions, we analyze their security properties, implement proof-of-concept applications for Facebook and empirically evaluate their functionality and performance. Our empirical measurements show that the proposed constructions execute efficiently on standard computing hardware, as well as, on portable mobile devices.

【Keywords】: Surveillance Resistance; Online Social Networks; Access Control; Privacy

28. Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems.

【Paper Link】【Pages】:311-322

【Authors】: Rui Ding ; Qiang Fu ; Jian-Guang Lou ; Qingwei Lin ; Dongmei Zhang ; Tao Xie

【Abstract】: Online service systems have been increasingly popular and important nowadays. Reducing the MTTR (Mean Time to Restore) of a service remains one of the most important steps to assure the user-perceived availability of the service. To reduce the MTTR, a common practice is to restore the service by identifying and applying an appropriate healing action. In this paper, we present an automated mining-based approach for suggesting an appropriate healing action for a given new issue. Our approach suggests an appropriate healing action by adapting healing actions from the retrieved similar historical issues. We have applied our approach to a real-world and large-scale product online service. The studies on 243 real issues of the service show that our approach can effectively suggest appropriate healing actions (with 87% accuracy) to reduce the MTTR of the service. In addition, according to issue characteristics, we further study and categorize issues where automatic healing suggestion faces difficulties.

【Keywords】: incident management; Online service system; healing action; issue repository

29. Understanding Interoperability Issues of Web Service Frameworks.

【Paper Link】【Pages】:323-330

【Authors】: Ivano Alessandro Elia ; Nuno Laranjeiro ; Marco Vieira

【Abstract】: Web Services are a set of technologies designed to support the invocation of remote services by client applications, with the key goal of providing interoperable application-to-application interaction while supporting vendor and platform independence. The goal of this work is to study the real level of interoperability provided by these technologies through a massive experimental campaign involving a wide set of very popular frameworks for web services, implemented using seven different programming languages. We have tested the inter-operation of eleven client-side framework subsystems with three of the most widely used server-side implementations, each one hosting thousands of different services. The results highlight numerous situations where the goal of interoperability between different frameworks is not met due to problems both on the client and the server side. Moreover, we have identified issues also affecting interactions between the client and server subsystems of the same framework.

【Keywords】: web service framework; web service; interoperability; WS-I Basic Profile

Session 5A: State Machine Replication 3

30. Scalable State-Machine Replication.

【Paper Link】【Pages】:331-342

【Authors】: Carlos Eduardo Benevides Bezerra ; Fernando Pedone ; Robbert van Renesse

【Abstract】: State machine replication (SMR) is a well-known technique able to provide fault-tolerance. SMR consists of sequencing client requests and executing them against replicas in the same order, thanks to deterministic execution, every replica will reach the same state after the execution of each request. However, SMR is not scalable since any replica added to the system will execute all requests, and so throughput does not increase with the number of replicas. Scalable SMR (S-SMR) addresses this issue in two ways: (i) by partitioning the application state, while allowing every command to access any combination of partitions, and (ii) by using a caching algorithm to reduce the communication across partitions. We describe Eyrie, a library in Java that implements S-SMR, and Volery, an application that implements Zookeeper's API. We assess the performance of Volery and compare the results against Zookeeper. Our experiments show that Volery scales throughput with the number of partitions.

【Keywords】: Servers; Throughput; Partitioning algorithms; Law; Real-time systems; Optimization

31. Clock-RSM: Low-Latency Inter-datacenter State Machine Replication Using Loosely Synchronized Physical Clocks.

【Paper Link】【Pages】:343-354

【Authors】: Jiaqing Du ; Daniele Sciascia ; Sameh Elnikety ; Willy Zwaenepoel ; Fernando Pedone

【Abstract】: This paper proposes Clock-RSM, a new state machine replication protocol that uses loosely synchronized physical clocks to totally order commands for geo-replicated services. Clock-RSM assumes realistic non-uniform latencies among replicas located at different data centers. It provides low-latency linearizable replication by overlapping 1) logging a command at a majority of replicas, 2) determining the stable order of the command from the farthest replica, and 3) notifying the commit of the command to all replicas. We evaluate Clock-RSM analytically and derive the expected command replication latency. We also evaluate the protocol experimentally using a geo-replicated key-value store deployed across multiple Amazon EC2 data centers.

【Keywords】: Clocks; Protocols; Synchronization; Servers; Detectors; Optimization; Complexity theory

32. State Machine Replication for the Masses with BFT-SMART.

【Paper Link】【Pages】:355-362

【Authors】: Alysson Neves Bessani ; João Sousa ; Eduardo Adílio Pelinson Alchieri

【Abstract】: The last fifteen years have seen an impressive amount of work on protocols for Byzantine fault-tolerant (BFT) state machine replication (SMR). However, there is still a need for practical and reliable software libraries implementing this technique. BFT-SMART is an open-source Java-based library implementing robust BFT state machine replication. Some of the key features of this library that distinguishes it from similar works (e.g., PBFT and UpRight) are improved reliability, modularity as a first-class property, multicore-awareness, reconfiguration support and a flexible programming interface. When compared to other SMR libraries, BFT-SMART achieves better performance and is able to withstand a number of real-world faults that previous implementations cannot.

【Keywords】: byzantine fault tolerance; state machine replication

Session 5B: Faults 3

33. Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults.

【Paper Link】【Pages】:363-374

【Authors】: Majid Dadashi ; Layali Rashid ; Karthik Pattabiraman ; Sathish Gopalakrishnan

【Abstract】: Intermittent hardware faults are hard to diagnose as they occur non-deterministically at the same location. Hardware-only diagnosis techniques incur significant power and area overheads. On the other hand, software-only diagnosis techniques have low power and area overheads, but have limited visibility into many micro-architectural structures and hence cannot diagnose faults in them. To overcome these limitations, we propose a hardware-software integrated framework for diagnosing intermittent faults. The hardware part of our framework, called SCRIBE continuously records the resource usage information of every instruction in the processor, and exposes it to the software layer. SCRIBE incurs a performance overhead of 12% and power overhead of 9%, on average. The software part of our framework is called SIED and uses backtracking from the program's crash dump to find the faulty micro-architectural resource. Our technique has an average accuracy of 84% in diagnosing the faulty resource, which in turn enables fine-grained deconfiguration with less than 2% performance loss after deconfiguration.

【Keywords】: Hardware/Software Co-design; Intermittent Faults; Backtracking; Dynamic Dependence Graphs

34. Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults.

【Paper Link】【Pages】:375-382

【Authors】: Jiesheng Wei ; Anna Thomas ; Guanpeng Li ; Karthik Pattabiraman

【Abstract】: Hardware errors are on the rise with reducing feature sizes, however tolerating them in hardware is expensive. Researchers have explored software-based techniques for building error resilient applications. Many of these techniques leverage application-specific resilience characteristics to keep overheads low. Understanding application-specific resilience characteristics requires software fault-injection mechanisms that are both accurate and capable of operating at a high-level of abstraction to allow developers to reason about error resilience. In this paper, we quantify the accuracy of high-level software fault injection mechanisms vis-à-vis those that operate at the assembly or machine code levels. To represent high-level injection mechanisms, we built a fault injector tool based on the LLVM compiler, called LLFI. LLFI performs fault injection at the LLVM intermediate code level of the application, which is close to the source code. We quantitatively evaluate the accuracy of LLFI with respect to assembly level fault injection, and understand the reasons for the differences.

【Keywords】: comparison; Fault injection; LLVM; PIN

35. Hard Drive Failure Prediction Using Classification and Regression Trees.

【Paper Link】【Pages】:383-394

【Authors】: Jing Li ; Xinpu Ji ; Yuhan Jia ; Bingpeng Zhu ; Gang Wang ; Zhongwei Li ; Xiaoguang Liu

【Abstract】: Some statistical and machine learning methods have been proposed to build hard drive prediction models based on the SMART attributes, and have achieved good prediction performance. However, these models were not evaluated in the way as they are used in real-world data centers. Moreover, the hard drives deteriorate gradually, but these models can not describe this gradual change precisely. This paper proposes new hard drive failure prediction models based on Classification and Regression Trees, which perform better in prediction performance as well as stability and interpretability compared with the state-of the-art model, the Back propagation artificial neural network model. Experiments demonstrate that the Classification Tree (CT) model predicts over 95% of failures at a false alarm rate (FAR) under 0.1% on a real-world dataset containing 25,792 drives. Aiming at the practical application of prediction models, we test them with different drive families, with fewer number of drives, and with different model updating strategies. The CT model still shows steady and good performance. We propose a health degree model based on Regression Tree (RT) as well, which can give the drive a health assessment rather than a simple classification result. Therefore, the approach can deal with warnings raised by the prediction model in order of their health degrees. We implement a reliability model for RAID-6 systems with proactive fault tolerance and show that our CT model can significantly improve the reliability and/or reduce construction and maintenance cost of large-scale storage systems.

【Keywords】: Health degree; Hard drive failure prediction; SMART; CART

Session 6A: Databases and Storage 3

36. Developing Correctly Replicated Databases Using Formal Tools.

【Paper Link】【Pages】:395-406

【Authors】: Nicolas Schiper ; Vincent Rahli ; Robbert van Renesse ; Mark Bickford ; Robert L. Constable

【Abstract】: Fault-tolerant distributed systems often contain complex error handling code. Such code is hard to test or model-check because there are often too many possible failure scenarios to consider. As we will demonstrate in this paper, formal methods have evolved to a state in which it is possible to generate this code along with correctness guarantees. This paper describes our experience with building highly-available databases using replication protocols that were generated with the help of correct-by-construction formal methods. The goal of our project is to obtain databases with unsurpassed reliability while providing good performance. We report on our experience using a total order broadcast protocol based on Paxos and specified using a new formal language called Event ML. We compile Event ML specifications into a form that can be formally verified while simultaneously obtaining code that can be executed. We have developed two replicated databases based on this code and show that they have performance that is competitive with popular databases in one of the two considered benchmarks.

【Keywords】: formal tools; Fault-tolerance; database replication; correct-by-construction distributed protocols

37. The Energy Efficiency of Database Replication Protocols.

【Paper Link】【Pages】:407-418

【Authors】: Nicolas Schiper ; Fernando Pedone ; Robbert van Renesse

【Abstract】: Replication is a widely used technique to provide high-availability to online services. While being an effective way to mask failures, replication comes at a price: at least twice as much hardware and energy are required to mask a single failure. In a context where the electricity drawn by data centers worldwide is increasing each year, there is a need to maximize the amount of useful work done per Joule, a metric denoted as energy efficiency. In this paper, we review commonly-used database replication protocols and experimentally measure their energy efficiency. We observe that the most efficient replication protocol achieves less than 60% of the energy efficiency of a stand-alone server on the TPC-C benchmark. We identify algorithmic techniques that can be used by any protocol to improve its efficiency. Some approaches improve performance, others lower power consumption. Of particular interest is a technique derived from primary-backup replication that implements a transaction log on low-power backups. We demonstrate how this approach can lead to an energy efficiency that is 79% of the one of a stand-alone server. This constitutes an important step towards reconciling replication with energy efficiency.

【Keywords】: energy efficiency; Fault-tolerance; database replication

38. Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters.

【Paper Link】【Pages】:419-430

【Authors】: Runhui Li ; Patrick P. C. Lee ; Yuchong Hu

【Abstract】: We have witnessed an increasing adoption of erasure coding in modern clustered storage systems to reduce the storage overhead of traditional 3-way replication. However, it remains an open issue of how to customize the data analytics paradigm for erasure-coded storage, especially when the storage system operates in failure mode. We propose degraded-first scheduling, a new MapReduce scheduling scheme that improves MapReduce performance in erasure-coded clustered storage systems in failure mode. Its main idea is to launch degraded tasks earlier so as to leverage the unused network resources. We conduct mathematical analysis and discrete event simulation to show the performance gain of degraded-first scheduling over Hadoop's default locality-first scheduling. We further implement degraded-first scheduling on Hadoop and conduct test bed experiments in a 13-node cluster. We show that degraded-first scheduling reduces the MapReduce runtime of locality-first scheduling.

【Keywords】: Encoding; Switches; Scheduling; Runtime; Algorithm design and analysis; Mathematical analysis; Availability

Session 6B: GPUs 3

39. Warped-Shield: Tolerating Hard Faults in GPGPUs.

【Paper Link】【Pages】:431-442

【Authors】: Waleed Dweik ; Mohammad Abdel-Majeed ; Murali Annavaram

【Abstract】: Graphics processing units (GPUs) are rapidly becoming the parallel accelerators of choice to run general purpose applications. GPUs that run general purpose applications are termed as GPGPUs. Many mission-critical and long-running scientific application are being ported to run on GPGPUs. These applications demand strong computational integrity. GPGPUs, like many other digital components, face imminent reliability threats due to technology scaling. Of particular concern is the infield hard faults that are persistent and irreversible. GPGPUs comprise of dozens of streaming processors where each streaming processor employs tens of execution units, organized as single instruction multiple thread (SIMT) lanes to deliver massive parallel computational power. In this paper we exploit the massive replication of SIMT lanes to tolerate infield hard faults. First, we introduce thread shuffling to reroute threads, originally mapped to faulty SIMT lanes, to idle healthy lanes. Thread shuffling is insufficient when the number of healthy SIMT lanes is fewer than the number of active threads. To broaden the reach of thread shuffling, we propose dynamic warp deformation to split the warp into multiple sub-warps, each sub-warp uses fewer SIMT lanes thereby providing more opportunities to avoid using a faulty SIMT lane. Finally, we propose warp shuffling which exploits non-uniform degradation of different streaming processors by scheduling a warp to a streaming processor that requires fewer warp splits. Hence, warp shuffling helps to reduce the performance overhead associated with dynamic warp deformation. By deploying the proposed techniques, we can tolerate the worst case scenario of having up to three hard faults per four SIMT lane cluster with at most 36%performance degradation.

【Keywords】: warp shuffling; Single instruction multiple threads (SIMT); thread shuffling; warp deformation

40. A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units.

【Paper Link】【Pages】:443-454

【Authors】: Claus Braun ; Sebastian Halder ; Hans-Joachim Wunderlich

【Abstract】: Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead. In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage.

【Keywords】: Matrix Multiplication; Algorithm-Based Fault Tolerance; ABFT; Rounding Error Estimation; GPU

41. Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability.

【Paper Link】【Pages】:455-466

【Authors】: Paolo Rech ; Laércio Lima Pilla ; Philippe Olivier Alexandre Navaux ; Luigi Carro

【Abstract】: Graphics Processing Units (GPUs) offer high computational power but require high scheduling strain to manage parallel processes, which increases the GPU cross section. The results of extensive neutron radiation experiments performed on NVIDIA GPUs confirm this hypothesis. Reducing the application Degree Of Parallelism (DOP) reduces the scheduling strain but also modifies the GPU parallelism management, including memory latency, thread registers number, and the processors occupancy, which influence the sensitivity of the parallel application. An analysis on the overall GPU radiation sensitivity dependence on the code DOP is provided and the most reliable configuration is experimentally detected. Finally, modifying the parallel management affects the GPU cross section but also the code execution time and, thus, the exposure to radiation required to complete computation. The Mean Workload and Executions Between Failures metrics are introduced to evaluate the workload or the number of executions computed correctly by the GPU on a realistic application.

【Keywords】: parallel algorithms; GPGPUs; reliability; radiation

Session 7A: System Configuration and Provisioning 3

42. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory.

【Paper Link】【Pages】:467-478

【Authors】: Yixin Luo ; Sriram Govindan ; Bikash Sharma ; Mark Santaniello ; Justin Meza ; Aman Kansal ; Jie Liu ; Badriddine Khessib ; Kushagra Vaid ; Onur Mutlu

【Abstract】: Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increases memory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications. Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key -- value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-off. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.

【Keywords】: DRAM; memory errors; software reliability; memory architectures; soft errors; hard errors; datacenter cost

43. Ocasta: Clustering Configuration Settings for Error Recovery.

【Paper Link】【Pages】:479-490

【Authors】: Zhen Huang ; David Lie

【Abstract】: Effective machine-aided diagnosis and repair of configuration errors continues to elude computer systems designers. Most of the literature targets errors that can be attributed to a single erroneous configuration setting. However, a recent study found that a significant amount of configuration errors require fixing more than one setting together. To address this limitation, Ocasta statistically clusters dependent configuration settings based on the application's accesses to its configuration settings and utilizes the extracted clustering of configuration settings to fix configuration errors involving more than one configuration settings. Ocasta treats applications as black-boxes and only relies on the ability to observe application accesses to their configuration settings. We collected traces of real application usage from 24 Linux and 5 Windows desktops computers and found that Ocasta is able to correctly identify clusters with 88.6% accuracy. To demonstrate the effectiveness of Ocasta, we evaluated it on 16 real-world configuration errors of 11 Linux and Windows applications. Ocasta is able to successfully repair all evaluated configuration errors in 11 minutes on average and only requires the user to examine an average of 3 screenshots of the output of the application to confirm that the error is repaired. A user study we conducted shows that Ocasta is easy to use by both expert and non-expert users and is more efficient than manual configuration error troubleshooting.

【Keywords】: Software tools; Fault diagnosis; System recovery; Clustering algorithms

44. FACE-CHANGE: Application-Driven Dynamic Kernel View Switching in a Virtual Machine.

【Paper Link】【Pages】:491-502

【Authors】: Zhongshu Gu ; Brendan Saltaformaggio ; Xiangyu Zhang ; Dongyan Xu

【Abstract】: Kernel minimization has already been established as a practical approach to reducing the trusted computing base. Existing solutions have largely focused on whole-system profiling - generating a globally minimum kernel image that is being shared by all applications. However, since different applications use only part of the kernel's code base, the minimized kernel still includes an unnecessarily large attack surface. Furthermore, once the static minimized kernel is generated, it is not flexible enough to adapt to an altered execution environment (e.g., new workload). FACE-CHANGE is a virtualization-based system to facilitate dynamic switching at runtime among multiple minimized kernels, each customized for an individual application. Based on precedent profiling results, FACE-CHANGE transparently presents a customized kernel view for each application to confine its reach ability of kernel code. In the event that the application exceeds this boundary, FACE-CHANGE is able to recover the missing code and back trace its attack/exception provenance to analyze the anomalous behavior.

【Keywords】: Virtualization; Attack Surface Minimization; Attack Provenance

Session 7B: Formal Methods 3

45. Model Checking Stochastic Automata for Dependability and Performance Measures.

【Paper Link】【Pages】:503-514

【Authors】: Peter Buchholz ; Jan Kriege ; Dimitri Scheftelowitsch

【Abstract】: Model checking of Continuous Time Markov Chains (CTMCs) is a widely used approach in performance and dependability analysis and proves for which states of a CTMC a logical formula holds. This viewpoint might be too detailed in several practical situations, especially if the states of the CTMC do not correspond to physical states of the system since they are introduced for example to model non-exponential timing. The paper presents a general class of automata with stochastic timing realized by clocks. A state of an automaton is given by a logical state and by clock states. Clocks trigger transitions and are modeled by phase type distributions or more general state based stochastic processes. The class of stochastic processes underlying these automata contains CTMCs but also goes beyond Markov processes. The logic CSL is extended for model checking automata with clocks. A formula is then proved for an automata state and for the clock states that depend on the past behavior of the automaton. Basic algorithms to prove CSL formulas for logical automata states with complete or partial knowledge of the clock states are introduced. In some cases formulas can be proved efficiently by decomposing the model with respect to concurrently running clocks which is a way to avoid state space explosion.

【Keywords】: Numerical Methods; Model Checking; Stochastic Automata; CSL; Non-Markovian Models

46. Scalable Security Models for Assessing Effectiveness of Moving Target Defenses.

【Paper Link】【Pages】:515-526

【Authors】: Jin B. Hong ; Dong Seong Kim

【Abstract】: Moving Target Defense (MTD) changes the attack surface of a system that confuses intruders to thwart attacks. Various MTD techniques are developed to enhance the security of a networked system, but the effectiveness of these techniques is not well assessed. Security models (e.g., Attack Graphs (AGs)) provide formal methods of assessing security, but modeling the MTD techniques in security models has not been studied. In this paper, we incorporate the MTD techniques in security modeling and analysis using a scalable security model, namely Hierarchical Attack Representation Models (HARMs), to assess the effectiveness of the MTD techniques. In addition, we use importance measures (IMs) for scalable security analysis and deploying the MTD techniques in an effective manner. The performance comparison between the HARM and the AG is given. Also, we compare the performance of using the IMs and the exhaustive search method in simulations.

【Keywords】: Security Modeling Techniques; Attack Representation Model; Importance Measures; Moving Target Defense; Security Analysis

47. A Novel Variable Ordering Heuristic for BDD-based K-Terminal Reliability.

【Paper Link】【Pages】:527-537

【Authors】: Minh Lê ; Josef Weidendorfer ; Max Walter

【Abstract】: Modern exact methods solving the NP-hard k-terminal reliability problem are based on Binary Decision Diagrams (BDDs). The system redundancy structure represented by the input graph is converted into a BDD whose size highly depends on the predetermined variable ordering. As finding the optimal available ordering has exponential complexity, a heuristic must be used. Currently, the breadth-first-search is considered to be state-of-the-art. Based on Hardy's decomposition approach, we derive a novel static heuristic which yields significantly smaller BDD sizes for a wide variety of network structures, especially irregular ones. As a result, runtime and memory requirements can be drastically reduced for BDD-based reliability methods. Applying the decomposition method with the new heuristic to three medium-sized irregular networks from the literature, an average speedup of around 9,400 is gained and the memory consumption drops to less than 0.1 percent.

【Keywords】: dependability analysis; BDD; decomposition; k-terminal reliability

Session 8A: System and Component Reliability 3

48. Adaptive Low-Power Architecture for High-Performance and Reliable Embedded Computing.

【Paper Link】【Pages】:538-549

【Authors】: Ronaldo Rodrigues Ferreira ; Jean da Rolt ; Gabriel L. Nazar ; Álvaro Freitas Moreira ; Luigi Carro

【Abstract】: This paper presents the Matrix Operation Microprocessor Architecture (MoMa) for reliable embedded computing. MoMa introduces a software execution mechanism based on transactions, which provides a localized error correction scheme that leads to reduced error correction latency and hardware redundancy without incurring on expensive execution check pointing. Coupled to the transactional software execution is a dedicated adaptive core for matrix multiplication which is protected with a hardware implementation of the Algorithm-Based Fault Tolerance technique. MoMa drives the matrix core in an adaptive fashion based on dynamically turning it on only when high-performance computation is necessary, leading to ultimate power savings and error coverage. We performed an exhaustive FPGA-implemented fault injection campaign, in which we observed an error detection coverage of almost 100% and an error correction coverage of almost 98% on average. MoMa is also evaluated in terms of power, area, and performance, showing its competitiveness against a classical TMR solution.

【Keywords】: transaction; adaptive computing; error correction; hardening; matrix multiplication; radiation; soft error

49. HV Code: An All-Around MDS Code to Improve Efficiency and Reliability of RAID-6 Systems.

【Paper Link】【Pages】:550-561

【Authors】: Zhirong Shen ; Jiwu Shu

【Abstract】: The increasing expansion of data scale leads to the widespread deployment of storage systems with larger capacity and further induces the climbing probability of data loss or damage. The Maximum Distance Separable (MDS) code in RAID-6, which tolerates the concurrent failures of any two disks with minimal storage requirement, is one of the best candidates to enhance the data reliability. However, most of the existing works in this literature are more inclined to be specialized and cannot provide a satisfied performance under an all-round evaluation. Aiming at this problem, we propose an all-round MDS code named Horizontal-Vertical Code (HV Code) by taking advantage of horizontal parity and vertical parity. HV Code achieves the perfect I/O balancing and optimizes the operation of partial stripe writes to continuous data elements, while preserving the optimal encode/decode/update efficiency. Moreover, it owns a shorter parity chain which grants it a more efficient recovery for one disk failure. HV Code also behaves well for the degraded read operation and accelerates the process to reconstruct two disabled disks by executing four recovery chains in parallel. The performance evaluation demonstrates that HV Code well balances the I/O distribution and eliminates up to 27.6% and 32.4% I/O requests for partial stripe writes operation when compared with RDP Code and HDP Code. Moreover, compared to RDP Code, HDP Code, X-Code and H-Code, HV Code reduces up to 5.4%~39.8% I/O requests per element for the single disk reconstruction, decreases 6.6%~28.3% I/O requests for degraded read operations, and achieves the same efficiency of X-Code for double disk recovery by shortening 47.4%~59.7% recovery time compared with other three codes.

【Keywords】: Degraded Read; RAID-6; Storage System; Load Balancing; Partial Stripe Writes; Disk Recovery

50. Replication-Based Fault-Tolerance for Large-Scale Graph Processing.

【Paper Link】【Pages】:562-573

【Authors】: Peng Wang ; Kaiyuan Zhang ; Rong Chen ; Haibo Chen ; Haibing Guan

【Abstract】: The increasing algorithm complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a must due to the increasing scale of machines. Unfortunately, existing large-scale graph-parallel systems usually adopt a distributed checkpoint mechanism for fault tolerance, which incurs not only notable performance overhead but also lengthy recovery time. This paper observes that the vertex replicas created for distributed graph computation can be naturally extended for fast in-memory recovery of graph states. This paper proposes Imitator, a new fault tolerance mechanism, that supports cheaply maintenance of vertex states by replicating vertex states to their replicas during normal message exchanges, and provides fast in-memory reconstruction of failed vertices from replicas in other machines. Imitator has been implemented by extending Hama, a popular open-source clone of Pregel. Evaluation shows that Imitator incurs negligible performance overhead (less than 5% for all cases) and can recover from failures of more than one million of vertices with less than 3.4 seconds.

【Keywords】: fault-tolerance; graph-parallel system

Session 8B: Miscellaneous 3

51. System Call Redirection: A Practical Approach to Meeting Real-World Virtual Machine Introspection Needs.

【Paper Link】【Pages】:574-585

【Authors】: Rui Wu ; Ping Chen ; Peng Liu ; Bing Mao

【Abstract】: Existing VMI techniques have high overhead, and require customized introspection programs/tools for different guest OS versions - lack of generality. In this paper, we present Shadow Context, a system for close-to-real time manual-effort-free VMI. Shadow Context can meet several important real-world VMI needs which existing VMI techniques cannot. Compared to other automatic introspection tool generation techniques, Shadow Contexthas two merits: (1) Its overhead is significantly less. It achieves close-to-real time VMI. (2) It significantly improves the practical usefulness of introspection tools by allowing one introspection program to inspect a variety of guest OS versions. These merits are achieved via a new concept called "Shadow Context" which allows the guest OSessystem call code to be reused inside a "shadowed" portion of the context of the out-of-guest inspection program. Besides, Shadow Context is secure enough to defend against a variety of real world attacks. Shadow Context is designed, implemented and systematically evaluated. Experimental results show that the performance overhead is about 75%with a median initialization time of 0.117 milliseconds.

【Keywords】: Virtual Machine Introspection; Virtualization

52. Interoperability between Fingerprint Biometric Systems: An Empirical Study.

【Paper Link】【Pages】:586-597

【Authors】: Stephen Mason ; Ilir Gashi ; Luca Lugini ; Emanuela Marasco ; Bojan Cukic

【Abstract】: Fingerprints are likely the most widely used biometric in commercial as well as law enforcement applications. With the expected rapid growth of fingerprint authentication in mobile devices their importance justifies increased demands for dependability. An increasing number of new sensors, applications and a diverse user population also intensify concerns about the interoperability in fingerprint authentication. In most applications, fingerprints captured for user enrollment with one device may need to be "matched" with fingerprints captured with another device. We have performed a large-scale study with 494 participants whose fingerprints were captured with 4 different industry-standard optical fingerprint devices. We used two different image quality algorithms to evaluate fingerprint images, and then used three different matching algorithms to calculate match scores. In this paper we present a comprehensive analysis of dependability and interoperability attributes of fingerprint authentication and make empirically-supported recommendations on their deployment strategies.

【Keywords】: interoperability; biometric systems; empirical assessment; experimental results; design diversity

53. DNS Noise: Measuring the Pervasiveness of Disposable Domains in Modern DNS Traffic.

【Paper Link】【Pages】:598-609

【Authors】: Yizheng Chen ; Manos Antonakakis ; Roberto Perdisci ; Yacin Nadji ; David Dagon ; Wenke Lee

【Abstract】: In this paper, we present an analysis of a new class of domain names: disposable domains. We observe that popular web applications, along with other Internet services, systematically use this new class of domain names. Disposable domains are likely generated automatically, characterized by a "one-time use" pattern, and appear to be used as a way of "signaling" via DNS queries. To shed light on the pervasiveness of disposable domains, we study 24 days of live DNS traffic spanning a year observed at a large Internet Service Provider. We find that disposable domains increased from 23.1% to 27.6% of all queried domains, and from 27.6% to 37.2% of all resolved domains observed daily. While this creative use of DNS may enable new applications, it may also have unanticipated negative consequences on the DNS caching infrastructure, DNSSEC validating resolvers, and passive DNS data collection systems.

【Keywords】: Internet Measurement; Disposable Domain Name

Session 9: Failure Analysis and Assurance 3

54. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters.

【Paper Link】【Pages】:610-621

【Authors】: Catello Di Martino ; Zbigniew T. Kalbarczyk ; Ravishankar K. Iyer ; Fabio Baccanico ; Joseph Fullop ; William Kramer

【Abstract】: This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.

【Keywords】: Nvidia GPU errors; Failure Analysis; Failure Reports; Cray XE6; Cray XK7; Supercomputer; Machine Check

55. GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates.

【Paper Link】【Pages】:622-629

【Authors】: Konstantinos Parasyris ; Georgios Tziantzoulis ; Christos D. Antonopoulos ; Nikolaos Bellas

【Abstract】: Dependable computing on unreliable substrates is the next challenge the computing community needs to overcome due to both manufacturing limitations in low geometries and the necessity to aggressively minimize power consumption. System designers often need to analyze the way hardware faults manifest as errors at the architectural level and how these errors affect application correctness. This paper introduces GemFI, a fault injection tool based on the cycle accurate full system simulator Gem5. GemFI provides fault injection methods and is easily extensible to support future fault models. It also supports multiple processor models and ISAs and allows fault injection in both functional and cycle-accurate simulations. GemFI offers fast-forwarding of simulation campaigns via check pointing. Moreover, it facilitates the parallel execution of campaign experiments on a network of workstations. In order to validate and evaluate GemFI, we used it to apply fault injection on a series of real-world kernels and applications. The evaluation indicates that its overhead compared with Gem5 is minimal (up to 3.3%), whereas optimizations such as fast-forwarding via check pointing and execution on NoWs can significantly reduce simulation time of a fault injection campaign.

【Keywords】: full system; fault-injection; simulation; cycle accurate

56. A Design and Implementation of an Assurance Case Language.

【Paper Link】【Pages】:630-641

【Authors】: Yutaka Matsuno

【Abstract】: Assurance cases are documented bodies of evidence that provide valid and convincing arguments that a system is adequately dependable in a given application and environment. Assurance cases are widely required by regulation for safety-critical systems in the EU. There have been several graphical notation systems proposed for assurance cases. GSN (Goal Structuring Notation) and CAE (Claim, Argument, Evidence) are such two notation systems, and a standardization effort for these notation systems have been attempted in OMG (Object Management Group). However, these notation systems have not been defined in a formal way. This paper presents a formal definition of an assurance case language based on GSN and its pattern and module extensions. We take the framework of functional programming language as the basis of our study. The implementation has been done on an Eclipse based GSN editor. We report case studies on previous work done with GSN and show the applicability of the assurance case language.

【Keywords】: Functional Programming Lanugages; Assurance Cases; GSN (Goal Structuring Notation)

The Fourth International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2014)

(DCDV 2014) Session 1: Cloud Dependability 3

57. Recovery for Failures in Rolling Upgrade on Clouds.

【Paper Link】【Pages】:642-647

【Authors】: Min Fu ; Liming Zhu ; Len Bass ; Anna Liu

【Abstract】: When cloud consumers perform rolling upgrade operations on cloud applications, they may encounter failures due to cloud uncertainty, interfering operations and incorrect configurations. For example, unreliable cloud API calls can make the rolling upgrade operation fail in unpredictable ways due to a long time delay to respond to the API call. This paper proposes two recovery strategies for recovering from rolling upgrade failures. The strategies are Compensated Undo & Redo and Reparation. We evaluated our recovery strategies on Asgard-based rolling upgrade operation on Amazon Cloud based on two evaluation metrics: MTTR and Service Performance. The experiment results show that our strategies perform better than the recovery mechanisms provided by Asgard itself. We also conduct a comparison between the two recovery strategies based on the metrics.

【Keywords】: recovery strategy; cloud consumer; cloud API; rolling upgrade

58. Crosscheck: Hardening Replicated Multithreaded Services.

【Paper Link】【Pages】:648-653

【Authors】: Arthur Martens ; Christoph Borchert ; Tobias Oliver Geissler ; Daniel Lohmann ; Olaf Spinczyk ; Rüdiger Kapitza

【Abstract】: State-machine replication has received widespread attention for the provisioning of highly available services in data centers. However, current production systems focus on tolerating crash faults only and prominent service outages caused by state corruptions have indicated that this is a risky strategy. In the future, state corruptions due to transient faults (such as bit flips) become even more likely, caused by ongoing hardware trends regarding the shrinking of structure sizes and reduction of operating voltages. In this paper we present Crosscheck, an approach to tolerate arbitrary state corruption (ASC) in the context of fault-tolerant replication of multithreaded services. Crosscheck is able to detect silent data corruptions ahead of execution, and by crosschecking state changes with co-executing replicas, even ASCs can be detected. Finally, fault tolerance is achieved by a fine-grained recovery using fault-free replicas. Our implementation is transparent to the application by utilizing fine-grained software-hardening mechanisms using aspect-oriented programming. To validate Crosscheck we present a replicated multithreaded key-value store that is resilient to state corruptions.

【Keywords】: AspectC++; Replication; Software Error Hardening; Determinism; Multithreading

59. NV-Hypervisor: Hypervisor-Based Persistence for Virtual Machines.

【Paper Link】【Pages】:654-659

【Authors】: Vasily A. Sartakov ; Rüdiger Kapitza

【Abstract】: Power outages and subsequent recovery are major causes of service downtimes. This issue is amplified by the ongoing trend of steadily growing in-memory state of Internet-based services which increases the risk of data loss and extends recovery time. Protective measures against power outages, such as uninterruptible power supply are expensive, maintenance-intensive and often fragile. With the advent of non-volatile random-access memory (NVRAM) provided by commodity servers, there is a scalable, less costly and robust alternative to recover from power outages and other failures. However, as of today, off-the-shelf software is not ready for benefiting from NVRAM. We present NV-Hyper visor a lightweight hyper visor extension that transparently provides persistence for virtual machines. NV-Hyper visor paves the way for utilizing NVRAM in virtualized environments (i.e., infrastructure-as-a-service clouds) and protects stateful services such as key-value stores and databases from data loss and time-consuming recovery.

【Keywords】: Cloud Computing; NV-RAM; Hypervisor; Operating Systems

Session 2: Dependability Evaluation 3

60. A Markov Decision Process Approach for Optimal Data Backup Scheduling.

【Paper Link】【Pages】:660-665

【Authors】: Ruofan Xia ; Fumio Machida ; Kishor S. Trivedi

【Abstract】: The explosive growth of data generation and increasing reliance of business analysis on massive data make data loss more damaging than ever before. Nowadays many organizations start relying on cloud services for keeping their valuable data. It is a critical issue for cloud service provider to protect the data for individual users securely and effectively. To protect the data in a system with multiple data sources, backup schedule plays an important role for achieving the desired level of data protection while minimizing the impact on system operation. In this paper we investigate the use of Markov Decision Process (MDP) to guide the scheduling of data backup operation and propose a framework to automatically generate an MDP instance from system specifications and data protection requirements. We then demonstrate the benefits of the MDP approach.

【Keywords】: Markov decision process; data backup; scheduling

61. Availability Evaluation of Digital Library Cloud Services.

【Paper Link】【Pages】:666-671

【Authors】: Julian Araujo ; Paulo Romero Martins Maciel ; Matheus Torquato ; Gustavo Rau de Almeida Callou ; Ermeson C. Andrade

【Abstract】: Cloud computing is a new paradigm that provides services through the Internet. Such paradigm has the influence of the previous available technologies (e.g., cluster, peer-to-peer and grid computing) and has been adopted to reduce costs, to provide flexibility and to make management easier. Companies like Google, Amazon, Microsoft, IBM, HP, Yahoo, Oracle, and EMC have conducted significant investments on cloud infrastructure to provide services with high availability levels. The advantages of cloud computing allowed the construction of digital libraries that represent collections of information. This system demands high reliability and studies regarding analysis of availability are important due to the relevance of conservation and dissemination of the scientific and literature information. This paper proposes an approach to model and evaluate the availability of a digital library. A case study is conducted to show the applicability of the proposed approach. The obtained results are useful for the design of this system since missing data can lead to various errors and incalculable losses.

【Keywords】: Availability; Cloud computing; Digital Library; Accelerated Life Testing; Petri net; Reliability Block Diagram

62. Defects per Million (DPM) Evaluation for a Cloud Dealing with VM Failures Using Checkpointing.

【Paper Link】【Pages】:672-677

【Authors】: Subrota K. Mondal ; Jogesh K. Muppala

【Abstract】: In cloud computing systems, a user request goes through several cloud service provider specific processing steps from the instant it is submitted until the service is completed. In this paper, we use service-oriented metrics to characterize the dependability of cloud computing systems in order to find the pitfalls and improve the service. We find that it is not possible to fully reflect the impact of a cloud-service's dependability behavior through traditional dependability metrics like availability or reliability. We use a user-perceived dependability metric called Defects Per Million (DPM), defined as the number of user requests dropped out of a million. We demonstrate a new formulation for computing DPM metric in cloud computing systems. We incorporate check pointing scheme for job execution in the cloud to mitigate the impact of virtual machine failures, and compute DPM in order to characterize the improvement in the DPM due to the check pointing scheme compared to no-check pointing scheme.

【Keywords】: Deadline; Cloud Computing; Physical Machine; Virtual Machine; Defects Per Million; Checkpointing; Fault Tolerance; Reliability

The First International Workshop on Dependability and Security of System Operation (DSSO 2014)

(DSSO 2014) Session 1: Error Detection and Diagnosis 3

63. Predicting Incident Reports for IT Help Desk.

【Paper Link】【Pages】:678-683

【Authors】: Anneliese Amschler Andrews ; Joseph Lucente

【Abstract】: Costs associated with IT help desk operations present challenges to profitability goals in an organization. Minimizing software failures in an operational environment is important to customer experience, but more practically to the cost model for offering IT services. The user-facing behavior of software systems such as unscheduled downtime, slow performance and anomalous behavior affect the overall perception of a software product. Critical business applications and systems with human life at stake demand reliability and continuous availability of services. Software failures are costly. In this case study we investigate software reliability models and their applicability to improvement processes at an IT help desk. We propose a model selection process and demonstrate its success using real help desk incident data from eighteen desktop software applications. Our results show direct applicability to meeting cost challenges in IT help desk operations.

【Keywords】: industrial data; IT Help Desk; incident reports; SRGM; reliability model; process improvement

64. What Vulnerability Do We Need to Patch First?

【Paper Link】【Pages】:684-689

【Authors】: Jin B. Hong ; Dong Seong Kim ; Abdelkrim Haqiq

【Abstract】: Computing a prioritized set of vulnerabilities to patch is important for system administrators to determine the order of vulnerabilities to be patched that are more critical to the network security. One way to assess and analyze security to find vulnerabilities to be patched is to use attack representation models (ARMs). However, security solutions using ARMs are optimized for only the current state of the networked system. Therefore, the ARM must reanalyze the network security, causing multiple iterations of the same task to obtain the prioritized set of vulnerabilities to patch. To address this problem, we propose to use importance measures to rank network hosts and vulnerabilities, then combine these measures to prioritize the order of vulnerabilities to be patched. We show that nearly equivalent prioritized set of vulnerabilities can be computed in comparison to an exhaustive search method in various network scenarios, while the performance of computing the set is dramatically improved, while equivalent solutions are computed in various network scenarios.

【Keywords】: Vulnerability Patch; Attack Representation Model; Network Centrality; Security Analysis; Security Management; Security Metrics

65. What Logs Should You Look at When an Application Fails? Insights from an Industrial Case Study.

【Paper Link】【Pages】:690-695

【Authors】: Marcello Cinque ; Domenico Cotroneo ; Raffaele Della Corte ; Antonio Pecchia

【Abstract】: Event logs are the first place where to find useful information about application failures. Event logs are available at different system levels, such as application, middleware and operating system. In this paper we analyze the failure reporting capability of event logs collected at different levels of an industrial system in the Air Traffic Control (ATC) domain. The study is based on a data set of 3,159 failures induced in the system by means of software fault injection. Results indicate that the reporting ability of event logs collected at a given level is strongly affected by the type of failure observed at runtime. For example, even if operating system logs catch almost all application crashes, they are strongly ineffective in face of silent and erratic failures in the considered system.

【Keywords】: air traffic control; event log; fault injection; failure analysis; middleware

Session 2: Design Strategies for Dependability 2

66. Towards a Taxonomy of Cloud Recovery Strategies.

【Paper Link】【Pages】:696-701

【Authors】: Min Fu ; Len Bass ; Anna Liu

【Abstract】: Recovering from failures of sporadic operations such as rolling upgrade or migration is complicated by the fact that the application being upgraded or migrated must continue to provide service. This means that recovery strategies for sporadic operations must include facilities for recovering from normal operations as well. As a step in deriving methods for recovering from failures in sporadic operations, we classify existing methods into four categories according to their purposes and the life cycle phase for which they are applicable. Not only does this taxonomy facilitate the research on recoverability of cloud sporadic operations but also it can help better understand the existing cloud recovery strategies.

【Keywords】: sporadic operations; cloud recovery; consumer-initiated

67. Toward Design Decisions to Enable Deployability: Empirical Study of Three Projects Reaching for the Continuous Delivery Holy Grail.

【Paper Link】【Pages】:702-707

【Authors】: Stephany Bellomo ; Neil A. Ernst ; Robert L. Nord ; Rick Kazman

【Abstract】: There is growing interest in continuous delivery practices to enable rapid and reliable deployment. While practices are important, we suggest architectural design decisions are equally important for projects to achieve goals such continuous integration (CI) build, automated testing and reduced deployment-cycle time. Architectural design decisions that conflict with deploy ability goals can impede the team's ability to achieve the desired state of deployment and may result in substantial technical debt. To explore this assertion, we interviewed three project teams striving to practicing continuous delivery. In this paper, we summarize examples of the deploy ability goals for each project as well as the architectural decisions that they have made to enable deploy ability. We present the deploy ability goals, design decisions, and deploy ability tactics collected and summarize the design tactics derived from the interviews in the form of an initial draft version hierarchical deploy ability tactic tree.

【Keywords】: test automation; deployability; continuous integration; continuous delivery; architecture tactic

The Fourth Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop 2014

Session 1: Checkpoint/Restart Modeling and Message Logging 2

68. Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms.

【Paper Link】【Pages】:708-713

【Authors】: Dewan Ibtesham ; David Debonis ; Dorian C. Arnold ; Kurt B. Ferreira

【Abstract】: As high-performance computing systems continue to grow in size and complexity, energy efficiency and reliability have emerged as first-order concerns. Researchers have shown that data movement is a significant contributing factor to power consumption on these systems. Additionally, rollback/recovery protocols like checkpoint/restart can generate large volumes of data traffic exacerbating the energy and power concerns. In this work, we show that a coarse-grained model can be used effectively to speculate about the energy footprints of rollback/recovery protocols. Using our validated model, we evaluate the energy footprint of checkpoint compression, a method that incurs higher computational demand to reduce data volumes and data traffic. Specifically, we show that while checkpoint compression leads to more frequent checkpoints (as per the optimal checkpoint frequency) and increases per checkpoint energy cost, compression still yields a decrease in total application energy consumption due to the overall runtime decrease.

【Keywords】: Checkpoint Compression; Fault Tolerance; Modeling; Checkpoint Restart

69. Grid-Oriented Process Clustering System for Partial Message Logging.

【Paper Link】【Pages】:714-719

【Authors】: Hideyuki Jitsumoto ; Yuki Todoroki ; Yutaka Ishikawa ; Mitsuhisa Sato

【Abstract】: In a computer cluster composed of many nodes, the mean time between failures becomes shorter as the number of nodes increases. This may mean that lengthy tasks cannot be performed, because they will be interrupted by failure. Therefore, fault tolerance has become an essential part of high-performance computing. Partial message logging forms clusters of processes, and coordinates a series of checkpoints to log messages between groups. Our study proposes a system of two features to improve the efficiency of partial message logging: 1) the communication log used in the clustering is recorded at runtime, and 2) a graph partitioning algorithm reduces the complexity of the system by geometrically partitioning a grid graph. The proposed system is evaluated by executing a scientific application. The results of process clustering are compared to existing methods in terms of the clustering performance and quality.

【Keywords】: graph partition; fault tolerance; message logging

Session 2: Application and Algorithm Resiliency 2

70. Evaluating the Error Resilience of Parallel Programs.

【Paper Link】【Pages】:720-725

【Authors】: Bo Fang ; Karthik Pattabiraman ; Matei Ripeanu ; Sudhanva Gurumurthi

【Abstract】: As a consequence of increasing hardware fault rates, HPC systems face significant challenges in terms of reliability. Evaluating the error resilience of HPC applications is an essential step for building efficient fault-tolerant mechanisms for these applications. In this paper, we propose a methodology to characterize the resilience of OpenMP programs using fault-injection experiments. We find that the error resilience of OpenMP applications depends on the program structure and thread model, hence, these need to be taken into account while characterizing error resilience. We also report preliminary results about the correlation between the application's error resilience and the algorithm(s) used in the application.

【Keywords】: algorithms; Error Resilience; OpenMP

71. Comparison Criticality in Sorting Algorithms.

【Paper Link】【Pages】:726-731

【Authors】: Thomas B. Jones ; David H. Ackley

【Abstract】: Fault tolerance techniques often presume that the end-user computation must complete flawlessly. Though such strict correctness is natural and easy to explain, it's increasingly unaffordable for extreme-scale computations, and blind to possible preferences among errors, should they prove inevitable. In a case study on traditional sorting algorithms, we present explorations of a criticality measure defined over expected fault damage rather than probability of correctness. We discover novel 'error structure' in even the most familiar algorithms, and observe that different plausible error measures can qualitatively alter criticality relationships, suggesting the importance of explicit error measures and criticality in the wise deployment of the limited spare resources likely to be available in future extreme-scale computers.

【Keywords】: robust-first computing; Fault-intolerance; fault tolerance; criticality; sorting algorithms

Session 3: Hardware - Reliability Studies and Tailored Resilience Techniques 3

72. Radiation Sensitivity of High Performance Computing Applications on Kepler-Based GPGPUs.

【Paper Link】【Pages】:732-737

【Authors】: Daniel A. G. de Oliveira ; Caio B. Lunardi ; Laércio Lima Pilla ; Paolo Rech ; Philippe Olivier Alexandre Navaux ; Luigi Carro

【Abstract】: In this paper we assess and discuss the radiation sensitivity of a set of HPC applications executed on NVIDIA K20 GPGPUs. The occurrence of both radiation-induced silent data corruption and functional interruption will be experimentally addressed for Hotspot, LavaMD, and Matrix Transponse. Each of the tested codes requires a proper computational power and elaborates a different amount of data. Both these characteristics play a significant role in the application radiations sensitivity. Additionally, an evaluation of the error rate at sea level will be provided for all the tested codes.

【Keywords】: functional interruption; GPGPU; HPC; reliability; neutron sensitivity; silent data corruption

73. HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems.

【Paper Link】【Pages】:738-743

【Authors】: Sudarsun Kannan ; Naila Farooqui ; Ada Gavrilovska ; Karsten Schwan

【Abstract】: Moving toward exascale, the number of GPUs in HPC machines is bound to increase, and applications will spend increasing amounts of time running on those GPU devices. While GPU usage has already led to substantial speedup for HPC codes, their failure rates due to overheating are at least 10 times higher than those seen for the CPUs now commonly used on HPC machines. This makes it increasingly important for GPUs to have robust checkpoint/restart mechanisms. This paper introduces a unified CPU-GPU checkpoint mechanism, which can efficiently checkpoint the combined GPU-CPU memory state resident on machine nodes. Efficiency is gained in part by addressing the end-to-end data movements required for check pointing - from GPU to storage - by introducing novel pre-copy and checksum methods. These methods reduce checkpoint data movement cost seen by HPC applications, with initial measurements using different benchmark applications showing up to 60% reduced checkpoint overhead. Additional exploration of the use of next-generation storage, like NVM, show further promises of reduced check pointing overheads.

【Keywords】: Pre-Copy; NVM; Checkpoint; GPUs

74. Harnessing Unreliable Cores in Heterogeneous Architecture: The PyDac Programming Model and Runtime.

【Paper Link】【Pages】:744-749

【Authors】: Bin Huang ; Ron Sass ; Nathan DeBardeleben ; Sean Blanchard

【Abstract】: Heterogeneous many-core architectures combined with scratch-pad memories are attractive because they promise better energy efficiency than conventional architectures and a good balance between single-thread performance and multi-thread throughput. However, programmers will need an environment for finding and managing the large degree of parallelism, locality, and system resilience. We propose a Python-based task parallel programming model called PyDac to support these objectives. PyDac provides a two-level programming model based on the divide-and-conquer strategy. The PyDac runtime system allows threads to be run on unreliable hardware by dynamically checking the results without involvement from the programmer. To test this programming model and runtime, an unconventional heterogeneous architecture consisting of PowerPC and ARM cores was developed and emulated on an FPGA device. We inject faults during the execution of micro-benchmarks and show that through the use of double and triple modular redundancy we are able to complete the benchmarks with the correct results while only incurring a proportional performance penalty.

【Keywords】: Fault Tolerance; Heterogeneous Many-core Processor; Task-based Programming Model; Soft Error; Resilience

Session 4: Resiliency in HPC Messaging 2

75. Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI.

【Paper Link】【Pages】:750-755

【Authors】: Amin Hassani ; Anthony Skjellum ; Ron Brightwell

【Abstract】: With the rapid scale out of supercomputers comes a corresponding higher failure frequency. Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI, the most widely used scalable programming middleware, is insufficient when confronting such failures. We present FA-MPI (Fault-Aware MPI), a set of extensions to the MPI standard designed to enable applications to implement a wide range of fault-tolerant methods. FA-MPI introduces transactional concepts to the MPI programming model for the first time to address failure detection, isolation, mitigation, and recovery via application-driven policies. To reach the maximum achievable performance of these scalable machines, overlapping communication and I/O with computation through non-blocking operations (while reducing jitter) are design themes of growing importance. Therefore, we emphasize fault tolerant, non-blocking communication operations combined with a set of nest able lightweight transactional Try Block API extensions architected to exploit system and application hierarchy both for failure detection and recovery. This is to enable applications to run to completion with higher probability than otherwise. Scaling up and out and fault-free overhead are key concerns that can be managed by tuning transaction granularity, we provide a simulation of FA-MPI in a stencil 3D program to illustrate this. Supported failure models include but are not limited to process failures, a key difference from other proposed fault-tolerant extensions to MPI. Restriction to non-blocking operations is a current limitation as compared to other proposed approaches insofar as legacy applications are concerned, but FA-MPI aligns well with future-looking applications emphasizing Exascale. And, tools to evolve legacy MPI programs to this fault-aware paradigm will soon bridge that portability gap.

【Keywords】: Fault-Awareness; MPI; Fault-Tolerance

76. Extreme-Scale Viability of Collective Communication for Resilient Task Scheduling and Work Stealing.

【Paper Link】【Pages】:756-761

【Authors】: Jeremiah J. Wilke ; Janine Bennett ; Hemanth Kolla ; Keita Teranishi ; Nicole Slattengren ; John Floren

【Abstract】: Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate "pervasive parallelism" and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.

【Keywords】: asynchronous programming models; fault tolerant collectives; structural simulation

The First International Workshop on Trustworthiness of Smart Grids (ToSG 2014)

(ToSG 2014) Session 4

77. Distributed Implementation of Wide-Area Monitoring Algorithms for Power Systems Using a US-Wide ExoGENI-WAMS Testbed.

【Paper Link】【Pages】:762-767

【Authors】: Jianhua Zhang ; Aranya Chakrabortty ; Yufeng Xin

【Abstract】: In this paper we address the problem of implementing wide-area oscillation monitoring algorithms for large power system networks using distributed processing of Synchrophasor measurements. We consider two computational approaches, namely decentralized least squares (DLS) and its recursive implementation (RLS). Both algorithms are executed using multiple phasor data concentrators (PDC), deployed as virtual computing machines communicating over a fiber-optic communication network. Results are demonstrated using the US-Wide ExoGENI communication network connected to a PMU test bed at NC State University, and analyze the end-to-end computational and communication delays for both algorithms.

【Keywords】: Phasor measurement units; Delays; Power systems; Computer architecture; Estimation; Monitoring; Oscillators

78. Stateful Data Delivery Service for Wide Area Monitoring and Control Applications.

【Paper Link】【Pages】:768-773

【Authors】: Yiming Wu ; Davood Babazadeh ; Lars Nordström

【Abstract】: Recent Information and Communication Technology (ICT) advances have enabled power system applications using measurement signals across the Wide Area Network (WAN). The application control performance relies on the quality of data delivery service. However, the characteristics of Quality of Service (QoS), such as latency, packet loss, and packet jitter, are unavoidable. It is a trend to take QoS metric requirement into the consideration of controller design. But, how to ensure the application receiving the data within the designed tolerant range is another challenge. This paper presents ongoing work on a novel Stateful Data Delivery Service (SDDS) for power system application to address the challenge from the side of the communication infrastructure. The SDDS monitors the QoS performance on-line and identifies the signals which satisfy the requirement for the application to use. As a proof of concept, a Power Oscillation Damping (POD) controller is connected to the SDDS. The result shows the improvement in robustness of the POD controller by application of the SDDS. The paper also shows the feasibility of applying SDDS to WAC applications.

【Keywords】: Stateful Data Delivery Service; Overlay Network; Power Oscillation Damping; Quality of Service; Robustness

79. Hybrid Control Network Intrusion Detection Systems for Automated Power Distribution Systems.

【Paper Link】【Pages】:774-779

【Authors】: Masood Parvania ; Georgia Koutsandria ; Vishak Muthukumar ; Sean Peisert ; Chuck McParland ; Anna Scaglione

【Abstract】: In this paper, we describe our novel use of network intrusion detection systems (NIDS) for protecting automated distribution systems (ADS) against certain types of cyber attacks in a new way. The novelty consists of using the hybrid control environment rules and model as the baseline for what is normal and what is an anomaly, tailoring the security policies to the physical operation of the system. NIDS sensors in our architecture continuously analyze traffic in the communication medium that comes from embedded controllers, checking if the data and commands exchanged conform to the expected structure of the controllers interactions, and evolution of the system's physical state. Considering its importance in future ADSs, we chose the fault location, isolation and service restoration (FLISR) process as our distribution automation case study for the NIDS deployment. To test our scheme, we emulated the FLISR process using real programmable logic controllers (PLCs) that interact with a simulated physical infrastructure. We used this test bed to examine the capability of our NIDS approach in several attack scenarios. The experimental analysis reveals that our approach is capable of detecting various attacks scenarios including the attacks initiated within the trusted perimeter of the automation network by attackers that have complete knowledge about the communication information exchanged.

【Keywords】: intrusion detection systems; Power distribution systems; distribution automation; network security

80. Towards Secure Metering Data Analysis via Distributed Differential Privacy.

【Paper Link】【Pages】:780-785

【Authors】: Xiaojing Liao ; David Formby ; Carson Day ; Raheem A. Beyah

【Abstract】: The future electrical grid, i.e., smart grid, will utilize appliance-level control to provide sustainable power usage and flexible energy utilization. However, load trace monitoring for appliance-level control poses privacy concerns with inferring private information. In this paper, we introduce a privacy-preserving and fine-grained power load data analysis mechanism for appliance-level peak-time load balance control in the smart grid. The proposed technique provides rigorous provable privacy and an accuracy guarantee based on distributed differential privacy. We simulate the scheme as privacy modules in the smart meter and the concentrator, and evaluate its performance under a real-world power usage dataset, which validates the efficiency and accuracy of the proposed scheme.

【Keywords】: Privacy; Home appliances; Smart grids; Smart meters; Noise; Power demand; Accuracy

Session 2

81. An Integrated Security Framework for GOSS Power Grid Analytics Platform.

【Paper Link】【Pages】:786-791

【Authors】: Tara D. Gibson ; Selim Ciraci ; Sharma Poorva ; Craig Allwardt ; Mark Rice ; Bora A. Akyol

【Abstract】: In power grid operations, security is an essential component for any middleware platform. Security protects data against unwanted access as well as cyber attacks. GridOpticsTM Software System (GOSS) is an open source power grid analytics platform that facilitates ease of access between applications and data sources and promotes development of advanced analytical applications. GOSS contains an API that abstracts many of the difficulties in connecting to various heterogeneous data sources. A number of applications and data sources have already been implemented to demonstrate functionality and ease of use. A security framework has been implemented which leverages widely accepted, robust Java TM security tools in a way such that they can be interchanged as needed. This framework supports the complex fine-grained, access control rules identified for the diverse data sources already in GOSS. Performance and reliability are also important considerations in any power grid architecture. An evaluation is done to determine the overhead cost caused by security within GOSS and ensure minimal impact to performance.

【Keywords】: pmu; power grid; smartgrid; security; jaas; middleware

82. Ironstack: Performance, Stability and Security for Power Grid Data Networks.

【Paper Link】【Pages】:792-797

【Authors】: Zhiyuan Teo ; Vera Kutsenko ; Ken Birman ; Robbert van Renesse

【Abstract】: Operators of the nationwide power grid use proprietary data networks to monitor and manage their power distribution systems. These purpose-built, wide area communication networks connect a complex array of equipment ranging from PMUs and synchrophasers to SCADA systems. Collectively, these equipment form part of an intricate feedback system that ensures the stability of the power grid. In support of this mission, the operational requirements of these networks mandates high performance, reliability, and security. We designed Iron Stack, a system to address these concerns. By using cutting-edge software defined networking technology, Iron Stack is able to use multiple network paths to improve communications bandwidth and latency, provide seamless failure recovery, and ensure signals security. Additionally, Iron Stack is incrementally deployable and backward-compatible with existing switching infrastructure.

【Keywords】: security; software-defined networking; SDNs; network performance; high-assurance computing