DSN 2015:Rio de Janeiro, Brazil

45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015, Rio de Janeiro, Brazil, June 22-25, 2015. IEEE Computer Society 【DBLP Link】

Paper Num: 54 || Session Num: 19

Best Paper Award Session 3
Session 1A: Large-Scale Systems 2
Session 1B: Attack Mitigation 2
Session 1C: Modeling and Metrics 2
Session 2A: Secure Execution 3
Session 2B: Storage Systems 3
Session 2C: Web Services 3
Session 3A: Data Clusters, Servers and Clouds 3
Session 3B: Attack Prevention 3
Session 3C: Modeling and Assessment 3
Session 4A: Mitigation of Hardware Faults 3
Session 4B: Mobile Devices 3
Session 5A: Network Security and Consensus 3
Session 5B: Memory Errors 3
Session 6A: Fault Tolerance 3
Session 6B: Vulnerability Detection and Mitigation 3
Session 7A: Assurance of Complex Systems 3
Session 7B: Preventing Memory and Information Leakage 2
Workshops Summary 4

Best Paper Award Session 3

1. Leveraging State Information for Automated Attack Discovery in Transport Protocol Implementations.

【Paper Link】【Pages】:1-12

【Authors】: Samuel Jero ; Hyojeong Lee ; Cristina Nita-Rotaru

【Abstract】: We present a new method for finding attacks in unmodified transport protocol implementations using the specification of the protocol state machine to reduce the search space of possible attacks. Such reduction is obtained by appling malicious actions to all packets of the same type observed in the same state instead of applying them to individual packets. Our method requires knowledge of the packet formats and protocol state machine. We demonstrate our approach by developing SNAKE, a tool that automatically finds performance and resource exhaustion attacks on unmodified transport protocol implementations. SNAKE utilizes virtualization to run unmodified implementations in their intended environments and network emulation to create the network topology. SNAKE was able to find 9 attacks on 2 transport protocols, 5 of which we believe to be unknown in the literature.

【Keywords】: Transport protocols; Throughput; Reliability; Servers; Testing; Receivers

2. Δ-Encoding: Practical Encoded Processing.

【Paper Link】【Pages】:13-24

【Authors】: Dmitrii Kuvaiskii ; Christof Fetzer

【Abstract】: Transient and permanent errors in memory and CPUs occur with alarming frequency. Although most of these errors are masked at the hardware level or result in crashes, a non-negligible number of them leads to Silent Data Corruptions (SDCs), i.e., incorrect results of computations. Safety-critical programs require a very high level of confidence that such faults are detected and not propagated to the outside. Unfortunately, state-of-the-art fault detection techniques generally assume a limited Single Event Upset fault model, concentrating only on transient faults.We present Δ-encoding: a software-only approach to detect hardware faults with very high probability. Δ-encoding makes no assumptions on the rate and type of faults. Our approach combines AN codes and duplicated instructions to harden programs against transient and permanent hardware errors. Our evaluation shows that Δ-encoding detects 99.997% of all injected errors with performance slowdown of 2 - 4 times.

【Keywords】: duplicated instructions; fault tolerance; hardware errors; AN codes; encoded processing

3. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5, 000, 000 HPC Application Runs.

【Paper Link】【Pages】:25-36

【Authors】: Catello Di Martino ; William Kramer ; Zbigniew Kalbarczyk ; Ravishankar K. Iyer

【Abstract】: This paper presents an in-depth characterization of the resiliency of more than 5 million HPC application runs completed during the first 518 production days of Blue Waters, a 13.1 petaflop Cray hybrid supercomputer. Unlike past work, we measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. The characterization is performed by means of a joint analysis of several data sources, which include workload and error/failure logs. In order to relate system errors and failures to the executed applications, we developed LogDiver, a tool to automate the data pre-processing and metric computation. Some of the lessons learned in this study include: i) while about 1.53% of applications fail due to system problems, the failed applications contribute to about 9% of the production node hours executed in the measured period, i.e., the system consumes computing resources, and system-related issues represent a potentially significant energy cost for the work lost, ii) there is a dramatic increase in the application failure probability when executing full-scale applications: 20x (from 0.008 to 0.162) when scaling XE applications from 10,000 to 22,000 nodes, and 6x (from 0.02 to 0.129) when scaling GPU/hybrid applications from 2000 to 4224 nodes, and iii) the resiliency of hybrid applications is impaired by the lack of adequate error detection capabilities in hybrid nodes.

【Keywords】: data-driven resilience; supercomputer; resilience; data analysis; application resilience; extreme-scale; hybrid machines

Session 1A: Large-Scale Systems 2

4. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems.

【Paper Link】【Pages】:37-44

【Authors】: Saurabh Gupta ; Devesh Tiwari ; Christopher Jantzi ; James H. Rogers ; Don Maxwell

【Abstract】: As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

【Keywords】: High Performance Computing; Spatial Locality; System Failures; Resilience; Fault tolerance

5. Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data.

【Paper Link】【Pages】:45-56

【Authors】: Alina Oprea ; Zhou Li ; Ting-Fang Yen ; Sang H. Chin ; Sumayah A. Alrwais

【Abstract】: Recent years have seen the rise of sophisticated attacks including advanced persistent threats (APT) which pose severe risks to organizations and governments. Additionally, new malware strains appear at a higher rate than ever before. Since many of these malware evade existing security products, traditional defenses deployed by enterprises today often fail at detecting infections at an early stage. We address the problem of detecting early-stage APT infection by proposing a new framework based on belief propagation inspired from graph theory. We demonstrate that our techniques perform well on two large datasets. We achieve high accuracy on two months of DNS logs released by Los Alamos National Lab (LANL), which include APT infection attacks simulated by LANL domain experts. We also apply our algorithms to 38TB of web proxy logs collected at the border of a large enterprise and identify hundreds of malicious domains overlooked by state-of-the-art security products.

【Keywords】: Belief Propagation; Advanced Persistent Threats; Data Analysis

Session 1B: Attack Mitigation 2

6. LEAPS: Detecting Camouflaged Attacks with Statistical Learning Guided by Program Analysis.

【Paper Link】【Pages】:57-68

【Authors】: Zhongshu Gu ; Kexin Pei ; Qifan Wang ; Luo Si ; Xiangyu Zhang ; Dongyan Xu

【Abstract】: Currently cyber infrastructures are facing increasingly stealthy attacks that implant malicious payloads under the cover of benign programs. Existing attack detection approaches based on statistical learning methods may generate misleading decision boundaries when processing noisy data with such a mixture of benign and malicious behaviors. On the other hand, attack detection based on formal program analysis may lack completeness or adaptivity when modelling attack behaviors. In light of these limitations, we have developed LEAPS, an attack detection system based on supervised statistical learning to classify benign and malicious system events. Furthermore, we leverage control flow graphs inferred from the system event logs to enable automatic pruning of the training data, which leads to a more accurate classification model when applied to the testing data. Our extensive evaluation shows that, compared with pure statistical learning models, LEAPS achieves consistently higher accuracy when detecting real-world camouflaged attacks with benign program cover-up.

【Keywords】: Program Analysis; Attack Detection; Statistical Learning

7. OnionBots: Subverting Privacy Infrastructure for Cyber Attacks.

【Paper Link】【Pages】:69-80

【Authors】: Amirali Sanatinia ; Guevara Noubir

【Abstract】: Over the last decade botnets survived by adopting a sequence of increasingly sophisticated strategies to evade detection and take overs, and to monetize their infrastructure. At the same time, the success of privacy infrastructures such as Tor opened the door to illegal activities, including botnets, ransomware, and a marketplace for drugs and contraband. We contend that the next waves of botnets will extensively attempt to subvert privacy infrastructure and cryptographic mechanisms. In this work we propose to preemptively investigate the design and mitigation of such botnets. We first, introduce OnionBots, what we believe will be the next generation of resilient, stealthy botnets. OnionBots use privacy infrastructures for cyber attacks by completely decoupling their operation from the infected host IP address and by carrying traffic that does not leak information about its source, destination, and nature. Such bots live symbiotically within the privacy infrastructures to evade detection, measurement, scale estimation, observation, and in general all IP-based current mitigation techniques. Furthermore, we show that with an adequate self-healing network maintenance scheme, that is simple to implement, OnionBots can achieve a low diameter and a low degree and be robust to partitioning under node deletions. We develop a mitigation technique, called SOAP, that neutralizes the nodes of the basic OnionBots. In light of the potential of such botnets, we believe that the research community should proactively develop detection and mitigation methods to thwart OnionBots, potentially making adjustments to privacy infrastructure.

【Keywords】: privacy infrastructure; botnet; Tor; self-healing network; cyber security

Session 1C: Modeling and Metrics 2

8. A Statistical Approach for Timed Reachability in AADL Models.

【Paper Link】【Pages】:81-88

【Authors】: Harold Bruintjes ; Joost-Pieter Katoen ; David Lesens

【Abstract】: We introduce a simulator (slimsim) for a subset of AADL extended with formalized behavioral semantics for nominal and error models. The simulator allows to perform probabilistic analysis using the Monte Carlo method, on linear-hybrid, stochastic models, which describe a combination of nominal and error behaviors of hard- and software components. The tool supports the use of different strategies, which control the behavior of the simulator when dealing with various forms of non-determinism. The simulator is tested using benchmarks of the COMPASS toolset, as well as a case study by Airbus Defense and Space.

【Keywords】: Data models; Analytical models; Delays; Semantics; Reactive power; Compass; Synchronization

9. Scalable Analysis of Fault Trees with Dynamic Features.

【Paper Link】【Pages】:89-100

【Authors】: Jan Krcál ; Pavel Krcál

【Abstract】: Fault trees constitute one of the essential formalisms for static safety analysis of various industrial systems. Dynamic fault trees (DFT) enrich the formalism by time-dependent behavior, e.g., repairs or functional dependencies. Analysis of DFT is so far limited to substantially smaller models than those required for, e.g., nuclear power plants. We propose a fault tree formalism that combines both static and dynamic features, called SD fault trees. It gives the user the freedom to express each equipment failure either statically, without modelling temporal information, or dynamically, allowing repairs and other timed interdependencies. We introduce an analysis algorithm for an important subclass of SD fault trees. The algorithm (1) scales similarly to static algorithms and (2) allows for a more realistic analysis compared to static algorithms as it takes into account temporal interdependencies. Finally, we demonstrate the applicability of the method by an experimental evaluation on fault trees of nuclear power plants.

【Keywords】: industrial risk assessment; static fault trees; dynamic fault trees; continuous time Markov chains

Session 2A: Secure Execution 3

10. The Power of Evil Choices in Bloom Filters.

【Paper Link】【Pages】:101-112

【Authors】: Thomas Gerbet ; Amrit Kumar ; Cédric Lauradoux

【Abstract】: A Bloom filter is a probabilistic hash-based data structure extensively used in software including online security applications. This paper raises the following important question: Are Bloom filters correctly designed in a security context? The answer is no and the reasons are multiple: bad choices of parameters, lack of adversary models and misused hash functions. Indeed, developers truncate cryptographic digests without a second thought on the security implications. This work constructs adversary models for Bloom filters and illustrates attacks on three applications, namely SCRAPY web spider, BITLY DABLOOMS spam filter and SQUID cache proxy. As a general impact, filters are forced to systematically exhibit worst-case behavior. One of the reasons being that Bloom filter parameters are always computed in the average case. We compute the worst-case parameters in adversarial settings, show how to securely and efficiently use cryptographic hash functions and propose several other countermeasures to mitigate our attacks.

【Keywords】: Denial-of-Service; Bloom filters; Hash functions; Digest truncation; Pre-image attack

11. Secure Dynamic Software Loading and Execution Using Cross Component Verification.

【Paper Link】【Pages】:113-124

【Authors】: Byungho Min ; Vijay Varadharajan

【Abstract】: In this paper, we propose a cross verification mechanism for secure execution and dynamic component loading. Our mechanism is based on a combination of code signing and same-origin policy, and it blocks several types of attacks from drive-by download attacks to malicious component loadings such as DLL hijacking, DLL side-loading, binary hijacking, typical DLL injection and loading of newly installed malware components, even when malicious components have valid digital signatures. Considering modern malware often uses stolen private keys to sign its binaries and bypass code signing mechanism, we believe the proposed mechanism can significantly improve the security of modern computing platforms. In addition, the proposed mechanism protects proprietary software components so that unauthorised use of such components cannot occur. We have implemented a prototype for Microsoft Windows 7 and XP SP3, and evaluated application execution and dynamic component loading behaviour under our security mechanism. The proposed mechanism is general, and can be applied to other major computing platforms including Android, Linux and Mac OS X.

【Keywords】: Malware; Loading; Digital signatures; Operating systems; Public key

12. Parallax: Implicit Code Integrity Verification Using Return-Oriented Programming.

【Paper Link】【Pages】:125-135

【Authors】: Dennis Andriesse ; Herbert Bos ; Asia Slowinska

【Abstract】: Parallax is a novel self-contained code integrity verification approach, that protects instructions by overlapping Return-Oriented Programming (ROP) gadgets with them. Our technique implicitly verifies integrity by translating selected code (verification code) into ROP code which uses gadgets scattered over the binary. Tampering with the protected instructions destroys the gadgets they contain, so that the verification code fails, thereby preventing the adversary from using the modified binary. Unlike prior solutions, Parallax does not rely on code checksumming, so it is not vulnerable to instruction cache modification attacks which affect checksumming techniques. Further, unlike previous algorithms which withstand such attacks, Parallax does not compute hashes of the execution state, and can thus protect code with non-deterministic state. Parallax limits performance overhead to the verification code, while the protected code executes at its normal speed. This allows us to protect performance-critical code, and confine the slowdown to other code regions. Our experiments show that Parallax can protect up to 90% of code bytes, including most control flow instructions, with a performance overhead of under 4%.

【Keywords】: return-oriented programming; Tamperproofing; reverse engineering; code verification

Session 2B: Storage Systems 3

13. TIP-Code: A Three Independent Parity Code to Tolerate Triple Disk Failures with Optimal Update Complextiy.

【Paper Link】【Pages】:136-147

【Authors】: Yongzhe Zhang ; Chentao Wu ; Jie Li ; Minyi Guo

【Abstract】: With the rapid expansion of data storages and the increasing risk of data failures, triple Disk Failure Tolerant arrays (3DFTs) become popular and widely used. They achieve high fault tolerance via erasure codes. One class of erasure codes called Maximum Distance Separable (MDS) codes, which aims to offer data protection with minimal storage overhead, is a typical choice to enhance the reliability of storage systems. However, existing 3DFTs based on MDS codes are inefficient in terms of update complexity, which results in poor write performance. In this paper, we present an efficient MDS coding scheme called TIP-code, which is purely based on XOR operations and can tolerate triple disk failures. It uses three independent parities (horizontal, diagonal and anti-diagonal parities), and offers optimal update complexity. To demonstrate the effectiveness of TIP-code, we conduct several quantitative analysis and experiments. The results show that, compared to typical MDS codes for 3DFTs (i.e., Cauchy-RS and STAR codes), TIP-code improves the single write performance by up to 46.6%.

【Keywords】: Performance Evaluation; RAID; Reliability; Erasure Code; MDS Code; Triple Disk Failure; Update Complexity

14. Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems.

【Paper Link】【Pages】:148-159

【Authors】: Runhui Li ; Yuchong Hu ; Patrick P. C. Lee

【Abstract】: To balance performance and storage efficiency, modern clustered file systems (CFSes) often first store data with random replication (i.e., distributing replicas across randomly selected nodes), followed by encoding the replicated data with erasure coding. We argue that random replication, while being commonly used, does not take into account erasure coding and hence will raise both performance and availability issues to the subsequent encoding operation. We propose encoding-aware replication, which carefully places the replicas so as to (i) avoid cross-rack downloads of data blocks during encoding, (ii) preserve availability without data relocation after encoding, and (iii) maintain load balancing as in random replication. We implement encoding-aware replication on HDFS, and show via tested experiments that it achieves significant encoding throughput gains over random replication. We also show via discrete-event simulations that encoding-aware replication remains effective under various parameter choices in a large-scale setting. We further show that encoding-aware replication evenly distributes replicas as in random replication.

【Keywords】: Encoding; Ear; Fault tolerance; Fault tolerant systems; Bipartite graph; Load management; Throughput

15. Grouping-Based Elastic Striping with Hotness Awareness for Improving SSD RAID Performance.

【Paper Link】【Pages】:160-171

【Authors】: Yubiao Pan ; Yongkun Li ; Yinlong Xu ; Zhipeng Li

【Abstract】: RAID provides a good option to provide device-level fault tolerance. Conventional RAID usually updates parities with read-modify-write or read-reconstruct-write, which may introduce a lot of extra I/Os and thus significantly degrade SSD RAID performance. The recently proposed elastic striping scheme reconstructs new stripes with updated new data chunks without updating old parity chunks. However, it necessitates RAID-level garbage collection which may incur a very high cost. In this paper, we propose a hotness-aware caching scheme to buffer incoming writes and categorize data chunks in buffers into multiple groups according to their hotness values. We then propose a grouping-based elastic striping scheme to separately write data chunks in different groups into SSDs. We deployed the proposed schemes on a RAID-5 array composed of eight commercial SSDs, and experimental results show that compared to elastic striping, our scheme reduces 26% -- 65% of chunk writes to SSDs, and also reduces the average response time by 17.2% -- 63.9%.

【Keywords】: Elastic Striping; SSD RAID; RAID-level Garbage Collection; Endurance; Performance

Session 2C: Web Services 3

16. Joza: Hybrid Taint Inference for Defeating Web Application SQL Injection Attacks.

【Paper Link】【Pages】:172-183

【Authors】: Abbas Naderi-Afooshteh ; Anh Nguyen-Tuong ; Mandana Bagheri-Marzijarani ; Jason D. Hiser ; Jack W. Davidson

【Abstract】: Despite years of research on taint-tracking techniques to detect SQL injection attacks, taint tracking is rarely used in practice because it suffers from high performance overhead, intrusive instrumentation, and other deployment issues. Taint inference techniques address these shortcomings by obviating the need to track the flow of data during program execution by inferring markings based on either the program's input (negative taint inference), or the program itself (positive taint inference). We show that existing taint inference techniques are insecure by developing new attacks that exploit inherent weaknesses of the inferencing process. To address these exposed weaknesses, we developed Joza, a novel hybrid taint inference approach that exploits the complementary nature of negative and positive taint inference to mitigate their respective weaknesses. Our evaluation shows that Joza prevents real-world SQL injection attacks, exhibits no false positives, incurs low performance overhead (4%), and is easy to deploy.

【Keywords】: Web application security; Taint inference; Taint tracking; SQL injection

17. Private Browsing Mode Not Really That Private: Dealing with Privacy Breach Caused by Browser Extensions.

【Paper Link】【Pages】:184-195

【Authors】: Bin Zhao ; Peng Liu

【Abstract】: Private Browsing Mode (PBM) is widely supported by all major commodity web browsers. However, browser extensions can greatly undermine PBM. In this paper, we propose an approach to comprehensively identify and stop privacy breaches under PBM caused by browser extensions. Our approach is primarily based on run-time behavior tracking. We combine dynamic analysis and symbolic execution to represent extensions' behavior to identify privacy breaches in PBM caused by extensions. Our analysis shows that many extensions have not fulfilled PBM's guidelines on handling private browsing data. To the best of our knowledge, our approach also provides the first work to stop privacy breaches through instrumentation. We implemented a prototype SoPB on top of Firefox and evaluated it with 1,912 extensions. The results show that our approach can effectively identify and stop privacy breaches under PBM caused by extensions, with almost negligible performance impact.

【Keywords】: Dynamic analysis; Private Browsing Mode; Browser extensions; Privacy breach

18. Test-Based Interoperability Certification for Web Services.

【Paper Link】【Pages】:196-206

【Authors】: Ivano Alessandro Elia ; Nuno Laranjeiro ; Marco Vieira

【Abstract】: Web Services are designed with the key goal of providing interoperable application-to-application interaction, regardless of the platforms involved. Although experience shows that interoperability is difficult to achieve, developers still have limited tools to assess the interoperability of their services and, to the best of our knowledge, none able to support end-to-end interoperability certification. In this paper, we lay the foundations of an interoperability certification process for Web services, which allows testing the interoperability level of a given Web service and also identifying possible interoperability issues. In practice, the process can be used by developers or providers to certify a given web service for interoperability, ensuring successful interaction with client-side platforms. We show the effectiveness of the process by conducting a large experimental evaluation to certify five different implementations of the services specified by the TPC-App benchmark, and about 2500 synthetic generated services.client-side platforms.

【Keywords】: Testing; Web Services; Interoperability; Certification

Session 3A: Data Clusters, Servers and Clouds 3

19. Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures.

【Paper Link】【Pages】:207-218

【Authors】: Andrea Rosà ; Lydia Y. Chen ; Walter Binder

【Abstract】: Motivated by the high system complexity of today's datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.

【Keywords】: Random access memory; Time factors; Google; Predictive models; Measurement; Electronic mail; Resource management

20. Improving Reliability with Dynamic Syndrome Allocation in Intelligent Software Defined Data Centers.

【Paper Link】【Pages】:219-230

【Authors】: Ulya Bayram ; Dwight Divine ; Pin Zhou ; Eric William Davis Rozier

【Abstract】: We propose new algorithms for implementing a software-defined data center (SDDC) to improve the dependability of storage systems without the addition of new hardware. We define the construction of a system that can predict its future resource requirements and act on these predictions to allocate overprovisioned resources to improve reliability. We introduce algorithms for implementing a smart SDDC (SSDDC) that characterizes user I/O transactions (writes and deletes), and use these models to predict the level of overprovisioning within a system, overbooking excess resources to improve reliability, while mitigating the impact on quality of service. We compare several implementations of our methods experimentally, and discuss methods for improving the fault-tolerance of our S2DDC, present experimental results showcasing our ability to improve system reliability showing the decrease in expected annual block loss due to disk failures and latent sector errors, and highlight the benefit of dependence based usage models in estimating overprovisioning.

【Keywords】: markov modeling; reliability; dependability; storage; data center; big data; predictive modeling

21. Experiences with Building Disaster Recovery for Enterprise-Class Clouds.

【Paper Link】【Pages】:231-238

【Authors】: Long Wang ; Harigovind V. Ramasamy ; Richard E. Harper ; Mahesh Viswanathan ; Edmond Plattier

【Abstract】: The ability to recover from disasters is an important requirement for many enterprises. With enterprise-class workloads increasingly hosted on the cloud, cloud customers have come to expect disaster recovery (DR) as a necessary feature from cloud platforms. This paper identifies key challenges in providing DR as a service on enterprise cloud platforms, and portrays DR solutions for a managed cloud platform. In particular, we present the reference architecture for DR solutions, and describe our practical experiences in providing a portfolio of DR solutions for the cloud platform. The solutions cover diverse target recovery sites, such as an equivalent cloud site, a dedicated recovery site, and a customer-owned site. From the experiences, we provide insights into and lessons on implementing DR for enterprise-class clouds.

【Keywords】: Automation; Cloud; Disaster Recovery

Session 3B: Attack Prevention 3

22. FloodGuard: A DoS Attack Prevention Extension in Software-Defined Networks.

【Paper Link】【Pages】:239-250

【Authors】: Haopei Wang ; Lei Xu ; Guofei Gu

【Abstract】: This paper addresses one serious SDN-specific attack, i.e., data-to-control plane saturation attack, which overloads the infrastructure of SDN networks. In this attack, an attacker can produce a large amount of table-miss packet_in messages to consume resources in both control plane and data plane. To mitigate this security threat, we introduce an efficient, lightweight and protocol-independent defense framework for SDN networks. Our solution, called FloodGuard, contains two new techniques/modules: proactive flow rule analyzer and packet migration. To preserve network policy enforcement, proactive flow rule analyzer dynamically derives proactive flow rules by reasoning the runtime logic of the SDN/OpenFlow controller and its applications. To protect the controller from being overloaded, packet migration temporarily caches the flooding packets and submits them to the OpenFlow controller using rate limit and round-robin scheduling. We evaluate FloodGuard through a prototype implementation tested in both software and hardware environments. The results show that FloodGuard is effective with adding only minor overhead into the entire SDN/OpenFlow infrastructure.

【Keywords】: Denial-of-Service Attack; Software-Defined Networking; SDN; Security

23. Enhancing Software Dependability and Security with Hardware Supported Instruction Address Space Randomization.

【Paper Link】【Pages】:251-262

【Authors】: Seung-Hun Kim ; Lei Xu ; Ziyi Liu ; Zhiqiang Lin ; Won Woo Ro ; Weidong Shi

【Abstract】: We present a micro-architecture based lightweight framework to enhance dependability and security of software against code reuse attack. Different from the prior hardware based approaches for mitigating code reuse attacks, our solution is based on software diversity and instruction level control flow randomization. Generally, software based instruction location randomization (ILR) using binary emulator as a mediation layer has been shown to be effective for thwarting code reuse attacks like return oriented programming (ROP). However, our in-depth studies show that straightforward and naive implementation of ILR at the micro-architecture level will incur major performance deficiencies in terms of instruction fetch and cache utilization. For example, straightforward implementation of ILR increases the first level instruction cache miss rates on average by more than 9 times for a set of SPEC CPU2006 benchmarks. To address these issues, we present a novel micro-architecture design that can support native execution of control flow randomized software binary while at the same time preserve the performance of instruction fetch and efficient use of on-chip caches. The proposed design is evaluated by extending cycle based x86 architecture simulator, XIOSim with validated power simulation. Performance evaluation on SPEC CPU2006 benchmarks shows an average speedup of 1.63 times compared to the hardware implementation of ILR. Using the proposed approach, direct execution of ILR software incurs only 2.1% IPC performance slowdown with a very small hardware overhead.

【Keywords】: software security; Instruction location randomization; microarchitecture; code reuse attack

【Paper Link】【Pages】:263-274

【Authors】: Davide Frey ; Rachid Guerraoui ; Anne-Marie Kermarrec ; Antoine Rault ; François Taïani ; Jingjing Wang

【Abstract】: Computing k-nearest-neighbor graphs constitutes a fundamental operation in a variety of data-mining applications. As a prominent example, user-based collaborative-filtering provides recommendations by identifying the items appreciated by the closest neighbors of a target user. As this kind of applications evolve, they will require KNN algorithms to operate on more and more sensitive data. This has prompted researchers to propose decentralized peer-to-peer KNN solutions that avoid concentrating all information in the hands of one central organization. Unfortunately, such decentralized solutions remain vulnerable to malicious peers that attempt to collect and exploit information on participating users. In this paper, we seek to overcome this limitation by proposing H&S (Hide & Share), a novel landmark-based similarity mechanism for decentralized KNN computation. Landmarks allow users (and the associated peers) to estimate how close they lay to one another without disclosing their individual profiles. We evaluate H&S in the context of a user-based collaborative-filtering recommender with publicly available traces from existing recommendation systems. We show that although landmark-based similarity does disturb similarity values (to ensure privacy), the quality of the recommendations is not as significantly hampered. We also show that the mere fact of disturbing similarity values turns out to be an asset because it prevents a malicious user from performing a profile reconstruction attack against other users, thus reinforcing users' privacy. Finally, we provide a formal privacy guarantee by computing an upper bound on the amount of information revealed by H&S about a user's profile.

【Keywords】: Recommender systems; Data privacy; Nearest neighbor searches; Peer-to-peer computing

Session 3C: Modeling and Assessment 3

25. Energy Resilience Modelling for Smart Houses.

【Paper Link】【Pages】:275-286

【Authors】: Hamed Ghasemieh ; Boudewijn R. Haverkort ; Marijn R. Jongerden ; Anne Remke

【Abstract】: The use of renewable energy in houses and neighbourhoods is very much governed by national legislation and has recently led to enormous changes in the energy market and poses a serious threat to the stability of the grid at peak production times. One of the approaches towards a more balanced grid is, e.g., taken by the German government by subsidizing local storage for solar power. While the main interest of the energy operator and the government is to balance the grid, thereby ensuring its stability, the main interest of the client is twofold: the total cost for electricity should be as low as possible and the house should be as resilient as possible in the presence of power outages. Using local battery storage can help to overcome the effects of power outages. However, the resulting resilience highly depends on the battery usage strategy employed by the controller, taking into account the state of charge of the battery. We present a Hybrid Petri net model of a house (that is mainly powered by solar energy) with a local storage unit, and analyse the impact of different battery usage strategies on its resilience for different production and consumption patterns. Our analysis shows that there is a direct relationship between resilience and flexibility, since increased resilience, i.e., reserving battery capacity for backup, decreases the flexibility of the storage unit.

【Keywords】: Hybrid Petri nets; smart grid; resilience; survivability

26. Impact of Malfunction on the Energy Efficiency of Batch Processing Systems.

【Paper Link】【Pages】:287-298

【Authors】: Marcello Cinque ; Domenico Cotroneo ; Flavio Frattini ; Stefano Russo

【Abstract】: Energy efficiency of large processing systems is usually assessed as the relation between a performance and a power consumption metric, neglecting malfunction. Execution failures have a tangible cost in terms of wasted energy, however. They are often managed through fault tolerance mechanisms, which in turn consume electricity. We introduce the consumability attribute for batch processing systems, encompassing performance, consumption, and dependability aspects altogether. We propose a metric for its quantification and a methodology for its analysis. Using a real 500-node batch system as a case study, we show that consumability is representative of both efficiency and effectiveness, and we show the usefulness of the proposed metric and the suitability of the proposed methodology.

【Keywords】: cooling; Performance; consumption; dependability; fault tolerance; energy efficiency; green computing

27. phpSAFE: A Security Analysis Tool for OOP Web Application Plugins.

【Paper Link】【Pages】:299-306

【Authors】: Paulo Jorge Costa Nunes ; José Fonseca ; Marco Vieira

【Abstract】: There is nowadays an increasing pressure to develop complex Web applications at a fast pace. The vast majority is built using frameworks based on third-party server-side plugins that allow developers to easily add new features. However, as many plugin developers have limited programming skills, there is a spread of security vulnerabilities related to their use. Best practices advise the use of systematic code review for assure security, but free tools do not support OOP, which is how most Web applications are currently developed. To address this problem we propose phpSAFE, a static code analyzer that identifies vulnerabilities in PHP plugins developed using OOP. We evaluate phpSAFE against two well-known tools using 35 plugins for a widely used CMS. Results show that phpSAFE clearly outperforms other tools, and that plugins are being shipped with a considerable number of vulnerabilities, which tends to increase over time.

【Keywords】: vulnerabilities; Static analysis; web application plugins; security

Session 4A: Mitigation of Hardware Faults 3

28. BAAT: Towards Dynamically Managing Battery Aging in Green Datacenters.

【Paper Link】【Pages】:307-318

【Authors】: Longjun Liu ; Chao Li ; Hongbin Sun ; Yang Hu ; Juncheng Gu ; Tao Li

【Abstract】: Energy storage devices (batteries) have shown great promise in eliminating supply/demand power mismatch and reducing energy/power cost in green datacenters. These important components progressively age due to irregular usage patterns, which result in less effective capacity and even pose serious threat to server availability. Nevertheless, prior proposals largely ignore the aging issue of batteries or simply use ad-hoc discharge capping to extend their lifetime. To fill this critical void, we thoroughly investigate battery aging on a heavily instrumented prototype over an observation period of six months. We propose battery anti-aging treatment (BAAT), a novel framework for hiding, reducing, and planning the battery aging effects. We show that BAAT can extend battery lifetime by 69%. It enables datacenters to maximally utilize energy storage resources to enhance availability and boost performance. Moreover, it reduces 26% battery cost and allows datacenters to economically scale in the big data era.

【Keywords】: System Prototype; Battery Aging; Analysis; Green Datacenters; Power Management

29. Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors.

【Paper Link】【Pages】:319-330

【Authors】: Horst Schirmeier ; Christoph Borchert ; Olaf Spinczyk

【Abstract】: Since the first identification of physical causes for soft errors in memory circuits, fault injection (FI) has grown into a standard methodology to assess the fault resilience of computer systems. A variety of FI techniques trying to mimic these physical causes has been developed to measure and compare program susceptibility to soft errors. In this paper, we analyze the process of evaluating programs, which are hardened by software-based hardware fault-tolerance mechanisms, under a uniformly distributed soft-error model. We identify three pitfalls in FI result interpretation widespread in the literature, even published in renowned conference proceedings. Using a simple machine model and transient single-bit faults in memory, we find counterexamples that reveal the unfitness of common practices in the field, and substantiate our findings with real-world examples. In particular, we demonstrate that the fault coverage metric must be abolished for comparing programs. Instead, we propose to use extrapolated absolute failure counts as a valid comparison metric.

【Keywords】: SIHFT; Fault Injection; Program Susceptibility Comparison; Soft Errors; Fault Coverage Factor; Absolute Failure Count; Result Interpretation; Single-Bit Flips

30. Warped-RE: Low-Cost Error Detection and Correction in GPUs.

【Paper Link】【Pages】:331-342

【Authors】: Mohammad Abdel-Majeed ; Waleed Dweik ; Hyeran Jeon ; Murali Annavaram

【Abstract】: Graphics processing units (GPUs) are now the dominant computing fabric within many supercomputers. As such many mission critical applications run on GPUs, which demand stringent reliability and computational correctness guarantees from GPUs. Prior approaches to GPU reliability have tackled solely either error detection, or error correction assuming error detection is already present. In this paper we present Warped Redundant Execution (Warped-RE), a unified framework that is capable of detecting and then correcting transient and nontransient errors in the GPU execution lanes. Our work exploits two critical properties of applications running on GPUs. First, we observe that neighboring execution lanes in GPUs may operate on the same values. Thus when neighboring lanes execute the same instruction using the same values then these lanes provide inherent DMR (dual modular redundancy) or even inherent TMR (triple modular redundancy) opportunities. The second property we exploit is that due to insufficient parallelism or due to branch divergence, applications do not fully utilize all the available execution lanes. In this case it is possible to force DMR or TMR on unused execution lanes, when inherent redundancy is insufficient. During error-free execution, Warped-RE uses a combination of inherent and forced DMR to guarantee that every thread computation within every warp instruction will be verified. When an error is detected in a warp instruction, the instruction is re-executed in TMR mode in order to correct the error and identify execution lanes with potential non-transient errors. Our evaluations show 8.4% and 29% average performance overhead during the DMR and TMR operation modes, respectively. Compared to traditional DMR and TMR, Warped-RE reduces the power overhead by 42% and 40%, respectively.

【Keywords】: value similarity; GPUs; error detection; error correction

Session 4B: Mobile Devices 3

31. Decomposable Trust for Android Applications.

【Paper Link】【Pages】:343-354

【Authors】: Earlence Fernandes ; Ajit Aluri ; Alexander Crowell ; Atul Prakash

【Abstract】: Current operating system designs require applications (apps) to implicitly place trust in a large amount of code. Taking Android as an example, apps must trust both the kernel as well as privileged userspace services that consist of hundreds of thousands of lines of code. Malware apps, on the other hand, aim to exploit any vulnerabilities in the above large trusted base to escalate their privileges. Once malware escalates its privileges, additional attacks become feasible, such as stealing credentials by scanning memory pages or intercepting user interactions of sensitive apps, e.g., those used for banking or health management. This paper introduces a novel mechanism, called Anception, that strategically deprivileges a significant portion of the kernel and system services, moving them to an untrusted container, thereby significantly reducing the attack surface for privilege escalation available to malware. Anception supports unmodified apps, running on a modified Android kernel. It achieves performance close to native Android on several popular macro benchmarks and provides security against many types of known Android root exploits.

【Keywords】: Trust Decomposition; Android; Virtualization; Root Exploits

32. Reducing Refresh Power in Mobile Devices with Morphable ECC.

【Paper Link】【Pages】:355-366

【Authors】: Chia-Chen Chou ; Prashant J. Nair ; Moinuddin K. Qureshi

【Abstract】: Energy consumption is a primary consideration that determines the usability of emerging mobile computing devices such as smartphones. Refresh operations for main memory account for a significant fraction of the overall energy consumption, especially during idle periods, when processor can be switched off quickly, however, memory contents continue to get refreshed to avoid data loss. Given that mobile devices are idle most of the times, reducing refresh power in idle mode is critical to maximize the duration for which the device remains usable. The frequency of refresh operations in memory can be reduced significantly by using strong multi-bit error correction codes (ECC). Unfortunately, strong ECC codes incur high latency, which causes significant performance degradation (as high as 21%, and on average 10%). To obtain both low refresh power in idle periods and high performance in active periods, this paper proposes Morphable ECC (MECC). During idle periods, MECC keeps the memory protected with 6-bit ECC (ECC-6) and employs a refresh period of 1 second, instead of the typical refresh period of 64ms. During active operation, MECC reduces the refresh interval to 64ms, and converts memory from ECC-6 to weaker ECC (single-bit error correction) on a demand-basis, thus avoiding the high latency of ECC-6, except for the first access during the active mode. Our proposal reduces refresh operations during idle mode by 16x, memory power in idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.

【Keywords】: Memory Reliability; Mobile DRAM; DRAM Refresh Rate; Mobile Mem- ory System; Error Correction Code; DRAM Power Consumption

33. TrustICE: Hardware-Assisted Isolated Computing Environments on Mobile Devices.

【Paper Link】【Pages】:367-378

【Authors】: He Sun ; Kun Sun ; Yuewu Wang ; Jiwu Jing ; Haining Wang

【Abstract】: Mobile devices have been widely used to process sensitive data and perform important transactions. It is a challenge to protect secure code from a malicious mobile OS. ARM TrustZone technology can protect secure code in a secure domain from an untrusted normal domain. However, since the attack surface of the secure domain will increase along with the size of secure code, it becomes arduous to negotiate with OEMs to get new secure code installed. We propose a novel TrustZone-based isolation framework named TrustICE to create isolated computing environments (ICEs) in the normal domain. TrustICE securely isolates the secure code in an ICE from an untrusted Rich OS in the normal domain. The trusted computing base (TCB) of TrustICE remains small and unchanged regardless of the amount of secure code being protected. Our prototype shows that the switching time between an ICE and the Rich OS is less than 12 ms.

【Keywords】: TrustZone; Computing Environment; Isolation

Session 5A: Network Security and Consensus 3

34. Delving into Internet DDoS Attacks by Botnets: Characterization and Analysis.

【Paper Link】【Pages】:379-390

【Authors】: An Wang ; Aziz Mohaisen ; Wentao Chang ; Songqing Chen

【Abstract】: Internet Distributed Denial of Service (DDoS) at- tacks are prevalent but hard to defend against, partially due to the volatility of the attacking methods and patterns used by attackers. Understanding the latest DDoS attacks can provide new insights for effective defense. But most of existing understandings are based on indirect traffic measures (e.g., backscatters) or traffic seen locally. In this study, we present an in-depth analysis based on 50,704 different Internet DDoS attacks directly observed in a seven-month period. These attacks were launched by 674 botnets from 23 different botnet families with a total of 9,026 victim IPs belonging to 1,074 organizations in 186 countries. Our analysis reveals several interesting findings about today's Internet DDoS attacks. Some highlights include: (1) geolocation analysis shows that the geospatial distribution of the attacking sources follows certain patterns, which enables very accurate source prediction of future attacks for most active botnet families, (2) from the target perspective, multiple attacks to the same target also exhibit strong patterns of inter-attack time interval, allowing accurate start time prediction of the next anticipated attacks from certain botnet families, (3) there is a trend for different botnets to launch DDoS attacks targeting the same victim, simultaneously or in turn. These findings add to the existing literature on the understanding of today's Internet DDoS attacks, and offer new insights for designing new defense schemes at different levels.

【Keywords】: Computer crime; Internet; Geology; IP networks; Organizations; Malware; Cities and towns

35. Consensus Refined.

【Paper Link】【Pages】:391-402

【Authors】: Ognjen Maric ; Christoph Sprenger ; David A. Basin

【Abstract】: Algorithms for solving the consensus problem are fundamental to distributed computing. Despite their brevity, their ability to operate in concurrent, asynchronous, and failure-prone environments comes at the cost of complex and subtle behaviors. Accordingly, understanding how they work and proving their correctness is a non-trivial endeavor where abstraction is immensely helpful. Moreover, research on consensus has yielded a large number of algorithms, many of which appear to share common algorithmic ideas. A natural question is whether and how these similarities can be distilled and described in a precise, unified way. In this work, we combine stepwise refinement and lockstep models to provide an abstract and unified view of a sizeable family of consensus algorithms. Our models provide insights into the design choices underlying the different algorithms, and classify them based on those choices. All our results are formalized and verified in the theorem prover Isabelle/HOL, yielding precision and strong correctness guarantees.

【Keywords】: formal methods; distributed algorithms; consensus; fault tolerance; fault tolerant computing; benign fault; refinement

36. Segugio: Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks.

【Paper Link】【Pages】:403-414

【Authors】: Babak Rahbarinia ; Roberto Perdisci ; Manos Antonakakis

【Abstract】: In this paper, we propose Segugio, a novel defense system that allows for efficiently tracking the occurrence of new malware-control domain names in very large ISP networks. Segugio passively monitors the DNS traffic to build a machine-domain bipartite graph representing who is querying what. After labelling nodes in this query behavior graph that are known to be either benign or malware-related, we propose a novel approach to accurately detect previously unknown malware-control domains. We implemented a proof-of-concept version of Segugio and deployed it in large ISP networks that serve millions of users. Our experimental results show that Segugio can track the occurrence of new malware-control domains with up to 94% true positives (TPs) at less than 0.1% false positives (FPs). In addition, we provide the following results: (1) we show that Segugio can also detect control domains related to new, previously unseen malware families, with 85% TPs at 0.1% FPs, (2) Segugio's detection models learned on traffic from a given ISP network can be deployed into a different ISP network and still achieve very high detection accuracy, (3) new malware-control domains can be detected days or even weeks before they appear in a large commercial domain name blacklist, and (4) we show that Segugio clearly outperforms Notos, a previously proposed domain name reputation system.

【Keywords】: Large-scale Data Analysis; Malware-control Domains; DNS; Graph Learning; Behavioral Learning

Session 5B: Memory Errors 3

37. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field.

【Paper Link】【Pages】:415-426

【Authors】: Justin Meza ; Qiang Wu ; Sanjeev Kumar ; Onur Mutlu

【Abstract】: Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have shown, failures in DRAM devices are an important source of errors in modern servers. To reduce the effects of memory errors, error correcting codes (ECC) have been developed to help detect and correct errors when they occur. In order to develop effective techniques, including new ECC mechanisms, to combat memory errors, it is important to understand the memory reliability trends in modern systems. In this paper, we analyze the memory errors in the entire fleet of servers at Facebook over the course of fourteen months, representing billions of device days. The systems we examine cover a wide range of devices commonly used in modern servers, with DIMMs manufactured by 4 vendors in capacities ranging from 2 GB to 24 GB that use the modern DDR3 communication protocol. We observe several new reliability trends for memory systems that have not been discussed before in literature. We show that (1) memory errors follow a power-law, specifically, a Pareto distribution with decreasing hazard rate, with average error rate exceeding median error rate by around 55×, (2) non-DRAM memory failures from the memory controller and memory channel cause the majority of errors, and the hardware and software overheads to handle such errors cause a kind of denial of service attack in some servers, (3) using our detailed analysis, we provide the first evidence that more recent DRAM cell fabrication technologies (as indicated by chip density) have substantially higher failure rates, increasing by 1.8× over the previous generation, (4) DIMM architecture decisions affect memory reliability: DIMMs with fewer chips and lower transfer widths have the lowest error rates, likely due to electrical noise reduction, (5) while CPU and memory utilization do not show clear trends with respect to failure rates, workload type can influence failure rate by up to 6:5×, suggesting certain memory access patterns may induce more errors, (6) we develop a model for memory reliability and show how system design choices such as using lower density DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%, and (7) we perform the first implementation and real-system analysis of page offlining at scale, showing that it can reduce memory error rate by 67%, and identify several real-world impediments to the technique.

【Keywords】: warehouse-scale computing; DRAM; main memory; reliability

38. AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems.

【Paper Link】【Pages】:427-437

【Authors】: Moinuddin K. Qureshi ; Dae-Hyun Kim ; Samira Manabi Khan ; Prashant J. Nair ; Onur Mutlu

【Abstract】: Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

【Keywords】: Memory Scrubbing; Dynamic Random Access Memory; Refresh Rate; Variable Retention Time; Error Correcting Codes; Performance

39. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery.

【Paper Link】【Pages】:438-449

【Authors】: Yu Cai ; Yixin Luo ; Saugata Ghose ; Onur Mutlu

【Abstract】: NAND flash memory reliability continues to degrade as the memory is scaled down and more bits are programmed per cell. A key contributor to this reduced reliability is read disturb, where a read to one row of cells impacts the threshold voltages of unread flash cells in different rows of the same block. Such disturbances may shift the threshold voltages of these unread cells to different logical states than originally programmed, leading to read errors that hurt endurance. For the first time in open literature, this paper experimentally characterizes read disturb errors on state-of-the-art 2Y-nm (i.e., 20-24 nm) MLC NAND flash memory chips. Our findings (1) correlate the magnitude of threshold voltage shifts with read operation counts, (2) demonstrate how program/erase cycle count and retention age affect the read-disturb-induced error rate, and (3) identify that lowering pass-through voltage levels reduces the impact of read disturb and extend flash lifetime. Particularly, we find that the probability of read disturb errors increases with both higher wear-out and higher pass-through voltage levels. We leverage these findings to develop two new techniques. The first technique mitigates read disturb errors by dynamically tuning the pass-through voltage on a per-block basis. Using real workload traces, our evaluations show that this technique increases flash memory endurance by an average of 21%. The second technique recovers from previously-uncorrectable flash errors by identifying and probabilistically correcting cells susceptible to read disturb errors. Our evaluations show that this recovery technique reduces the raw bit error rate by 36%.

【Keywords】: error tolerance; NAND flash memory; read disturb

Session 6A: Fault Tolerance 3

40. Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs.

【Paper Link】【Pages】:450-461

【Authors】: Guanpeng Li ; Qining Lu ; Karthik Pattabiraman

【Abstract】: As the rate of transient hardware faults increases, researchers have investigated software techniques to tolerate these faults. An important class of faults are those that cause long- latency crashes (LLCs), or faults that can persist for a long time in the program before causing it to crash. In this paper, we develop a technique to automatically find program locations where LLC causing faults originate so that the locations can be protected to bound the program's crash latency. We first identify program code patterns that are responsible for the majority of LLC causing faults through an empirical study. We then build CRASHFINDER, a tool that finds LLC locations by statically searching the program for the patterns, and then refining the static analysis results with a dynamic analysis and selective fault injection-based approach. We find that CRASHFINDER can achieve an average of 9.29 orders of magnitude time reduction to identify more than 90% of LLC causing locations in the program, compared to exhaustive fault injection techniques, and has no false-positives.

【Keywords】: checkpoint corruption; long-latency crashes; hardware faults

41. User-Constraint and Self-Adaptive Fault Tolerance for Event Stream Processing Systems.

【Paper Link】【Pages】:462-473

【Authors】: Andre Martin ; Tiaraju Smaneoto ; Tobias Dietze ; Andrey Brito ; Christof Fetzer

【Abstract】: Event Stream Processing (ESP) Systems are currently enabling a renaissance in the data processing area as they provide results at low latency compared to the traditional MapReduce approach. Although the majority of ESP systems offer some form of fault tolerance to their users, the provided fault tolerance scheme is often not tailored to the application at hand. For example, active replication is well suited for critical applications where unresponsiveness due to a background recovery process is not acceptable. However, for other classes of applications without such tight constraints, the use of passive replication, based on checkpoints and logging, is a better choice as it can save a significant amount of resources compared to active replication. In this paper, we present StreamMine3G, a fault tolerant and elastic ESP system which employs several fault tolerance schemes, such as passive and active replication as well as intermediate alternatives such as active and passive standby. In order to free the user from the burden of choosing the correct scheme for the application at hand, StreamMine3G is equipped with a fault-tolerance controller that transitions between the employed schemes during runtime in response to the evolution of the given workload and the user's provided constraints (recovery time and semantics, i.e., gap or precise). Our evaluation shows that the overall resource footprint for fault tolerance can be considerably reduced using our adaptive approach without consequences to the recovery time.

【Keywords】: gap recovery; fault tolerance; active replication; passive replication; active standby; passive standby; adaptation; deterministic execution; precise recovery

42. Lightweight Memory Checkpointing.

【Paper Link】【Pages】:474-484

【Authors】: Dirk Vogt ; Cristiano Giuffrida ; Herbert Bos ; Andrew S. Tanenbaum

【Abstract】: Memory check pointing is a pivotal technique in systems reliability, with applications ranging from crash recovery to replay debugging. Unfortunately, many traditional memory check pointing use-cases require high-frequency checkpoints, something for which existing application-level solutions are not well-suited. The problem is that they incur either substantial run-time performance overhead, or poor memory usage guarantees. As a result, their application in practice is hampered. This paper presents Lightweight Memory Check pointing (LMC), a new user-level memory check pointing technique that combines low performance overhead with strong memory usage guarantees for high check pointing frequencies. To this end, LMC relies on compiler-based instrumentation to shadow the entire memory address space of the running program and incrementally checkpoint modified memory bytes in a LMC-maintained shadow state. Our evaluation on popular server applications demonstrates the viability of our approach in practice, confirming that LMC imposes low performance overhead with strictly bounded memory usage at runtime.

【Keywords】: Compiler- based instrumentation; Memory checkpointing; Shadow memory

Session 6B: Vulnerability Detection and Mitigation 3

43. HeapTherapy: An Efficient End-to-End Solution against Heap Buffer Overflows.

【Paper Link】【Pages】:485-496

【Authors】: Qiang Zeng ; Mingyi Zhao ; Peng Liu

【Abstract】: For decades buffer overflows have been one of the most prevalent and dangerous software vulnerabilities. Although many techniques have been proposed to address the problem, they mostly introduce a very high overhead while others assume the availability of a separate system to pinpoint attacks or provide detailed traces for defense generation, which is very slow in itself and requires considerable extra resources. We propose an efficient solution against heap buffer overflows that integrates exploit detection, defense generation, and overflow prevention in a single system, named Heap Therapy. During program execution it conducts on-the-fly lightweight trace collection and exploit detection, and initiates automated diagnosis upon detection to generate defenses in real-time. It can handle both over-write and over-read attacks, such as the recent Heartbleed attack. The system has no false positives, and keeps effective under polymorphic exploits.%as the generated defense captures semantic characteristics of exploits. It is compliant with mainstream hardware and operating systems, and does not rely on specific allocation algorithms. We evaluated Heap Therapy on a variety of services (database, web, and ftp) and benchmarks (SPEC CPU2006), it incurs a very low average overhead in terms of both speed (6.2%) and memory (7.7%).

【Keywords】: Context; Encoding; Resource management; Monitoring; Hardware; Computer bugs; Security

44. Smart-TV Security Analysis: Practical Experiments.

【Paper Link】【Pages】:497-504

【Authors】: Yann Bachy ; Frederic Basse ; Vincent Nicomette ; Eric Alata ; Mohamed Kaâniche ; Jean-Christophe Courrège ; Pierre Lukjanenko

【Abstract】: Modern home networks are becoming more and more complex with the integration of various types of interconnected smart devices, using heterogeneous networking technologies. Many of these devices are also connected to the Internet, generally through an integrated access device. Those smart devices are potentially vulnerable to several types of attacks. In this practical experience report we investigate the specific case of smart TVs. The main objective is to experimentally explore possible attack vectors and identify practically exploitable vulnerabilities and attack scenarios. In particular, the study covers local and remote attacks using different entry points, including the Digital Video Broadcasting (DVB) transmission channel and the copper-pair local loop. Several methods, allowing to observe and simulate service provider networks, are used to support several experiments considering four types of commercially available smart TVs for a comparative analysis. We also discuss several methods allowing to extract and analyze the embedded firmware, and obtain relevant information concerning target devices.

【Keywords】: Digital video broadcasting; TV; Security; Internet; Modulation; Copper; Multiplexing

45. On the Metrics for Benchmarking Vulnerability Detection Tools.

【Paper Link】【Pages】:505-516

【Authors】: Nuno Antunes ; Marco Vieira

【Abstract】: Research and practice show that the effectiveness of vulnerability detection tools depends on the concrete use scenario. Benchmarking can be used for selecting the most appropriate tool, helping assessing and comparing alternative solutions, but its effectiveness largely depends on the adequacy of the metrics. This paper studies the problem of selecting the metrics to be used in a benchmark for software vulnerability detection tools. First, a large set of metrics is gathered and analyzed according to the characteristics of a good metric for the vulnerability detection domain. Afterwards, the metrics are analyzed in the context of specific vulnerability detection scenarios to understand their effectiveness and to select the most adequate one for each scenario. Finally, an MCDA algorithm together with experts' judgment is applied to validate the conclusions. Results show that although some of the metrics traditionally used like precision and recall are adequate in some scenarios, others require alternative metrics that are seldom used in the benchmarking area.

【Keywords】: Vulnerability Detection; Automated Tools; Benchmarking; Security Metrics; Software Vulnerabilities

Session 7A: Assurance of Complex Systems 3

46. Formal Assurance Arguments: A Solution in Search of a Problem?

【Paper Link】【Pages】:517-528

【Authors】: Patrick John Graydon

【Abstract】: An assurance case comprises evidence and argument showing how that evidence supports assurance claims (e.g., about safety or security). It is unsurprising that some computer scientists have proposed formalising assurance arguments: most associate formality with rigour. But while engineers can sometimes prove that source code refines a formal specification, it is not clear that formalisation will improve assurance arguments or that this benefit is worth its cost. For example, formalisation might reduce the benefits of argumentation by limiting the audience to people who can read formal logic. In this paper, we present (1) a systematic survey of the literature surrounding formal assurance arguments, (2) an analysis of errors that formalism can help to eliminate, (3) a discussion of existing evidence, and (4) suggestions for experimental work to definitively answer the question.

【Keywords】: formal argumentation; safety case; security case; assurance argument

47. A Quality Control Engine for Complex Physical Systems.

【Paper Link】【Pages】:529-536

【Authors】: Haifeng Chen ; Mizoguchi Takehiko ; Yan Tan ; Kai Zhang ; Geoff Jiang

【Abstract】: This paper proposes a novel framework to automatically pinpoint suspicious sensors that lead to the quality change in physical systems such as manufacture plants. Our framework treats sensor readings as time series, and contains three main stages: time series transformation to feature series, feature ranking, and ranking score fusion. In the first step, we transform time series into a number of different feature series to describe the underlying dynamics of each sensor data. After that, the importance scores of all feature series are computed by utilizing several feature selection and ranking techniques, each of which discovers specific aspects of feature importance and their dependencies in the feature space. Finally we combine importance scores from all the rankers and all the features to obtain the final ranking of each sensor with respect to the system quality change. Our experiments based on synthetic time series as well as sensor data from a real system demonstrate the effectiveness of proposed method. In addition, we have implemented our framework as a production engine, and successfully applied it to several real physical systems.

【Keywords】: regularization; Time series; quality control; feature extraction; sliding window; feature selection

48. Discovering and Visualizing Operations Processes with POD-Discovery and POD-Viz.

【Paper Link】【Pages】:537-544

【Authors】: Ingo Weber ; Chao Li ; Len Bass ; Xiwei Xu ; Liming Zhu

【Abstract】: Understanding the behavior of an operations process and capturing it as an abstract process model has been shown to improve dependability significantly [1]. In particular, process context can be used for error detection, diagnosis, and even automated recovery. Creating the process model is an essential step in determining process context and, consequently, improving dependability. This paper describes two systems. The first, POD-Discovery, simplifies the creation of such an abstract process model from operations logs. An activity that previously required many manual steps can now be done largely automatically and in minutes. Using the discovered model, the second system, POD-Viz, provides operators with the ability to visualize the current state of an operations process in near-real-time and to replay a set of events to understand how the process context changed over time. This allows operators to trace the progress of an operations process easily, and helps in analyzing encountered errors.

【Keywords】: Monitoring; Dependability; System Operation; System Administration; Cloud Computing; Process Modelling

Session 7B: Preventing Memory and Information Leakage 2

49. Incinerator - Eliminating Stale References in Dynamic OSGi Applications.

【Paper Link】【Pages】:545-554

【Authors】: Koutheir Attouchi ; Gaël Thomas ; Gilles Muller ; Julia L. Lawall ; André Bottaro

【Abstract】: Java class loaders are commonly used in application servers to load, unload and update a set of classes as a unit. However, unloading or updating a class loader can introduce stale references to the objects of the outdated class loader. A stale reference leads to a memory leak and, for an update, to an inconsistency between the outdated classes and their replacements. To detect and eliminate stale references, we propose Incinerator, a Java virtual machine extension that introduces the notion of an outdated class loader. Incinerator detects stale references and sets them to null during a garbage collection cycle. We evaluate Incinerator in the context of the OSGi framework and show that Incinerator correctly detects and eliminates stale references, including a bug in Knopflerfish. We also evaluate the performance of Incinerator with the DaCapo benchmark on VMKit and show that Incinerator has an overhead of at most 3.3%.

【Keywords】: Incineration; Java; Monitoring; Middleware; Virtual machining; Context; Loading

50. Risk Assessment of Buffer "Heartbleed" Over-Read Vulnerabilities.

【Paper Link】【Pages】:555-562

【Authors】: Jun Wang ; Mingyi Zhao ; Qiang Zeng ; Dinghao Wu ; Peng Liu

【Abstract】: Buffer over-read vulnerabilities (e.g., Heartbleed) can lead to serious information leakage and monetary lost. Most of previous approaches focus on buffer overflow (i.e., over-write), which are either infeasible (e.g., canary) or impractical (e.g., bounds checking) in dealing with over-read vulnerabilities. As an emerging type of vulnerability, people need in-depth understanding of buffer over-read: the vulnerability, the security risk and the defense methods. This paper presents a systematic methodology to evaluate the potential risks of unknown buffer over-read vulnerabilities. Specifically, we model the buffer over-read vulnerabilities and focus on the quantification of how much information can be potentially leaked. We perform risk assessment using the RUBiS benchmark which is an auction site prototype modeled after eBay.com. We evaluate the effectiveness and performance of a few mitigation techniques and conduct a quantitative risk measurement study. We find that even simple techniques can achieve significant reduction on information leakage against over-read with reasonable performance penalty. We summarize our experience learned from the study, hoping to facilitate further studies on the over-read vulnerability.

【Keywords】: Risk management; Measurement; Payloads; Benchmark testing; Memory management; Heart rate variability; Entropy

Workshops Summary 4

51. Workshop on Dependability Issues on SDN and NFV (DISN).

【Paper Link】【Pages】:563-564

【Authors】: Elias Procópio Duarte Jr. ; Matti A. Hiltunen

【Abstract】: Software-Defined Networks (SDN) and Network Function Virtualization (NFV) are two technologies that have already had a deep impact on computer and telecommunication networks. Software Defined Networks (SDN) decouple network control from forwarding functions, enabling network control to become directly programmable and the underlying infrastructure to be abstracted from applications and network services. Network Function Virtualization (NFV) is a network architecture concept where IT virtualization techniques are used to implement network node functions as building blocks that may be combined, or chained, together to create communication services. SDN and NFV make it simpler and faster to deploy and manage new services, avoiding the cost and the long time frame required to design and implement hardwarebased network services. SDN and NVF introduce numerous dependability challenges. In terms of reliability, the challenges range from the design of reliable new SDN and NFV technologies to the adaptation of classical network functions to these technologies. The effective, dependable deployment of the virtual network on the physical substrate is particularly important. In terms of security, the challenges are enormous, as SDN and NFV are meant to be the very fabric of both the Internet and private networks. Threats, privacy concerns, authentication issues, and isolation - defining a truly secure virtualized network requires work on multiple fronts. The program of DISN'2015 consists of 3 technical papers and 2 keynotes, which are briefly described.

【Keywords】:

52. Workshop on Model Based Design for Cyber-Physical Systems (MB4CP).

【Paper Link】【Pages】:565-566

【Authors】: Alberto Avritzer ; Daniel Sadoc Menasché ; Kishor S. Trivedi ; Lucia Happe ; Sahra Sedigh Sarvestani

【Abstract】: This paper provides a summary of the First International Workshop on Model Based Design for Cyber- Physical Systems (MB4CP 2015) in conjunction with DSN 2015 conference in Rio de Janeiro, Brazil.

【Keywords】: Unified modeling language; Analytical models; Conferences; Cyber-physical systems; Computational modeling; Electronic mail; Adaptation models

53. Workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms (RADIANCE).

【Paper Link】【Pages】:567-568

【Authors】: Ariadne M. B. R. Carvalho ; Nuno Antunes ; Andrea Ceccarelli ; András Zentai

【Abstract】: The workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms (RADIANCE), in its first edition, aims to discuss novel dependability assessment approaches for complex systems and to promote their adoption in real-world settings through industrial and academic research. The main objective is to promote and foster discussion on novel ideas, constituting a forum where researchers can share both real problems and innovative solutions for the assessment of complex systems. The workshop focuses on assessing complex evolving systems, where increasing complexity and changes are due to the introduction of new components and sensors, and to the extensive usage of software OTS components or black box components in general. In this macro area, the workshop welcomed a broad list of applications ranging from agile development in critical systems to model-driven assessment approaches as well as new needs for verification, validation and certification of dynamic and evolving systems, which also includes solutions for automating the verification and validation processes. Finally, the workshop was interested inexperimental assessment of dependability and security at large.

【Keywords】: dynamic systems; dependability assessment; complex systems; fault injection

54. Workshop on Safety and Security of Intelligent Vehicles (SSIV).

【Paper Link】【Pages】:569-570

【Authors】: João Carlos Cunha ; Kalinka Branco ; Antonio Casimiro ; Urbano Nunes

【Abstract】: For intelligent vehicles to become a reality, further research and development must be performed, addressing the needs of multidisciplinary approaches like integrated control systems, communication and network, security algorithms, artificial intelligence, verification and validation, neural networks, safety assets and other technological concerns. The goal of this workshop is to explore the challenges and innovative solutions regarding intelligent vehicles, considering the implications of security and real-time issues on safety and certification, which emerge when introducing networked, autonomous and cooperative functionalities. It aims at joining together in an active debate, researchers and practitioners from several communities, namely dependability and security, realtime and embedded systems, intelligent transportation and mobile robot systems. This workshop is aimed at exploring the challenges and innovative solutions related to the security and safety of intelligent vehicles.