IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012, Boston, MA, USA, June 25-28, 2012. IEEE Computer Society 【DBLP Link】
【Paper Link】 【Pages】:1-12
【Authors】: Thorsten Piper ; Stefan Winter ; Paul Manns ; Neeraj Suri
【Abstract】: The AUTOSAR standard guides the development of component-based automotive software. As automotive software typically implements safety-critical functions, it needs to fulfill high dependability requirements, and the effort put into the quality assurance of these systems is correspondingly high. Testing, fault injection (FI), and other techniques are employed for the experimental dependability assessment of these increasingly software-intensive systems. Having flexible and automated support for instrumentation is key in making these assessment techniques efficient. However, providing a usable, customizable and performant instrumentation for AUTOSAR is non-trivial due to the varied abstractions and high complexity of these systems. This paper develops a dependability assessment guidance framework tailored towards AUTOSAR that helps identify the applicability and effectiveness of instrumentation techniques at (a) varied levels of software abstraction and granularity, (b) at varied software access levels - black-box, grey-box, white-box, and (c) the application of interface wrappers for conducting FI.
【Keywords】: run-time monitoring; AUTOSAR; instrumentation; interface wrappers; fault injection
【Paper Link】 【Pages】:1-12
【Authors】: Markus Becker ; Christoph Kuznik ; Mabel M. Joy ; Tao Xie ; Wolfgang Müller
【Abstract】: This paper presents a novel mutation based testing method through binary mutation. For this, a table of mutants is derived by control flow analysis of a disassembled binary under test. Mutations are injected at runtime by dynamic translation. Thus, our approach neither relies on source code nor a certain compiler. As instrumentation is avoided, testing results correspond to the original binary. In addition to high-level language faults, the proposed approach captures target specific faults related to compiling and linking. We investigated the software of an automotive case study. For this, a taxonomy of mutation operators for the ARM instruction set is proposed. Our experimental results prove 100% accuracy w.r.t. confidence metrics provided by conventional testing methods while avoiding significant mutant compilation overhead. Further speed up is achieved by an efficient binary mutation testing framework that relies on extending the open source software emulator QEMU.
【Keywords】: test confidence; Embedded software verification; software emulation; fault-based testing; mutation analysis
【Paper Link】 【Pages】:1-8
【Authors】: Hajime Fujita ; Yutaka Matsuno ; Toshihiro Hanawa ; Mitsuhisa Sato ; Shinpei Kato ; Yutaka Ishikawa
【Abstract】: Today's information systems have become large and complex because they must interact with each other via networks. This makes testing and assuring the dependability of systems much more difficult than ever before. DS-Bench Toolset has been developed to address this issue, and it includes D-Case Editor, DS-Bench, and D-Cloud. D-Case Editor is an assurance case editor. It makes a tool chain with DS-Bench and D-Cloud, and exploits the test results as evidences of the dependability of the system. DS-Bench manages dependability benchmarking tools and anomaly loads according to benchmarking scenarios. D-Cloud is a test environment for performing rapid system tests controlled by DS-Bench. It combines both a cluster of real machines for performance-accurate benchmarks and a cloud computing environment as a group of virtual machines for exhaustive function testing with a fault-injection facility. DS-Bench Toolset enables us to test systems satisfactorily and to explain the dependability of the systems to the stakeholders.
【Keywords】: virtual machine; dependability benchmarking; assurance case; fault injection; distributed systems
【Paper Link】 【Pages】:1-12
【Authors】: Parisa Jalili Marandi ; Marco Primi ; Fernando Pedone
【Abstract】: This paper addresses the scalability of group communication protocols. Scalability has become an issue of prime importance as data centers become commonplace. By scalability we mean the ability to increase the throughput of a group communication protocol, measured in number of requests ordered per time unit, by adding resources (i.e., nodes). We claim that existing group communication protocols do not scale in this respect and introduce Multi-Ring Paxos, a protocol that orchestrates multiple instances of Ring Paxos in order to scale to a large number of nodes. In addition to presenting Multi-Ring Paxos, we describe a prototype of the system we have implemented and a detailed evaluation of its performance.
【Keywords】: Throughput; Protocols; Servers; Scalability; Databases; Receivers; Educational institutions
【Paper Link】 【Pages】:1-12
【Authors】: Srikar Tati ; Bong-Jun Ko ; Guohong Cao ; Ananthram Swami ; Thomas F. La Porta
【Abstract】: In this paper, we propose an algorithm to efficiently diagnose large-scale clustered failures. The algorithm, Cluster-MAX-COVERAGE (CMC), is based on greedy approach. We address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100% accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. Furthermore, we propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) that performs efficiently during both kinds of failures, i.e., independent and clustered. During a series of failues that include both independent and clustered, AMC results in a reduced number of false negatives and false positives.
【Keywords】: Clustered failures; Fault diagnosis; Large-scale failures; Incomplete information
【Paper Link】 【Pages】:1-8
【Authors】: Jesus Friginal ; Juan Carlos Ruiz ; David de Andrés ; Antonio Bustos
【Abstract】: Wireless Mesh networks (WMN) typically rely on proactive routing protocols to establish optimal communication routes between every pair of system nodes. These protocols integrate link-quality-based mechanisms to minimise the adverse effect of ambient noise on communications. This paper shows the limitations existing in such mechanisms by analysing the impact of ambient noise on three state-of-the-art proactive routing protocols: OLSR, B.A.T.M.A.N and Babel. As will be shown, the lack of context-awareness in their link-quality mechanisms prevents the protocols from adjusting their behaviour according to persistent levels of ambient noise, which may vary along the time. Consequently, they cannot minimise the impact of such noise on the availability of network routes. This issue is very serious for a WMN since the loss communication links may strongly increase the convergence time of the network. An adaptive extension to studied link-quality-based mechanisms is proposed to avoid the loss of communication links in the presence of high levels of ambient noise. The effectiveness of the proposal is experimentally assessed, thus establishing a new method to reduce the impact of ambient noise on WMN.
【Keywords】: Ambient Noise; Wireless Mesh Networks; Proactive Routing Protocols; Adaptive Fault Tolerance
【Paper Link】 【Pages】:1-12
【Authors】: George Amvrosiadis ; Alina Oprea ; Bianca Schroeder
【Abstract】: Latent sector errors (LSEs) are a common hard disk failure mode, where disk sectors become inaccessible while the rest of the disk remains unaffected. To protect against LSEs, commercial storage systems use scrubbers: background processes verifying disk data. The efficiency of different scrubbing algorithms in detecting LSEs has been studied in depth; however, no attempts have been made to evaluate or mitigate the impact of scrubbing on application performance. We provide the first known evaluation of the performance impact of different scrubbing policies in implementation, including guidelines on implementing a scrubber. To lessen this impact, we present an approach giving conclusive answers to the questions: when should scrubbing requests be issued, and at what size, to minimize impact and maximize scrubbing throughput for a given workload. Our approach achieves six times more throughput, and up to three orders of magnitude less slowdown than the default Linux I/O scheduler.
【Keywords】: background scheduling; scrubbing; hard disk failures; latent sector errors; idleness predictors
【Paper Link】 【Pages】:1-12
【Authors】: Cristina Basescu ; Christian Cachin ; Ittay Eyal ; Robert Haas ; Alessandro Sorniotti ; Marko Vukolic ; Ido Zachevsky
【Abstract】: A key-value store (KVS) offers functions for storing and retrieving values associated with unique keys. KVSs have become the most popular way to access Internet-scale “cloud” storage systems. We present an efficient wait-free algorithm that emulates multi-reader multi-writer storage from a set of potentially faulty KVS replicas in an asynchronous environment. Our implementation serves an unbounded number of clients that use the storage concurrently. It tolerates crashes of a minority of the KVSs and crashes of any number of clients. Our algorithm minimizes the space overhead at the KVSs and comes in two variants providing regular and atomic semantics, respectively. Compared with prior solutions, it is inherently scalable and allows clients to write concurrently. Because of the limited interface of a KVS, textbook-style solutions for reliable storage either do not work or incur a prohibitively large storage overhead. Our algorithm maintains two copies of the stored value per KVS in the common case, and we show that this is indeed necessary. If there are concurrent write operations, the maximum space complexity of the algorithm grows in proportion to the point contention. A series of simulations explore the behavior of the algorithm, and benchmarks obtained with KVS cloud-storage providers demonstrate its practicality.
【Keywords】: Registers; Emulation; Robustness; Complexity theory; Computer crashes; Semantics; Automata
【Paper Link】 【Pages】:1-12
【Authors】: Eric Rozier ; William H. Sanders
【Abstract】: In this paper we present a framework for analyzing the fault tolerance of deduplicated storage systems. We discuss methods for building models of deduplicated storage systems by analyzing empirical data on a file category basis. We provide an algorithm for generating component-based models from this information and a specification of the storage system architecture. Given the complex nature of detailed models of deduplicated storage systems, finding a solution using traditional discrete event simulation or numerical solvers can be difficult. We introduce an algorithm which allows for a more efficient solution by exploiting the underlying structure of dependencies to decompose the model of the storage system. We present a case study of our framework for a real system.We analyze a production deduplicated storage system and propose extensions which improve fault tolerance while maintaining high storage efficiency.
【Keywords】: decomposition; storage; deduplication; reliability; simulation
【Paper Link】 【Pages】:1-6
【Authors】: Cristian Constantinescu ; Mike Butler ; Chris Weller
【Abstract】: Single-event upsets (SEU) and single-event transients (SET) may lead to crashes or even silent data corruption (SDC) in microprocessors. Error detection and recovery features are employed to mitigate the impact of SEU and SET. However, these features add performance, area, power, and cost overheads. As a result, designers must concentrate their efforts on protecting the most sensitive areas of the processor. Simulated error injection was used to study the propagation of the SEU-induced soft errors in the latest AMD microprocessor module, Bulldozer. This paper presents the Bulldozer architecture, error injection methodology, and experimental results. Propagation of soft errors is quantified by derating factors. Error injection is performed both at the module and unit level, derating factors and simulation times being compared. Accuracy is assessed by deriving confidence intervals of the derating factors. The experiments point out the most sensitive units of the Bulldozer module, and allow efficient implementation of the error-handling features.
【Keywords】: error handling; soft errors; simulated error injection; microprocessor design; RTL model
【Paper Link】 【Pages】:1-12
【Authors】: Xin Xu ; Man-Lap Li
【Abstract】: Extreme CMOS scaling is expected to significantly impact the reliability of future microprocessors, prompting recent research effort on low-cost hardware-software cross-layer reliability solutions. To evaluate, statistical fault injection (SFI) is often used to estimate the error coverage of the underlying method. Unfortunately, because a significant number of errors injected by SFI are often derated, the evaluation becomes less rigorous and less efficient. This paper makes the observation that many derated errors can be gracefully avoided to allow the fault injection campaign to focus on likely non-derated faults that stress the method-under-test. We propose a biased injection framework called CriticalFault that employs vulnerability analysis to map out relevant faults for stress testing. With CriticalFault, our results show that the injection space is reduced by 29% and 59% of the biased injections cause either software aborts or silent data corruptions, both are improvements from SFI. Moreover, we characterize different propagation behaviors of these non-derated faults and discuss the implications of designing future cross-layer solutions. Overall, not only CriticalFault is highly effective in identifying relevant test cases for current systems, but reliability researchers and engineers can also conduct more in-depth and meaningful analysis in deveoping future reliability solutions using CriticalFault.
【Keywords】: Soft Error; Fault Injection; Fault Analysis; Vulnerability; Microarchitecture
【Paper Link】 【Pages】:1-8
【Authors】: Kuan-Yu Tseng ; Daniel Chen ; Zbigniew Kalbarczyk ; Ravishankar K. Iyer
【Abstract】: With the advent of modern technologies, microprocessor-based devices are used to monitor and control critical infrastructures, e.g., electric power grids, oil and gas distribution. However, the security and reliability of these microprocessor-based systems is a significant issue, since they are more susceptible to transient errors and malicious attacks. An error in one of these systems could have a cascading and catastrophic impact on the whole infrastructure. This paper explores the error resiliency of power grid substation devices. A software-implemented fault injection technique is used to induce errors/faults inside devices used in power grid substations. The goal is to test the ability of these systems to compute through errors/faults. Our results demonstrate that a single error in a substation device may render the operator in the control center unable to control the operation of a relay in the substation.
【Keywords】: fault injection; reliability; security; power grid
【Paper Link】 【Pages】:1-12
【Authors】: Daniel Y. Deng ; G. Edward Suh
【Abstract】: This paper proposes Harmoni, a high performance hardware accelerator architecture that can support a broad range of run-time monitoring and bookkeeping functions. Unlike custom hardware, which offers very little configurability after it has been fabricated, Harmoni is highly configurable and can allow a wide range of different hardware monitoring and bookkeeping functions to be dynamically added to a processing core even after the chip has already been fabricated. The Harmoni architecture achieves much higher efficiency than software implementations and previously proposed monitoring platforms by closely matching the common characteristics of run-time monitoring functions that are based on the notion of tagging. We implemented an RTL prototype of Harmoni and evaluated it with several example monitoring functions for security and programmability. The prototype demonstrates that the architecture can support a wide range of monitoring functions with different characteristics. Harmoni takes moderate silicon area, has very high throughput, and incurs low overheads on monitored programs.
【Keywords】: Monitoring; Tagging; Registers; Hardware; Software; Computer architecture; Prototypes
【Paper Link】 【Pages】:1-12
【Authors】: Greg Bronevetsky ; Ignacio Laguna ; Bronis R. de Supinski ; Saurabh Bagchi
【Abstract】: Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and to diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize normal and abnormal system behavior. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the complexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.
【Keywords】: autonomic management; fault detection; root cause analysis; statistical modeling
【Paper Link】 【Pages】:1-12
【Authors】: Soila Kavulya ; Scott Daniels ; Kaustubh R. Joshi ; Matti A. Hiltunen ; Rajeev Gandhi ; Priya Narasimhan
【Abstract】: Chronics are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and as a result are relatively easy to detect and diagnose quickly, chronic problems are elusive because they are often triggered by complex conditions, persist in a system for days or weeks, and coexist with other problems active at the same time. In this paper, we present Draco, a scalable engine to diagnose chronics that addresses these issues by using a “top-down” approach that starts by heuristically identifying user interactions that are likely to have failed, e.g., dropped calls, and drills down to identify groups of properties that best explain the difference between failed and successful interactions by using a scalable Bayesian learner. We have deployed Draco in production for the VoIP operations of a major ISP. In addition to providing examples of chronics that Draco has helped identify, we show via a comprehensive evaluation on production data that Draco provided 97% coverage, had fewer than 4% false positives, and outperformed state-of-the-art diagnostic techniques by up to 56% for complex chronics.
【Keywords】: Servers; Bayesian methods; IP networks; Mathematical model; Production; Equations; Probability distribution
【Paper Link】 【Pages】:1-12
【Authors】: Danilo Ansaloni ; Lydia Y. Chen ; Evgenia Smirni ; Walter Binder
【Abstract】: Optimal resource allocation and application consolidation on modern multicore systems that host multiple applications is not easy. Striking a balance among conflicting targets such as maximizing system throughput and system utilization while minimizing application response times is a quandary for system administrators. The purpose of this work is to offer a methodology that can automate the difficult process of identifying how to best consolidate workloads in a multicore environment. We develop a simple approach that treats the hardware and the operating system as a black box and uses measurements to profile the application resource demands. The demands become input to a queueing network model that successfully predicts application scalability and that captures the performance impact of consolidated applications on shared on-chip and off-chip resources. Extensive analysis with the widely used DaCapo Java benchmarks on an IBM Power 7 system illustrates the model's ability to accurately predict the system's optimal application mix.
【Keywords】: multicores; performance modeling; queueing networks; application consolidation; Java
【Paper Link】 【Pages】:1-12
【Authors】: Peter Buchholz
【Abstract】: Continuous Time Markov Decision Processes (CTMDPs) are used to describe optimization problems in many applications including system maintenance and control. Often one is interested in a control strategy or policy to optimize the gain of a system over a finite interval which is denoted as finite horizon. The computation of an ε-optimal policy, i.e., a policy that reaches the optimal gain up to some small ε, is often hindered by state space explosion which means that state spaces of realistic models can be very large or even infinite. The paper presents new algorithms to compute approximately optimal policies for CTMDPs with large or infinite state spaces. The new approach allows one to compute bounds on the achievable gain and a policy to reach the lower bound using a variant of uniformization on a finite subset of the state space. It is also shown how the approach can be applied to models with unbounded rewards or transition rates for which uniformization cannot be applied per se.
【Keywords】: Finite Horizons; Continuous Time Markov Decision Processes; Numerical Analysis
【Paper Link】 【Pages】:1-12
【Authors】: Miguel A. Erazo ; Ting Li ; Jason Liu ; Stephan Eidenbenz
【Abstract】: We present the design and implementation of FileSim, a simulation framework with detailed models of parallel file systems, capable of reproducing the complex I/O behavior at scale. FileSim aims to support comprehensive and accurate end-to-end I/O performance prediction and evaluation of exascale high-end computing systems. To this end, FileSim provides several key features, including detailed, pluggable models of contemporary parallel file systems, the support of trace-driven simulation, and the capability of running large-scale I/O systems using parallel and distributed simulation.We conducted extensive validation and performance studies, through which we show that the simulator is capable of reproducing important I/O system behaviors comparable to those measured from the real systems. We demonstrate the capabilities of FileSim as a tool for exploring the parameter space and design alternatives of large-scale parallel file systems.
【Keywords】: parallel simulation; Parallel file systems; PanFS; simulation and modeling
【Paper Link】 【Pages】:1-12
【Authors】: Daniele Sciascia ; Fernando Pedone ; Flavio Junqueira
【Abstract】: Deferred update replication is a well-known approach to building data management systems as it provides both high availability and high performance. High availability comes from the fact that any replica can execute client transactions; the crash of one or more replicas does not interrupt the system. High performance comes from the fact that only one replica executes a transaction; the others must only apply its updates. Since replicas execute transactions concurrently, transaction execution is distributed across the system. The main drawback of deferred update replication is that update transactions scale poorly with the number of replicas, although read-only transactions scale well. This paper proposes an extension to the technique that improves the scalability of update transactions. In addition to presenting a novel protocol, we detail its implementation and provide an extensive analysis of its performance.
【Keywords】: transactional systems; Database replication; scalable data store; fault tolerance; high performance
【Paper Link】 【Pages】:1-12
【Authors】: Moshe Gabel ; Assaf Schuster ; Ran Gilad-Bachrach ; Nikolaj Bjørner
【Abstract】: Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications. We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults. We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests.
【Keywords】: statistical learning; fault detection; web services; statistical analysis; distributed computing
【Paper Link】 【Pages】:1-12
【Authors】: Tomas Hruby ; Dirk Vogt ; Herbert Bos ; Andrew S. Tanenbaum
【Abstract】: For many years, multiserver1 operating systems have been demonstrating, by their design, high dependability and reliability. However, the design has inherent performance implications which were not easy to overcome. Until now the context switching and kernel involvement in the message passing was the performance bottleneck for such systems to get broader acceptance beyond niche domains. In contrast to other areas of software development where fitting the software to the parallelism is difficult, the new multicore hardware is a great match for the multiserver systems. We can run individual servers on different cores. This opens more room for further decomposition of the existing servers and thus improving dependability and live-updatability. We discuss in general the implications for the multiserver systems design and cover in detail the implementation and evaluation of a more dependable networking stack. We split the single stack into multiple servers which run on dedicated cores and communicate without kernel involvement. We think that the performance problems that have dogged multiserver operating systems since their inception should be reconsidered: it is possible to make multiserver systems fast on multicores.
【Keywords】: System performance; Operating systems; Reliability; Computer network reliability
【Paper Link】 【Pages】:1-12
【Authors】: Yunfeng Zhu ; Patrick P. C. Lee ; Liping Xiang ; Yinlong Xu ; Lingling Gao
【Abstract】: Modern distributed storage systems provide large-scale, fault-tolerant data storage. To reduce the probability of data unavailability, it is important to recover the lost data of any failed storage node efficiently. In practice, storage nodes are of heterogeneous types and have different transmission bandwidths. Thus, traditional recovery solutions that simply minimize the number of data blocks being read may no longer be optimal in a heterogeneous environment. We propose a cost-based heterogeneous recovery (CHR) algorithm for RAID-6-coded storage systems. We formulate the recovery problem as an optimization model in which storage nodes are associated with generic costs. We narrow down the solution space of the model to make it practically tractable, while still achieving the global optimal solution in most cases. We implement different recovery algorithms and conduct testbed experiments on a real networked storage system with heterogeneous storage devices. We show that our CHR algorithm reduces the total recovery time of existing recovery solutions in various scenarios.
【Keywords】: experimentation; distributed storage system; RAID-6 codes; node heterogeneity; failure recovery
【Paper Link】 【Pages】:1-12
【Authors】: Marco Beccuti ; Andrea Bobbio ; Giuliana Franceschinis ; Roberta Terruggia
【Abstract】: In this paper we propose an improved BDD approach to the network reliability analysis, that allows the user to compute an exact solution or an approximation based on reliability bounds when network complexity makes the former solution practically impossible. To this purpose, a new algorithm for encoding the connectivity graph on a Binary Decision Diagram (BDD) has been developed; it reduces the computation memory peak with respect to previous approaches based on the same type of data structure without increasing the execution time, and allows us also to derive from a subset of the minpaths/mincuts a lower/upper bound of the network reliability, so that the quality of the obtained approximation can be estimated. Finally, a fair and detailed comparison between our approach and another state of the art approach presented in the literature is documented through a set of benchmarks.
【Keywords】: BDD; Network Reliability; Exact and Approximate Algorithms; Upper and Lower Bound
【Paper Link】 【Pages】:1-8
【Authors】: Timothy K. Tsai ; Nawanol Theera-Ampornpunt ; Saurabh Bagchi
【Abstract】: Hard disk drives have multiple layers of fault tolerance mechanisms that protect against data loss. However, a few failures occasionally breach the entire set of mechanisms. To prevent such scenarios, we rely on failure prediction mechanisms to raise alarms with sufficient warning to allow the at-risk data to be copied to a safe location. A common failure prediction technique monitors the occurrence of soft errors and triggers an alarm when the soft error rate exceeds a specified threshold. This study uses data collected from a population of over 50,000 customer deployed disk drives to examine the relationship between soft errors and failures, in particular failures manifested as hard errors. The data analysis shows that soft errors alone cannot be used as a reliable predictor of hard errors. However, in those cases where soft errors do accurately predict hard errors, sufficient warning time exists for preventive actions.
【Keywords】: data mining; hard disk drive; failure prediction; soft errors; hard errors
【Paper Link】 【Pages】:1-12
【Authors】: James S. Plank ; Catherine D. Schuman ; B. Devin Robison
【Abstract】: Large scale, archival and wide-area storage systems use erasure codes to protect users from losing data due to the inevitable failures that occur. All but the most basic erasure codes employ bit-matrices so that encoding and decoding may be effected solely with the bitwise exclusive-OR (XOR) operation. There are CPU savings that can result from strategically scheduling these XOR operations so that fewer XOR's are performed. It is an open problem to derive a schedule from a bit-matrix that minimizes the number of XOR operations. We attack this open problem, deriving two new heuristics called Uber-CHRS and X-Sets to schedule encoding and decoding bit-matrices with reduced XOR operations. We evaluate these heuristics in a variety of realistic erasure coding settings and demonstrate that they are a significant improvement over previously published heuristics. We provide an open-source implementation of these heuristics so that practitioners may leverage our work.
【Keywords】: Bit-matrix scheduling; Erasure codes; Fault-tolerant storage; RAID; Disk; failures
【Paper Link】 【Pages】:1-12
【Authors】: Joseph Sloan ; Rakesh Kumar ; Greg Bronevetsky
【Abstract】: The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.
【Keywords】: error detection; ABFT; sparse linear algebra; numerical methods
【Paper Link】 【Pages】:1-8
【Authors】: Ewen Denney ; Ganesh Pai ; Ibrahim Habli
【Abstract】: We describe our experience with the ongoing development of a safety case for an unmanned aircraft system (UAS), emphasizing autopilot software safety assurance. Our approach combines formal and non-formal reasoning, yielding a semi-automatically assembled safety case, in which part of the argument for autopilot software safety is automatically generated from formal methods. This paper provides a discussion of our experiences pertaining to (a) the methodology for creating and structuring safety arguments containing heterogeneous reasoning and information (b) the comprehensibility of, and the confidence in, the arguments created, and (c) the implications of development and safety assurance processes. The considerations for assuring aviation software safety, when using an approach such as the one in this paper, are also discussed in the context of the relevant standards and existing (process-based) certification guidelines.
【Keywords】: Aviation software; Software safety; Safety cases; Unmanned aircraft; Formal methods
【Paper Link】 【Pages】:1-12
【Authors】: Arpan Roy ; Dong Seong Kim ; Kishor S. Trivedi
【Abstract】: Constraints such as limited security investment cost precludes a security decision maker from implementing all possible countermeasures in a system. Existing analytical model-based security optimization strategies do not prevail for the following reasons: (i) none of these model-based methods offer a way to find optimal security solution in the absence of probability assignments to the model, (ii) methods scale badly as size of the system to model increases and (iii) some methods suffer as they use attack trees (AT) whose structure does not allow for the inclusion of countermeasures while others translate the non-state-space model (e.g., attack response tree) into a state-space model hence causing state-space explosion. In this paper, we use a novel AT paradigm called attack countermeasure tree (ACT) whose structure takes into account attacks as well as countermeasures (in the form of detection and mitigation events). We use greedy and branch and bound techniques to study several objective functions with goals such as minimizing the number of countermeasures, security investment cost in the ACT and maximizing the benefit from implementing a certain countermeasure set in the ACT under different constraints. We cast each optimization problem into an integer programming problem which also allows us to find optimal solution even in the absence of probability assignments to the model. Our method scales well for large ACTs and we compare its efficiency with other approaches.
【Keywords】: security investment cost; attack countermeasure tree; branch and bound; integer programming; optimization
【Paper Link】 【Pages】:1-12
【Authors】: Saman A. Zonouz ; Amir Houmansadr ; Parisa Haghani
【Abstract】: To protect complex power-grid control networks, efficient security assessment techniques are required. However, efficiently making sure that calculated security measures match the expert knowledge is a challenging endeavor. In this paper, we present EliMet, a framework that combines information from different sources and estimates the extent to which a control network meets its security objective. Initially, during an offline phase, a state-based model of the network is generated, and security-level of each state is measured using a generic and easy-to-compute metric. EliMet then passively observes system operators' online reactive behavior against security incidents, and accordingly refines the calculated security measure values. Finally, to make the values comply with the expert knowledge, EliMet actively queries operators regarding those states for which sufficient information was not gained during the passive observation. Our experimental results show that EliMet can optimally make use of prior knowledge as well as automated inference techniques to minimize human involvement and efficiently deduce the expert knowledge regarding individual states of that particular system.
【Keywords】: situational awareness; Power grid critical infrastructure; intrusion detection and response; security metric
【Paper Link】 【Pages】:1-12
【Authors】: Massimiliano Albanese ; Sushil Jajodia ; Steven Noel
【Abstract】: Attack graph analysis has been established as a powerful tool for analyzing network vulnerability. However, previous approaches to network hardening look for exact solutions and thus do not scale. Further, hardening elements have been treated independently, which is inappropriate for real environments. For example, the cost for patching many systems may be nearly the same as for patching a single one. Or patching a vulnerability may have the same effect as blocking traffic with a firewall, while blocking a port may deny legitimate service. By failing to account for such hardening interdependencies, the resulting recommendations can be unrealistic and far from optimal. Instead, we formalize the notion of hardening strategy in terms of allowable actions, and define a cost model that takes into account the impact of interdependent hardening actions. We also introduce a near-optimal approximation algorithm that scales linearly with the size of the graphs, which we validate experimentally.
【Keywords】: reliability; network hardening; vulnerability analysis; attack graphs; intrusion prevention
【Paper Link】 【Pages】:1-12
【Authors】: Collin Mulliner ; Steffen Liebergeld ; Matthias Lange ; Jean-Pierre Seifert
【Abstract】: Malicious injection of cellular signaling traffic from mobile phones is an emerging security issue. The respective attacks can be performed by hijacked smartphones and by malware resident on mobile phones. Until today there are no protection mechanisms in place to prevent signaling based attacks other than implementing expensive additions to the cellular core network. In this work we present a protection system that resides on the mobile phone. Our solution works by partitioning the phone software stack into the application operating system and the communication partition. The application system is a standard fully featured Android system. On the other side, communication to the cellular network is mediated by a flexible monitoring and enforcement system running on the communication partition. We implemented and evaluated our protection system on a real smartphone. Our evaluation shows that it can mitigate all currently known signaling based attacks and in addition can protect users from cellular Trojans.
【Keywords】: System Virtualization; Smartphones; Cellular Signaling; Attack Mitigation; Operating Systems
【Paper Link】 【Pages】:1-12
【Authors】: Qiang Zheng ; Jing Zhao ; Guohong Cao
【Abstract】: Backup paths are widely used to protect IP links from failures. Existing solutions such as the commonly used independent and Shared Risk Link Group models do not accurately reflect the correlation between IP link failures, and thus may not choose reliable backup paths. We propose a cross-layer approach for IP link protection. We develop a correlated failure probability (CFP) model to quantify the impact of an IP link failure on the reliability of backup paths. With the CFP model, we propose two algorithms for selecting backup paths. The first algorithm focuses on choosing the backup paths with minimum failure probability. The second algorithm further considers the bandwidth constraint and aims at minimizing the traffic disruption caused by failures. It also ensures that the rerouted traffic load on each IP link does not exceed the usable bandwidth to avoid interfering with the normal traffic. Simulations based on real ISP networks show that our approach can choose backup paths that are more reliable and achieve better protection.
【Keywords】: model; IP link; protection; cross-layer; backup path
【Paper Link】 【Pages】:1-12
【Authors】: Vamsi Kambhampati ; Christos Papadopoulos ; Daniel Massey
【Abstract】: Critical services operating over the Internet are increasingly threatened by Distributed Denial of Service (DDoS) attacks. To protect them we propose Epiphany, an architecture that hides the service IP addresses so that attackers cannot locate and target them. Epiphany provides service access through numerous lightweight proxies, presenting a wide target to the attacker. Epiphany has strong location hiding properties; no proxy knows the service address. Instead, proxies communicate over ephemeral paths controlled by the service. If a specific proxy misbehaves or is attacked it can be promptly removed. Epiphany separates proxies into setup and data, and only makes setup proxies public, but these use anycast to create distinct network regions. Clients in clean networks are not affected by attackers in other networks. Data proxies are assigned to clients based on their trust. We evaluate the defense properties of Epiphany using simulations and implementations on PlanetLab and a router testbed.
【Keywords】: Hidden Paths; Critical Services; DDoS; Location Hiding; Proxies
【Paper Link】 【Pages】:1-12
【Authors】: Catello Di Martino ; Marcello Cinque ; Domenico Cotroneo
【Abstract】: This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer logs, due to increasing importance of automatic event log analysis for these systems. Experimental results show how the approach allows to compare different time coalescence techniques and to identify their weaknesses with respect to given system settings. In addition, results revealed an interesting correlation between errors caused by the coalescence and errors in the estimation of dependability measurements.
【Keywords】: dependability assessment; Event Log Analysis; supercomputer dependability; data coalescence
【Paper Link】 【Pages】:1-10
【Authors】: Robert G. Cain ; Aad P. A. van Moorsel
【Abstract】: Probabilistic and stochastic models are routinely used in performance, dependability and security evaluation, and determining appropriate values for model parameters is a long-standing problem in the practical use of such models. With the increasing emphasis on human aspects and business considerations, data collection to estimate parameter values often gets prohibitively expensive, since it may involve questionnaires, costly audits or additional monitoring and processing. In this paper we articulate a set of optimization problems related to data collection, and provide efficient algorithms to determine the optimal data collection strategy for a model. The main idea is to model the uncertainty of data sources and determine its influence on output accuracy by solving the model. This approach is particularly natural for data sources that rely on sampling, such as questionnaires or monitoring, since uncertainty can be expressed using the central limit theorem. We pay special attention to the efficiency of our optimization algorithm, using ideas inspired by importance sampling to derive optimal strategies for a range of parameter values from a single set of experiments.
【Keywords】: optimization; data collection; probabilistic modelling; dependability; information security
【Paper Link】 【Pages】:1-12
【Authors】: Li Yu ; Ziming Zheng ; Zhiling Lan ; Terry Jones ; Jim M. Brandt ; Ann C. Gentile
【Abstract】: Log data is an incredible asset for troubleshooting in large-scale systems. Nevertheless, due to the ever-growing system scale, the volume of such data becomes overwhelming, bringing enormous burdens on both data storage and data analysis. To address this problem, we present a 2-dimensional online filtering mechanism to remove redundant and noisy data via feature selection and instance selection. The objective of this work is two-fold: (i) to significantly reduce data volume without losing important information, and (ii) to effectively promote data analysis. We evaluate this new filtering mechanism by means of real environmental data from the production supercomputers at Oak Ridge National Laboratory and Sandia National Laboratory. Our preliminary results demonstrate that our method can reduce more than 85% disk space, thereby significantly reducing analysis time. Moreover, it also facilitates better failure prediction and diagnosis by more than 20%, as compared to the conventional predictive approach relying on RAS (Reliability, Availability, and Serviceability) events alone.
【Keywords】: Monitoring; Accuracy; Large-scale systems; Data analysis; Supercomputers; Correlation; Measurement
【Paper Link】 【Pages】:1-12
【Authors】: Rami G. Melhem ; Rakan Maddah ; Sangyeun Cho
【Abstract】:
With their potential for high scalability and density, resistive memories are foreseen as a promising technology that overcomes the physical limitations confronted by charge-based DRAM and flash memory. Yet, a main burden towards the successful adoption and commercialization of resistive memories is their low cell reliability caused by process variation and limited write endurance. Typically, faulty and worn-out cells are permanently stuck at either 0' or
1'. To overcome the challenge, a robust error correction scheme that can recover from many hard faults is required. In this paper, we propose and evaluate RDIS, a novel scheme to efficiently tolerate memory stuck-at faults. RDIS allows for the correct retrieval of data by recursively determining and efficiently keeping track of the positions of the bits that are stuck at a value different from the ones that are written, and then, at read time, by inverting the values read from those positions. RDIS is characterized by a very low probability of failure that increases slowly with the relative increase in the number of faults. Moreover, RDIS tolerates many more faults than the best existing scheme-by up to 95% on average at the same overhead level.
【Keywords】: Reliability; Error Correction Code; Hard Faults; Phase Change Memory; Fault Tolerance
【Paper Link】 【Pages】:1-12
【Authors】: Jie Chen ; Guru Venkataramani ; H. Howie Huang
【Abstract】: As main memory systems begin to face the scaling challenges from DRAM technology, future computer systems need to adapt to the emerging memory technologies like Phase-Change Memory (PCM or PRAM). While these newer technologies offer advantages such as storage density, non-volatility, and low energy consumption, they are constrained by limited write endurance that becomes more pronounced with process variation. In this paper, we propose a novel PRAM-based main memory system, RePRAM (Recycling PRAM), which leverages a group of faulty pages and recycles them in a managed way to significantly extend the PRAM lifetime while minimizing the performance impact. In particular, we explore two different dimensions of dynamic redundancy levels and group sizes, and design low-cost hardware and software support for RePRAM. Our proposed scheme involves minimal hardware modifications (that have less than 1% on-chip and off-chip area overheads). Also, our schemes can improve the PRAM lifetime by up to 43× (times) over a chip with no error correction capabilities, and outperform prior schemes such as DRM and ECP at a small fraction of the hardware cost. The performance overhead resulting from our scheme is less than 7% on average across 21 applications from SPEC2006, Splash-2, and PARSEC benchmark suites.
【Keywords】: Performance; Phase Change Memory; Lifetime; Redundancy; Main memory
【Paper Link】 【Pages】:1-11
【Authors】: Ulya R. Karpuzcu ; Krishna B. Kolluru ; Nam Sung Kim ; Josep Torrellas
【Abstract】: Near-Threshold Computing (NTC), where the supply voltage is only slightly higher than the threshold voltage of transistors, is a promising approach to attain energy-efficient computing. Unfortunately, compared to the conventional Super-Threshold Computing (STC), NTC is more sensitive to process variations, which results in higher power consumption and lower frequencies than would otherwise be possible, and potentially a non-negligible fault rate. To help address variations at NTC at the architecture level, this paper presents the first microarchitectural model of process variations for NTC. The model, called VARIUS-NTV, extends the existing VARIUS variation model. Its key aspects include: (i) adopting a gate-delay model and an SRAM cell type that are tailored to NTC, (ii) modeling SRAM failure modes emerging at NTC, and (iii) accounting for the impact of leakage in SRAM models. We evaluate a simulated 11nm, 288-core tiled manycore at both NTC and STC. The results show higher frequency and power variations within the NTC chip. For example, the maximum difference in on-chip tile frequency is ≈2.3× at STC and ≈3.7× at NTC. We also validate our model against an experimental chip.
【Keywords】: Power constraints; Process variations; Near-threshold voltage; Manycore architectures; SRAM fault models
【Paper Link】 【Pages】:1-11
【Authors】: David J. Palframan ; Nam Sung Kim ; Mikko H. Lipasti
【Abstract】: Delay variation due to dopant fluctuation is expected to become more prominent in future technology generations. To regain performance lost due to within-die variations, many architectural techniques propose modified timing schemes such as time borrowing or variable latency execution. As an alternative that specifically targets random variation, we propose introducing redundancy along the processor datapath in the form of one or more extra bitslices. This approach allows us to leave dummy slices in the datapath unused to avoid excessively slow critical paths created by delay variations. We examine the benefits of applying this technique to potential critical paths such as the ALU and register file, and demonstrate that our technique can significantly reduce the delay penalty due to variation. By adding a single bitslice, for instance, we can reduce this delay penalty by 10%. Finally, we discuss heuristics for configuring our redundant design after fabrication.
【Keywords】: reliability; process variation; doping; bitsliced design; performance
【Paper Link】 【Pages】:1-12
【Authors】: Nuno Machado ; Paolo Romano ; Luís E. T. Rodrigues
【Abstract】: This paper presents CoopREP, a system that provides support for fault replication of concurrent programs, based on cooperative recording and partial log combination. CoopREP employs partial recording to reduce the amount of information that a given program instance is required to store in order to support deterministic replay. This allows to substantially reduce the overhead imposed by the instrumentation of the code, but raises the problem of finding the combination of logs capable of replaying the fault. CoopREP tackles this issue by introducing several innovative statistical analysis techniques aimed at guiding the search of partial logs to be combined and used during the replay phase. CoopREP has been evaluated using both standard benchmarks for multi-threaded applications and a real-world application. The results highlight that CoopREP can successfully replay concurrency bugs involving tens of thousands of memory accesses, reducing logging overhead with respect to state of the art non-cooperative logging schemes by up to 50 times in computationally intensive applications.
【Keywords】: performance; concurrency errors; deterministic replay; debugging
【Paper Link】 【Pages】:1-12
【Authors】: Mirko Montanari ; Roy H. Campbell
【Abstract】: Monitoring systems observe important information that could be a valuable resource to malicious users: attackers can use the knowledge of topology information, application logs, or configuration data to target attacks and make them hard to detect. The increasing need for correlating information across distributed systems to better detect potential attacks and to meet regulatory requirements can potentially exacerbate the problem if the monitoring is centralized. A single zero-day vulnerability would permit an attacker to access all information. This paper introduces a novel algorithm for performing policy-based security monitoring. We use policies to distribute information across several hosts, so that any host compromise has limited impact on the confidentiality of the data about the overall system. Experiments show that our solution spreads information uniformly across distributed monitoring hosts and forces attackers to perform multiple actions to acquire important data.
【Keywords】: distributed systems; security; monitoring; policy compliance; confidentiality
【Paper Link】 【Pages】:1-12
【Authors】: Chao Shen ; Zhongmin Cai ; Xiaohong Guan
【Abstract】: Mouse dynamics is the process of identifying individual users based on their mouse operating characteristics. Although previous work has reported some promising results, mouse dynamics is still a newly emerging technique and has not reached an acceptable level of performance. One of the major reasons is intrinsic behavioral variability. This study presents a novel approach by using pattern-growth-based mining method to extract frequent-behavior segments in obtaining stable mouse characteristics, employing one-class classification algorithms to perform the task of continuous user authentication. Experimental results show that mouse characteristics extracted from frequent-behavior segments are much more stable than those from holistic behavior, and the approach achieves a practically useful level of performance with FAR of 0.37% and FRR of 1.12%. These findings suggest that mouse dynamics suffice to be a significant enhancement for a traditional authentication system. Our dataset is publicly available to facilitate future research.
【Keywords】: human computer interaction; mouse dynamics; one-class learning; anomaly detection; pattern mining
【Paper Link】 【Pages】:1-12
【Authors】: Palden Lama ; Xiaobo Zhou
【Abstract】: A virtualized data center faces important but challenging issue of performance isolation among heterogeneous customer applications. Performance interference resulting from the contention of shared resources among co-located virtual servers has significant impact on the dependability of application QoS. We propose and develop NINEPIN, a non-invasive and energy efficient performance isolation mechanism that mitigates performance interference among heterogeneous applications hosted in virtualized servers. It is capable of increasing data center utility. Its novel hierarchical control framework aligns performance isolation goals with the incentive to regulate the system towards optimal operating conditions. The framework combines machine learning based self-adaptive modeling of performance interference and energy consumption, utility optimization based performance targeting and a robust model predictive control based target tracking. We implement NINEPIN on a virtualized HP ProLiant blade server hosting SPEC CPU2006 and RUBiS benchmark applications. Experimental results demonstrate that NINEPIN outperforms a representative performance isolation approach, Q-Clouds, improving the overall system utility and reducing energy consumption.
【Keywords】: Fuzzy MIMO Control; Performance Isolation; Non-invasiveness; Virtualized Servers; Energy Efficiency; Robustness
【Paper Link】 【Pages】:1-12
【Authors】: Fabian Oboril ; Mehdi Baradaran Tahoori
【Abstract】: With shrinking feature sizes, transistor aging due to NBTI and HCI becomes a major reliability challenge for microprocessors. These processes lead to increased gate delays, more failures during runtime and eventually reduced operational lifetime. Currently, to ensure correct functionality for a certain operational lifetime, additional timing margins are added to the design. However, this approach implies a significant performance loss and may fail to meet reliability requirements. Therefore, aging-aware microarchitecture design is inevitable. In this paper we present ExtraTime, a novel microarchitectural aging analysis framework, which can be used in early design phases when detailed transistor-level information is not yet available to model, analyze, and predict performance, power and aging. Furthermore, we show a comprehensive investigation using ExtraTime of various clock and power gating strategies as well as aging-aware instruction scheduling policies as a case study to show the impact of the architecture on aging.
【Keywords】: Performance Simulator; NBTI; HCI; Wearout Modeling; Microarchitecture
【Paper Link】 【Pages】:1-12
【Authors】: Yubin Xia ; Yutao Liu ; Haibo Chen ; Binyu Zang
【Abstract】: Many classic and emerging security attacks usually introduce illegal control flow to victim programs. This paper proposes an approach to detecting violation of control flow integrity based on hardware support for performance monitoring in modern processors. The key observation is that the abnormal control flow in security breaches can be precisely captured by performance monitoring units. Based on this observation, we design and implement a system called CFIMon, which is the first non-intrusive system that can detect and reason about a variety of attacks violating control flow integrity without any changes to applications (either source or binary code) or requiring special-purpose hardware. CFIMon combines static analysis and runtime training to collect legal control flow transfers, and leverages the branch tracing store mechanism in commodity processors to collect and analyze runtime traces on-the-fly to detect violation of control flow integrity. Security evaluation shows that CFIMon has low false positives or false negatives when detecting several realistic security attacks. Performance results show that CFIMon incurs only 6.1% performance overhead on average for a set of typical server applications.
【Keywords】: Monitoring; Law; Security; Program processors; Runtime; Radiation detectors
【Paper Link】 【Pages】:1-12
【Authors】: Jiesheng Wei ; Karthik Pattabiraman
【Abstract】: The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardware error detection. Simultaneously, the multi-core revolution has impelled software to become parallel. Therefore, there is a compelling need to protect parallel programs from hardware errors. Parallel programs' tasks have significant similarity in control data due to the use of high-level programming models. In this study, we propose BLOCKWATCH to leverage the similarity in parallel program's control data for detecting hardware errors. BLOCKWATCH statically extracts the similarity among different threads of a parallel program and checks the similarity at runtime. We evaluate BLOCKWATCH on seven SPLASH-2 benchmarks to measure its performance overhead and error detection coverage. We find that BLOCKWATCH incurs an average overhead of 16% across all programs, and provides an average SDC coverage of 97% for faults in the control data.
【Keywords】: Runtime checks; Parallel programs; Control-data; SPMD; Static Analysis
【Paper Link】 【Pages】:1-12
【Authors】: Siva Kumar Sastry Hari ; Sarita V. Adve ; Helia Naeimi
【Abstract】: With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.
【Keywords】: Application resiliency; Hardware reliability; Transient faults; Silent data corruptions; Symptom-based fault detection
【Paper Link】 【Pages】:1-8
【Authors】: Jing Zhang ; Robin Berthier ; Will Rhee ; Michael Bailey ; Partha P. Pal ; Farnam Jahanian ; William H. Sanders
【Abstract】: Whether it happens through malware or through phishing, loss of one's online identity is a real and present danger. While many attackers seek credentials to realize financial gain, an analysis of the compromised accounts at our own institutions reveals that perpetrators often steal university credentials to gain free and unfettered access to information. This nontraditional motivation for credential theft puts a special burden on the academic institutions that provide these accounts. In this paper, we describe the design, implementation, and evaluation of a system for safeguarding academic accounts and resources called the University Credential Abuse Auditing System (UCAAS). We evaluate UCAAS at two major research universities with tens of thousands of user accounts and millions of login events during a two-week period. We show the UCAAS to be useful in reducing this burden, having helped the university security teams identify a total of 125 compromised accounts with zero false positives during the trail.
【Keywords】: Virtual Private Network (VPN); compromised account; university; authentication
【Paper Link】 【Pages】:1-12
【Authors】: Jiang Wang ; Kun Sun ; Angelos Stavrou
【Abstract】: Due to performance constraints, host intrusion detection defenses depend on event and polling-based tamper-proof mechanisms to detect security breaches. These defenses monitor the state of critical software components in an attempt to discover any deviations from a pristine or expected state. The rate and type of checks depend can be both periodic and event-based, for instance triggered by hardware events. In this paper, we demonstrate that all software and hardware-assisted defenses that analyze non-contiguous state to infer intrusions are fundamentally vulnerable to a new class of attacks, we call “evasion attacks”. We detail two categories of evasion attacks: directly-intercepting the defense triggering mechanism and indirectly inferring its periodicity. We show that evasion attacks are applicable to a wide-range of protection mechanisms and we analyze their applicability in recent state-of-the-art hardware-assisted protection mechanisms. Finally, we quantify the performance of implemented proof-of-concept prototypes for all of the attacks and suggest potential countermeasures.
【Keywords】: Hardware-assisted & software defenses; Integrity protection; Evasion Attacks
【Paper Link】 【Pages】:1-12
【Authors】: Amiya Kumar Maji ; Fahad A. Arshad ; Saurabh Bagchi ; Jan S. Rellermeyer
【Abstract】: Over the last three years, Android has established itself as the largest-selling operating system for smartphones. It boasts of a Linux-based robust kernel, a modular framework with multiple components in each application, and a security-conscious design where each application is isolated in its own virtual machine. However, all of these desirable properties would be rendered ineffectual if an application were to deliver erroneous messages to targeted applications and thus cause the target to behave incorrectly. In this paper, we present an empirical evaluation of the robustness of Inter-component Communication (ICC) in Android through fuzz testing methodology, whereby, parameters of the inter-component communication are changed to various incorrect values. We show that not only exception handling is a rarity in Android applications, but also it is possible to crash the Android runtime from unprivileged user processes. Based on our observations, we highlight some of the critical design issues in Android ICC and suggest solutions to alleviate these problems.
【Keywords】: exception; android; fuzz; security; smartphone; robustness