DSN 2011:Hong Kong, China

Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011. IEEE Compute Society 【DBLP Link

Paper Num: 59 || Session Num: 0

1. Uni-directional trusted path: Transaction confirmation on just one device.

Paper Link】 【Pages】:1-12

【Authors】: Atanas Filyanov ; Jonathan M. McCune ; Ahmad-Reza Sadeghi ; Marcel Winandy

【Abstract】: Commodity computer systems today do not include a full trusted path capability. Consequently, malware can control the user's input and output in order to reveal sensitive information to malicious parties or to generate manipulated transaction requests to service providers. Recent hardware offers compelling features for remote attestation and isolated code execution, however, these mechanisms are not widely used in deployed systems to date. We show how to leverage these mechanisms to establish a “one-way” trusted path allowing service providers to gain assurance that users' transactions were indeed submitted by a human operating the computer, instead of by malware such as transaction generators. We design, implement, and evaluate our solution, and argue that it is practical and offers immediate value in e-commerce, as a replacement for captchas, and in other Internet scenarios.

【Keywords】: trusted computing; security; transaction confirmation; trusted path

2. Boundless memory allocations for memory safety and high availability.

Paper Link】 【Pages】:13-24

【Authors】: Marc Brunink ; Martin Süßkraut ; Christof Fetzer

【Abstract】: Spatial memory errors (like buffer overflows) are still a major threat for applications written in C. Most recent work focuses on memory safety - when a memory error is detected at runtime, the application is aborted. Our goal is not only to increase the memory safety of applications but also to increase the application's availability. Therefore, we need to tolerate spatial memory errors at runtime. We have implemented a compiler extension, Boundless, that automatically adds the tolerance feature to C applications at compile time. We show that this can increase the availability of applications. Our measurements also indicate that Boundless has a lower performance overhead than SoftBound, a state-of-the-art approach to detect spatial memory errors. Our performance gains result from a novel way to represent pointers. Nevertheless, Boundless is compatible with existing C code. Additionally, Boundless provides a trade-off to reduce the runtime overhead even further: We introduce vulnerability specific patching for spatial memory errors to tolerate only known vulnerabilities. Vulnerability specific patching has an even lower runtime overhead than full tolerance.

【Keywords】: Software safety; Bounds checking; Fault tolerance; Compiler transformation; Availability

3. A methodology for the generation of efficient error detection mechanisms.

Paper Link】 【Pages】:25-36

【Authors】: Matthew Leeke ; Saima Arif ; Arshad Jhumka ; Sarabjot Singh Anand

【Abstract】: A dependable software system must contain error detection mechanisms and error recovery mechanisms. Software components for the detection of errors are typically designed based on a system specification or the experience of software engineers, with their efficiency typically being measured using fault injection and metrics such as coverage and latency. In this paper, we introduce a methodology for the design of highly efficient error detection mechanisms. The proposed methodology combines fault injection analysis and data mining techniques in order to generate predicates for efficient error detection mechanisms. The results presented demonstrate the viability of the methodology as an approach for the development of efficient error detection mechanisms, as the predicates generated yield a true positive rate of almost 100% and a false positive rate very close to 0% for the detection of failure-inducing states. The main advantage of the proposed methodology over current state-of-the-art approaches is that efficient detectors are obtained by design, rather than by using specification-based detector design or the experience of software engineers.

【Keywords】: Decision Tree Induction; Software Dependability; Fault Injection; Data Mining; Error Detection Mechanisms

4. Fault tolerant WSN-based structural health monitoring.

Paper Link】 【Pages】:37-48

【Authors】: X. Liu ; J. Cao ; M. Bhuiyan ; S. Lai ; H. Wu ; G. Wang

【Abstract】: Fault tolerance in wireless sensor networks (WSNs) has been studied extensively by computer science researchers and they proposed many fault-tolerant schemes for various applications including target and event detection. However, these schemes would fail in a particular application of WSNs: structural health monitoring (SHM). Different from other applications of WSNs, detecting structural damage requires significant amount of civil domain knowledge and utilizes different detection model. Meanwhile, researchers in civil engineering also proposed some fault-tolerant SHM algorithms. However, these algorithms are all centralized and not applicable to resource-limited wireless sensor networks. To our best knowledge, we are the first to address fault tolerance problem in WSN-based SHM. We target faulty sensor reading, one of the most difficult types of sensor fault to be detected, and propose a fault-tolerant SHM approach. The proposed approach is lightweight and it is able to disambiguate structural damage from sensor faults. The effectiveness of the proposed approach is demonstrated through both simulation and real implementation.

【Keywords】: clustering; wireless sensor networks; structural health monitoring; fault tolerance

5. Online detection of multi-component interactions in production systems.

Paper Link】 【Pages】:49-60

【Authors】: Adam J. Oliner ; Alex Aiken

【Abstract】: We present an online, scalable method for inferring the interactions among the components of large production systems. We validate our approach on more than 1.3 billion lines of log files from eight unmodified production systems, showing that our approach efficiently identifies important relationships among components, handles very large systems with many simultaneous signals in real time, and produces information that is useful to system administrators.

【Keywords】: signal compression; System management; statistical correlation; modeling; anomalies

6. An empirical study of the performance, security and privacy implications of domain name prefetching.

Paper Link】 【Pages】:61-72

【Authors】: Srinivas Krishnan ; Fabian Monrose

【Abstract】: An increasingly popular technique for decreasing user-perceived latency while browsing the Web is to optimistically pre-resolve (or prefetch) domain name resolutions. In this paper, we present a large-scale evaluation of this practice using data collected over the span of several months, and show that it leads to noticeable increases in load on name servers-with questionable caching benefits. Furthermore, to assess the impact that prefetching can have on the deployment of security extensions to DNS (DNSSEC), we use a custom-built cache simulator to perform trace-based simulations using millions of DNS requests and responses collected campus-wide. We also show that the adoption of domain name prefetching raises privacy issues. Specifically, we examine how prefetching amplifies information disclosure attacks to the point where it is possible to infer the context of searches issued by clients.

【Keywords】: Privacy; Domain Name System; Measurements Security

7. Efficient model checking of fault-tolerant distributed protocols.

Paper Link】 【Pages】:73-84

【Authors】: Péter Bokor ; Johannes Kinder ; Marco Serafini ; Neeraj Suri

【Abstract】: To aid the formal verification of fault-tolerant distributed protocols, we propose an approach that significantly reduces the costs of their model checking. These protocols often specify atomic, process-local events that consume a set of messages, change the state of a process, and send zero or more messages. We call such events quorum transitions and leverage them to optimize state exploration in two ways. First, we generate fewer states compared to models where quorum transitions are expressed by single-message transitions. Second, we refine transitions into a set of equivalent, finer-grained transitions that allow partial-order algorithms to achieve better reduction. We implement the MP-Basset model checker, which supports refined quorum transitions. We model check protocols representing core primitives of deployed reliable distributed systems, namely: Paxos consensus, regular storage, and Byzantine-tolerant multicast. We achieve up to 92% memory and 85% time reduction compared to model checking with standard unrefined single-message transitions.

【Keywords】: Protocols; Computational modeling; Semantics; Silicon; Arrays; Computer bugs; Syntactics

8. Unified debugging of distributed systems with Recon.

Paper Link】 【Pages】:85-96

【Authors】: Kyu Hyung Lee ; Nick Sumner ; Xiangyu Zhang ; Patrick Eugster

【Abstract】: To scale to today's complex distributed software systems, debugging and replaying techniques mostly focus on single facets of software, e.g., local concurrency, distributed messaging, or data representation. This forces developers to tediously combine different technologies such as instruction-level dynamic tracing, event log analysis, or global state reconstruction to gradually explain non-trivial defects. This paper proposes Recon, a debugging system that provides iterative and interactive homogeneous debugging services. As related systems, Recon promotes SQL-like queries for debugging distributed systems. Unlike other approaches, however, Recon allows for all system artifacts including nodes, communication channels, events, or instructions to be uniformly described by relations. Also, an application in Recon originally runs with a lightweight logger that only collects replay logs for individual nodes. Developers debug a complete program by replaying the execution with fine-grained instrumentation that is capable of exposing instruction-level information. We illustrate the effectiveness of Recon on programs as diverse as BerkeleyDB, i3/Chord, RandTree, and Pastry. Our evaluation includes executions in local clusters as well as in Amazon EC2 and exhibits an unreported bug in RandTree.

【Keywords】: instrumentation; Software reliability; distributed systems; debugging; replay

9. Improving Log-based Field Failure Data Analysis of multi-node computing systems.

Paper Link】 【Pages】:97-108

【Authors】: Antonio Pecchia ; Domenico Cotroneo ; Zbigniew Kalbarczyk ; Ravishankar K. Iyer

【Abstract】: Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.

【Keywords】: dependability measurements; Field Failure Data Analysis; supercomputer; tuple heuristic; collision

10. An analysis of signature overlaps in Intrusion Detection Systems.

Paper Link】 【Pages】:109-120

【Authors】: Frédéric Massicotte ; Yvan Labiche

【Abstract】: An Intrusion Detection System (IDS) protects computer networks against attacks and intrusions, in combination with firewalls and anti-virus systems. One class of IDS is called signature-based network IDSs, as they monitor network traffic, looking for evidence of malicious behaviour as specified in attack descriptions (referred to as signatures).Many studies report that IDSs, including signature-based network IDSs, have problems to accurately identify attacks. One possible reason that we observed in our past work, and that is worth investigating further, is that several signatures (i.e., several alarms) can be triggered on the same group of packets, a situation we coined overlapping signatures. This paper presents a technique to precisely and systemat ically quantify the signature overlapping problem of an IDS signature database. The solution we describe is based on set theory and finite state automaton theory, and we experiment with our technique on one widely-used and maintained IDS. Results show that our approach is effective at systematically quantifying the overlap problem in one IDS signature database, and can be potentially used on other IDSs.

【Keywords】: Automaton Theory; Intrusion Detection Signature; Set Theory

11. Detecting stealthy P2P botnets using statistical traffic fingerprints.

Paper Link】 【Pages】:121-132

【Authors】: Junjie Zhang ; Roberto Perdisci ; Wenke Lee ; Unum Sarfraz ; Xiapu Luo

【Abstract】: Peer-to-peer (P2P) botnets have recently been adopted by botmasters for their resiliency to take-down efforts. Besides being harder to take down, modern botnets tend to be stealthier in the way they perform malicious activities, making current detection approaches, including, ineffective. In this paper, we propose a novel botnet detection system that is able to identify stealthy P2P botnets, even when malicious activities may not be observable. First, our system identifies all hosts that are likely engaged in P2P communications. Then, we derive statistical fingerprints to profile different types of P2P traffic, and we leverage these fingerprints to distinguish between P2P botnet traffic and other legitimate P2P traffic. Unlike previous work, our system is able to detect stealthy P2P botnets even when the underlying compromised hosts are running legitimate P2P applications (e.g., Skype) and the P2P bot software at the same time. Our experimental evaluation based on real-world data shows that the proposed system can achieve high detection accuracy with a low false positive rate.

【Keywords】: Security; Botnet; P2P; Intrusion Detection

12. Applying game theory to analyze attacks and defenses in virtual coordinate systems.

Paper Link】 【Pages】:133-144

【Authors】: Sheila Becker ; Jeff Seibert ; David Zage ; Cristina Nita-Rotaru ; Radu State

【Abstract】: Virtual coordinate systems provide an accurate and efficient service that allows hosts on the Internet to determine latency to arbitrary hosts based on information provided by a subset of participating nodes. Unfortunately, the accuracy of the service can be severely impacted by compromised nodes providing misleading information. We define and use a game theory framework in order to identify the best attack and defense strategies assuming that the attacker is aware of the defense mechanisms. Our approach leverages concepts derived from the Nash equilibrium to model more powerful adversaries. We consider attacks that target the latency estimation (inflation, deflation, oscillation) and defense mechanisms that combine outlier detection with control theory to deter adaptive adversaries. We apply the game theory framework to demonstrate the impact and efficiency of these attack and defense strategies using a well-known virtual coordinate system and real-life Internet data sets.

【Keywords】: security; virtual coordinate systems; game theory

13. CEC: Continuous eventual checkpointing for data stream processing operators.

Paper Link】 【Pages】:145-156

【Authors】: Zoe Sebepou ; Kostas Magoutis

【Abstract】: The checkpoint roll-backward methodology is the underlying technology of several fault-tolerance solutions for continuous stream processing systems today, implemented either using the memories of replica nodes or a distributed file system. In this scheme the recovering node loads its most recent checkpoint and requests log replay to reach a consistent pre-failure state. Challenges with that technique include its complexity (typically implemented via copy-on-write), the associated overhead (exception handling under state updates), and limits to the frequency of checkpointing. The latter limit affects the amount of information that needs to be replayed leading to long recovery times. In this work we introduce continuous eventual checkpointing (CEC), a novel mechanism to provide fault-tolerance guarantees by taking continuous incremental state checkpoints with minimal pausing of operator processing. We achieve this by separating operator state into independent parts and producing frequent independent partial checkpoints of them. Our results show that our method can achieve low overhead fault-tolerance with adjustable checkpoint intensity, trading off recovery time with performance.

【Keywords】: Fault-Tolerance; Continuous Stream Processing

14. Coercing clients into facilitating failover for object delivery.

Paper Link】 【Pages】:157-168

【Authors】: Wyatt Lloyd ; Michael J. Freedman

【Abstract】: Application-level protocols used for object delivery, such as HTTP, are built atop TCP/IP and inherit its host-to-host abstraction. Given that these services are replicated for scalability, this unnecessarily exposes failures of individual servers to their clients. While changes to both client and server applications can be used to mask such failures, this paper explores the feasibility of transparent recovery for unmodified object delivery services (TRODS). The key insight in TRODS is cross-layer visibility and control: TRODS carefully derives reliable storage for application-level state from the mechanics of the transport layer. This state is used to reconstruct object delivery sessions, which are then transparently spliced into the client's ongoing connection. TRODS is fully backwards-compatible, requiring no changes to the clients or server applications. Its performance is competitive with unmodified HTTP services, providing nearly identical throughput while enabling timely failover.

【Keywords】: Servers; Monitoring; Protocols; Biomedical monitoring; Kernel; IP networks; Web and internet services

15. Phase-based reboot: Reusing operating system execution phases for cheap reboot-based recovery.

Paper Link】 【Pages】:169-180

【Authors】: Kazuya Yamakita ; Hiroshi Yamada ; Kenji Kono

【Abstract】: Although operating systems (OSes) are crucial to achieving high availability of computer systems, modern OSes are far from bug-free. Rebooting the OS is simple, powerful, and sometimes the only remedy for kernel failures. Once we accept reboot-based recovery as a fact of life, we should try to ensure that the downtime caused by reboots is as short as possible. This paper presents “phase-based” reboots that shorten the downtime caused by reboot-based recovery. The key idea is to divide a boot sequence into phases. The phase-based reboot reuses a system state in the previous boot if the next boot reproduces the same state. A prototype of the phase-based reboot was implemented on Xen 3.4.1 running para-virtualized Linux 2.6.18. Experiments with the prototype show that it successfully recovered from kernel transient failures inserted by a fault injector, and its downtime was 34.3 to 93.6% shorter than that of the normal reboot-based recovery.

【Keywords】: Virtualization; Reboot-based Recovery; Operating System Reliability

16. Communix: A framework for collaborative deadlock immunity.

Paper Link】 【Pages】:181-188

【Authors】: Horatiu Jula ; Pinar Tözün ; George Candea

【Abstract】: We present Communix, a collaborative deadlock immunity framework for Java programs. Deadlock immunity enables applications to avoid deadlocks that they previously encountered. Dimmunix, our deadlock immunity system, detects deadlocks and saves their signatures at runtime, then avoids execution flows that match these signatures; a signature is an abstraction of the execution flow that led to deadlock. Dimmunix needs all the deadlock bugs in an application to manifest, in all possible ways, in order to provide full protection against deadlocks for that application. Communix addresses this shortcoming by distributing the deadlock signatures produced by Dimmunix. The signatures of a deadlock can protect against the deadlock any user connected to the Internet and running the same application, even if he/she did not experience the deadlock yet. Besides signature distribution, Communix provides signature validation and generalization. Signature validation ensures that the incoming signatures match the target applications, and protect the users against malicious signatures. Signature generalization keeps the repository of deadlock signatures compact, by merging multiple deadlock signatures into one signature. Communix is application agnostic, i.e., it is applicable to any Java application. Communix is efficient and scalable, and can effectively protect Java applications against malicious signatures.

【Keywords】: System recovery; Servers; Synchronization; Java; History; Computer bugs; Internet

17. CloudVal: A framework for validation of virtualization environment in cloud infrastructure.

Paper Link】 【Pages】:189-196

【Authors】: Cuong Manh Pham ; Daniel Chen ; Zbigniew Kalbarczyk ; Ravishankar K. Iyer

【Abstract】: We present CloudVal, a framework to validate the reliability of virtualization environment in Cloud Computing infrastructure. A case study, based on injecting faults in the KVM hypervisor and Xen hypervisor, was conducted to show the viability of the framework. The study shows that due to the architectural differences between KVM and Xen, a direct comparison of the two virtualization systems is not feasible. In order to confidently weigh error resiliency of virtualization systems, more comprehensive studies are required. We believe, however, that the fault injection approach and the fault models proposed in this paper are a good starting point towards designing and implementing a benchmark which would enable the assessment of different virtualization infrastructures in a common manner.

【Keywords】: Xen hypervisor; Cloud computing; fault injection; KVM hypervisor

18. Helmet: A resistance drift resilient architecture for multi-level cell phase change memory system.

Paper Link】 【Pages】:197-208

【Authors】: Wangyuan Zhang ; Tao Li

【Abstract】: Phase change memory (PCM) is emerging as a promising solution for future memory systems and disk caches. As a type of resistive memory, PCM relies on the electrical resistance of Ge2Sb2Te5 (GST) to represent stored information. With the adoption of multi-level programming PCM devices, unwanted resistance drift is becoming an increasing reliability concern in future high-density, multi-level cell PCM systems. To address this issue without incurring a significant storage and performance overhead in ECC, conventional design employs a conservative approach, which increases the resistance margin between two adjacent states to combat resistance drift. In this paper, we show that the wider margin adversely impacts the low-power benefit of PCM by incurring up to 2.3X power overhead and causes up to 100X lifetime reduction, thereby exacerbating the wear-out issue. To tolerate resistance drift, we proposed Helmet, a multi-level cell phase change memory architecture that can cost-effectively reduce the readout error rate due to drift. Therefore, we can relax the requirement on margin size, while preserving the readout reliability of the conservative approach, and consequently minimize the power and endurance overhead due to drift. Simulation results show that our techniques are able to decrease the error rate by an average of 87%. Alternatively, for satisfying the same reliability target, our schemes can achieve 28% power savings and a 15X endurance enhancement due to the reduced margin size when compared to the conservative approach.

【Keywords】: computer architecture; phase change memory; multi-level cell; resistance drifting; reliability

19. HDP code: A Horizontal-Diagonal Parity Code to Optimize I/O load balancing in RAID-6.

Paper Link】 【Pages】:209-220

【Authors】: Chentao Wu ; Xubin He ; Guanying Wu ; Shenggang Wan ; Xiaohua Liu ; Qiang Cao ; Changsheng Xie

【Abstract】: With higher reliability requirements in clusters and data centers, RAID-6 has gained popularity due to its capability to tolerate concurrent failures of any two disks, which has been shown to be of increasing importance in large scale storage systems. Among various implementations of erasure codes in RAID-6, a typical set of codes known as Maximum Distance Separable (MDS) codes aim to offer data protection against disk failures with optimal storage efficiency. However, because of the limitation of horizontal parity or diagonal/anti-diagonal parities used in MDS codes, storage systems based on RAID-6 suffers from unbalanced I/O and thus low performance and reliability. To address this issue, in this paper, we propose a new parity called Horizontal-Diagonal Parity (HDP), which takes advantages of both horizontal and diagonal/anti-diagonal parities. The corresponding MDS code, called HDP code, distributes parity elements uniformly in each disk to balance the I/O workloads. HDP also achieves high reliability via speeding up the recovery under single or double disk failure. Our analysis shows that HDP provides better balanced I/O and higher reliability compared to other popular MDS codes.

【Keywords】: Reliability; RAID-6; MDS Code; Load Balancing; Horizontal Parity; Diagonal/Anti-diagonal Parity; Performance Evaluation

20. LLS: Cooperative integration of wear-leveling and salvaging for PCM main memory.

Paper Link】 【Pages】:221-232

【Authors】: Lei Jiang ; Yu Du ; Youtao Zhang ; Bruce R. Childers ; Jun Yang

【Abstract】: Phase change memory (PCM) has emerged as a promising technology for main memory due to many advantages, such as better scalability, non-volatility and fast read access. However, PCM's limited write endurance restricts its immediate use as a replacement for DRAM. Recent studies have revealed that a PCM chip which integrates millions to billions of bit cells has non-negligible variations in write endurance. Wear leveling techniques have been proposed to balance write operations to different PCM regions. To further prolong the lifetime of a PCM device after the failure of weak cell, techniques have been proposed to remap failed lines to spares and to salvage a PCM device that has a large number of failed lines or pages with graceful degradation. However, current wear-leveling and salvaging schemes have not been designed and integrated to work cooperatively to achieve the best PCM device lifetime. In particular, a non-contiguous PCM space generated from salvaging complicates wear leveling and incurs large overhead. In this paper, we propose LLS, a Line-Level mapping and Salvaging design. By allocating a dynamic portion of total space in a PCM device as backup space, and mapping failed lines to backup PCM, LLS constructs a contiguous PCM space and masks lower-level failures from the OS and applications. LLS seamlessly integrates wear leveling and salvaging and copes well with modern OSs, including ones that support multiple page sizes. Our experimental results show that LLS achieves 24% longer lifetime than a state-of-the-art technique. It has negligible hardware cost and performance overhead.

【Keywords】: Reliability; Salvaging; Wear Leveling; Hard Faults; Phase Change Memory

21. Get off my prefix! the need for dynamic, gerontocratic policies in inter-domain routing.

Paper Link】 【Pages】:233-244

【Authors】: Edmund L. Wong ; Vitaly Shmatikov

【Abstract】: Inter-domain routing in today's Internet is plagued by security and reliability issues (e.g., prefix hijacking), which are often caused by malicious or Byzantine misbehavior. We argue that route selection policies must move beyond static preferences that select routes on the basis of static attributes such as route length or which neighboring AS is advertising the route. We prove that route convergence in the presence of Byzantine misbehavior requires that the route selection metric include the dynamics of route updates as a primary component. We then describe a class of simple dynamic policies which consider the observed “ages” of routes. These gerontocratic policies can be combined with static preferences and implemented without major infrastructural changes. They guarantee convergence when adopted universally, without sacrificing most of the flexibility that autonomous systems enjoy in route selection. We empirically demonstrate that even if adopted unilaterally by a single autonomous system, gerontocratic policies yield significantly more stable routes, are more effective at avoiding prefix hijacks, and are as responsive to legitimate route changes as other policies.

【Keywords】: Convergence; Routing protocols; Advertising; Routing; Measurement; Stability analysis; Cryptography

22. Zab: High-performance broadcast for primary-backup systems.

Paper Link】 【Pages】:245-256

【Authors】: Flavio Paiva Junqueira ; Benjamin C. Reed ; Marco Serafini

【Abstract】: Zab is a crash-recovery atomic broadcast algorithm we designed for the ZooKeeper coordination service. ZooKeeper implements a primary-backup scheme in which a primary process executes clients operations and uses Zab to propagate the corresponding incremental state changes to backup processes. Due the dependence of an incremental state change on the sequence of changes previously generated, Zab must guarantee that if it delivers a given state change, then all other changes it depends upon must be delivered first. Since primaries may crash, Zab must satisfy this requirement despite crashes of primaries.

【Keywords】: Atomic broadcast; Fault tolerance; Distributed algorithms; Primary backup; Asynchronous consensus

23. Deadline-aware scheduling for Software Transactional Memory.

Paper Link】 【Pages】:257-268

【Authors】: Walther Maldonado ; Patrick Marlier ; Pascal Felber ; Julia L. Lawall ; Gilles Muller ; Etienne Riviere

【Abstract】: Software Transactional Memory (STM) is an optimistic concurrency control mechanism that simplifies the development of parallel programs. Still, the interest of STM has not yet been demonstrated for reactive applications that require bounded response time for some of their operations. We propose to support such applications by allowing the developer to annotate some transaction blocks with deadlines. Based on previous execution statistics, we adjust the transaction execution strategy by decreasing the level of optimism as the deadlines near through two modes of conservative execution, without overly limiting the progress of concurrent transactions. Our implementation comprises a STM extension for gathering statistics and implementing the execution mode strategies. We have also extended the Linux scheduler to disable preemption or migration of threads that are executing transactions with deadlines. Our experimental evaluation shows that our approach significantly improves the chance of a transaction meeting its deadline when its progress is hampered by conflicts.

【Keywords】: Contention Management; Transactional Memory; Scheduling

24. A combinatorial approach to detecting buffer overflow vulnerabilities.

Paper Link】 【Pages】:269-278

【Authors】: Wenhua Wang ; Yu Lei ; Donggang Liu ; David Chenho Kung ; Christoph Csallner ; Dazhi Zhang ; Raghu Kacker ; Rick Kuhn

【Abstract】: Buffer overflow vulnerabilities are program defects that can cause a buffer to overflow at runtime. Many security attacks exploit buffer overflow vulnerabilities to compromise critical data structures. In this paper, we present a black-box testing approach to detecting buffer overflow vulnerabilities. Our approach is motivated by a reflection on how buffer overflow vulnerabilities are exploited in practice. In most cases the attacker can influence the behavior of a target system only by controlling its external parameters. Therefore, launching a successful attack often amounts to a clever way of tweaking the values of external parameters. We simulate the process performed by the attacker, but in a more systematic manner. A novel aspect of our approach is that it adapts a general software testing technique called combinatorial testing to the domain of security testing. In particular, our approach exploits the fact that combinatorial testing often achieves a high level of code coverage. We have implemented our approach in a prototype tool called Tance. The results of applying Tance to five open-source programs show that our approach can be very effective in detecting buffer overflow vulnerabilities.

【Keywords】: Buffer Overflow Vulnerability; Software Security; Security Testing

25. Cross-layer resilience using wearout aware design flow.

Paper Link】 【Pages】:279-290

【Authors】: Bardia Zandian ; Murali Annavaram

【Abstract】: As process technology shrinks devices, circuits experience accelerated wearout. Monitoring wearout will be critical for improving the efficiency of error detection and correction. The most effective wearout monitoring approach relies on continuously checking only the most critical circuit paths to detect timing degradation. However, circuits optimized for power and area efficiency have a steep critical path wall in some designs. Furthermore, wearout depends on dynamic conditions, such as processor's operating environment, and application-specific path utilization profile. The dynamic nature of wearout coupled with steep critical path walls may result in excessive number of paths that need to be monitored. In this paper we propose a novel cross-layer circuit design flow that uses path timing information and runtime path utilization data to significantly enhance monitoring efficiency. The proposed methodology uses application-specific path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms for selecting paths to be monitored. These four approaches allow designers to select the best group of paths under varying power, area and monitoring budget constraints.

【Keywords】: Cross-layer design; Wearout; Timing margin

26. Transparent dynamic binding with fault-tolerant cache coherence protocol for chip multiprocessors.

Paper Link】 【Pages】:291-302

【Authors】: Shuchang Shan ; Yu Hu ; Xiaowei Li

【Abstract】: Aggressive technology scaling causes chip multiprocessors increasingly error-prone. Core-level fault-tolerant approaches bind two cores to implement redundant execution and error detection. However, along with more cores integrated into one chip, existing static and dynamic binding schemes suffer from the scalability problem when considering the violation effects caused by external write operations. In this paper, we present a transparent dynamic binding (TDB) mechanism to address the issue. Learning from static binding schemes, we involve the private caches to hold identical data blocks, thus we reduce the global masters-lave consistency maintenance to the scale of the private caches. With our fault-tolerant cache coherence protocol, TDB satisfies the objective of private cache consistency, therefore provides excellent scalability and flexibility. Experimental results show that, for a set of parallel workloads, the overall performance of our TDB scheme is very close to that of baseline fault-tolerant systems, outperforming dynamic core coupling by 9.2%, 10.4%, 18% and 37.1% when considering 4, 8, 16 and 32 cores respectively.

【Keywords】: cache coherence protocol; Chip multiprocessor; fault tolerance; transparent dynamic binding (TDB); master-slave memory consistency

27. Fault injection-based assessment of aspect-oriented implementation of fault tolerance.

Paper Link】 【Pages】:303-314

【Authors】: Ruben Alexandersson ; Johan Karlsson

【Abstract】: Aspect-oriented programming provides an interesting approach for implementing software-based fault tolerance as it allows the core functionality of a program and its fault tolerance features to be coded separately. This paper presents a comprehensive fault injection study that estimates the fault coverage of two software implemented fault tolerance mechanisms designed to detect or mask transient and intermittent hardware faults. We compare their fault coverage for two target programs and for three implementation techniques: manual programming in C and two variants of aspect-oriented programming. We also compare the impact of different compiler optimization levels on the fault coverage. The software-implemented fault tolerance mechanisms investigated are: i) triple time-redundant execution with voting and forward recovery, and ii) a novel dual signature control flow checking mechanism. The study shows that the variations in fault coverage among the implementation techniques generally are small, while some variations for different compiler optimization levels are significant.

【Keywords】: control flow checking; aspect oriented programming; fault tolerance; fault injection; time-redundant execution

28. A framework for early stage quality-fault tolerance analysis of embedded control systems.

Paper Link】 【Pages】:315-322

【Authors】: Satya Gautam Vadlamudi ; P. P. Chakrabarti ; Dipankar Das ; Purnendu Sinha

【Abstract】: This work presents a static-analysis based method for analyzing the robustness of a given embedded control system design, in the presence of quality-faults in sensors, software components, and inter-connections. The method characterizes the individual components of the system by storing the relations between the precision of inputs and the precision of outputs in what we call, lookup tables (LUTs). A network of LUTs thus formed which represent the given control system is converted into a satisfiability modulo theory (SMT) instance, such that a satisfying assignment corresponds to a potential counterexample (the set of quality-faults which violate the given fault-tolerance requirements) or hot-spot in the design. Hot-spots obtained in this manner are counter-verified through simulation to filter the false-positives. Experimental results on the fault-tolerant fuel controller from Simulink automotive library demonstrate the efficacy of the proposed approach.

【Keywords】: robustness; embedded systems; fault injection; fault tolerant systems; quality faults

29. Characterization of logical masking and error propagation in combinational circuits and effects on system vulnerability.

Paper Link】 【Pages】:323-334

【Authors】: Nishant J. George ; John Lach

【Abstract】: Among the masking phenomena that render immunity to combinational logic circuits from soft errors, logical masking is the hardest to model and characterize. This is mainly attributed to the fact that the algorithmic complexity of analyzing a combinational circuit for such masking is quite high, even for modestly sized circuits. In this paper, we present a hierarchical statistical approach to characterize the vulnerability of combinational circuits given logical masking and error propagation. By conducting detailed analyses and fault simulations for circuits at lower levels, initial assumptions of 100% vulnerability with single random output errors are refined. Fault simulations performed on the ISCAS85 benchmark circuits and Kogge-Stone adders of various widths demonstrate the varied nature of vulnerability for different circuits. The analysis performed at the circuit level for a 32-bit Kogge-Stone adder is applied to a microarchitecture simulation to examine impact on system-level vulnerability.

【Keywords】: statistical fault injection; logical masking; soft error vulnerability

30. A scalable availability model for Infrastructure-as-a-Service cloud.

Paper Link】 【Pages】:335-346

【Authors】: Francesco Longo ; Rahul Ghosh ; Vijay K. Naik ; Kishor S. Trivedi

【Abstract】: High availability is one of the key characteristics of Infrastructure-as-a-Service (IaaS) cloud. In this paper, we show a scalable method for availability analysis of large scale IaaS cloud using analytic models. To reduce the complexity of analysis and the solution time, we use an interacting Markov chain based approach. The construction and the solution of the Markov chains is facilitated by the use of a high-level Petri net based paradigm known as stochastic reward net (SRN). Overall solution is composed by iteration over individual SRN sub-model solutions. Dependencies among the sub-models are resolved using fixed-point iteration, for which existence of a solution is proved. We compare the solution obtained from the interacting sub-models with a monolithic model and show that errors introduced by decomposition are insignificant. Additionally, we provide closed form solutions of the sub-models and show that our approach can handle very large size IaaS clouds.

【Keywords】: Markov models; Analytic model; availability analysis; cloud; fixed-point iteration

31. Modeling and evaluating targeted attacks in large scale dynamic systems.

Paper Link】 【Pages】:347-358

【Authors】: Emmanuelle Anceaume ; Bruno Sericola ; Romaric Ludinard ; Frederic Tronel

【Abstract】: In this paper we consider the problem of targeted attacks in large scale peer-to-peer overlays. These attacks aimed at exhausting key resources of targeted hosts to diminish their capacity to provide or receive services. To defend the system against such attacks, we rely on clustering and implement induced churn to preserve randomness of nodes identifiers so that adversarial predictions are impossible. We propose robust join, leave, merge and split operations to discourage brute force denial of services and pollution attacks. We show that combining a small amount of randomization in the operations, and adequately tuning the sojourn time of peers in the same region of the overlay allows first to decrease the effect of targeted attacks at cluster level, and second to prevent pollution propagation in the whole overlay.

【Keywords】: Markov chains; Clusterized P2P Overlays; Adversary; Churn; Collusion

32. Incremental quantitative verification for Markov decision processes.

Paper Link】 【Pages】:359-370

【Authors】: Marta Z. Kwiatkowska ; David Parker ; Hongyang Qu

【Abstract】: Quantitative verification techniques provide an effective means of computing performance and reliability properties for a wide range of systems. However, the computation required can be expensive, particularly if it has to be performed multiple times, for example to determine optimal system parameters. We present efficient incremental techniques for quantitative verification of Markov decision processes, which are able to re-use results from previous verification runs, based on a decomposition of the model into its strongly connected components (SCCs). We also show how this SCC-based approach can be further optimised to improve verification speed and how it can be combined with symbolic data structures to offer better scalability. We illustrate the effectiveness of the approach on a selection of large case studies.

【Keywords】: probabilistic model checking; Quantitative verification; incremental verification; Markov decision processes; performance analysis

33. Hypervisor-assisted application checkpointing in virtualized environments.

Paper Link】 【Pages】:371-382

【Authors】: Min Lee ; A. S. Krishnakumar ; Parameshwaran Krishnan ; Navjot Singh ; Shalini Yajnik

【Abstract】: There are two broad categories of approaches used for checkpointing: application-transparent and application-assisted. Typically, application-assisted approaches provide a more flexible and light-weight mechanism but require changes to the application. Although most applications run well under virtualization (e.g. Xen which is being adopted widely), the addition of application-assisted checkpointing - used for high availability - causes performance problems. This is due to the overhead of key system calls used by the checkpointing techniques under virtualization. To overcome this, we introduce the notion of hypervisor-assisted application checkpointing with no changes to the guest operating system. We present the design and a Xen-based implementation of our family of application checkpointing techniques. Our experiments show performance improvements of 4× to 13× in the primitives used for supporting high availability compared to purely user-level approaches.

【Keywords】: high-availabiliy; virtualization; hypervisor; Xen; checkpointing

34. OS diversity for intrusion tolerance: Myth or reality?

Paper Link】 【Pages】:383-394

【Authors】: Miguel Garcia ; Alysson Neves Bessani ; Ilir Gashi ; Nuno Ferreira Neves ; Rafael R. Obelheiro

【Abstract】: One of the key benefits of using intrusion-tolerant systems is the possibility of ensuring correct behavior in the presence of attacks and intrusions. These security gains are directly dependent on the components exhibiting failure diversity. To what extent failure diversity is observed in practical deployment depends on how diverse are the components that constitute the system. In this paper we present a study with operating systems (OS) vulnerability data from the NIST National Vulnerability Database. We have analyzed the vulnerabilities of 11 different OSes over a period of roughly 15 years, to check how many of these vulnerabilities occur in more than one OS. We found this number to be low for several combinations of OSes. Hence, our analysis provides a strong indication that building a system with diverse OSes may be a useful technique to improve its intrusion tolerance capabilities.

【Keywords】: Intrusion Tolerance; Diversity; Vulnerabilities; NVD; Operating Systems

35. Resource and virtualization costs up in the cloud: Models and design choices.

Paper Link】 【Pages】:395-402

【Authors】: Daniel Gmach ; Jerry Rolia ; Ludmila Cherkasova

【Abstract】: Virtualization offers the potential for cost-effective service provisioning. For service providers who make significant investments in new virtualized data centers in support of private or public clouds, one of the serious challenges is the problem of recovering costs for new server hardware, software, network, storage, management, etc. Gaining visibility and accurately determining the cost of shared resources used by collocated services is essential for implementing a proper chargeback approach in cloud environments. We introduce and compare three different models for apportioning cost and champion the one that is least sensitive to workload placement decisions and provides the most robust and repeatable cost estimates. A detailed study involving 312 workloads from an HP customer environment demonstrates the result. Finally, we employ the cost model in a case study that evaluates the impact on the cost of exploiting different virtualization platform alternatives for the 312 workloads. For example, some workloads may cost more to host using certain virtualization platforms than on others or on standalone hosts. We demonstrate different decision points with potential cost savings of nearly 20% by “right-virtualizing” the workloads.

【Keywords】: Cost models; Resource Sharing; Workload Placement; Virtualization; Burstiness

36. WaRR: A tool for high-fidelity web application record and replay.

Paper Link】 【Pages】:403-410

【Authors】: Silviu Andrica ; George Candea

【Abstract】: We introduce WaRR, a tool that records and replays with high fidelity the interaction between users and modern web applications. WaRR consists of two independent components: the WaRR Recorder and the WaRR Replayer. The WaRR Recorder is embedded in a web browser, thus having access to user actions, and provides a complete interaction trace-this confers high recording fidelity. The WaRR Replayer uses an enhanced, developer-specific web browser that enables realistic simulation of user interaction-this confers high replaying fidelity. We describe two usage scenarios for WaRR that help developers improve the dependability of web applications: testing web applications against realistic human errors and generating user experience reports. WaRR helped us discover bugs in widely-used web applications, such as Google Sites, and offers higher recording fidelity compared to current tools.

【Keywords】: testing; web applications; record & replay

37. Aaron: An adaptable execution environment.

Paper Link】 【Pages】:411-421

【Authors】: Marc Brunink ; André Schmitt ; Thomas Knauth ; Martin Süßkraut ; Ute Schiffel ; Stephan Creutz ; Christof Fetzer

【Abstract】: Software bugs and hardware errors are the largest contributors to downtime, and can be permanent (e.g. deterministic memory violations, broken memory modules) or transient (e.g. race conditions, bitflips). Although a large variety of dependability mechanisms exist, only few are used in practice. The existing techniques do not prevail for several reasons: (1) the introduced performance overhead is often not negligible, (2) the gained coverage is not sufficient, and (3) users cannot control and adapt the mechanism. Aaron tackles these challenges by detecting hardware and software errors using automatically diversified software components. It uses these software variants only if CPU spare cycles are present in the system. In this way, Aaron increases fault coverage without incurring a perceivable performance penalty. Our evaluation shows that Aaron provides the same throughput as an execution of the original application while checking a large percentage of requests - whenever load permits.

【Keywords】: Compiler transformation; Fault detection; Fault tolerance; Diversity methods; Adaptive algorithm

38. Toward dependability benchmarking of partitioning operating systems.

Paper Link】 【Pages】:422-429

【Authors】: Raul Barbosa ; Johan Karlsson ; Qiu Yu ; Xiaozhen Mao

【Abstract】: This paper describes a dependability benchmark intended to evaluate partitioning operating systems. The benchmark includes both hardware and software faultloads and measures the spatial as well as the temporal isolation among tasks, provided by a given real-time operating system. To validate the benchmark, a prototype implementation is carried out and three targets are benchmarked according to the specified process. The results substantiate that the proposed benchmark is able to compare and rank the targets in an objective way, and that it provides the ability to identify aspects of the target systems that need improvement.

【Keywords】: fault tolerance; dependability benchmarking; fault injection; partitioning; operating systems

39. Modeling stream processing applications for dependability evaluation.

Paper Link】 【Pages】:430-441

【Authors】: Gabriela Jacques-Silva ; Zbigniew Kalbarczyk ; Bugra Gedik ; Henrique Andrade ; Kun-Lung Wu ; Ravishankar K. Iyer

【Abstract】: This paper describes a modeling framework for evaluating the impact of faults on the output of streaming applications. Our model is based on three abstractions: stream operators, stream connections, and tuples. By composing these abstractions within a Stochastic Activity Network, we allow the modeling of complete applications. We consider faults that lead to data loss and to silent data corruption (SDC). Our framework captures how faults originating in one operator propagate to other operators down the stream processing graph. We demonstrate the extensibility of our framework by evaluating three different fault tolerance techniques: checkpointing, partial graph replication, and full graph replication. We show that under crashes that lead to data loss, partial graph replication has a great advantage in maintaining the accuracy of the application output when compared to checkpointing. We also show that SDC can break the no data duplication guarantees of a full graph replication-based fault tolerance technique.

【Keywords】: Computer crashes; Logic gates; Fault tolerance; Fault tolerant systems; Storage area networks; Data models; Stochastic processes

40. Modeling and analysis of the impact of failures in Electric Power Systems organized in interconnected regions.

Paper Link】 【Pages】:442-453

【Authors】: Silvano Chiaradonna ; Felicita Di Giandomenico ; Nicola Nostro

【Abstract】: Analysis of interdependencies in Electric Power Systems (EPS) has been recognized as a crucial and challenging issue to improve their trustworthiness. The recent liberalization process in energy markets has promoted the entry of a variety of operators in the electricity industry. The resulting new organization contributed to increase in complexity, heterogeneity and interconnection. This paper proposes a framework for analyzing EPS organized as a set of interconnected regions, both from the point of view of the electric power grid and of the cyber control infrastructure. The emphasis is on interdependencies and in assessing their impact on indicators representative of the QoS perceived by users. Taking a reference power grid as test case, the effects of failures on selected measures are shown, both in case the grid is partitioned in a number of regions and in case of a single region, to illustrate the behavior of different grid and control configurations.

【Keywords】: Blackout-size Assessment; Stochastic Modeling; Electric Power System; Infrastructures Dependencies

41. High performance state-machine replication.

Paper Link】 【Pages】:454-465

【Authors】: Parisa Jalili Marandi ; Marco Primi ; Fernando Pedone

【Abstract】: State-machine replication is a well-established approach to fault tolerance. The idea is to replicate a service on multiple servers so that it remains available despite the failure of one or more servers. From a performance perspective, state-machine replication has two limitations. First, it introduces some overhead in service response time, due to the requirement to totally order commands. Second, service throughput cannot be augmented by adding replicas to the system. We address the two issues in this paper. We use speculative execution to reduce the response time and state partitioning to increase the throughput of state-machine replication. We illustrate these techniques with a highly available parallel B-tree service.

【Keywords】: Servers; Time factors; Throughput; Protocols; Context; Out of order; Fault tolerance

42. How secure are networked office devices?

Paper Link】 【Pages】:465-472

【Authors】: Edward Condon ; Emily Cummins ; Zaïna Afoulki ; Michel Cukier

【Abstract】: Many office devices have a history of being networked (such as printers) and others without the same past are increasingly becoming networked (such as photocopiers). The modern networked versions of previously non-networked devices have much in common with traditional networked servers in terms of features and functions. While an organization may have policies and procedures for securing traditional network servers, securing networked office devices providing similar services can easily be overlooked. In this paper we present an evaluation of privacy and security risks found when examining over 1,800 networked office devices connected to a large university network. We use the STRIDE threat model to categorize threats and vulnerabilities and then we group the devices according to assessed risk from the perspective of the university. We found that while steps had been taken to secure some devices, many were using default or unsecured configurations.

【Keywords】: printers; networked devices; network security; risk assessment; privacy

43. A combinatorial approach to network covert communications with applications in Web Leaks.

Paper Link】 【Pages】:474-485

【Authors】: Xiapu Luo ; Peng Zhou ; Edmond W. W. Chan ; Rocky K. C. Chang ; Wenke Lee

【Abstract】: Various effective network covert channels have recently demonstrated the feasibility of encoding messages into the timing or content of individual network objects, such as data packets and request messages. However, we show in this paper that more robust and stealthy network covert channels can be devised by exploiting the relationship of the network objects. In particular, we propose a combinatorial approach for devising a wide spectrum of covert channels which can meet different objectives based on the channel capacity and channel undetectability. To illustrate the approach, we design WebLeaks and ACKLeaks, two novel covert channels which can leak information through the data and acknowledgment traffic in a web session. We implement both channels and deploy them on the PlanetLab nodes for evaluation. Besides the channel capacity, we apply the state-of-the-art detection schemes to evaluate their camouflage capability. The experiment results show that their capacity can be boosted up by our combinatorial approach, and at the same time they can effectively evade the detection.

【Keywords】: Encoding; Decoding; Partitioning algorithms; Channel capacity; Timing; Algorithm design and analysis; Indexes

44. Crash graphs: An aggregated view of multiple crashes to improve crash triage.

Paper Link】 【Pages】:486-493

【Authors】: Sunghun Kim ; Thomas Zimmermann ; Nachiappan Nagappan

【Abstract】: Crash reporting systems play an important role in the overall reliability and dependability of the system helping in identifying and debugging crashes in software systems deployed in the field. In Microsoft for example, the Windows Error Reporting (WER) system receives crash data from users, classifies them, and presents crash information for developers to fix crashes. However, most crash reporting systems deal with crashes individually; they compare crashes individually to classify them, which may cause misclassification. Developers need to download multiple crash data files for debugging, which requires non-trivial effort. In this paper, we propose an approach based on crash graphs, which are an aggregated view of multiple crashes. Our experience with crash graphs indicates that it reduces misclassification and helps identify fixable crashes in advance.

【Keywords】: network; crash; graph; triaging

45. Amplifying limited expert input to sanitize large network traces.

Paper Link】 【Pages】:494-505

【Authors】: Xin Huang ; Fabian Monrose ; Michael K. Reiter

【Abstract】: We present a methodology for identifying sensitive data in packet payloads, motivated by the need to sanitize packets before releasing them (e.g., for network security/dependability analysis). Our methodology accommodates packets recorded from an incompletely documented protocol, in which case it will be necessary to consult a human expert to determine what packet data is sensitive. Since expert availability for such tasks is limited, however, our methodology adopts a hierarchical approach in which most packet inspection is done by less-trained workers whose designations of sensitive data in selected packets best match the expert's. At the core of our methodology is a data reduction and presentation algorithm that selects candidate workers based on their evaluations of a small number of packets; that solicits these workers' designations of sensitive data in a larger (but still minuscule) subset of packets; and then applies these designations to mark sensitive data in the entire data set. We detail our algorithms and evaluate them in a realistic user study.

【Keywords】: sensitive data; sanitization; packet payloads

46. Analysis of security data from a large computing organization.

Paper Link】 【Pages】:506-517

【Authors】: Aashish Sharma ; Zbigniew Kalbarczyk ; James Barlow ; Ravishankar K. Iyer

【Abstract】: This paper presents an in-depth study of the forensic data on security incidents that have occurred over a period of 5 years at the National Center for Supercomputing Applications at the University of Illinois. The proposed methodology combines automated analysis of data from security monitors and system logs with human expertise to extract and process relevant data in order to: (i) determine the progression of an attack, (ii) establish incident categories and characterize their severity, (iii) associate alerts with incidents, and (iv) identify incidents missed by the monitoring tools and examine the reasons for the escapes. The analysis conducted provides the basis for incident modeling and design of new techniques for security monitoring.

【Keywords】: large scale computing systems; incident/attack data analysis; security monitoring; alerts

47. Coerced Cache Eviction and discreet mode journaling: Dealing with misbehaving disks.

Paper Link】 【Pages】:518-529

【Authors】: Abhishek Rajimwale ; Vijay Chidambaram ; Deepak Ramamurthi ; Andrea C. Arpaci-Dusseau ; Remzi H. Arpaci-Dusseau

【Abstract】: We present Coerced Cache Eviction (CCE), a new method to force writes to disk in the presence of a disk cache that does not properly obey write-cache configuration or flush requests. We demonstrate the utility of CCE by building a new journaling mode within the Linux ext3 file system. When mounted in this discreet mode, ext3 uses CCEs to ensure that writes are properly ordered and thus maintains file system integrity despite the presence of an improperly behaving disk. We show that discreet mode journaling operates with acceptable overheads for most workloads.

【Keywords】: reliability; file systems; disks; journaling

48. Impact of temperature on hard disk drive reliability in large datacenters.

Paper Link】 【Pages】:530-537

【Authors】: Sriram Sankar ; Mark Shaw ; Kushagra Vaid

【Abstract】: When datacenters are pushed to their limits of operational efficiency, reducing failure rates becomes critical for maintaining high levels of healthy server operation. In this experience report, we present a dense storage case study from a large population of servers housing tens of thousands of disk drives. Previous studies have presented divergent results concerning correlation between temperature and hard disk drive failures. In our paper, we specifically establish correlation between temperatures and failures observed at different location granularities: a) inside drive locations in a server chassis, b) across server locations in a rack and c) across multiple racks in a datacenter. We also establish that temperature exhibits a stronger correlation to failures compared to the correlation of disk utilization with drive failures. Thus, we show that temperature-aware server and datacenter design plays a pivotal role in datacenter reliability. Following our case study, we present a reliability model for estimating hard disk drive failures correlated with the datacenter operating temperature. We use a physical Arrhenius model with empirically derived coefficients for our model. We show an application of the model for selecting the datacenter inlet temperature setpoint for two different server storage configurations. Finally, with the help of a datacenter cost discussion, we highlight the need to incorporate reliability-aware datacenter design for increased efficiency in large scale datacenters.

【Keywords】: Correlation; Datacenter; Reliability; Temperature; Hard Disk Drive

49. Merging ultra-low duty cycle networks.

Paper Link】 【Pages】:538-549

【Authors】: Matthew Dobson ; Spyros Voulgaris ; Maarten van Steen

【Abstract】: Energy is the scarcest resource in ad-hoc wireless networks, particularly in wireless sensor networks requiring a long lifetime. Intermittently switching the radio on and off is widely adopted as the most effective way to keep energy consumption low. This, however, prevents the very goal of communication, unless nodes switch their radios on at synchronized intervals, a rather nontrivial coordination task. In this paper we address the problem of synchronizing node radios to a single universal schedule in very large scale wireless ad-hoc networks. More specifically, we focus on how independently synchronized clusters of nodes can detect each other and merge to a common radio schedule. Our main contributions consist in identifying the fundamental subproblems that govern cluster merging, providing a detailed comparison of the respective policies and their combinations, and supporting them by extensive simulation. Energy consumption, convergence speed, and network scalability have been the driving factors in our evaluation. The proposed policies are extensively tested in networks of up to 4,096 nodes. Our work is based on the GMAC protocol, a gossip-based MAC protocol for wireless ad-hoc networks.

【Keywords】: Topology; Ad hoc networks

50. Modeling time correlation in passive network loss tomography.

Paper Link】 【Pages】:550-561

【Authors】: Jin Cao ; Aiyou Chen ; Patrick P. C. Lee

【Abstract】: We consider the problem of inferring link loss rates using passive measurements. Prior inference approaches are mainly built on the time correlation nature of packet losses. However, passive inference generally has limited control over the measurement process, and it is a challenging issue to adapt loss rate inference to the impact of time correlation. We address this issue and propose a new loss model that expresses an inferred link loss rate as a function of time correlation. Under this loss model with time correlation, we show its identifiability, and propose a novel profile-likelihood-based inference approach that can accurately infer link loss rates for various complex topologies (e.g., trees with many leaf branches). We validate the accuracy of our inference approach with model and network simulations.

【Keywords】: performance evaluation and assessment; network tomography; passive loss rate inference; time correlation; measurement and monitoring techniques

51. Simple bounds for a transient queue.

Paper Link】 【Pages】:562-573

【Authors】: Takayuki Osogami ; Rudy Raymond

【Abstract】: Bounds on performance of a queueing model can provide useful information to guarantee quality of service for communication networks. We study the bounds on the mean delay in a transient GI/GI/1 queue given the first two moments of the service time and the inter-arrival time, respectively. We establish a simple upper-bound, which then is used to show that the true transient mean-delay is at most four times larger than an asymptotic diffusion-approximation. We also prove that the tight lower-bound is zero as long as the service time and the inter-arrival time have finite variance and the load is below one. Tightness of the trivial lower-bound is in contrast to the stationary mean-delay, which has strictly positive lower-bound when the service time is sufficiently variable. We also show how our results can be applied to analyze the transient mean delay of packets in the real-world Internet.

【Keywords】: moments; Queue; GI/GI/1; bounds

52. Approximate analysis of blocking queueing networks with temporal dependence.

Paper Link】 【Pages】:574-585

【Authors】: Vittoria De Nitto Persone ; Giuliano Casale ; Evgenia Smirni

【Abstract】: In this paper we extend the class of MAP queueing networks to include blocking models, which are useful to describe the performance of service instances which have a limited concurrency level. We consider two different blocking mechanisms: Repetitive Service-Random Destination (RS-RD) and Blocking After Service (BAS). We propose a methodology to evaluate MAP queueing networks with blocking based on the recently proposed Quadratic Reduction (QR), a state space transformation that decreases the number of states in the Markov chain underlying the queueing network model. From this reduced state space, we obtain boundable approximations on average performance indexes such as throughput, response time, utilizations. The two approximations that dramatically enhance the QR bounds are based on maximum entropy and on a novel minimum mutual information principle, respectively. Stress cases of increasing complexity illustrate the excellent accuracy of the proposed approximations on several models of practical interest.

【Keywords】: Queueing analysis; Markov processes; Approximation methods; Complexity theory; Computational modeling; Throughput; Joints

53. 5th Workshop on Recent Advances in Intrusion-Tolerant Systems WRAITS 2011.

Paper Link】 【Pages】:586-587

【Authors】: Alysson Neves Bessani ; Partha P. Pal

【Abstract】: The 5th Workshop on Recent Advances in Intrusion-Tolerant Systems, held in conjunction with DSN 2011, aims to continue the collaborative discourse on the challenges of building intrusion-tolerant systems and innovative ideas to address them.

【Keywords】: Conferences; Fault tolerance; Intrusion detection; Fault tolerant systems; Software engineering

54. Introduction to the fifth workshop on dependable and secure nanocomputing.

Paper Link】 【Pages】:588-589

【Authors】: Jean Arlat ; Cristian Constantinescu ; Johan Karlsson ; Takashi Nanya ; Alan Wood

【Abstract】: Nanocomputing and related-enabling technologies hold the promise of higher performance and lower power consumption, as well as increased communication capabilities and functionality. In addition to the impact on today computerized systems, nanocomputing is an essential lever to foster the emerging cyberphysical system paradigm. However, the dependability and security of these unprecedentedly small devices, of their deployment, and of their interconnection remain uncertain. The main sources of concern are: • Nanometer devices are expected to be highly sensitive to process variations. The guard-bands used today for avoiding the impact of such variations will not represent a feasible solution in the future. As a consequence, timing errors and their higher frequency of occurrence have to be addressed. • New and intricate failure modes, specific to new materials, are expected to raise serious challenges to the design and test engineers. • Environment induced errors, such as single event upsets (SEU), are likely to occur more frequently than in the case of more conventional semiconductor devices. • Design of hardware architectures encompassing resilience techniques are needed to achieve the development of highly reliable energy efficient systems. • The increased complexity of the systems based on nanotechnology will require improved computer aided design (CAD) tools, as well as better validation techniques. • The security of nanocomputing systems may be threatened by malicious attacks targeting new vulnerable areas in the hardware.

【Keywords】: Conferences; Hardware; Testing; Computer architecture; Program processors; Materials

55. The First International Workshop on Dependability of Clouds, data centers and Virtual Computing Environments.

Paper Link】 【Pages】:590-591

【Authors】: Jogesh K. Muppala ; Matti A. Hiltunen ; Robert J. Stroud ; Ji Wang

【Abstract】: Cloud computing can be characterized as the culmination of the integration of computing and data infrastructures to provide a scalable, agile and cost-effective approach to support the ever-growing critical IT needs (in terms of computation and storage) of both enterprises and the general public. Cloud computing introduces a paradigm shift in computing where the ownership of computing resources is no more necessary for businesses and individuals to provide services to their end-users over the Internet. Cloud computing relieves its users from the burdens of provisioning and managing their own data centers and allows them to pay for resources only when they are actually needed and used. However, a shared cloud infrastructure introduces a number of new dependability challenges both for the cloud providers and users. Indeed all the data gets created, stored, shared and manipulated within the cloud.

【Keywords】: Conferences; Cloud computing; Security; Resource management; Computer architecture; Availability

56. Seventh workshop on hot topics in system dependability (HotDep'11).

Paper Link】 【Pages】:592

【Authors】: Andreas Haeberlen ; Mootaz Elnozahy

【Abstract】: We are pleased to present to the community the proceedings of HotDep'11, the seventh instance of the HotDep workshop series. The goals of HotDep are to bring forth cutting-edge research ideas spanning the various domains of dependable systems, and to build linkages between two communities with active interest in dependability research, namely researchers who attend traditional dependability conferences such as DSN and ISSRE, and those who attend mainstream systems conferences such as SOSP, OSDI, and EuroSys. To achieve these goals, the workshop has been alternating between a dependability conference in odd years and a systems conference in even years. This year, we have kept the tradition of previous meetings by selecting a program committee with a mix of researchers from both communities: • Lorenzo Alvisi, University of Texas at Austin • Remzi Arpaci-Dusseau, University of Wisconsin, Madison • Byung-Gon Chun, Intel Labs Berkeley • Arun Iyengar, IBM Research • Mohamed Kaâniche, LAAS-CNRS • Daniel Mossé, University of Pittsburgh • Priya Narasimhan, Intel Labs/Carnegie Mellon University • James Plank, University of Tennessee • André Schiper, EPFL • Alexander Shraer, Yahoo! Research • Atul Singh, NEC Labs, Princeton • Marco Vieira, University of Coimbra • Lidong Zhou, Microsoft Research Asia, Beijing

【Keywords】: Conferences; Communities; Software reliability; Couplings; Intrusion detection; Asia; Fault tolerant systems

57. WOSD 2011 the first international workshop on open systems dependability.

Paper Link】 【Pages】:593-594

【Authors】: Mario Tokoro ; Karama Kanoun ; Kimio Kuramitsu ; Jean-Charles Fabre

【Abstract】: Modern computer systems are increasing in complexity, spread, and scale in order to meet the diverse and sophisticated needs of the users. In the development of these systems, we inevitably use legacy codes and off-the shelf software as black box software in order to shorten the development time and lower the development cost. These systems are often connected via a network to utilize services provided by other systems, whereas services and network performance may change while in operation. We often need to change the specification and implementation of a system due to the changes of environments and users' requirements. In addition, threats caused by viruses and unauthorized accesses have to be properly removed. Therefore, modern computer systems inherently involve incompleteness of specifications and implementations and uncertainty of environments and requirements.

【Keywords】: Open systems; Unified modeling language; Conferences; Uncertainty; Computers; Computational modeling; Software

58. Third workshop on proactive failure avoidance, recovery, and maintenance (PFARM).

Paper Link】 【Pages】:595-596

【Authors】: Miroslaw Malek ; Felix Salfner ; Kishor S. Trivedi

【Abstract】: Over the last decade, research on dependable computing has undergone a shift from reactive towards proactive methods: In classical fault tolerance a system reacts to errors or component failures in order to prevent them from turning into system failures, and maintenance follows fixed, time-based plans. However, due to an ever increasing system complexity, use of commercial-off-the-shelf components, virtualization, ongoing system patches and updates and dynamicity such approaches have become difficult to apply. Therefore, a new area in dependability research has emerged focusing on proactive approaches that start acting before a problem arises in order to increase time-to-failure and/or reduce time-to-repair. These techniques frequently build on the anticipation of upcoming problems based on runtime monitoring. Industry and academia use several terms for such techniques, each focusing on different aspects, including self-* computing, autonomic computing, proactive fault management, trustworthy computing, software rejuvenation, or preventive/proactive maintenance.

【Keywords】: Conferences; Maintenance engineering; Games; Monitoring; Runtime; USA Councils; Availability

59. 5th international workshop on adaptive and dependable mobile ubiquitous systems ADAMUS 2011.

Paper Link】 【Pages】:597-598

【Authors】: Domenico Cotroneo ; Vincenzo De Florio

【Abstract】: The vision of mobile and ubiquitous systems is becoming a reality thanks to the recent advances in wireless communication and device miniaturization. However, the widespread industrial uptake of these systems is still compromised by the highly error-prone and heterogeneous mobile provisioning environments, which induce several impairments to normal operation. Thus, how to improve the dependability of these systems is still an open issue.

【Keywords】: Mobile communication; Sensors; Smart phones; Ad hoc networks; Adaptive systems; Collaboration; Conferences