mpm + availability   84

Building a Large-scale Distributed Storage System Based on Raft
In this article, I’d like to share some of our firsthand experience in designing a large-scale distributed storage system based on the Raft consensus algorithm.
storage  scaling  availability  consistency 
7 weeks ago by mpm
The Amazon Builders' Library
The Amazon Builders’ Library is a collection of living articles that describe how Amazon develops, architects, releases, and operates technology
architecture  queuing  availability  load-balancing 
11 weeks ago by mpm
Designing resilient systems: Circuit Breakers or Retries?
The third option is of course to adopt both circuit breaker and retry mechanisms.
availability  rpc 
january 2019 by mpm
Improving Cloud Service Resilience using Brownout-Aware Load-Balancing
In this paper we propose two novel brownout-aware load-balancing algorithms. To test their practical applicability, we extended the popular lighttpd web server and load-balancer, thus obtaining a production-ready implementation. Experimental evaluation shows that the approach enables cloud services to remain responsive despite cascading failures. Moreover, when compared to Shortest Queue First (SQF), believed to be near-optimal in the non-adaptive case, our algorithms improve user experience by 5%, with high statistical significance, while preserving response time predictability.
availability  resilience 
november 2018 by mpm
Brownout: Building More Robust Cloud Applications
In this paper, we introduce a self-adaptation programming paradigm called brownout. Using this paradigm, applications can be designed to robustly withstand unpredictable runtime variations, without over-provisioning. The paradigm is based on optional code that can be dynamically deactivated through decisions based on control theory.
availability  resilience 
november 2018 by mpm
Defining SLOs for services with dependencies
In this episode, we discuss how to define and manage SLOs for services with dependencies, each of which may (or may not!) have their own SLOs.
availability 
july 2018 by mpm
Performance Under Load
Enforcing concurrency limits is nothing new; the hard part is figuring out this limit in a large dynamic distributed system where concurrency and latency characteristics are constantly changing. The main purpose of our solution is to dynamically identify this concurrency limit.
concurrency  availability 
march 2018 by mpm
Just-Right Consistency: reconciling availability and safety
By the CAP Theorem, a distributed data storage system can ensure either Consistency under Partition (CP) or Availability under Partition (AP), but not both. This has led to a split between CP databases, in which updates are synchronous, and AP databases, where they are asynchronous. However, there is no inherent reason to treat all updates identically: simply, the system should be as available as possible, and synchronised just enough for the application to be correct. We offer a principled Just-Right Consistency approach to designing such applications, reconciling correctness with availability and performance, based on the following insights:(i) The Conflict-Free Replicated Data Type (CRDTs) data model supports asynchronous updates in an intuitive and principled way.(ii) Invariants involving joint or mutually-ordered updates are compatible with AP and can be guaranteed by Transactional Causal Consistency, the strongest consistency model that does not compromise availability. Regarding the remaining, "CAP-sensitive" invariants:(iii) For the common pattern of Bounded Counters, we provide encapsulated data type that is proven correct and is efficient; (iv) in the general case, static analysis can identify when synchronisation is not necessary for correctness.Our Antidote cloud database system supports CRDTs, Transactional Causal Consistency and the Bounded Counter data type. Support tools help design applications by static analysis and proof of CAP-sensitive invariants. This system supports industrial-grade applications and has been tested experimentally with hundreds of servers across several geo-distributed data centres.
consistency  availability 
february 2018 by mpm
How Complex Systems Fail
Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety
availability  systems 
september 2017 by mpm
EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding
EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs erasure coding by: (i) splitting and erasure coding individual objects during writes, and (ii) late binding, wherein obtaining any k out of (k+r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.
storage  availability  caching 
august 2017 by mpm
In search of a simple consensus algorithm
In this post: (1) covered an availability limitation of the Raft protocol (2) demonstrated that modern implementations of Raft are subject to it (3) described an existing simpler approach to the problem of consensus (4) showed that its toy 500-lines implementation has performance similar to Etcd but doesn't suffer from Raft's performance penalty
consensus  paxos  availability  actors 
april 2017 by mpm
dangsan
DangSan instruments programs written in C or C++ to invalidate pointers whenever a block of memory is freed, preventing dangling pointers. Instead, whenever such a pointer is dereferenced, it refers to unmapped memory and results in a crash. As a consequence, attackers can no longer exploit dangling pointers.
c++  memory  availability 
april 2017 by mpm
CORDS
File-system fault injection framework for distributed storage systems
storage  testing  availability 
march 2017 by mpm
eventuate
Eventuate is a toolkit for building applications composed of event-driven and event-sourced services that collaborate by exchanging events over shared event logs. Services can either be co-located on a single node or distributed up to global scale. Services can also be replicated with causal consistency and remain available for writes during network partitions
crdt  availability  event 
september 2016 by mpm
Putting Consistency Back into Eventual Consistency
Geo-replicated storage systems are at the core of current Internet services. The designers of the replication protocols used by these systems must choose between either supporting low-latency, eventually-consistent operations, or ensuring strong consistency to ease application correctness. We propose an alternative consistency model, Explicit Consistency, that strengthens eventual consistency with a guarantee to preserve specific invariants defined by the applications. Given these application-specific invariants, a system that supports Explicit Consistency identifies which operations would be unsafe under concurrent execution, and allows programmers to select either violation-avoidance or invariant-repair techniques. We show how to achieve the former, while allowing operations to complete locally in the common case, by relying on a reservation system that moves coordination off the critical path of operation execution. The latter, in turn, allows operations to execute without restriction, and restore invariants by applying a repair operation to the database state. We present the design and evaluation of Indigo, a middleware that provides Explicit Consistency on top of a causally-consistent data store. Indigo guarantees strong application invariants while providing similar latency to an eventually-consistent system in the common case
consistency  availability 
may 2015 by mpm
Algorithms for Replica Placement in High-Availability Storage
A new model of causal failure is presented, and used to solve a novel replica placement problem in data centers. The model describes dependencies among system components as a directed graph. A replica placement is defined as a subset of vertices in such a graph. A criterion for optimizing replica placements is formalized and explained. In this work, the optimization goal is to avoid choosing placements in which a single event is likely to wipe out multiple replicas. Using this criterion, a fast algorithm is given for the scenario in which the dependency model is a tree. The main contribution of the paper is an O(n+ρlogρ) dynamic programming algorithm for placing ρ replicas on a tree with n vertices. This algorithm exhibits the interesting property that only two subproblems need to be recursively considered at each stage. An O(n2ρ) greedy algorithm is also briefly reported.
availability  storage 
march 2015 by mpm
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
consistency  availability  storage 
march 2014 by mpm
Copysets and Chainsets: A Better Way to Replicate
The traditional technique for performing such partitioning and replication is to randomly assign data to replicas. Although such random assignment is relatively easy to implement, it suffers from a fatal drawback: as cluster size grows, it becomes almost guaranteed that a failure of a small percentage of the cluster will lead to permanent data loss.
database  replication  availability 
february 2014 by mpm
Reliability Models for Highly Fault-tolerant Storage Systems
We found that a reliability model commonly used to estimate Mean-Time-To-Data-Loss (MTTDL), while suitable for modeling RAID 0 and RAID 5, fails to accurately model systems having a fault-tolerance greater than 1. Therefore, to model the reliability of RAID 6, Triple-Replication, or k-of-n systems requires an alternate technique. In this paper, we explore some alternatives, and evaluate their efficacy by comparing their predictions to simulations. Our main result is a new formula which more accurately models storage system reliability
reliability  statistics  availability 
october 2013 by mpm
Non-Monotonic Snapshot Isolation
Many distributed applications require transactions. However, transactional protocols that require strong synchronization are costly in large scale environments. Two properties help with scalability of a transactional system: genuine partial replication (GPR), which leverages the intrinsic parallelism of a workload, and snapshot isolation (SI), which decreases the need for synchronization. We show that, under standard assumptions (data store accesses are not known in advance, and transactions may access arbitrary objects in the data store), it is impossible to have both SI and GPR. To circumvent this impossibility, we propose a weaker consistency criterion, called Non-monotonic Snapshot Isolation (NMSI). NMSI retains the most important properties of SI, i.e., read-only transactions always commit, and two write-conflicting updates do not both commit. We present a GPR protocol that ensures NMSI, and has lower message cost (i.e., it contacts fewer replicas and/or commits faster) than previous approaches.
consistency  availability  database 
july 2013 by mpm
XORing Elephants: Novel Erasure Codes for Big Data
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance. We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication
storage  data  performance  availability 
july 2013 by mpm
The network is reliable
much of what we know about the failure modes of real-world distributed systems is founded on guesswork and rumor. Sysadmins and developers will swap stories over beers, but detailed, public postmortems and comprehensive surveys of network availability are few and far between. In this post, we’d like to bring a few of these stories together. We believe this is a first step towards a more open and honest discussion of real-world partition behavior, and, ultimately, more robust distributed systems design
networking  availability  consistency  reliability  fault-tolerance  outage 
june 2013 by mpm
Viewstamped Replication Revisited
This paper presents an updated version of Viewstamped Replication, a replication technique that handles failures in which nodes crash. It describes how client requests are handled, how the group reorganizes when a replica fails, and how a failed replica is able to rejoin the group. The paper also describes a number of important optimizations and presents a protocol for handling reconfigurations that can change both the group membership and the number of failures the group is able to handle.
consensus  consistency  fault-tolerance  availability 
june 2013 by mpm
The CAP FAQ
The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement
availability  consistency  fault-tolerance 
may 2013 by mpm
Dynamic Reconfiguration of Primary/Backup Clusters
Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art.
zookeeper  consistency  availability 
march 2013 by mpm
HAT, not CAP: Highly Available Transactions
To provide high availability, many scalable data stores abandon traditional database functionality, often offering operations limited to single objects (or groups of co-located objects) with limited consistency. However, many applications benefit from transactions, or larger units of arbitrary combinations of multiple operations on multiple objects. While the CAP theorem is often interpreted to preclude the availability of transactions in a partition-prone environment, we show that highly available systems can provide transactional guarantees matching the majority of today's ACID databases. We propose Highly Available Transactions (HATs) that support many desirable semantic guarantees for arbitrary transactional sequences of read and write operations, execute with low latency, and remain available during partitions
availability  fault-tolerance 
february 2013 by mpm
Postmortem Porn
I've collected over a hundred outage and security related postmortems in this Pinboard feed.

There's no shortage of human-error examples in the collection. But better, there are many interesting (sometimes gripping) stories ranging from monitoring loops gone wild to freak hardware incidents to creeping issues undetectable in testing. Even an FBI raid.
outage  availability 
february 2013 by mpm
Amateur Hour at Github
When the network froze, many of our fileservers which are intentionally located in different racks for redundancy, exceeded their heartbeat timeouts and decided that they needed to take control of the fileserver resources. They issued STONITH commands to their partner nodes and attempted to take control of resources, however some of those commands were not delivered due to the compromised network. When the network recovered and the cluster messaging between nodes came back, a number of pairs were in a state where both nodes expected to be active for the same resource. This resulted in a race where the nodes terminated one another and we wound up with both nodes stopped for a number of our fileserver pairs
availability  outage 
january 2013 by mpm
Hystrix for Resilience Engineering
In a distributed environment, failure of any given service is inevitable. Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system's overall resiliency.
fault-tolerance  reliability  availability 
december 2012 by mpm
On Transaction Liveness in Replicated Databases
This paper makes a first attempt to give a precise characterisation of liveness in replicated database systems
consistency  availability 
november 2012 by mpm
Perses: Data Layout for Low Impact Failures
PERSES reduces the length of degradation from the reference frame of the user by clustering data on disks such that working sets are kept together as much as possible. During a device failure, this co-location reduces the number of impacted working sets. PERSES uses statistical properties of data accesses to automatically determine which data to co-locate, avoiding extra administrative overhead
data  storage  availability  fault-removal  mttr 
october 2012 by mpm
Chain Replication in Theory and in Practice
This paper is a case study of the implementation of the chain replication protocol in a distributed key-value store called Hibari. In theory, the chain replication algorithm is quite simple and should be straightforward to implement correctly. In practice, however, there were many implementation details that had effects both profound and subtle.
availability  replication 
july 2012 by mpm
Chain Replication for Supporting High Throughput and Availablility
Chain replication is a new approach to coordinating clusters of fail-stop storage servers. The approach is intended for supporting large-dcale storage services that exhibit high throughput and availability without sacrificing strong consistency guarantees
distributed  availability  replication 
july 2012 by mpm
Using lightweight modeling to understand chord
Correctness of the Chord ring-maintenance protocol would mean that the protocol can eventually repair all disruptions in the ring structure, given ample time and no further disruptions while it is working. In other words, it is "eventual reachability." Under the same assumptions about failure behavior as made in the Chord papers, no published version of Chord is correct
dht  availability  testing 
june 2012 by mpm
Berkeley Cloud Seminar
Presentations on various distributed & cloudy topics
distributed  availability  fault-tolerance  consistency 
april 2012 by mpm
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore
Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads
paxos  consistency  availability 
april 2012 by mpm
The SMART way to migrate replicated stateful services
This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service
fault-tolerance  availability  consensus  consistency 
march 2012 by mpm
Efficient Replica Maintenance for Distributed Storage Systems
This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.
storage  availability  fault-tolerance 
march 2012 by mpm
Lower Bounds for Asynchronous Consensus
Impossibility results and best-case lower bounds are proved for the number of message delays and the number of processes required to reach agreement in an asynchronous consensus algorithm that tolerates non-Byzantine failure
consensus  availability  consistency  distributed  fault-tolerance 
march 2012 by mpm
Consistency, Availability, and Convergence
We examine the limits of consistency in fault-tolerant distributed storage systems. In particular, we identify fundamental tradeoffs among properties of consistency, availability, and convergence, and we close the gap between what is known to be impossible (i.e. CAP) and known systems that are highly-available but that provide weaker consistency such as causal. Specifically, in the asynchronous model with omission-failures and unreliable networks, we show the following tight bound: No consis-tency stronger than Real Time Causal Consistency (RTC) can be provided in an always-available, one-way convergent system and RTC can be provided in an always-available, one-way convergent system. In the asynchronous, Byzantine-failure model, we show that it is impossible to implement many of the recently introduced fork-based consistency semantics without sacrificing either availability or con-vergence; notably, proposed systems allow Byzantine nodes to permanently partition correct nodes from one another. To address this limitation, we introduce bounded fork join causal semantics that extends causal consistency to Byzantine environments while retaining availability and convergence
consistency  availability  distributed 
february 2012 by mpm
Client Puzzles: A Cryptographic Countermeasure Against Connection Depletion Attacks
When a server comes under attack, it distributes small cryptographic puzzles to clients making service requests. To complete its request, a client must solve its puzzle correctly
availability  protocol  networking 
august 2011 by mpm
Why mobile apps suck when you're mobile (TCP over 3G)
There are some crazy-high round trip times. The minimum round trip time was 107ms (which would put my home cable connection to shame) and even the median is pretty awesome at 239ms but the maximum was a whopping 20226 ms
networking  availability 
june 2011 by mpm
Understanding TCP Incast Throughput Collapse in Datacenter Networks
TCP Throughput Collapse, also known as Incast, is a pathological behavior of TCP that results in gross under-utilization of link capacity in certain many-to-one communication patterns.
networking  availability  performance 
june 2011 by mpm
upright - Making distributed systems Up (available) and Right (correct)
UpRight is an infrastructure and library for building fault tolerant distributed systems. The goal is to provide a simple library that can ensure that systems remain up (available) and right (correct) despite faults
distributed  availability  reliability 
june 2011 by mpm
doozer
Doozer is a highly-available, completely consistent store for small amounts of extremely important data. When the data changes, it can notify connected clients immediately (no polling), making it ideal for infrequently-updated data for which clients want real-time updates. Doozer is good for name service, database master elections, and configuration data shared between several machines
distributed  coordination  consistency  consensus  availability  go 
april 2011 by mpm
Netflix’s Transition to High-Availability Storage Systems
This paper addresses Netflix’s transition to AWS SimpleDB and S3, examples of AP storage systems.
availability 
october 2010 by mpm
You Can't Sacrifice Partition Tolerance
Of the CAP theorem’s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot not choose it. Instead of CAP, you should think about your availability in terms of yield (percent of requests answered successfully) and harvest (percent of required data actually included in the responses) and which of these two your system will sacrifice when failures happen
availability  distributed 
october 2010 by mpm
Availiability in Globally Distributed Storage Systems
We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies.
availability  distributed  storage 
october 2010 by mpm
gimli
Gimli is a crash tracing/analysis framework
fault-tolerance  fault-removal  deployment  availability 
july 2010 by mpm
The ϕ Accrual Failure Detector
The particularity of the ϕ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed.
availability  failure-detector  fault-forecasting 
may 2010 by mpm
Vuurmuur
Vuurmuur is a powerful firewall manager built on top of iptables on Linux
linux  availability  networking 
october 2009 by mpm
More on today's Gmail issue
At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.
outage  availability 
september 2009 by mpm
Pure Load Balancer
Pure Load Balancer [PLB] is a high-performance software load balancer for the HTTP and SMTP protocols.
linux  unix  web  cluster  networking  availability 
july 2009 by mpm
Slowloris HTTP DoS
Slowloris holds connections open by sending partial HTTP requests. It continues to send subsequent headers at regular intervals to keep the sockets from closing. In this way webservers can be quickly tied up
apache  http  networking  availability 
july 2009 by mpm
Service Management Facility
Self-healing services are delivered and managed on Solaris with the Service Management Facility (smf(5)). smf(5) augments the existing init.d(4) and inetd(1M) startup mechanisms, promoting the service to a first-class operating system object.
unix  monitoring  maintainability  availability 
june 2009 by mpm
IPtables Examples
Simpler rulesets are at the start, with more complex scripts near the end
networking  linux  confidentiality  availability 
december 2008 by mpm
Building on Quicksand
Reliable systems have always been built out of unreliable components. Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred.
reliability  safety  availability  integrity 
december 2008 by mpm
Zero-Downtime Restarts with HAProxy
The challenge with doing a rolling restart is in the coordination between your application servers, and an upstream reverse-balancer
proxy  web  availability  deployment  maintainability 
december 2008 by mpm
Security Maxims
Engineers don’t understand security. They think nature is the adversary, not people. They tend to work in solution space, not problem space. They think systems fail stochastically, not through deliberate, intelligent, malicious intent.
availability  integrity 
september 2008 by mpm
OpenAIS
OSI Certified implementation of the Service Availability Forum Application Interface Specification (AIS).
reliability  networking  messaging  linux  availability  cluster 
september 2008 by mpm
Confidence in the Cloud
# Some Observations about Reliable Process Pairs, # Less Is More, # N-Version Programming, # Availability Over Consistency, # Eventual Consistency,
distributed  base  integrity  availability 
september 2008 by mpm
Perspectives - Degraded Operations Mode
all services should expect to be overloaded and all services should expect mass failures. Very few do and I see related down-time in the news every month or so.
reliability  availability  safety 
september 2008 by mpm
Availability Enlightenment
The path to enlightenment is often only visible after failure, if at all.
availability 
august 2008 by mpm
Modular Software Upgrades for Distributed Systems
We present a methodology and infrastructure that make it possible to upgrade distributed systems automatically while limiting service disruption
deployment  distributed  maintainability  availability 
august 2008 by mpm
Scalaris
scalable and fault-tolerant structured storage with strong data consistency for online databases or Web 2.0 services.
database  distributed  availability  dht  scalability 
july 2008 by mpm
Continuent Community
open source community portal dedicated to improving the availability and performance of databases and database applications
cluster  database  availability 
july 2008 by mpm
Article: Could we have saved the Death Star?
Had Darth Vader employed formal methods to the design of his Death Star, perhaps it would not have been vulnerable to the Starfighter attack that led to its destruction.
availability  safety 
july 2008 by mpm
Why DNS Based Global Server Load Balancing (GSLB) Doesn‽t Work
High-availability GSLB of general Internet browser based services is best accomplished by including the use of multiple A records, but the use of multiple A records debilitates DNS based global server 'load balancing'
dns  availability 
july 2008 by mpm
Security as a System-Level Constraint
The essence of system-level design is the need to concurrently consider information from multiple engineering domains across multiple subsystems to assess holistic system properties
integrity  confidentiality  availability 
june 2008 by mpm
Linux HA
Provide a high availability (clustering) solution for Linux which promotes reliability, availability, and serviceability (RAS) through a community development effort.
cluster  distributed  linux  availability  reliability  maintainability  scalability 
may 2008 by mpm
The Computer Failure Data Repository (CFDR)
This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites
reliability  availability  outage 
april 2008 by mpm
Defense in Depth, Reconsidered: Is Information Security Anything Like War?
Despite repeated assertion, I am dubious about the standing of “defense in depth” as a core principle for security design.
integrity  confidentiality  availability 
april 2008 by mpm
Recovery-Oriented Computing (ROC) Project
The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services.
reliability  availability  maintainability 
april 2008 by mpm
The Cactus Project
integrated design and implementation framework for supporting customizable dynamic fine-grain Quality of Service (QoS) attributes related to dependability, real time, and security in distributed systems
gcs  reliability  confidentiality  integrity  availability 
march 2008 by mpm
HAProxy
free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications
cluster  availability  http  maintainability  proxy 
march 2008 by mpm
Perlbal
Perlbal is our Perl-based reverse proxy load balancer and web server.
http  availability  maintainability 
march 2008 by mpm
The Linux Virtual Server Project
The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system.
linux  cluster  availability  maintainability  scalability 
february 2008 by mpm
Neat tricks with iptables
The result of this research has been the ongoing creation of a firewall to protect my laptop against open networks, and my Internet server from port scanning and DoS attacks. I’m pretty certain I haven’t even scratched the surface yet, but I have found some settings to protect against the most common attacks. Below I’ll summarize the major pieces of my new firewall, and the logic behind it.
linux  networking  integrity  availability 
october 2007 by mpm
Thoughts on Threat Modeling...
Remember that threat modeling is an analysis tool. You threat model to identify threats to your component, which then lets you know where you need to concentrate your resources
confidentiality  integrity  availability 
october 2007 by mpm
Threat Modeling: Uncover Security Design Flaws Using The STRIDE Approach
systematic approach to threat modeling developed in the Security Engineering and Communications group at Microsoft
confidentiality  integrity  availability 
october 2007 by mpm
« earlier      
per page:    204080120160

Copy this bookmark:



description:


tags: