cqrs   2692

« earlier    

The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering
Excellent, end kinda looks like CQRS + Event Sourcing

"Deterministic means that the processing isn't timing dependent and doesn't let any other "out of band" input influence its results. For example a program whose output is influenced by the particular order of execution of threads or by a call to gettimeofday or some other non-repeatable thing is generally best considered as non-deterministic."

"The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync."

"One of the beautiful things about this approach is that the time stamps that index the log now act as the clock for the state of the replicas—you can describe each replica by a single number, the timestamp for the maximum log entry it has processed. This timestamp combined with the log uniquely captures the entire state of the replica."

"The "state machine model" usually refers to an active-active model where we keep a log of the incoming requests and each replica processes each request. A slight modification of this, called the "primary-backup model", is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests."

"the usefulness of the log comes from simple function that the log provides: producing a persistent, re-playable record of history. Surprisingly, at the core of these problems is the ability to have many machines playback history at their own rate in a deterministic manner."

"Let's say we write a record with log entry X and then need to do a read from the cache. If we want to guarantee we don't see stale data, we just need to ensure we don't read from any cache which has not replicated up to X."

"You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics. In distributed systems, this model of communication sometimes goes by the (somewhat terrible) name of atomic broadcast."

"It's worth emphasizing that the log is still just the infrastructure. That isn't the end of the story of mastering data flow: the rest of the story is around metadata, schemas, compatibility, and all the details of handling data structure and evolution. But until there is a reliable, general way of handling the mechanics of data flow, the semantic details are secondary."

"The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of each consumer of data."

"The key problem for a data-centric organization is coupling the clean integrated data to the data warehouse. A data warehouse is a piece of batch query infrastructure which is well suited to many kinds of reporting and ad hoc analysis, particularly when the queries involve simple counting, aggregation, and filtering. But having a batch system be the only repository of clean complete data means the data is unavailable for systems requiring a real-time feed—real-time processing, search indexing, monitoring systems, etc."

"At LinkedIn, we have built our event data handling in a log-centric fashion. We are using Kafka as the central, multi-subscriber event log. We have defined several hundred event types, each capturing the unique attributes about a particular type of action. This covers everything from page views, ad impressions, and searches, to service invocations and application exceptions."

"The "event-driven" style provides an approach to simplifying this. The job display page now just shows a job and records the fact that a job was shown along with the relevant attributes of the job, the viewer, and any other useful facts about the display of the job. Each of the other interested systems—the recommendation system, the security system, the job poster analytics system, and the data warehouse—all just subscribe to the feed and do their processing. The display code need not be aware of these other systems, and needn't be changed if a new data consumer is added."

"Lack of a global order across partitions is a limitation, but we have not found it to be a major one. Indeed, interaction with the log typically comes from hundreds or thousands of distinct processes so it is not meaningful to talk about a total order over their behavior. Instead, the guarantees that we provide are that each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent."

"The cumulative effect of these optimizations is that you can usually write and read data at the rate supported by the disk or network, even while maintaining data sets that vastly exceed memory."

"Even in the presence of a healthy batch processing ecosystem, I think the actual applicability of stream processing as an infrastructure style is quite broad. I think it covers the gap in infrastructure between real-time request/response services and offline batch processing. For modern internet companies, I think around 25% of their code falls into this category."

" First, it makes each dataset multi-subscriber and ordered. Recall our "state replication" principle to remember the importance of order. To make this more concrete, consider a stream of updates from a database—if we re-order two updates to the same record in our processing we may produce the wrong final output. This order is more permanent than what is provided by something like TCP as it is not limited to a single point-to-point link and survives beyond process failures and reconnections.

Second, the log provides buffering to the processes. This is very fundamental. If processing proceeds in an unsynchronized fashion it is likely to happen that an upstream data producing job will produce data more quickly than another downstream job can consume it. When this occurs processing must block, buffer or drop data. Dropping data is likely not an option; blocking may cause the entire processing graph to grind to a halt. The log acts as a very, very large buffer that allows process to be restarted or fail without slowing down other parts of the processing graph. This isolation is particularly important when extending this data flow to a larger organization, where processing is happening by jobs made by many different teams. We cannot have one faulty job cause back-pressure that stops the entire processing flow.

Both Storm and Samza are built in this fashion and can use Kafka or other similar systems as their log."

"A stream processor can keep it's state in a local "table" or "index"—a bdb, leveldb, or even something more unusual such as a Lucene or fastbit index. The contents of this this store is fed from its input streams (after first perhaps applying arbitrary transformation). It can journal out a changelog for this local index it keeps to allow it to restore its state in the event of a crash and restart. This mechanism allows a generic mechanism for keeping co-partitioned state in arbitrary index types local with the incoming stream data."

"When the process fails, it restores its index from the changelog. The log is the transformation of the local state into a sort of incremental record at a time backup."

"However, retaining the complete log will use more and more space as time goes by, and the replay will take longer and longer. Hence, in Kafka, we support a different type of retention. Instead of simply throwing away the old log, we remove obsolete records—i.e. records whose primary key has a more recent update. By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. We call this feature log compaction."

"The client can get read-your-write semantics from any node by providing the timestamp of a write as part of its query—a serving node receiving such a query will compare the desired timestamp to its own index point and if necessary delay the request until it has indexed up to at least that time to avoid serving stale data."

"One of the trickier things a distributed system must do is handle restoring failed nodes or moving partitions from node to node. A typical approach would have the log retain only a fixed window of data and combine this with a snapshot of the data stored in the partition. It is equally possible for the log to retain a complete copy of data and garbage collect the log itself. This moves a significant amount of complexity out of the serving layer, which is system-specific, and into the log, which can be general purpose"

"I find this view of systems as factored into a log and query api to very revealing, as it lets you separate the query characteristics from the availability and consistency aspects of the system. I actually think this is even a useful way to mentally factor a system that isn't built this way to better understand it."
DistributedSystems  Scalability  Logs  architecture  events  Kinesis  Kafka  Samza  Messaging  EventSourcing  CQRS 
5 days ago by colin.jack
[no title]
Event Sourcing in production presentation
events  sourcing  ruby  cqrs  presentation  gotconf 
13 days ago by kjeldahl
GOTO 2014 • Event Sourcing • Greg Young - YouTube
Iconoclastic talk on event sourcing by Greg Young (who created terminology for cqrs). Dissipates ambiguity with ddd / object orientation.
coink  event  sourcing  cqrs 
16 days ago by paunit
Developing microservices with aggregates - Chris Richardson - YouTube
A good talk on microservices and event sourcing style of consistency following CQRS pattern. Several mentions of DDD as the functional decomposition style, and events as first class citizens.
coink  ddd  event  sourcing  microservice  cqrs 
16 days ago by paunit
When Microservices Meet Event Sourcing - YouTube
Talk presented at O'Reilly Software Architecture Conference 2017 in New York City.
coink  microservice  event  sourcing  cqrs 
16 days ago by paunit
Twitter
Commander: Better Distributed Applications through and Event Sourcing
CQRS  from twitter_favs
25 days ago by pdudits

« earlier    

related tags

akka  antipattern  application-services  application  ar  architecture  article  best-practices  blog  c#  cases  casestudy  code  code_examples:true  coffee  coink  command  computer-science  course  data  ddd  dddesign  design-pattern  design-patterns  development  distributed_computing  distributedsystems  docker  domain-driven-design  domaindrivendesign  dot-net  eda  elixir  es  event-driven-architecture  event-driven  event-sourcing  event  event_sourcing  events  eventsource  eventsourcing  eventvault  example  functional  github  golang  gotconf  graphql  gregyoung  hackernews  hibernate  immutability  infoq  integration  java  javaone2017  javascript  jooq  kafka  kinesis  logging  logs  martin-fowler  mediator  messaging  microservice  microservices  microsoft  model  modelling  mvc  myths  netflix  onemanconsultant  oop  opensource  opinion  oss  paper  pattern  pdf  philosophy  phoenix  php  presentation  programming  project  rails  react  read  redis  reference  repository  ruby  saga  sagas  sample  samza  scala  scalability  serverless  servicefabric  shop  simple  sketchnotes  software-architecture  software  software_architecture  sothebys  sourcing  spring  sql  stream  techdaysnl  theory  todo  toolkit  toread  transaction-log  tutorial  types  typescript  use  value  video  videos  webdesign  webservices  write  youtube 

Copy this bookmark:



description:


tags: