White Paper: Cheese on your Cheeseburger? Redundancy and Efficiencies in a Collaborative Real-Time Machine Learning System

Monir Hossain, Alireza Goli, Omid Ardakanian* and Hamzeh Khazaei
Department of Electrical and Computer Engineering
*Department of Computing Science
University of Alberta

Gunjan Kaur and Dmitriy Volinskiy
AI Lab
ATB Financial
Edmonton, Canada

Abstract

In this paper we describe the design of a collab-orative real-time machine learning system  anddiscuss preliminary results and experience gainedfrom two redundancy-themed design studies con-ducted in a context of a financial institution.  Arationale is provided for the chosen approaches,alternatives are discussed, and directions for fu-ture system development are outlined.

I. Introduction

Redundancy in the software and information domain used to be mostly about the systems fault tolerance.  The in-creasing availability and affordability of cloud computinghas changed many design paradigms and led to the pro-liferation of software systems running on cloud providers“everything-as-a-service” platforms.  The former can nolonger be vertically-architected and centrally-managed. Onone hand, this is nothing short of a dependability bliss as,arguably, design diversity becomes both deeply intrinsicand multi-faceted.  On the other hand, the typically loosecoupling of system components creates new challenges: thestate the system is in, the total number of its states, as wellas many other metrics turn from a quantity uncertain into aquantity unknown. Most importantly, the very meaning ofcertain key redundancy notions may change as the systemnow is a loose agglomeration of modules built in-house, ofclouds services, API, distributed storage, etc.

At the center of this short paper is the design of a real-timemachine  learning  system  which  we  dub  “collaborative”.Similar to how a social network enables its participants tointeract in multiple planes, components of our system listenand/or publish to a variety of topics in a high-throughputpub/sub messaging bus. This leads to a de-facto completede-coupling, as any given component has no inherent knowledge regarding the existence, kind, state of any other compo-nent except for the messaging bus. There is no coordinationnor any direct information flow between any set of com-ponents.  A key design feature this seemingly primordialarchitecture yields is the ease with which an author can con-nect their component of arbitrary design to the system, andthe way the system naturally will containerize its connectedsoftware.  The system will thus grow, optimize itself anddevelop more functionality not according to a centrally pro-vided blueprint, but as a result of collaboration of multiplecontributing authors, hence the “collaborative” nomencla-ture piece.  As it is highly likely that the multiple authors will supply multiple equivalent solutions, redundancy comesto the forefront.

The paper also covers two mini-case studies; one relatedto software redundancy and the second discussing certainaspects of information redundancy. On the software redun-dancy front, we consider a case of two databases, of verydifferent kinds, storing and retrieving identical informationfrom the messaging bus. Not only do we comment on thesetup needed to achieve this – given that the information re-questor knows nothing about the existence, nature and querysyntax of the databases — we also consider how one canpiggyback off this redundancy to handle data requests intelli-gently given the requestors preference for either low latencyor data consistency. Information redundancy that the secondstudy is dealing with arises due to the fact that our real-timesystem has no facility to synchronize or dispatch data flowsin a particular way. Uncurated data get released into the sys-tem the moment the information becomes available, whichmay lead to it appearing with delays or in bursts. This maywreak havoc on machine learning models deployed on theorchestration level of the system: the models use time seriesdata summarized over various time windows and have noway of telling an artifact due to a delay or burst from a mean-ingful change in the data-generating process. A techniquewe consider to remediate this is by employing streams ofartificial, predicted data which are regularized and blendedwith the real data in case the handler detects an irregularityin the respective data stream.

And, concluding the paper, we offer some musings on thenature of cheese on a cheeseburger, the efficiency of colaborative content creation, and other non-technical aspectswhich, curiously, are quite important given the nature of our endeavor.

II. Collaborative Real-Time Machine Learning System

The present section outlines the general design of the Col-laborative real-time machine learning system (to be referredto as CMLS).

A. System Layers

Conceptually, CMLS is based on the two-layer architectureshown in Fig. 1.

Fig. 1. CMLS two-layer design.
  • Underneath it all, there is a fast Systems layer. The layer uses a Message-oriented middleware (MOM) software infrastructure which allows various components and modules of the system to exchange information while placing virtually no restrictions as to how and where those components are distributed, what software or platforms they use, and so on. The Systems layer is thus designed with the Software Engineer in mind: the objective is to reduce or possible eliminate any coupling of software components as well as to make the system as a whole and its components independent of the hardware it runs on, or the cloud or operating system characteristics. A key advantage of this design is, as is normally with MOM infrastructure, the asynchronicity of the system. Due to certain requirements and peculiarities of CMLS’starget application, we need to allow for a variety of unrelated external sending systems which will feed raw data, events, into a messaging bus. While it is generally possible to use time stamps to organize some of the data flows, synchronizing the arrival, transformation, and storage of those would result in a tremendous computational overhead and a patchwork of data lake dependencies.
  • Unlike Systems layer, which prioritizes the flexibility of software engineering and integration, Orchestration layer is Data Scientist’s lot. It is the business layer from the perspective of model development and deployment. In its future, ready state, CMLS Orchestration layer is as non-specific as any interactive computational environment where the Data Scientist can create their model’s code. Data are to be provided to models as a service (DaaS), which means that, as the number, kind, and state of systems both sending data to the messaging queue, transforming those data, and storing them is unknown, no knowledge can exist in Orchestration layer as to the structure of the underlying data. For example, if, conventionally, a data set needed to train a model is obtained by querying a known repository (database, schema, table, cube) in its native querying language or by other means of lookup, a model process in CMLS Orchestration layer will post a request, “provide me with this particular kind of data” to a topic, and then retrieve the results, if any, from another one. And it is not only DaaS — more in general, the layer provides service orchestration for its Data Science users, hence the name of the layer.
  • Connectors allow for the transfer of control and the movement of data flows between the layers. From the vantage point of Orchestration Layer, the connectors are libraries that can be loaded by the model code during run time. The libraries provide access to services, exception and error handling, etc. An important function of connectors also is to provide the means for deploying a model; as user directs the system to deploy a production-ready model, a connector piece should package the model appropriately, create a new component listening to the messaging bus, and move the model there.

B. De-coupling of data transformation and storage

In the preceding paragraphs we gave a rather non-specific, cursory overview of the system architecture. To understand the two case studies that follow, one needs to take a closer look at parts of the Proof-of-concept (PoC) available at the moment of writing. The former implements a significant part of the system; all development and integration work was done in Google Cloud Platform (GCP) using solely GCP’s managed services.

Fig. 2. De-coupling of data transformation and storage.

Fig. 2 illustrates a part of the PoC build, showing how a derived feature gets calculated and stored. The role of the messaging bus is played by Google’s managed service, Google
Cloud Pub/Sub. The latter is a scalable event ingestion and delivery system. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for highly available communication among independently written applications. Cloud Dataflow is a managed service for executing a wide variety of data processing patterns. We use it to create streaming data processing pipelines. Pipelines are built using the Apache Beam Software development kit (SDK), and then run on the Cloud Dataflow service.

To further illustrate the process, suppose a certain stochastic process generates and publishes to Pub/Sub events xit pertaining to an observational unit i in discrete time indexed by t. The investigator desires to obtain hourly counts of the events, ∑1hr xit. There are two Apache Beam pipelines: one gets xit from a topic where it is published (“raw data”), applies the respective transforms, and publishes the result to another topic, “derived data”. Another pipeline is listening on the derived data topic, inputs the counts as they arrive, and channels them to an external storage.

This make strike the reader as completely sub-optimal and redundant. Indeed, a pipeline is supposed to be encapsulating the entire data processing task, from start to finish. The two stages — the transform and the output to storage — can easily be made parts of a single pipeline. However, the redundancy suffered by the system to enable this de-coupling has its raison d’etre, and the next section presents two case studies ˆ addressing different aspects of redundancy in CMLS.

III. (MINI) CASE STUDY I: DATABASES THAT COMPETE AND COOPERATE

As we have alluded to in the previous sections, this mini- case study addresses software redundancy in the system. Most commonly redundancy is a purposeful part of the system design; it may be beneficial to deliberately add redundancy to increase the systems fault tolerance. An example is N-version programming that diversifies the design process to produce redundant functionality [1], which in itself is a special case of the more general design diversity approach [2]. An alternative is to let redundancy develop naturally in the system, and use it as needed. For example, a multitude a non-replicated components can be connected and reconnected in a multitude of ways, thus allowing for a Lego-like redundancy to provide automatic workarounds in self-healing systems; see [3], [4].

To facilitate exposition, let’s trivialize the problem and introduce only two competing data storage solutions, albeit of a very different kind. We can consider the solutions to be coming from two different authors as they have little in common, if any at all, in terms of design and implementation.

  • Bigtable (open-source version known as Apache Hbase). A scalable, NoSQL wide-column database which is suitable for both real-time access and analytic workloads, featuring:
    • low-latency read/write access;
    • high-throughput analytics;
    • native time series support.
  • Common workloads include:
    • Internet-of-Things data, such as usage reports from energy meters and home appliances. Graph data, such as information about how users are connected to one another.
    • Time-series data, such as CPU and memory usage over time for multiple servers.
    • Machine Learning data, data involved with machine learning algorithms.
    • Financial data, such as transaction histories (keeping historical behavior while continually updating fraud patterns and comparing with real-time transactions).
  • Cloud Spanner is an enterprise-grade, globally-distributed, and strongly consistent database service. It is a horizontally scalable, strongly consistent, relational database service, featuring:
    • strong consistency and horizontal scalability;
    • relational database and transactions support.
  • Common workloads include:
    • Financial services operational data.
    • Adtech systems, recommendation systems, and customer surveys.
    • Geo-spatial data.
    • Global supply chain and retail data.

A practical rationale for introducing such a pair of storage solutions can be due to a case of availability versus consistency dilemma. By “CAP theorem”, it is conjectured to be impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: (a) consistency, (b) availability, and (c) partition tolerance [5].

Fig. 3. Competing storage? Cloud Bigtable and Cloud Spanner.

Fig. 3 zooms on the part of the system PoC with the two connected databases. We add a Pub/Sub topic for data queries (Q) and data query results (A), and introduce a dummy module for creating data requests for items from a certain kind of transactional time series data. Requests for different data elements also come with an urgency indicator; we only use two levels: zero for non-urgent requests and unity for urgent ones. One can consider urgent requests as those prioritizing availability; non-urgent requests require consistency. No information about the kind of the databases is available to the data requesting module and, conversely, the handlers can only see data requests as published in the topic.
Data requests can now be fulfilled in one of the three ways, namely

  1. No redundancy will exist if Bigtable only serves urgent requests, with Cloud Spanner handling non-urgent ones. Unsurprisingly, with simple point or block queries that do not require a full table scan, the faster Bigtable easily outperforms Cloud Spanner at 0.002 seconds per query versus 1.855 seconds for Spanner.
  2. Parallel evaluation will result from both databases indiscriminately serving both types of requests.
  3. Predictive/Intelligent redundancy is possible if one adds a recommendation system for data requests. Data requests form longitudinal data series (zit, Iit), where zit is the requested value and Iit is the stated request urgency. Identity or the requestors and query times are also available as part of the response. Correspondingly, it is straightforward to implement a recommendation system, again, as an independent module listening to the Pub/Sub topics, to provide direct source recommendations to the requestors.

IV. (MINI) CASE STUDY II: CURATING DATA STREAMS

Perhaps counter-intuitively, the previous section dealt with software redundancy in CMLS using the example of redundant data storage. While early studies considered information or data redundancy an extension or a development of the software redundancy concept (e.g. see [6]), the former is currently viewed as a separate paradigm. Information redundancy includes the use of information with data and the use of additional forms of data to assist in fault tolerance [7]. 

The canonical form of the alternate input generation process proceeds as follows. An input, x, to the Program — we will beusing Model, instead, as more relevant in the CMLS context— is re-expressed into a different form r = R(x), which is then supplied to the Model. The Model’s output, M(r), is adjusted by applying some A(M(r)), to compensate for distortions introduced at the re-expression stage. While the above considerations are still quite relevant in our study, the real-time nature of the system introduces some unique challenges, which lead to a significant departure from the canon.

Fig. 4. Smoothing out delays and bursts.
  • Information flows arrive with delays and in bursts. A variety of external systems will be expected to publish their data in the messaging bus. As is the case with any real-time system, it is generally not possible to ensure that the data, the input x, arrive synchronously and in regular intervals. Fig. 4 illustrates the point. If the input is missing altogether, no re-expression of the missing data can be carried out. The workaround is to generate an alternative, predicted stream of data, x∗, based on historical values and trends. Unlike x, stream x∗ is always available. The predicted data then get blended with the actual x, which amounts to a re-expression, r = R(x, x*).
  • CMLS models are expected to be mainly those of stochastic processes. In models of this kind, inter-arrival times for events are as important as events themselves. Also, the models will be using data provided as DaaS, which means that deployed models will operate while being completely oblivious as to the state of the system’s data streams. A delay may be interpreted as a dramatic change in the distributional assumptions about the event-generating process, thus yielding unpredictable or meaningless results. On the other hand, encountering a data burst may be interpreted as a deliberate attack on the system, a fraudulent activity in progress, or an intrusion attempt.

Technology-wise, Cloud Dataflow (open-sourced as Apache Beam) is an unified model for defining both batch and streaming data-parallel processing pipelines, which is well suited for most of the data transformation in this exercise. The goal is to provide an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of run-time platforms. Dataflow is designed with infinite data sets in mind and, as such, it can deal with both bounded (batch) and unbounded (streaming) data in a reliable manner as opposed to Lambda Architecture [8]. The challenge — and a quagmire at times — is the need to build, provision, and maintain two independent versions of the pipeline, and then also somehow merge the results from the two pipelines at the end.

Figs. 5 and 6 show two out of, admittedly, many options to implement data stream curation.

Fig. 5. Data stream curation: linear design.

 

Fig. 6. Data stream curation: “beefed-up” design.

The linear design in Fig. 5 is straightforward and comes directly from our earlier discussion of what the alternative stream generating process would look like in CMLS. Note that the simulation, that is, the generation of the predicted data stream happens in Cloud Datalab. Cloud Datalab is a interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. But it is not a production tool, which turns it into the system’s weakest link. The design in Fig. 6 is more robust. The simulation moves to Google Compute engine, which is Google’s general computing infrastructure, Dataflow still handles basic transforms, and Cloud Stackdriver is added to perform certain dispatch functions. This adds more modules to the system, but this also relieves the simulation and transformation modules from the necessity to constantly monitor all data streams.

V. CONCLUDING REMARKS

So, would the esteemed patron want cheese on their cheeseburger? It is always tempting to attempt to design a system optimally and efficiently. We all wield our Occam’s razor, even subconsciously, looking for simple, minimalist designs. No different from collaborate content creation, attempts to create online software content through the collective action of many autonomous individuals with little coordination, no common affiliation, and no explicit incentives to collaborate, should be futile, as aptly noted in [9]. So — the reader probably figures — it only makes sense to skip the talk and serve the cheeseburger to the customer straight away, without asking them for any added cheese. Or does it? 

A collaborative process is by its nature inefficient as it involves multiple parties to it. A lot of time and development effort is bound to be spent in discussions, negotiation, the much-dreaded “back and forth”. This essentially harks back to so-called Waldo’s argument [10], who proposed the existence of a negative, linear relation between efficiency and democracy, rooted in the impossibility to reconcile the efficiency from a managerial perspective with the concept of democracy through public engagement and dialog. Yet the success of such seemingly wasteful collaborative creation ecosystems as Wikipedia and GitHub may provide a wee testimonial to the contrary.

This collaborative real-time machine learning system is being built as we speak; techniques that we are discovering and observations we make will require a substantial level of maturity, to form a solid corpus of knowledge that we anticipate to evolve rapidly. Nonetheless, we believe that the present paper will be instrumental to practitioners in the field, hopefully shining some new light on familiar concepts as it relates to redundancy in software systems.

REFERENCES

[1] A. Avizienis, “The n-version approach to fault-tolerant software,” IEEE Transactions on software engineering, no. 12, pp. 1491–1501, 1985.
[2] J. P. J. Kelly, T. I. McVittie, and W. I. Yamamoto, “Implementing design diversity to achieve fault tolerance,” IEEE Software, vol. 8, no. 4, pp. 61–71, 1991.
[3] A. Carzaniga, A. Gorla, and M. Pezze, “Self-healing by means of automatic workarounds,” in Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems. ACM, 2008, pp. 17–24.
[4] A. Carzaniga, A. Gorla, N. Perino, and M. Pezze, “Raw: runtime automatic workarounds,” in Software Engineering, 2010 ACM/IEEE 32nd International Conference on, vol. 2. IEEE, 2010, pp. 321–322.
[5] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” Acm Sigact News, vol. 33, no. 2, pp. 51–59, 2002.
[6] P. E. Ammann and J. C. Knight, “Data diversity: An approach to software fault tolerance,” Ieee transactions on computers, no. 4, pp. 418–425, 1988.
[7] L. L. Pullum, Software fault tolerance techniques and implementation. Artech House, 2001, ch. Data Diversity (2.3).
[8] E. Friedman and K. Tzoumas, Introduction to Apache Flink: Stream Processing for Real Time and Beyond. ” O’Reilly Media, Inc.”, 2016.
[9] C. Wagner and P. Prasarnphanich, “Innovating collaborative content creation: the role of altruism and wiki technology,” in System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on. IEEE, 2007, pp. 18–18.
[10] D. Waldo, The administrative state: A study of the political theory of American public administration. Routledge, 2017.

initializing
We are ATB transformation - innovating at the forefront
of robotics, AI, blockchain and the future.