analytic operations

100 Millisecond Auctions, Your Privacy and Regulatory Risk

December 14, 2021 by Robert Grossman

The Regulatory Risks Posed by GDPR When Using Bidstream Data in Real Time Bidding (RTB)

You can download Mobilewalla and Ubermedia's data directories from the evidence we sent the DPC 13+ months ago here https://t.co/3ccOMbNyNC. Note: this is clearly the IAB standard. pic.twitter.com/aSHP3NzQgE
— Johnny Ryan (@johnnyryan) November 18, 2021

Figure 1. A tweet from Johnny Ryan from the Irish Council for Civil Liberties (ICCL) showing some of the data that is available in a TC consent string. The ICCL is involved in litigation under the GDPR with those using bidstream data for targeting.

In real time bidding (RTB), there is a 100 ms auction in which different advertisers bid against each other to place an ad on a webpage or other space offered by a publisher. This is why you sometimes have the experience when browsing that the web jumps a bit as you start to read the page. The jump is because the web page content can shift after the ad loads.

The data exchanged in this auction is called bidstream data and includes location information, information about the application and system you are using, and standardized codes developed by the Interactive Advertising Bureau (IAB). The information about the application and system you are using is enough for online device fingerprinting. The specificity in the IAB codes can be a bit shocking the first time you look at them. Here are some examples of IAB codes from the IAB Taxonomy:

ID 281 Interest | Businesses and Finance | Bankruptcy

There has been concerns about the leakage of privacy information in bidstream data for sometime and the inadequacy of the pop-up consents that are suppose to give users some control.

Providing Consent for Bidstream Tracking Data

For those website that follow the European General Data Protection Regulation (GDPR), users are presented with a pop-up banner that asks for consent for the collection of tracking information. In general, websites use this information for both internal purposes and for targeted advertising, such as supplying information for bidstream auctions.

In 2018, IAM developed a “consent framework” called the Transparency and Consent Framework to standardize how advertisers and publishers could collect and store the required consent information. The information is passed around in a format standardized by the IAB called the TC consent string.

Changes in the Regulatory Framework for Using Bidstream Data

The Irish Council for Civil Liberties (ICCL) has brought a case under the GDPR claiming the the IAB Transparency and Consent Framework and its use with TC String in real time bidding does not provide adequate consent [1]. The actual 174 page court filing [2] provides a very detailed explanation of the bidstream data that is exposed and why it should be considered a data breach under GDPR. For a good overview of the ICCL case and its implications, see [3].

Upcoming Regulatory Changes and Your Analytic Strategy

As a result of the ICCL court case, there maybe changes in the regulations for using bidstream data and one of the challenges from an analytic strategy point of view is to have a plan to update your own strategy in anticipation of future regulatory changes.

Figure 2. The process of how advertisers (the Brand) sell ads to publishers. An advertiser can use a demand side platform to work with different networks of publishers and a publisher can use a supply side platform to work with different networks of advertisers. Advertisers buying space for advertising and publishers selling space for advertising are matched by the ad exchange and use the bidstream data to determine the auction prices. The figure is from Wikipedia (CC-By).

Problems with Bidstream Data

The exchange of bidstream data involves the advertiser, the publisher, the ad exchange, and, usually, a demand side and supply side platform. Advertisers buying space for advertising and publishers selling space for advertising are matched by the ad exchange. An advertiser can use a demand side platform to work with different networks of publishers and a publisher can use a supply side platform to work with different networks of advertisers. See Figure 2. With this many different entities involved in the auction, it is not surprising that there are often problems with bidstream data.

Inaccurate information. Given the amount of information exchanged with RTB and the monetary value involved, there is a significant level of inaccurate information being exchanged and a significant level of activity where the bidstream data is being used for purposes for which it was not intended and for which it was not consented. An example would be surveilling bidstream data and processing it using device fingerprints to create profiles of the users bidding.

Misuse of bidstream data. This is an example of adversarial analytics in the original sense (not the more modern GAN sense) in which there are two parties building analytic models that are at cross purposes. Here the advertisers and published exchanging bidstream information for real time auctions and the adversaries who are monitoring the information to build profiles of users that can be monetized.

Challenges with changing it. Most websites list a long list of advertising partners that they exchange bidstream data with it (often hundreds) and leave it to the user to update or correct the information for each partner. This is almost never practical.

The Emerging Era of Post Third Party Cookies

We are moving to an era in which advertising will not be able to use third party cookies, but rather rely on other types of information for matching advertisers and publishers. From an analytic strategy point of view, it is critical to make sure that your analytic frameworks still provide the value you require when third party cookies are not supported. More generally, platforms will have to update so as not to rely on third party cookies.

Regulatory changes almost always take years, giving your organization time to put in new place new analytic frameworks. The challenge though is usually not a technical one of transitioning to a new regulatory framework but rather figuring out how to extract the necessary business value in the new regulatory framework to provide the foundation for a successful analytic business.

References

[1] Irish Council for Civil Liberties, ICCL lawsuit takes aim at Google, Facebook, Amazon, Twitter and the entire online advertising industry, 15 June 2021, retrieved from ICCL.ie on December 6, 2021.

[2] ICCL complaint filed against IAB Technology Lab, Inc. and others in Hamburg District Court, May 15, 2021 (machine translated from the original German).

[3] Daniel Cooper, How a civil rights group is holding Europe’s online ad industry to account, Engadget, December 3, 2021. Retrieved from Engadget on Dec 14, 2021.

When You Need to Deploy Predictive Models Safely

August 10, 2020 by Robert Grossman

Little languages. A key insight in the development of Unix was that there was an important role for what became known as little languages, which are simple specialized languages for executing important types of tasks. The insight was that it is much easier to design a little language that can be implemented efficiently for a specific task than a general language that is designed to support all tasks. For example, in Unix there are specialized little languages and corresponding programs for:

pattern matching (regular expressions)
text line editing (ed/sed)
grammars for languages (lex/yacc)
shell services (sh)
text formatting (troff/nroff)
processing data records (awk)
processing data (S)

This point of view is explained clearly in an influential 1986 ACM article by Jon Bentley called Little Languages. Of course, on the downside is that software developers must learn the little languages.

How should we view models in analytics, AI and data science? From the viewpoint of the practice of analytics, it is important to understand the different perspectives that different members in your organization have about analytic models.

If you are a modeler, your task is often given some data to develop an analytic model. So your input is data and your output is an analytic model. Historically, modelers split data into training and test (or validation), but these days with hyper-parameters it’s more common to split into training, dev and test datasets.
If you are a member of an operations team (what I call analyticOps) that is deploying analytic models in products, services or internal operations, than you must manage multiple analytic models and make sure that they are processing data as required to produce scores and that the associated post-processing is in place to process the scores and take the required actions.
If you are a member of the IT team, or the IT team deploying analytic models, such as the AIOps or ModelOps team, then you task is to take the models developed by the modeling team, manage them as enterprise IT assets, and deploy them as needed into the required products, services and internal processes.
Finally, if you are developing an analytic strategy, then the models produced by the modeling team and the products, services and internal processes managed by the analytic operations team are part of a broader analytic ecosystem that might also include supply chain partners and product ecosystem partners that also use models that your organization develops.

With this split (described in more detail in my post on the analytic diamond) and in my primer Developing an AI Strategy, a Primer, there is one team that produces models and one or more teams that consumes models, and so it is natural to ask what type of efficiencies can be obtained by using little languages for expressing models and using these little languages for managing models across an analytic enterprise and the analytic ecosystem that it supports, including all the applications and systems that produce models (model producers) and all the applications and systems that consume models (model consumers).

Analytic models as code. With the critical importance of DevOps, there is another important way to view models. With this view, models are simply code, and the way to manage code is with a version control system, continuous integration (CI), and continuous deployment (CD). Model code is dockerized in a container, and the dockerized container is managed with the same systems that are used for the CI/CD of the rest of the code. There are many more software developers than modelers, and so the most common way of viewing analytic models these days is as code.

Analytic models as described in little languages. Returning to the first point of view, there are several little languages that have been developed for analytic models:

Predictive Model Markup Language (PMML). PMML is an XML language for expressing standard statistical, machine learning and data mining models, that has been used for over 20 years. It is widely deployed, and good at expressing the familiar machine learning models, such as decision trees, support vector machines and clusters, but does not support arbitrary models and has only a limited ability to support the data transformations that are needed for preparing features for models.
Open Neural Network Exchange (ONNX). ONNX is a language for expressing deep learning neural network models and is supported by major systems for deep learning including TensorFlow, PyTorch and Keras. It is by far the most common language for expressing deep neural networks, but does not support standard statistical and data mining models as well, and also does not fully support data transformations that are often required in machine learning and analytics.
Portable Format for Analytics (PFA). The Portable Format for Analytics is a newer little language and model interchange format for analytic models based upon JSON that is designed for the safe and secure execution of arbitrary analytic models and arbitrary data transformation. You can find an overview of PFA that was presented at KDD 2016 that also describes an open source PFA scoring engine. PFA supports the safe and secure execution of models in several ways, including:
- PFA models are strictly sandboxed
- PFA models can only access data that is explicitly given to it
- PFA models cannot manipulate anything beyond its own state
- PFA models have no way to access the disk, network or operating system
- PFA models have static data types and missing value safety
Combinations of little languages. In some situations, it might make sense to complement ONNX with PFA to support the data transformation not available, or not available efficiently in ONNX, and to support models that are not available in ONNX.

When you need to deploy models safely. There are some situations when it is critical to deploy code safely. It is well known that changing just a single line of code can bring down an enterprise system and for this reason there is always a risk in deploying analytic models as code in system that requires high availability with accurate results. As another example, when analytic models are deployed at the edges, including IoT, OT and in consumer devices, there are strong arguments for deploying models in safe languages, such as PFA, or other small languages designed for this purpose.

References

[1] Bentley J. Programming pearls: little languages. Communications of the ACM. 1986 Aug 1;29(8):711-21.

[2] Pivarski J, Bennett C, Grossman RL. Deploying analytics with the portable format for analytics (PFA). In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016 Aug 13 (pp. 579-588).

Deploying Analytic Models

November 4, 2019 by Robert Grossman

One of my defining experiences in analytics occurred over twenty years ago in 1998. At that time, it was very challenging to build analytic models if the data did not fit into the memory of a single computer. This was the time that clusters of distributed computers (called Beowulf clusters then) were still new and only used in academia, ensembles of models were still largely unknown, and when Python and R were still only used by a few fringe communities. I was working at a venture backed start up that developed software for building trees and other common analytic models over clusters of workstations by using ensembles of models. One of our first customers gave us clean, nicely labeled data just when they said they would (in over twenty years this has only happened a handful of times). We built a model that outperformed what was used in production and I (very) naively thought we were done. Little did I understand just how much work it would be to get the model into production and how much it would have to change to get it there. The figure above gives the general idea, but in practice getting the data and deploying the model is even harder than the figure indicates.

Most analytic and machine learning conferences even today focus on new algorithms and new software but very little on deploying analytic models. An exception is the workshop called “Common Model Infrastructure” which was held at KDD 2018 and ICDM 2019. The workshop describes its focus as “infrastructure for model lifecycle management—to support discovery, sharing, reuse, and reproducibility of machine learning, data mining, and data analytics models.”

I spoke at the CMI 2018 workshop at KDD 2018. My slides are on SlideShare and you can find them here. I singled out the following best practices:

Five Best Practices When Deploying Models

Mature analytic organizations have an environment to automate testing and deployment of analytic models.
Don’t think just about deploying analytic models, but make sure that you have a process for deploying analytic workflows.
Focus not just on reducing Type 1 and Type 2 errors, but also data input errors, data quality errors, software errors, systems errors and human errors. People only remember that model didn’t work, not whose fault it was.
Track value obtained by the deployed analytic model, even if it is not your explicit responsibility.
It is often easier to increase the value of deployed model by improving the pre- and post- processing vs chasing smaller improvements in the model’s lift curve.

During the talk, I identified five common approaches for deploying analytic models. I remember these with the acronym E3RW:

Embed analytics in databases
Export models and deploy them by importing into scoring engines
Encapsulate models using containers or virtual machines
Read a table of values that provide the parameters of the model
Wrap the code, workflow, or analytic system, and, perhaps, create a service

Although these days, with the popularity of continuous integration (CI) and continuous deployment (CD), encapsulating models in containers and using CI/CD tools is the approach du jour, each approach has its place.

I also identified five common mistakes when deploying analytic models.

Five Common Mistakes When Deploying Models

Not understanding all the subtle differences between the supplied run time data used to train the model and the actual run time data the model sees.
Thinking that the features are fixed and all that you will need to do is update the parameters.
Thinking the model is done and not realizing how much work is required to keep up to date all the the pre- and post-processing required.
Not checking in production to see if the inputs to the models drift slowly over time.
Not checking that the model will keep running despite missing values, garbage values, etc. (even values that should never be missing in first place).

Improving the Analytic Maturity Level of Your Company

October 2, 2019 by Robert Grossman

The figure shows the five levels of analytic maturity – Analytic Maturity Level (AML) 1 -5.

Over the past 10 years or so, I developed and used an analytic maturity model that was broadly motivated by the software Capability Maturity Model (CMM) that is used in software development.

You can learn more about the model in this article: A Framework for Evaluating the Analytic Maturity of an Organization, International Journal of Information Management, 2018, which is open access and available here: doi.org/10.1016/j.ijinfomgt.2017.08.005. This article introduces five Analytic Maturity Levels (AML) and discusses the analytic processes required to achieve each level.

Over the past few years I have been working on a book called the Strategy and Practice of Analytics and have been thinking about analytic maturity levels again. Here are the five levels as defined in the book:

AML 1: Build reports. An AML 1 organization can analyze data, build reports summarizing the data, and make use of the reports to further the goals of the organization.

AML 2: Build models. An AML 2 organization can analyze data, build and validate analytic models from the data, and deploy a model.

AML 3: Repeatable analytics. An AML 3 organization follows a repeatable process for building, deploying and updating analytic models. In my experience, a repeatable process usually requires a functioning analytic governance process.

AML 4: Strategy driven repeatable analytics. An AML 4 has an analytic strategy that aligns with the corporate strategy, uses the analytic strategy for selecting which analytic opportunities to pursue, and follows a repeatable process for building, deploying and updating analytic models.

AML 5: Strategy driven enterprise analytics. An AML 5 organization uses analytics throughout the organization and analytic models in the organization are built with a common infrastructure and process whenever possible, deployed with a common infrastructure and process whenever possible, and the outputs of the analytic models integrated together as required to optimize the goals of the organization as a whole. Analytics across the enterprise are coordinated by an analytic strategy and analytic governance process.

Five ways to improve the analytic maturity level of your organization. Here are five suggestions for improving the analytic maturity level of your organization.

Set up a committee to quantify the analytic maturity of your company. If you cannot measure it, you cannot improve it.
Set up (or improve) the environment for deploying analytic models, using a model interchange format, analytic engines, or similar technology. (Helpful for AM Level 2).
Set up your first analytic governance committee or improve the operational efficiency of your current analytic governance. (Helpful for AM Level 2).
Set up SOPs for building and/or deploying analytic models so that the process is faster, repeatable & replicable. (AM Level 3 Requirement)
Volunteer to lead a process to integrate two different models from two different parts of your company to improve the relevant actions. (Helpful for AM Level 5)

What’s different between AMM-12 and AMM-17? Although the paper describing the analytic maturity model was published in 2017, the work dates back to period 2010-2012. I first gave a talk about the AMM in 2012 at a Predictive Analytic World (PAW) conference that took place on June 25, 2012 in Chicago. Since the AMM didn’t change between 2012 and when it was published in 2017, let’s call this the AMM-2012 model.

In the AMM-2012, AML 1-3 are the same, but AML 4 is Enterprise Analytics and AML 5 is Strategy Driven Analytics. The problem with this approach is that strategy driven analytics can occur at any level, so it is not particularly helpful to reserve it for level 5. From one perspective whether, strategy can be integrated into processes for building reports (AML 1), for building models (AML 2), for building models in a repeatable fashion (AML 4), or for building models across the enterprise in a repeatable fashion (AML 5). With this approach, reports -> models -> repeatable models -> enterprise models would be one axis and the degree of strategy integration would be another axis. This seemed too complicated though for most applications, so for the AML-19 versions of the AMM, AML 4 is simply strategy driven repeatable analytics and AML-5 is strategy driven enterprise analytics.

I speak about analytic maturity models from time to time, most recently at the Predictive Analytics World (PAW) Conference that took place in Chicago on June 29, 2017 and The Data Science Conference Chicago (TDSC) that took place in Chicago on, April 20, 2017. I also occasionally give in house training courses that include material about helping a company increase its analytic maturity level.

You can find more information about Analytic Maturity Models in Chapter 14 of my book The Strategy and Practice of Analytics.