Uncategorized

Two Years of Posts About Analytic Strategy

September 14, 2021 by Robert Grossman

I have written about a post per month for the past two years about analytics and I thought it might be interesting to group these by categories and themes.

Theme 1: Running an algorithm over data is easy, building a useful AI-based system is quite hard

We train data scientists to transform data into rows and columns and then run an algorithm over the rectangular data, but as the articles below discuss, this is just one of many iterative steps in building a functional and useful AI-based system.

Why Didn’t AI Contribute More to COVID-19 Research? (July 14, 2021)
How to Navigate the Challenging Journey from an AI Algorithm to an AI Product (Feb 15, 2021)
Why Great Machine Learning Models are Never Enough: Three lessons About Data Science from Dr. Foege’s Letter (Oct 12, 2020)
Why Do So Many Analytic and AI Projects Fail? (May 11, 2020)
Deploying Analytic Models (Nov 4, 2019)

Theme 2: AI/ML Business Models

Developing a successful business model is hard in general, and AI/ML business models are no exception. I was a bit surprised that there were not more articles in this theme (for three years, I taught a course at the University of Chicago Booth School of Business that discuss edAI/ML business models as one of its sections), and I intend to write more of these articles during the next two years.

Machine Learning vs AI Business Models — What’s New with the Economics of AI? (Nov 12 2020)
Starting a Lean Analytic Start-Up: Four Key Questions to Ask (June 12, 2020)

Theme 3: Organizational Issues About AI/ML

Three Reasons All Corporate Boards Need Someone Who Understands Both Analytic Innovation and Analytic Strategy (Jan 14, 2021)
Five Steps to Improve the Analytic Maturity of Your Company — 2021 Edition (Dec 14, 2020)
Continuous Improvement, Innovation and Disruption in AI (July 6, 2020)
Analytic governance and why it matters (April 10, 2020)
Improving the Analytic Maturity Level of Your Company (Oct 2, 2019)

I also spent some time researching and writing mini profiles about some of the advocates for data driven decisions, analytics and systems. These include posts (in alphabetical) about

Theme 4: Profiles in AI/ML

W. Edwards Demming (May 10, 2021)
William H. Foege (Oct 12, 2020)
Frank Knight (Mar 14, 2021)
George Heilmeier (Dec 6, 2019)

Hidden Dangers from Dark Patterns of Data Collection

August 11, 2021 by Robert Grossman

The term “dark patterns” refer to user interfaces that are deliberately designed to mislead users.

A good definition of the term dark patterns is contained in the article “Shining a light on dark patterns [1]:”

“Dark patterns are user interfaces whose designers knowingly confuse users, make it difficult for users to express their actual preferences, or manipulate users into taking certain actions [1].”

Getting the data you need for modeling is often the hardest part, and some organizations are tempted by using dark patterns to collect. In Chapter 6, of my Primer on developing an AI strategy, I cover several ways that data can be legitimately collected. In this post, I’m going to look at three dark patterns of data collection, in which the user is deliberately mislead. There are quite a few dark patterns of data collection, but here we will focus on just three of them.

Dark Pattern 1: Promiscuous Data Collection

This is one of the most common dark patterns used for collecting data. A good example of this approach is a weather application on your phone that provides the local weather but also collects your location data each day, extracts features from it to characterize your behavior, and then sells your location and behavior data to third parties. There is no standard name for this dark pattern, but I call it promiscuous data collection. Another good example is a game on your phone that collects and sells your location data. The terms of service that allow this are usually, but not always hidden in the click through agreement when you install the app. One of the definitions in the Oxford English Dictionary for promiscuous is: “Of an agent or agency: making no distinctions; undiscriminating,” and this certainly describe mobile applications that collect and sell data in this way.

Dark Pattern 2. Collecting Data Through Browser Fingerprinting

Since cookies can be deleting, advertisers and advertising networks have designed other ways to track your behavior. There are about 300,000,000 million people in the US, which is about 2^28, so about 28 bits of information are needed to identify someone in the US. Browser fingerprinting, also known as online fingerprinting or device fingerprinting, is the technique in which standard online scripts is used to collect information about the system you are using to browse the web, such as the extensions in your browser, the screen resolution and color depth of the system you are using, your time zone, the language you are using, etc. With answers to enough questions like these, a “fingerprint” can be formed that uniquely identifies a device used by an individual. For example, the number of bits of information in a user agent string that a browser provides varies, but 10 bits is a good average [2].

A good place to learn about browser and device fingerprinting is the the privacy education website Am I Unique. When I visited Am I Unique, my operating system, browser (Firefox, Chrome, Safari, etc.), browser version, time zone, and preferred language (all available from the browser when I visited the website) uniquely identified me via a browser fingerprint from 4.2M fingerprints in their database. Am I Unique collects 23 characteristics of your browser and device, although only five of these were needed to uniquely identify me.

I view browser fingerprinting as a dark pattern, because unlike with cookies, there is less you can do to block browser fingerprinting, although this has started to change. You can find a list of tools on the Am I Unique website that can provide some protection from browser fingerprinting.

Dark Pattern 3. Dark Digital Twins

Digital twins are “digital replications of living as well as nonliving entities that enable data to be seamlessly transmitted between the physical and virtual worlds [3]”. AI and deep learning are enabling the development of more functional and more powerful digital twins.

By a dark digital twin, I mean an AI application that either asks a series of questions or observes your behavior in one context with your consent, and then through any of several techniques, builds a profile of you and uses this profile in another context without your consent. As a simple hypothetical example, assume you are interacting with an AI application that is giving you wine recommendations, but the information used is then used in a dark digital twin to target you with vacation packages. Although this is pretty innocent example, as digital twin profiling technology improves, the use of dark digital twins is likely to become more and more disconcerting.

These days, it is more common for a machine learning or AI model built with user consent for one purpose is reused for another purpose without adequate user consent, but it is only a matter of time before the models became sophisticated enough to start acting as digital twins.

References

[1] Jamie Luguri and Lior Jacob Strahilevitz. “Shining a light on dark patterns.” Journal of Legal Analysis 13, no. 1 (2021) pages 43-109.

[2] Peter Eckersley, A Primer on Information Theory and Privacy, 2010, Electronic Freedom Foundation, retrieved from https://www.eff.org/deeplinks/2010/01/primer-information-theory-and-privacy on July 7, 2021.

[3] Abdulmotaleb El Saddik, Digital twins: The convergence of multimedia technologies, IEEE multimedia 25, no. 2 (2018) pages 87-92.

Why Didn’t AI Contribute More to COVID-19 Research?

July 14, 2021 by Robert Grossman

Over the last several months, a number of reports and technical publications have appeared describing the lack of success applying machine learning and AI to COVID-19 research. Although many people are surprised, the readers of this blog should not be. Getting data is hard, understanding data is hard, building models is hard, embedding AI models in systems is hard, and extracting value from AI systems is hard. This is one of the themes of this blog and why there is a big difference between applying an AI algorithm to a dataset and developing a successful AI application.

BMJ review of 232 predictive models

If you haven’t seen any of these reports, a good place to start is the review article that appeared in the April, 2021 issue of BMJ and which analyzed 169 studies describing 232 COVID prediction models [1]. The title of the article “Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal” gives a good idea of what is covered. What the authors found is that of the 232 predictive models,

“[the] models are poorly reported and at high risk of bias, raising concern that their predictions could be unreliable when applied in daily practice.”
“Two prediction models (one for diagnosis and one for prognosis) were identified as being of higher quality than others and efforts should be made to validate these in other datasets.”

In other words, of the 232 predictive models studied, all except two could be unreliable when applied in clinical practice and the remaining two needed to be validated with data from additional datasets. This is not the good news, but it is not surprising either.

Turing Institute Report

The Alan Turing Institute is a UK research institute focused on data science and AI with the tag line “We believe data science and artificial intelligence will change the world.” (So do I, by the way, but I also believe that this data driven revolution has been going on for the past thirty years.) The Turing Institute ran a series of workshops on COVID-19 research and produced a summary report [2]. Here are some of the conclusions of the report:

“… the single most consistent message across the workshops was the importance – and at times lack – of robust and timely data. Problems around data availability, access and standardisation spanned the entire spectrum of data science activity during the pandemic. The message was clear: better data would enable a better response.”
“… conventional data analysis has been at the heart of the COVID-19 response, not AI”.

Perhaps most noteworthy is what the report didn’t say. It didn’t describe successful AI tools and their impact, but rather described challenges getting data, difficulties standardizing the data for analysis, and the many issues related around inequality and inclusion. It also described the “difficulty of communicating transparently with other researchers, policy makers and the public, particularly around issues of modelling and uncertainty [2].”

Don’t be Surprised When Black Box Frameworks Fail to Bring Value

Why should readers of this blog not be surprised? First, several of the posts of the themes of this blog is about 1) the difficulties getting data (Chapter 6, Crossing the Data Chasm of my Primer), building models, deploying models in AI systems, and extracting value from the AI systems, and 2) how to develop analytic strategies and implement best practices to overcome these difficulties. For example, my post “Why do so many analytic and AI projects fail” suggests using a simple radar plot to measure over time the increase or decrease int the likelihood that an AI project will fail.

Today, we have good tools and software frameworks for building AI models, but it still requires a lot of work and a good understanding of the practice of analytics to build and deploy a useful AI system. As Figure 1 shows, one level of understanding comes from cleaning and preparing data modeling, another from building a model with the data, another from deploying the model, and a still higher level when you can extract value from the model. Even for skilled modelers, data leakage is always a problem and can be quite hard to track down. Often times, you do not understand how to clean, prepare and harmonize data appropriately until you have built a model. Similarly, it can be hard to determine what features are actually available for a model running in the field until you build and deploy your first system that uses the model. Almost always, you do not understand how to build the right model, deploy the model correctly, and create the right actions around the model’s output until you see the system performing in practice, understand what works and what doesn’t, and start again.

With deep learning frameworks such as PyTorch, Keras and TensorFlow, building models does not require features. You can take someone else’s data to use in transfer learning and someone else’s modeling software and treat both as blackboxes. With this approach, you shouldn’t surprised when the deployed model in the real world doesn’t bring value. The staircase of data understanding (Figure 1) is about the real world practice of analytics and the engineering of analytic systems, and most of what we teach these days are machine learning and AI algorithms over someone else’s data.

References

[1] Laure Wynants, Ben Van Calster, Gary S. Collins, Richard D. Riley, Georg Heinze, Ewoud Schuit, Marc MJ Bonten et al. “Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal.” British Medical Journal 369 (2020).

[2] Alan Turing Institute, Data science and AI in the age of COVID-19. Reflections on the response of the UK’s data science and AI community to the COVID-19 pandemic, retrieved from: www.turing.ac.uk (https://www.turing.ac.uk/research/publications/data-science-and-ai-age-covid-19-report) on July 2, 2021.

Crossing the Data Chasm for Cancer Research

June 15, 2021 by Robert Grossman

The Cancer Moonshot Program was announced in the State of the Union address on January 12, 2016. It’s goal was to accomplish 10 years of research in 5 years and one of its strategies was to use data sharing to help accomplish this.

The Cancer Moonshot Program was announced in the State of the Union address on January 12, 2016. It was funded with the 21st Century Cures Act, which was passed by the U.S. Senate on December 7, 2106 and signed by President Obama on December 13, 2016. The act provided NIH an additional $1.8 billion over seven year in supplemental funding to fund Cancer Moonshot projects and initiatives. The project was led by President Biden, who was then the Vice President of the US.

Data and data sharing played an important strategy in the cancer moonshot strategy. As we approach the fifth year anniversary of the passing of the 21st Century Cures Act, it might be a good time to look back at components of the underlying strategy from an analytic strategy point of view.

With five years of effort behind us, it should be easier to see what worked well, what worked less well, and how we might fine tune the current and planned activities.

As described on the National Cancer Institute’s Cancer Moonshot website [1], “The Cancer Moonshot has three ambitious goals: to accelerate scientific discovery in cancer, foster greater collaboration, and improve the sharing of data.”

As described in the White House’s announcement [2]: “Here’s the ultimate goal: To make a decade’s worth of advances in cancer prevention, diagnosis, and treatment, in five years.” The goal was to get done in five years what would normally take ten years. A critical element of the strategy was to share cancer related data in an effort to accelerate research.

In analytics and AI, the biggest challenge that you face is usually not building a model, but instead collecting or acquiring the data that you need to build the model. I call this crossing the data chasm, and one of the levers used was to force those funded by Cancer Moonshot projects to share data. You can learn more about crossing the data chasm and its role in an analytic strategy in my Primer on Analytic Strategy.

Today as we plan the next five years, it is important to look at how we can leverage data sharing to continue the goal of accomplishing in five years what would normally take ten years.

The good news is that more and more data is being shared when it is funded with federal dollars. As an example, the NCI has developed a data sharing policy for all Cancer Moonshot funded projects [3]. The not so good news is that often data sharing is not required when research is funded by private foundations and private philanthropy, and data is rarely shared when research is funded by industry. Given the size and complexity of the cancer industry, the amount of money at stake, the competitiveness of scientists and cancer centers, the challenges with deidentifying research data, and the legal risk of sharing human subject data that is collected by healthcare providers, most cancer data is not shared, and there is much less progress as a result.

An exception is pediatric cancer. Fortunately, pediatric cancer is relatively rare compared to adult cancers. With fewer cases, there is usually not enough pediatric cancer data at any one research center to provide the number of cases required for research projects, unless data is shared [4].

Some success stories

One of the success stories of the Cancer Moonshot is the BloodPAC Consortium, whose mission is accelerate the development, validation and accessibility of liquid biopsy assays to improve the outcomes of patients with cancer. The BloodPAC Consortium is now an independent 501(c)(3) organization with over 50 consortium members organized into a number of different working groups and operates the BloodPAC Data Commons to support data sharing among its members [5].

Other cancer data sharing success stories include: the NCI Genomic Data Commons, ACR’s Project GENIE and ASCO’s CancerLinQ. Each of these three projects provide large scale data sharing that has accelerated cancer research and resulted in many significant publications.

Why is data sharing so hard?

The first question to ask is: “Why is data sharing so hard?” There are number of reasons, but probably the most important are the following:

It’s difficult and time consuming for researchers to prepare data and to submit data for data sharing. Often researchers have moved on to new experiments, new analyses, and writing new papers, and sharing data from their last project always remains on their “B-list” of items to do.
Investigators must balance the potential public good and benefit to patients that results when data is shared compared to the loss of momentum in their career when others publish results from their data that they could have published with more time before they shared data.
Data is often not collected with consents that make it easy to share.
Data sharing is often not required, except with federally funded research; and, for federally funded research, data sharing is often not enforced [6].
There is a risk when data is shared in case the shared data contains sensitive data or third party data sources can be used to re-identify some subjects in large shared datasets [7].
Often, the computing infrastructure required for data sharing infrastructure is not funded. So data sharing is “unfunded mandate.”

What can change?

Perhaps the most important we can ask is: “What can change?”

Cancer research organizations can form data sharing coalitions to share data around specific cancers of interest in order to accelerate research. For example, several NCI Comprehensive Cancer Centers could self-organize and share data to achieve critical mass to study cancers of interest that individually would be harder to study with just their own patients. Although there are already several such such projects like this, in practice, just a small fraction of the potential data is being shared in this way.
Research projects can be collaborative with patients and data sharing technologies, such as blue-button, can be used so that patients themselves can directly their share their data. This is sometimes called patient-partnered research and for new research projects this is by far the preferred approach. For this to work, healthcare providers must make it easier for patients to share their data using blue-button and other data sharing technologies.
We can improve the data ecosystems that link together local cancer information collected by different projects and support federated learning. This way cancer data that cannot be easily shared with others can stay within the security and compliance boundaries of the organization that provides healthcare services, but be shared effectively with the broader research community to accelerate research. This might be a good project for the proposed ARPA-Health (ARPA-H).
We can reduce the liability and fines associated with the inadvertent disclosure of health information and create insurance pools to pay the fines in order to protect research organizations that use best practices to protect data, but health information is still exposed.

I’ll be returning to these four topics from time to time in future posts.

Disclaimers

I’m actively involved with both the BloodPAC Consortium and the NCI Genomic Data Commons.

References

[1] National Cancer Institute, Cancer Moonshot, retrieved from: https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative on June 1, 2021.

[2] The White House, President Barack Obama, Join the Vice President’s Cancer Moonshot, retrieved from https://obamawhitehouse.archives.gov/cancermoonshot on June 1, 2021.

[3] NCI Cancer Moonshot Public Access and Data Sharing Policy, retrieved from https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/funding/public-access-policy on June 1, 2021.

[4] Samuel L. Volchenboum, Suzanne M. Cox, Allison Heath, Adam Resnick, Susan L. Cohn, and Robert Grossman, Data Commons to Support Pediatric Cancer Research, American Society of Clinical Oncology Educational Book 2017:37, 746-752

[5] Robert L. Grossman, Jonathan R. Dry, Sean E. Hanlon, Donald J. Johann, Anand Kolatkar, Jerry SH Lee, Christopher Meyer, Lea Salvatore, Walt Wells, and Lauren Leiman. “BloodPAC Data Commons for liquid biopsy data.” JCO Clinical Cancer Informatics 5 (2021): 479-486.

[6] Frisby, Tammy M., and Jorge L. Contreras. “The National Cancer Institute Cancer Moonshot Public Access and Data Sharing Policy—Initial assessment and implications.” Data & Policy 2 (2020).

[7] Luc Rocher, Julien M. Hendrickx, and Yves-Alexandre De Montjoye. “Estimating the success of re-identifications in incomplete datasets using generative models.” Nature communications 10, no. 1 (2019): 1-9.

Uncategorized

Theme 1: Running an algorithm over data is easy, building a useful AI-based system is quite hard

Theme 2: AI/ML Business Models

Theme 3: Organizational Issues About AI/ML

Theme 4: Profiles in AI/ML

Other Topics

Dark Pattern 1: Promiscuous Data Collection

Dark Pattern 2. Collecting Data Through Browser Fingerprinting

Dark Pattern 3. Dark Digital Twins

References

BMJ review of 232 predictive models

Turing Institute Report

Don’t be Surprised When Black Box Frameworks Fail to Bring Value

References

Some success stories

Why is data sharing so hard?

What can change?

Disclaimers

References