Exploring the Data Frontier • SC23

August 7, 2023

Since earning an A.B. in Mathematics from Harvard and a Doctorate in Applied Mathematics from Princeton, Robert Grossman has spent more than 30 years working in data science, machine learning, big data, and data-intensive computing. Today, he is the Frederick H. Rawson Distinguished Service Professor in Medicine and Computer Science, the Jim and Karen Frank Director of the Center for Translational Data Science, and the Chief of the Section of Biomedical Data Science in the Department of Medicine at the University of Chicago. He is also Chair of the Open Commons Consortium, a not-for-profit that manages and operates cloud computing infrastructure to support scientific, medical, health care, and environmental research.

Decades of Expertise & Insight

Throughout his career, Grossman has authored or co-authored more than 200 publications and pioneered or contributed to an array of advancements in data management and analysis in computing, so he has many interesting stories about using HPC to advance data science.

As much of Grossman’s work focuses on health-care-related subject matter, and August is National Wellness Month, it seemed an ideal time to sit down and discuss some highlights from his career, including gleaning some insights about the fast-evolving use of HPC in healthcare. And, for those who want to dig a little deeper, Grossman provided links to several papers that capture noteworthy moments of impact in data science and beyond.

Robert Grossman, Phd

Frederick H. Rawson, Distinguished Service Professor of Medicine and Computer Science, Jim and Karen Frank Director of the Center for Translational Data Science, Chief of the Section of the Biomedical Data Science (Department of Medicine), University of Chicago; Chair, Open Commons Consortium

Robert on LinkedIN

Q: How did you get started using HPC systems in your work?

Grossman: Beginning in the mid to late 1980s, I worked on the problem of how you can manage and analyze large amounts of scientific data. Back then, this was not a popular topic, and HPC was focused on using large machines for simulations and other specialized tasks.

I was interested in what types of questions you could answer and problems that you could solve by creating a database of scientific objects and then querying the database to make discoveries or to answer basic questions. This was a variant of the basic trade-off in algorithms of trading space for time. For example, if you wanted to quickly answer a question about a flow of a differential equation or control system, perhaps you could simply answer the question with a sequence of simple queries to a database that stored trajectory segments of the system. [1] [2]

A few years later, software infrastructure planning for the Superconducting Super Collider (SSC) was beginning. One of the challenges for the SSC was how to manage and analyze all the data for it. The SSC was designed to produce petabytes of data, which was million times larger than the data that was routinely analyzed at that time. Working with Drew Baden from the University of Maryland, we proposed what was then a radical approach—create a distributed database of all the events produced by the SSC. [3]

This led to several computer science challenges: 1) how do you create very large databases of distributed scientific data; 2) how do you transfer large scientific datasets over wide area networks; and 3) what are the algorithms to process, explore, analyze, and share the data in these databases?

These three questions are now familiar SC topics, and I ended up working on various aspects of these topics for much of the past 30 years.

Q: What are some of the key projects you have worked on through the years?

Grossman: I have worked on a number of projects over the past 30 years, so I’ll just give you some highlights about a few of them that seem to have notable impacts, including the Petabyte Access and Storage Solution (PASS) project, National Scalable Cluster Project (NSCP), and a startup I founded called Magnify.

The PASS project, which the DOE [Department of Energy] funded, was born out of an effort to develop what was effectively big data technology for SSC with the hope that the technology could be used more generally. Collaborating with several institutions, we tackled the “foreseen data access problems of the next generation of scientific experiments…characterized by a large sample of complex event data (~10¹⁵ bytes), a dilute signal, and a large (~1000) and geographically distributed user community.” [4] Keep in mind that in the early 1990s, 1 TB of disk cost roughly $750,000. So, at the time, 1 PB of disk would have been approximately $750 million.

The PASS project led us to prototype three different technology approaches: scaling relational databases, scaling object-oriented databases, and developing object stores for managing large collections of data objects. My team eventually focused on the latter, developing an open-source lightweight manager for large data object collections called PTool. Although its design sparked controversy, it was efficient, scalable, and enabled access to large amounts of data (it scaled 10x-100x larger than the first two prototype options). It was also open source. The SSC was canceled, but our core software design became generally accepted and was mostly picked up by the Large Hadron Collider (LHC) that was built and is operated by CERN.

“Our core software design became generally accepted and was mostly picked up by the Large Hadron Collider (LHC) that was built and is operated by CERN.”

After the SSC’s cancellation, we applied for an NSF [National Science Foundation] grant to develop a PASS-like software architecture for broad scientific use. The result was the NSCP, an early example of what later became known as “the grid.” [5] The NSCP-1 Meta-Cluster was completed in 1996 and interoperated three geographically distributed clusters. The first NSCP Meta-Cluster contained approximately 100 nodes and 3 TB of disk geographically distributed among the participating sites and was connected by laboratory, campus, and national ATM [asynchronous transfer mode] networks.

At the time, NSCP used PTool for data management to create what would now be called a data warehouse (at the time, we called it a lightweight object manager). Around the same time, I developed an open-source software tool called PSockets that used parallel TCP [Transmission Control Protocol] connections to increase bandwidth and move PTool-managed data over wide area networks. [6] PSockets and its successors ended up winning several bandwidth challenges at SC conferences.

In 1996, I founded a startup called Magnify to take the lessons learned from the PASS Project and apply them to financial services and online advertising. We used commodity clusters of workstations along with PTool and similar techniques to manage out of memory data and build and “glue together” ensembles of machine learning models in a simple, elegant way. We also developed specialized software applications for deploying these models into operational environments. We came across the idea of using ensembles on our own and met quite a bit of resistance from statisticians and others who objected to the approach. Although others had also begun to use ensembles in machine learning and other areas, they were not well known at that time. Looking back, the idea of an ensemble is a simple and obvious one, which dates back in some form to Condorcet’s jury theorem from the 18thcentury.

Q: Let’s shift gears and talk about your current work health- and wellness-related work with the Genomic Data Commons (GDC). What is the GDC, and why has it been so effective?

Grossman: We were awarded a contract from the National Cancer Institute in 2014 to develop the Genomic Data Commons, which was launched in 2016 by then Vice President Biden as part of the Cancer Moonshot. Today, the GDC is used by over 60,000 researchers each month and, on average, over 2 PB of data are accessed or downloaded each month.[7]

Robert Grossman speaking to then Vice President Joe Biden at the launch of the Genomic Data Commons on June 6, 2016.

Importantly, we curate and harmonize all the data submitted to the GDC by running a common set of bioinformatics pipelines. [8] The importance of curating and harmonizing the data to build a successful data platform cannot be overemphasized. [9]

Building effective data platforms has always been about choosing the right set of questions to optimize, so users have efficient access to the data they need and a simple intuitive experience using the data platform. Sometimes, we are more successful with this approach than other times, but it is always one of the criteria in mind when developing new data systems. Jim Gray eloquently summarized this approach with his advice: Give me your 20 most important questions you would like to ask of your data system, and I will design the system for you. [10]

Given the value of the GDC to researchers, we have continued iterating the technology for wider use with the goal of making it better and easier to replicate. Our Gen3 version is open source, and today there are over 20 Gen3 data commons built by ourselves and others supporting research in cardiovascular disease, COVID, infectious diseases, irritable bowel disorder, opioid use disorder and pain management, and other diseases—in addition to cancer.

Our goal is to reach 200 Gen3 data commons by working with partners and by developing a Gen3 as a Service.

Q: Where do you see the future of HPC in the next decade, particularly in relation to health care and wellness?

Grossman: The role of HPC will continue to be important in simulations that are the basis for understanding molecular interactions; identifying candidate drug targets; building agent-based simulations; creating digital twins of cells, tissues, organs and living systems; and a variety of other biomedical applications.

On the other hand, for many important problems in biology, medicine, and health care, we are not compute-limited, but rather data-limited. The HPC challenge is to build the data platforms that can manage, explore, analyze, and share biomedical data at the scale needed and with the governance, security, and compliance required so we can tease out interesting small effects.

“The HPC challenge is to build the data platforms that can manage, explore, analyze, and share biomedical data at the scale needed and with the governance, security, and compliance required so we can tease out interesting small effects.”

Q: What are some important lessons you’ve learned throughout your career in HPC that you believe would benefit those just starting out in the field?

Grossman: Just as there is no substitute for actually programming, there is no substitute for actually exploring data, understanding data, building multiple models over the data, and understanding the advantages and disadvantages of the different models. Today, it is so easy to use software to build a model that you can fall into the trap of building a model that has serious problems because you don’t understand the data. In other words, the wonderful power of software today to build statistical, machine learning, and other models with such little effort can fool us into believing we understand the data when, in practice, this often requires a lot of work and effort.

It’s also easy to forget the wonderful flexibility researchers have at universities. They can tackle important, ambitious problems and fail. All that is required is that you succeed from time to time and then talk and write about your successes. There are very few other careers where you have this level of flexibility.

Learn More:

[1] R. Grossman, “Querying databases of trajectories of differential equations: Data structures for trajectories,” NAS 1.26:185040, Jun. 1989. Accessed: Jun. 21, 2023. [Online]. Available: https://ntrs.nasa.gov/citations/19890016401.

[2] R. Grossman, “Querying databases of trajectories of differential equations. I. Data structures for trajectories,” in Twenty-Third Annual Hawaii International Conference on System Sciences, IEEE Computer Society, 1990, pp. 18–24. doi: 10.1109/HICSS.1990.205171.

[3] D. Baden and R. Grossman, “A model for computing at the SSC (Superconducting Super Collider),” Superconducting Super Collider Lab., Dallas, TX (United States), SSCL-288, Jun. 1990. doi: 10.2172/6515278.

[4] D. R. Quarrie, C. T. Day, and S. Loken, The PASS project: A progress report. 1994. doi: 10.2172/10172158.

[5] I. Foster and C. Kesselman, The Grid 2: Blueprint for a new computing infrastructure. Elsevier, 2003. https://doi.org/10.1016/B978-1-55860-933-4.X5000-7.

[6] H. Sivakumar, S. Bailey, and R. L. Grossman, “PSockets: The case for application-level network striping for data intensive applications using high speed wide area networks,” in SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, IEEE, 2000, pp. 38–38. doi: 10.1109/SC.2000.10040.

[7] A. P. Heath et al., “The NCI Genomic Data Commons,” Nat. Genet., pp. 1–6, Feb. 2021, doi: 10.1038/s41588-021-00791-5.

[8] Z. Zhang et al., “Uniform genomic data analysis in the NCI Genomic Data Commons,” Nat. Commun., vol. 12, no. 1, Art. no. 1, Feb. 2021, doi: 10.1038/s41467-021-21254-9.

[9] R. L. Grossman, “Ten lessons for data sharing with a data commons,” Sci. Data, vol. 10, no. 1, Art. no. 1, Mar. 2023, doi: 10.1038/s41597-023-02029-x.

[10] A. S. Szalay, “Jim Gray, astronomer,” Commun. ACM, vol. 51, no. 11, pp. 58–65, 2008. doi: 10.1145/1400214.1400231.

Decades of Expertise & Insight

Robert Grossman, Phd

Stay Up to Date