McCombs Researcher Finds Meaning in a Vast Sea of Data

 

Takeaway

  • ‘Discrete outcomes’ have a fixed value that can be quantified — like the number of flu cases that have been diagnosed at a hospital, or the number of tweets about a certain product
  • ‘Continuous outcomes’ such as speed and distance are measured in arbitrarily small increments
  • Study reveals new ways to develop statistical models for predicting discrete outcomes

When he first saw the nearly half-million dollar email, James Scott wasn’t supposed to be checking his cell phone.

Coming back from a family trip to Northern Ireland, Scott, an assistant professor of statistics in the Department of Information, Risk, and Operations Management, was waiting in the passport control hall at Dallas/Fort Worth International Airport. The McCombs professor had been away for several weeks, and hadn’t kept up with email, so he was “cheekily” scanning his inbox.

That’s when Scott saw the notification informing him that he was being recommended for a National Science Foundation CAREER award, the most prestigious award given to junior faculty, along with $400,000 to further his research. “I had very much to contain my excitement in the passport check line, because we weren’t supposed to be using mobile phones,” Scott says.

It’s difficult not to share his excitement. Scott won the NSF’s CAREER award for his project, “Bringing Richly Structured Bayesian Models into the Discrete-Data Realm via New Data-Augmentation Theory and Algorithms." Translation? Scott, who has also been honored for his teaching at McCombs, will work on creating software that solves data-analysis problems that have frustrated both industry professionals and researchers, such as forecasting disease rates in a specific location.   

To learn more, we asked Scott about his award-winning research.    

How will the NSF award further your research?

Part of the nature of statistical research is that you try to identify features that are common to a wide class of problems. Once that kind of abstract structure is recognized, and once you have software that exploits it, it becomes part of the ecosystem. It becomes a tool that a practitioner in any area of science or industry could now take that off the shelf and go apply to their problems. Fundamentally, the grant money pays for computers and graduate research assistants to help me do all that!

How would you explain your research?

There's a fundamental distinction in statistics between "discrete" and "continuous" outcomes. Marbles are discrete: we count them on our fingers and toes. Speed and distance are continuous, since we measure them in arbitrarily small increments. It turns out, for mathematical reasons not worth going into, that it’s much harder to fit statistical models for discrete outcomes.

That’s a bit counterintuitive: you might think, "What could be simpler than a yes or no outcome, or something you could count?" But it’s not the case, and the long history of statistics, going back at least to the early 18th century, has been shaped by the fact that it's much easier to do mathematics with continuous variables. As a field we've sort of looked for our keys where the mathematical streetlight is shining. The same is true of more recent history, where everything has been driven by computers. It's much harder to find algorithms that will handle discrete outcomes.

Well, in a lot of the very large data sets that people are dealing with today, the variables are discrete. How many patients showed up at the hospital yesterday complaining of flu? How many tweets yesterday mentioned some brand of cereal or running shoes? Now how do these outcomes depend systematically upon other sources of information, like time of year or demographics? Our statistical modeling language for answering these kinds of questions is very rich and expressive, but not our computational abilities. My work is trying to help bridge that gap.

Does ‘big data’ play a role in your work?

Absolutely. Historically, statisticians have worried about the mathematical efficiency of procedures. How much data do I need in order to get a decent answer? But in a lot of problems, we’re no longer data-limited; we are computer limited, we are algorithm limited, and we are limited by the ability of people to figure out how to make their fancy methods run on new computational infrastructure. Of course, questions about mathematical efficiency are still important, and the era of "big data" certainly doesn’t mean that all of our old insights and all of our old body of knowledge is irrelevant. As the data sets get bigger, the questions you can ask get richer and more complex and more nuanced.

This article originally appeared on the McCombs TODAY website.

 

Faculty in this Article

James Scott

Associate Professor, IROM

James Scott is an associate professor at the McCombs School of Business in the department of Information, Risk, and Operations Management (IROM)....

About The Author

Jeremy Simon

Writer, McCombs School of Business

As a writer for Texas Enterprise, Jeremy covers business-related research and news from the University of Texas at Austin. In addition, he manages...

Comments

#1 Awesome research! Where can I

Awesome research! Where can I read more about this?

Leave a comment

We want to hear from you! To keep discussions on-topic and constructive, comments are moderated for relevance and for abusive or profane language.
Login or register to post comments