Interview: Rick Stevens, Computing and Life Sciences Directorate Lead, Argonne National Laboratory

May 15, 2007

David Geer, ITworld.com

David Geer recently spoke with Rick Stevens, Computing and Life Sciences Directorate Lead at the Argonne National Laboratory and an internationally recognized expert who helps drive the national agenda on computing. Following is an edited transcript of that conversation.

Listen to the original interview here, or visit our Podcast Center for more audio interviews.

David Geer: Today we're talking with Rick Stevens, Computing and Life Sciences Directorate Lead at the Argonne National Laboratory and an internationally recognized expert who helps drive the national agenda on computing. Hello, Rick.

Rick Stevens: Hi, David. Nice to be with you today.

Geer: Briefly describe your current work on petaflops.

Stevens: Well, let's see. At the sort of bottom of what we're doing, we're standing up very large supercomputers. We're in the process of acquiring a next-generation IBM Blue Gene system, which will be up and running later this year, and that has an architecture that will support petaflop-scale computing -- petascale computing. And then in addition to the hardware, our lab has been developing operating system software, system software, tools, file systems, that scale so we can run them on machines with hundreds of thousands of processors, which this machine will have. And then layering on top of that is a broad class of applications, ranging from basic physics to the modeling of nuclear reactors to the design of nano materials to the analysis of genomes to the screening of new drug compounds, and many other areas. So, it's probably good to think of this as sort of a stack -- not that different from the software stack that's on a laptop or a PC, except that in our case we're talking about hundreds of thousands of cores in the system.

Geer: What's most fascinating about your work as you've described it here?

Stevens: Well, the things that keep it interesting are that the computer architecture trajectory that we're on, or think of it as more of the general IT technology trajectory that we're on, (but of course we're trying to exploit that trajectory), driven by Moore's Law and associated things, to get ever more capable systems for doing science. So every day, in some sense, our job is to figure out what's next. What's the next class of architecture, what's the latest improvement in process technologies that will create the next turn of chips? And how do we harness that capability, that sort of underlying stuff, to do more interesting science?

And what's interesting about this thing is that if you go back 15, 20 years, the community has been on this track for a long time and we can plot out, just like we can plot out the increasing performance of desktop or laptop machines or servers in a conventional environment. We can do the same thing for supercomputers and they track that same trend because, at least nowadays, they're built on essentially the same technology, which wasn't true, say, 20 years ago.

But they're actually ahead of that track, and it's because on the scientific computing domain or the area at which we work, people have one more dimension in which they've been able to exploit performance, and that is scale, that is taking parallelism to new levels. So ten years ago, a large system was a system with, like, a thousand processors. Today a large system is a system with a hundred thousand processors, and over the next five to ten years we'll be dealing with systems that have a million processors or more than a million processor cores. And so not only are we getting performance improvements from architectures, memory architectures from faster clocks, at least up until recently, and smaller feature sizes and all of the normal stuff that's affecting the general base, we're seeing the creation of systems at an unprecedented scale. What that does is it pushes right back into the software space and say, now we have to have programming environments that make it if not easy, at least straightforward to program a system with a hundred thousand CPU cores. And while the industry is still struggling with 32 or 64 or something like that, the scientific community is orders of magnitude ahead in terms of trying to figure out how to deal with these issues.

And then it has all kinds of downstream effects like, assume you can write a program that can effectively use a hundred thousand processors. Now you've got to generate I/O. And if you've got a hundred thousand processors generating I/O, what's the architecture of a file system that can take that, and what's your internal networking infrastructure look like coming off of a machine that might have a capability of driving a terabit per second of I/O? We don't see those systems--you can't go down to your Best Buy and buy a terabyte per second of I/O, right? So it causes you to reinvent, sort of, everything around the computer every couple of years, and so we're constantly in motion. Nothing is ever the same. And on the application side, once we have access to these very large machines, we can start thinking about doing things differently. Like one problem that we're now thinking about in climate modeling as a good example, because it's relevant to recent concerns, is that historically climate models took the ocean, sea ice, atmosphere, land processes, and over the last five years particularly, have integrated those together, so now we have sort of a climate systems model. But those models don't include humans. They don't include human consumption of energy. They don't include the processes that create, say, carbon dioxide in production of energy, and they don't include economic activity, and they don't include demographics. So it's sort of like looking at the earth as if humans didn't actually exist on it. But now that we're having so much compute power, we can start thinking about, can we add to these models, say, a model of a city or a model of an economy and how it might react to a change in climate, and then couple those feedbacks in. And so I was using an example in climate modeling, but that same concept that you've got enough compute power that you can start building integrative models that take ideas from many different disciplines and tie them together, is what's really exciting right now.

Geer: So, for example, detail how your work might advance environmental pollution modeling.

Stevens: Okay. There's lots of different kinds of environmental pollution, but let's focus on, say, the problem of understanding how much smog in California is actually coming from China. It turns out, right now a significant amount of it is coming from China. So how would you unravel that? One way you would do it is you would get a very high-resolution atmospheric model that is resolution at the level of maybe tens of kilometers or maybe better, if you could, and you would run that model for many months, maybe many years, of simulated time, maybe getting weather data and climate data from the last, say, ten years, running that so we have a pretty good understanding of where the prevailing winds are. And then we would look at what we know about, say, construction of coal-fired electricity plants in China and we would put those sources now on the map. Then we would run this model and we would put into the atmosphere virtual tracer particles. So imagine that each of these plants is producing a stream of pollution -- sulfur dioxide, whatever's coming out of them -- and what we want to do is stick in the model these little weightless virtual particles that allow us to see where that pollution ends up. And we know sort of how fast that sulfur dioxide gets dispersed in the atmosphere, and so we can just watch the model and we see where it lands. And we know, for example, in Los Angeles there's a basin effect there that tends to trap things that flow from the Pacific into the Los Angeles area, and it sort of gets stuck there in between the mountains. And so you can run this model and it will show you that some significant fraction of the sulfur dioxide that's coming into that basin is actually coming from across the Pacific. That's an example. But you can do that in groundwater, you can do it in atmospheric pollution -- the sky's the limit. All it takes is the ability to construct a mathematical model, validate that model, and then collect the underlying data, and then ask the what-if questions.

Geer: What are the positive impacts on society as far as taking the information and using it to better environments for people?

Stevens: For example, we want to make informed policy decisions, whether it's the Clean Air Act or whether it's trying to understand how global is the global climate problem? Or right now in the US, we don't have a -- public policy doesn't recognize very well, say, the existence of watersheds. So the natural landscape is organized in terms of watersheds but those watersheds cross political boundaries, whether it's states or countries, between the US and Canada or Mexico or whatever. An environmental phenomenon, whether it's acid rain or whether it's point-source pollution in the Mississippi River Basin or whatever, doesn't obey the political laws, it obeys the natural laws. So it's going to stay within a watershed or whatever. And so our ability to actually build models and examine them and see them -- the models aren't perfect, they're not going to be perfect for a long time, if ever. But they're ways of seeing things in an integrated way that's very difficult if you don't have them. And what it allows us to do is at least ask questions about whether or not our policies are optimal, given what our objectives are -- clean water, clean air, stable climate, or whatever. So that's a class of problems. If you shift over into, say, something like biomedical research where we're trying to -- an area that I do some work in is in microbial pathogens. And, for example, right now in the US, if you're in a hospital, there is a likelihood that you would get an infection. And if you get an infection in a hospital, there's a likelihood that that infection will be resistant to antibiotics, or at least some antibiotics. So one of the real problems is, can we understand how the organisms develop antibiotic resistance so that we can develop treatments, ways of either rendering organisms to be non-virulent so they don't hurt people, so that they don't evolve resistance to antibiotics or maybe create new antibiotics? And it's a race. People are dying every day, every week, from these kinds of infections that are resisting antibiotics and the question is, how fast can we come up with new ideas to fight that? So it's like a war. And in that context, what do we use the computers for? We can use them for analyzing the genomes of these organisms. We can compare many different versions of the organism, different strains that occur in different regions or in different hospitals, and we can look at how they're different from each other and what's common. And we can then do what-if experiments, say what if we knocked out this protein or this gene? How would it affect this organism? And we can design, say, drug strategies, and we can do that via simulation. If we're clever, we can do it in simulation. Perhaps more effectively, we can do it without simulation.

Geer: You mentioned climate. What are the social benefits that can be achieved through your work from the study of long-term climate changes?

Stevens: Right now one of the challenges that we're trying to--I say we, I mean the world is trying to understand--is, what are our options? First of all, we want to understand, given the current state of, say, carbon emissions and the current level of economic growth, what is the likely sort of business-as-usual climate outcome? So a climate model allows you to project forward 50 or 100 years and get some statistical idea of what--of how the climate might be different from what it is today. So that gives you the ability to sort of see into the future, so that's one general benefit. And you're seeing into the future very fuzzily, in some sense, but you're at least getting predictions that have a basis to them. You can go back into the model and see whether your assumptions are good and so on. So once you have that, now you can try out different scenarios. You can say, well, what if we have a policy that somehow reduces CO�? How would that affect the world's climate in the future if we made these changes? And so again, you'd use the model under some different scenario to predict what that would look like. So that gives you a tool to evaluate different options in terms of how fast you're trying to, say, lower CO� or some other greenhouse gas.

Another way you might use that, though, is if you took that model and looked at where the climate on the ground is likely to be most different -- so wetter areas might get wetter, drier areas might get drier, the average growing regions for things like wheat might move further north, and so forth -- so you're going to have a series of displacements, potential displacements, of human activity that the model predicts. And then you can do a secondary analysis and say, well, what would that mean? What if we can no longer grow wheat in Kansas or whatever? What does that mean? What would we do? How would you adjust to that? What could you grow in Kansas, or is Kansas going to be a desert? I'm just picking on Kansas here because everybody knows where it is. But the point is that these models allow you to ask those questions. But they're not a substitute for getting more data, they're not a substitute for taking action. They're just sort of a way to see what the implications are of a complex situation. So in that sense, there are many systems that we can't do experiments on. We can't do experiments easily on the Earth. We can't easily do experiments on humans. So if we're trying to understand, say--trying to optimize a medical technology, if you have the ability to put that into a simulation and do the experiments in the computer as opposed to doing them in patients, that's sort of an obvious good thing. We don't experiment on kids, for example. And there are many areas of science where it's just not feasible to build an experiment. So, for example, let's say we wanted to--well, we are, we're working on fusion power. The ability to design a fusion reactor that might provide a much cleaner kind of energy source is really going to be dependent on our ability to understand plasma physics and to build models that would allow us to steer the direction of possible designs into a direction that might work. The same with fission power. It's likely that we'll want to build more fission reactors and will want to optimize them for safety -- make them even safer than they are now, make them maybe less expensive, easier to maintain, passively safe and so on -- and you can do all of that in simulation before you actually have to pour concrete.

Geer: Describe a few ways that your current work benefits the IT world and IT organizations today.

Stevens: We look back, we see that many technologies that are now mainstream in the general IT community -- parallel servers, for example -- came out of the scientific computing world. The idea of high-speed networking was pushed early on by scientific computing. Even if we look at where the Web came from, it came from CERN and from NCSA. The underlying protocols from CERN, which is a physics lab, science lab, and from NCSA that was in place was doing scientific computing. So many of the technologies and the ideas come from research, and some of the places where more aggressive research is being done is in those locations that are trying to push the envelope on the IT technology to address a problem. Not everything is coming from there -- databases, for example, where we didn't come from scientific computing, transaction processing really didn't. But some of the underlying ideas of very large file systems, petabytes hierarchical storage systems, that's an area sort of really frontier work that's going on right now that's being driven by the needs of some of these very large simulations. So there's always sort of a technology transfer that's flowing from the scientific computing into the regular IT, but there's also ideas from regular IT that flows the other direction in terms of--advanced ideas for system configuration and system management really grew out of the IT world and are now being adopted to make it possible to have a handful of people manage these really large computers.

Geer: And given a timeframe of two, three, five or ten years down the road, what might IT be able to do in the future because of your research?

Stevens: One concept that I think is likely to just take over -- and it was enabled by the kind of research that we've been doing, not now but ten years ago -- is this idea of making multi-core very practical. Pretty soon you won't be able to buy, aside from the embedded world, single-processor chips. Everything will be multi-core. It's just a question of how many cores -- 2, 4, 8, 16, whatever. The idea is, how do you make it easy for people to program systems that, say, have 32 or 64 cores? The world has been programming that for a decade or more as the parallel computing--scientific computing world. And so we have a lot of tools, a lot of concepts, that came out of their compiler concepts, message passing tools and so on, that will see much, much broader use in the next five to ten years, when everything becomes multi-core. That's going to be a real fundamental change. Other ideas that will take more time -- I'm not sure will move more slowly but will clearly have impact -- is a shift towards massively parallel data environments, where we will be able to easily manage file systems that can handle thousands of concurrent I/O streams with good efficiency. And this means probably abandoning POSIX semantics. It means changing sort of how file systems work. But that's going to be another big impact area. And another area that I think is--again, it's unclear exactly when it's going to hit, is changing the fundamental architecture. So we're going to see--future machines that we're envisioning are somewhat cluster-like. They have more dedicated networks, more balance. They have a different OS structure, typically, maybe specialize in kernels for certain parts of the machine, so they're a more heterogeneous kind of environment. That may penetrate the commercial sector more. It also is the case that these sort of leading-edge supercomputers are just now being used or experimented with as large data mining engines, and that technology may penetrate back into the commercial sector, make it much, much easier to deal with petabytes, say, databases and do data mining quickly. And then finally, right now we have a need to have very high-speed networking in and out of these machines, aggregate bandwidths that are hundreds of times what the individual connections are, and yet we want to treat them as sort of coherent bundles. And that's a technology that is still being developed, and maybe over the next five to ten years will see a transfer into commercial IT environments that would enable you, for example, to move a petabyte of data from New York to California so you can have a snapshot backup of your data center, for example, overnight, which now is just not feasible, but in five to ten years might be feasible.

Geer: I appreciate your time today, Rick, and thank you for speaking with us. If you would like to learn more about the Argonne National Lab, surf to http://www.anl.gov.