Abstracts
H.G. Kaper and S. Tipei, "An Abstract Approach to Music," Preprint ANL/MCS-P748-0399, March 1999. The notion of formalized music implies that a musical composition can be described in mathematical terms. In this article we explore some formal aspects of music and propose a framework for an abstract approach.
H.G. Kaper, S. Tipei, and E. Wiebel, "Data Sonification and Sound Visualization," Computing in Science and Engineering 1(4) (July/Aug 1999), pp. 48-58. This article describes a collaborative project between researchers in the Mathematics and Computer Science Division at Argonne National Laboratory and the Computer Music Project of the University of Illinois at Urbana-Champaign. The project focuses on the use of sound for the exploration and analysis of complex data sets in scientific computing. The article addresses digital sound synthesis in the context of DIASS, a Digital Instrument for Additive Sound Synthesis, and sound visualization in a virtual-reality environment by means of M4CAVE. It describes the procedures and preliminary results of some experiments in scientific sonification and sound visualization.
G. W. Crabtree, D. O. Gunter, H.G. Kaper, A. E. Koshelev, G. K. Leaf, and V. M. Vinokur, "Numerical Simulations of Driven Vortex Systems," Physical Review B 61, no. 2 (J an. 2000), pp. 1446-1455. This article reports on several large-scale numerical simulations of vortex systems that are driven through superconducting media with defects. The simulations are based on the time-dependent Ginzburg-Landau equations. The simulations demonstrate regimes of plastic and elastic steady-state motion in the presence of a twin boundary, show the effect of regular and irregular arrays of point defects on vortex trajectories, and show a mechanism by which vortices move through an array of columnar defects. Also presented are the results of some transient simulations in two and three dimensions, which show that, in the transition from the Meissner state to the vortex state, vortices are formed by a process of deposition.
D. O. Gunter, H.G. Kaper, and G. K. Leaf, "Implicit Integration of the Time-Dep endent Ginzburg-Landau Equations of Superconductivity," SIAM J. Sci. Comput. 23, no. 6 (2002) 1944-1959. This article is concerned with the integration of the time-dependent Ginzburg-Landau (TDGL) equations of superconductivity. Four algorithms, ranging from fully explicit to fully implicit, are presented and evaluated for stability, accuracy, and compute time. The benchmark problem for he evaluation is the equilibration of a vortex configuration in a superconductor that is embedded in a thin insulator and subject to an applied magnetic field.
M. Grimsditch, L. Giovannini, F. Montoncello, F. Nizzoli, G. K. Leaf, and H.G. Kaper, "Normal Modes in Ferromagnetic nanoparticles: A Dynamical Matrix Approach," Preprint ANL/MCS-P1116-0104, January 2004. We present a method to compute the magnetic normal modees of a ferromagnetic particle. The method is a hybrid of micromagnetic simuations and a "dynamical matrix" approach similar to that used for vibrational studies. We use the method to calculate the normal modes of an Fe parallelepiped and compare the results with the modees recently extracted from a purely micromagnetic simulation. The results of the two approaches are in excellent agreement. We discus the pros and cons of both approaches. We also present information on standing waves with wavevector perpendicular to the applied field and on a family of modes localized at the particle ends.
W. McCune, R. Padmanabhan, M. A. Rose, and R. Veroff, :Automated Discovery of Single Axioms for Ortholattices," Albegra univers. 52 (2004) 541-549. We present single short axioms for ortholattices, orthomodular lattices, and modular ortholattices, all in terms of the Sheffer stroke. The ortholattice axiom is the shortest possible. We also give multiequation bases in terms of the Sheffer stroke and in terms of join, meet, and complementation. Proofs are omitted but are available in an associated technical report and on the Web. We used computers extensively to find candidates, reject candidates, and search for proofs that candidates are single axioms.
H.G. Kaper and H. Nordborg, "The Ginzburg-Landau Equations of Superconductivity in the Limit of Weak Coupling near the Upper Critical Field," Preprint ANL/MCS-P786-0100, January 2000. This article is concerned with the Ginzburg-Landau (GL) equations of superconductivity. The equations provide a mathematical model for the study of magnetic flux vortices in superconductors. The focus is on the asymptotic case when the charge of the superconducting charge carriers (Cooper pairs) is vanishingly small and the applied magnetic field approaches the upper critical field. It is shown that the GL model reduces in the limit to the frozen-field model, where the superconducting phenomena are affected by the electromagnetic phenomena, but not vice versa. The convergence is second order in the small parameter. The analytical results are confirmed in some numerical examples.
D. Bonachea, P. Dickens, and R. Thakur, "High-Performance File I/O in Java: Existing Approaches and Bulk I/O Extensions," Preprint ANL/MCS-P840-0800, Argonne National Laboratory, August 2000. There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance but also high-performance I/O. In this paper, we first examine several approaches that attempt tp provide high-performance I/O in Java--many of which are not obvious at first glance--and evaluate their performance on two parallel machines: the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically, The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.
J. No, R. Thakur, and A. Choudhary, "High-Performance Scientific Data Management System," Preprint ANL/MCS-P973-0502, April 2003. Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or datasets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions have been used for this task: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of file and larage and complex data sets. Databases can be convenient, flexible, and powerful but do not perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that combines the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data (using various I/O optimizations available in MPI-IO) and a database to store application-related metadata. In order to support I/O in irregular applications, SDM makes extensive use of MPI-IO's noncontiguous collective I/O functions. Moreover, SDM uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in the database. We describe the design and implementation of SDM and present performance results with two regular applications, ASTRO3D and an Euler solver, and with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code.
W. Ligon and R. Ross, "Parallel I/O and the Parallel Virtual File System", in Beowulf Cluster Computing with Linux, 2nd ed., edited by W. Gropp, E. Lusk, and T. Sterling, MIT Press, 2003. This chapter discusses some of the most important issues in parallel I/O systems. These include parallel access patterns, parallel I/O system components and architectures, and consistency semantics. Knowing how parallel I/O systems operate and the issues involved can be useful when performance tunign an application for a particular system or choosing an I/O solution to mach expected workloads. Following this general discussion, the authors delve into PVFS, covering management and tuning and approaches for narrowing the source of problems that crop up. The chapter concludes with discussion of some critical issues for parallel file systems and how PVFS2 attempts to address these.
J. Lee, X. Ma, R. Ross, R. Thakur, and M. Winslett," "RFS: Efficient and Flexible Remote File Access for MPI-IO," Preprint ANL/MCS-P1176-0604, June 2004. In this paper we present RFS, a high-performance remote I/O facility for ROMIO, a well-known MPI-IO implementation. The simple, portable, and flexible design eliminates the shortcomings of previous remote I/O efforts. In particular, RFS improves the remote I/O performance by adopting active buffering with threads (ABT), which hides I/O cost by aggressively buffering the output data using available memory and performing background I/O using threads while computation is taking place. Our experimental results show that RFS with ABT can significantly reduce the remote I/O visible cost, achieving up to 92% of the theoretical peak throughput. The compoutation slowdown caused by concurrent I/O activities was 0.2-6.2%, which is dwarfed by the overall performance improvement in application turnaround time.
M. S. Min, T. W. Lee, P. F. Fischer, and S. K. Gray,
"Fourier Spectral Simulations
and Gegenbauer Reconstructions for Electromagnetic Waves in the Presence of
a Metal Nanoparticle," Preprint ANL/MCS-P1202-1004. We deescribe Fourier psuedospectral time-domain simulations, carried out in order to study light interacting with a metallic nanoscale object. The difficulty of using Fourier
methods to accurately predict the electromagnetic scattering in such dielectric configuration arises from the discontinuity in the dielectric
function along the surface of the metallic object.
Standard Fourier methods lead to oscillatory behavior in approximating solutions that are nonsmooth or that have steep gradients.
By applying the Gegenbauer reconstruction technique as a postprocessing method to the Fourier pseudospectral solution, we successfully reduce the oscillations after postprocessing.
W. Gropp, R. Ross, and N. Miller, "Providing Efficient I/O Redundancy in MPI Environments," Preprint ANL/MCS-P1178-0604, June 2004. We demonstrate a scalable method for recovering from single disk failures that is optimized for typical scientific data sets. This approach exploits coarser-grained (but precise) semantics to reduce the overhead of constructing revovery data and uses parallel computation (proportional to the data size and independent of number of processors) to construct data. Experiments showing the efficiency of this approach on a cluster with independent disks are presented, and a technique for hiding the creation of redundant data within the MPI-IO implementation is described.
G. Almasi, C. Archer, J. G. Castanos, J. Gunnels, C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. Steinmacher-burow, W. Gropp, and B. Toonen, "The Design and Implementation of Message Passing Services for the BlueGene/L Supercomputer," Preprint ANL/MCS-P1183-0604, June 2004. OThe BlueGene/L supercomputer, with 65,536 dual-processor compoute nodes, was designed to support effocoemt execution of massively prallel message-passing programs. Part of this support is an optimized implementation of MPI that leverages the hardware features of BlueGene/L. MPI for BlueGene/L is implemented on top of a more basic message-passing infrastructure called the message layer. This layer can be used to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two modes of oepration of BlueGene/L: coprocessor mode and virtual node mode. Performance measurements show that our message-passing services deliver peformance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth and that running two processes per compute node (virtual node mode) can have a positive effect on application performance.
J. No, R. Thakur, A. Choudhary, "Integrating Parallel File I/O and Database Support for High-Performance Scientific Data Management," Preprint ANL/MCS-P798-0300, March 2000. Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions are used for this problem: file I/O or databases. File I/O can provide high performance but is tedious to use within large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do ot perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that combines the good features of both file I/O and databases. SDM provides a thin layer of database-like functionality on top of a high-performance, parallel file-I/O interface (MPI-IO). As a result, users can access data with the convenience of databases and the performance of MPI-IO, without having to bother with the details of either. In this paper, we describe the design and implementation of SDM. With the help of two parallel application templates, ASTRO3D and an Euler solver, we illustrate how some of the design criteria affect performance.
R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO, Preprint ANL/MCS-P723-0898, August 1998. The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications---an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)---on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
I. Foster, W. Gropp, C. Kesselman, "Message Passing and Threads," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, A. White, Morgan Kaufmann Publishers (to appear). Also Preprint ANL/MCS-P843-0700, July 2000. In this chapter, we examine two fundamental, although low-level, approaches to expressing parallelism in programs. Over the years, numerous different approaches to designing and implementing parallel prorams have been developed. However, over time, two dominant alternatives have emerged: message passing and multithreading.
W. D. Gropp, "The 2-D Poisson Problem," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, A. White, Morgan Kaufmann Publishers (to appear). Also Preprint ANL/MCS-P842-0700, July 2000. In this chapter we briefly describe how an approximate solution to a simple partial differential equation can be found when using parallel computing. This chapter will allow us to illustrate the issues of parallelizing an application and contrast the two major approaches.
J. Czyzyk, T. Wisniewski, and S. J. Wright, "Optimization Case Studies in the NEOS Guide," SIAM Review 41 (1999) 148-163. The point of applied mathematics is that the theoretical and algorithmic developments at the core of the subject are relevant to important applications in the real world. In studying the subject, we learn the usefulness of abstracting individual problem characteristics to a mathematical level. The connection to applications motivates us to tackle many of the conceptual difficulties that arise in our study of the mathematics.
Joseph Czyzyk, Michael P. Mesnier, and Jorge J. More', "The NEOS Server," IEEE Computational Science & Engineering 5 (1998) 68-75. The Network-Enabled Optimization System (NEOS) is an environment for solving optimization problems over the Internet. Users submit optimization problems to the NEOS Server via e-mail, the World Wide Web, or the NEOS Submission Tool. The NEOS Server locates the appropriate optimization solver, computes all additional information (for example, derivatives and sparsity patterns) required by the solver, links the optimization problem with the solver, and returns a solution. This article discusses the design and implementation of the NEOS Server.
William Gropp and Jorge J. More', "Optimization environments and the NEOS Server," Approximation Theory and Optimization, edited by M. D. Buhmann and A. Iserles, Cambridge University Press, 1997, pp. 167-182. IEEE Computational Science & Engineering 5 (1998) 68-75. The Network-Enabled Optimization System (NEOS) is an Internet-based service for optimization providing information, software, and problem-solving services for optimization. The main components of NEOS are the NEOS Guide and the NEOS Server. Additional information on the various services provided by NEOS can be obtained from the home page of the Optimization Technology Center.
J. J. More' and C.-J. Lin, "Incomplete Cholesky Factorizations with Limited Memory," SIAM Journal on Scientific Computing, 21 (1999) 24-45. We propose an incomplete Cholesky factorization for the solution of large-scale trust region subproblems and positive definite systems of linear equations. This factorization depends on a parameter p that specifies the amount of additional memory (in multiples of n, the dimension of the problem) that is available; there is no need to specify a drop tolerance. Our numerical results show that the number of conjugate gradient iterations and the computing time are reduced dramatically for small values of p. We also show that in contrast with drop tolerance strategies, the new approach is more stable in terms of number of iterations and memory requirements.
J. J. More' and C.-J. Lin, "Newton's Method for Large Bound-Constrained Optimization Problems," SIAM Journal on Optimization, 9 (1999) 1100-1127. We analyze a trust region version of Newton's method for bound-constrained problems. Our approach relies on the geometry of the feasible set, not on the particular representation in terms of constraints. The convergence theory holds for linearly constrained problems, and yields global and superlinear convergence without assuming neither strict complementarity nor linear independence of the active constraints. We also show that the convergence theory leads to an efficient implementation for large bound-constrained problems.
W. Anderson, W. Gropp, D. Kaushik, D. Keyes, and B. Smith, "Achieving High Sustained Performance in an Unstructured Mesh CFD Application," Proceedings of SC '99, 1999. Many applications of economic and national security importance require the solution of nonlinear partial differential equations (PDEs). In many cases, PDEs possess a wide range of time scales---some (e.g., acoustic) faster than the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton Methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This nested hierarchy of tunable algorithms has proved effective in solving complex problems in areas such as aerodynamics, combustion, radiation transport, and global circulation. Typically, for steady-state solutions from a trivial initial guess, the number of "work units" (evaluations of the discrete residuals on the finest mesh on which the problem is represented) is around 10**3 (to achieve reductions in the norm of the residual of 10**-8 to 10**-12). Although these algorithms are efficient (in the sense of using relatively few floating-point operations to arrive at the final result), they do not necessarily achieve the absolute flops-per-second (flop/s) ratings that less efficient or less versatile algorithms may.
N. Desai, R. Bradshaw, A. Lusk, and E. Lusk, "MPI Cluster System Software," Preprint ANL/MCS-P1160-0504, May 2004. We describe the use of MPI for writing system software and tools, an area where it has not been previously applied. By "system software" we mean collections of tools used for system management and operations. We describe the common methodologies used for system software development, together with our experiences in implementing three items of system software with MPI. We demonstrate that MPI can bring significant performance and other benefits to system software.
W. Gropp, "Building Library Components That Can Use Any MPI Implementation," Preprint ANL/MCS-P956-0502," May 2002. The MPI standard for programming parallel computers is widely used for builiding both programs and libraries. Two of the strengths of MPI are its support for libraries and the existence of multiple implementations on many platforms. These two strengths are in conflict, however, when an application wants to use libraries built with different MPI implementations. This paper describes several solutions to this problem, based on minor changes to the API. These solutions also usggest design considerations for other standards, particularly those that expect to have multiple implementations and to be used in concert with other libraries.
W. Gropp and E. Lusk, "Reproducible Measurements of MPI Performance Characteristics," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 6th European PVM/MPI Users Group Meeting, Springer Lecture Notes in Computer Science #1697, Springer, 1999, pp. 11-18. In this paper we describe the difficulties inherent in making accurate, reproducible measurements of message-passing performance. We describe some of the mistakes often made in attempting such measurements and the consequences of such mistakes. We describe mpptest, a suite of performance measurement programs developed at Argonne National Laboratory, that attempts to avoid such mistakes and obtain reproducible measures of MPI performance that can be useful to both MPI implementors and MPI application writers. We include a number of illustrative examples of its use.
L. Freitag, M. Jones, and P. Plassmann, "A Parallel Algorithm for Mesh Smoothing," SIAM Journal for Scientific Computing, 20, no. 6, 1999 Maintaining good mesh quality during the generation and refinement of unstructured meshes in finite-element application is an important aspect in obtaining accurate discretizations and well-conditioned linear systems. In this article, we present a mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm. We prove that the parallel algorithm has a provably fast runtime bound and executes correctly for a PRAM computational model. We extend the PRAM algorithm to distributed memory computers and report results for two- and three-dimensional simplicial meshes that demonstrate the efficiency and scalability of this approach for a number of different test cases. We also examine the effect of different architectures on the parallel algorithm and present results for the IBM SP supercomputer and an ATM-connected network of SPARC Ultras.
C.-J. Lin and J. J. More', "Incomplete Cholesky Factorizations with Limited Memory," SIAM J. on Scientific Computing 21(1) (1999), pp. 24-45. We propose an incomplete Cholesky factorization for the solution of large-scale trust region subproblems and positive definite systems of linear equations. This factorization depends on a parameter p that specifies the amount of additional memory (in multiples of n, the dimension of the problem) that is available; there is no need to specify a drop tolerance. Our numerical results show that the number of conjugate gradient iterations and the computing time are reduced dramatically for small values of p. We also show that in contrast with drop tolerance strategies, the new approach is more stable in terms of number of iterations and memory requirements.
F. Jarre and S. J. Wright, "The Role of Linear Objective Functions in Barrier Methods," Mathematical Programming, Series A, 84 (1999) 357-373. To simplify the analysis of interior-point methods, one commonly formulates the problem so that the objective function is linear, by introducing a single extra variable if necessary. Here we show that a linear objective function makes the Newton direction for a barrier function a useful search direction if the current iterate is sufficiently close to the central path. Hence, there are two advantages to using a linear objective and staying close to the central path. First, the Newton direction (which coincides with the affine scaling direction on the central path) gives a very accurate approximation to the direction to the minimum. Second, a long step along the Newton direction is possible without violating the inequality constraints."
S. J. Wright, "Modified Cholesky Factorizations in Interior-Point Algorithms for Linear Programming," SIAM Journal on Optimization 9 (1999) 1159-1191. We investigate a modified Cholesky algorithm similar to those used in current interior-point codes for linear programming. Cholesky-based interior-point codes are popular for three reasons: their implementation requires only minimal changes to standard sparse Cholesky codes (allowing us to take full advantage of software written by specialists in that area); they tend to be more efficient than competing approaches that use different factorizations; and they perform robustly on most practical problems, yielding good interior-point steps even when the coefficient matrix is ill conditioned. We explain the surprisingly good performance of the Cholesky-based approach by using analytical tools from matrix perturbation theory and error analysis, illustrating our results with computational experiments. Finally, we point out the limitations of this approach.
D. Levine, W. Gropp, K. Forsman, and L. Kettunen, "Parallel Computation of Three-dimensional Nonlinear Magnetostatic Problems," Concurrency Practice and Experience, 11(2):109--120, February 1999. We describe a general-purpose parallel electromagnetic code for computing accurate solutions to large computationally demanding, 3D, nonlinear magnetostatic problems. The code, CORAL, is based on a volume integral equation formulation. Using an IBM SP parallel computer and iterative solution methods, we successfully solved the dense linear systems inherent in such formulations. A key component of our work was the use of the PETSc library, which provides parallel portability and access to the latest linear algebra solution technology.
S. Benson, M. Krishnan, L. McInnes, J. Nieplocha, and J. Sarich, "Using the GA and TAO Toolkits for Solving Large-Scale Optimization Proble ms on Parallel Computers," Preprint ANL/MCS-P1084-0903, September 2003. Challenges in the scalable solution of large-scale optimization problems include the development of innovative algorithms as well as efficient tools for parallel data manipulation. This paper discusses the combined use of two complementary toolkits from the collection of Advanced CompuTational Software (ACTS), namely, Global Arrays (GA) for parallel data management and the Toolkit for Advanced Optimization (TAO). TAO uses abstractions for vectors and matrics, so that optimization algorithms can easily interface to the external linear algebra support provided by the GA library. The GA/TAO interfaces are available both in the traditional library mode and as components compliant with the Common Component Architecture (CCA). We highlight the design of each toolkit, describe the interfaces between them, and evaluate performance for model problems involving bound-constrained optimization.
S. J. Benson, L. McInnes, J. J. More', and J. Sarich, "Scalable Algorithms in Optimization: Computational Experiments," Preprint ANL/MCS-P1175-0604, June 2004. We survey techniques in the Toolkit for Advanced Optimization (TAO) for developing scalable algorithms for mesh-based optimization problems on distributed architectures. We discuss the distribution of the mesh, the computation of the gradient and the Hessian matrix, and the use of preconditioners. We show that these techniques, together with mesh sequencing, can produce results that scale with mesh size.
P. Hovland, K. Keahey, L. C. McInnes, B. Norris, L. Freitag, and P. Raghaven, "A Quality-of-Service Architecture for High-Performance Numerical Components," Preprint Preprint ANL/MCS-P1028-0203, February 2003. We propose a model for QoS-based composition of high-performance numerical components. We define an architecture that relies on five key capabilities and services, including component characterization, component proxy services, component replacement, a decision module, and archival run information processing. We describe quality metrics that are important for high-performance numerical simulations, includinmg computational cost, accuracy, and rates of convergence and failure. We discuss the use of the architecture and quality metrics in the context of a driven cavity flow simulation, which has been shown to benefit from adaptive solution techniques that could be derived from a QoS architecture.
S. Balay, W. D. Gropp, L. C. McInnes, B. F. Smith, "Software for the Scalable Solution of PDEs," in CRPC Handbook of Parallel Computing, eds. J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, Morgan Kaufmann Publishers (to appear). The numerical approximation of the solution of partial differential equation (PDEs), which can be used to model physical, chemical, and biological phenomena, is an important application of parallel computers. Early efforts to build programs to solve PDE problems had to start from scratch, building code for each algorithm used in the solution process. This custom approach has two major drawback: it limits the use of parallel computers to a small number of groups that have the resources and expertise to develop these codes, and it hampers the ability to take advantage of developments in parallel algorithms. In conventional, serial programming, both of these drawbacks were partially solved by developing libraries of routines that contained he best numerical analysis and implementation techniques. The same route is being followed for parallel libraries, though parallelism introduces additional complications. Handling these complications has caused many groups to rethink the structure of numerical libraries, leading to better software even for uniprocessor applications. Handling these complications has caused many groups to rethink structure of numerical libraries, leading to better software even for uniprocessor applications. In this chapter, we will cover some of the issues and solutions in the context of the Portable, Extensible Toolkit for Scientific Computation (PETSc), a collection of tools for the numerical solution of PDEs and related problems.
N. Karonis, B de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan, "Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance," Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS2000), to appear. The efficient implementation of collective communication operations has received much attention. Initial efforts modeled network communication and produced "optimal" trees based on those models. However, the models used by these initial efforts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area "computational grids," and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost differences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G, the Globus-enabled version of the popular MPICH mplementation of the MPI standard. Using information about topology discovered by Globus, we construct these topology-aware trees automatically during execution, thus freeing the MPI application programmer from having to write special files or functions to describe the topology to the MPICH library. We present results demonstrating the advantages of our multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.
O. Zaki, E. Lusk, W. Gropp, and D. Swider, "Toward Scalable Performance Visualization with Jumpshot," Int'l J. of High-Performance Computing Applications (to appear). Also Preprint ANL/MCS-P763-0699, June 1999. Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool, but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format, and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java, and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user's attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing very many events.
C. H. Bischof, H. M. Bücker, and P. D. Hovland, "On Combining Computational Differentiation and Toolkits for Parallel Scientific Computing," Preprint ANL/MCS-P797-0200, Feb. 2000. Automatic differentiation is a powerful technique for evaluating derivatives of functions given in the form of a high-level programming language such as Fortran, C, or C++. The program is treated as a potentially very long sequence of elementary statements to which the chain rule of differential calculus is applied over and over again. Combining automatic differentiation and the organizational structure of toolkits for parallel scientific computing provides a mechanism for evaluating derivatives by exploiting mathematical insight on a higher level. In these toolkits, algorithmic structures such as BLAS-like operations, linear and nonlinear solvers, or integrators for ordinary differential equations can be identified by their standardized interfaces and recognized as high-level mathematical objects rather than as a sequence of elementary statements. In this note, the differentiation of a linear solver with respect to some parameter vector is taken as an example. Mathematical insight is used to reformulate this problem into the solution of multiple linear systems that share the same coefficient matrix but differ in their right-hand sides. The experiments reported here use ADIC, a tool for the automatic differentiation of C programs, and PETSc, an object-oriented toolkit for the parallel solution of scientific problems modeled by partial differential equations.
C. H. Bischof and H. M. Bucker, "Computing Derivatives of Computer Programs, Preprint ANL/MCS-P81 3-0400, April 2000. Automatic differentiation is introduced as a powerful technique to compute derivatives of functions given in the form of a computer program in a high-level programming language such as Fortran, C, or C++. In contrast to traditional approaches such as handcoding of analytic expressions, numerical approximation by divided differences, or manipulation of symbolic algebraic expressions by computer algebra systems, automatic differentiation offers the following substantial benefits: it is accurate up to machine precision, efficient in terms of computational cost, applicable to a 1-line formula as well las to a 100,000-line code, and can be produced with minimal human effort.
P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur, "PVFS: A Parallel File System for Linux Clusters,& quot; Preprint ANL/MCS-P804-0400, April 2000. As Linux clusters have matured as platforms for low-cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and networking. One area devoid of support, however, has been parallel file systems, which are critical for high-performance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further research in parallel I/O and parallel file systems for Linux clusters. In this paper, we describe the design and implementation of PVFS and present performance results on the Chiba City cluster at Argonne. We provide performance results for a workload of concurrent reads and writes for various numbers of compute nodes, I/O nodes, and I/O request sizes. We also present performance results for MPI-IO on PVFS, both for a concurrent read/write workload and for the BTIO benchmark. We compare the I/O performance when using a Myrinet network versus a fast-ethernet network for I/O-related communication in PVFS. We obtained read and write bandwidths as high as 700 Mbytes/sec with Myrinet and 225 Mbytes/sec with fast ethernet.
L. Freitag and R. M. Loy, "Comparison of Remote Visualization Strategies for Interactive Exploration of Large Data Sets," Preprint ANL/MCS-P803-0300, March 2000. We compare three remote visualization strategies used for interactive exploration of large data sets: image-based rendering, parallel visualization servers, and subsampling. We review each strategy and provide details for an adaptive multiresolution subsampling technique that we have developed. To determine the problem regimes for which each approach is most cost effective, we develop performance models to analyze the costs of computation and communication associated with the common visualization task of isosurface generation. Using these models, we investigate a number of hardware system configurations and task complexity scenarios when parameters such as problem size, visualization demands, and network bandwidth change. For one particular strategy, subsampling, we further investigate the tradeoffs between multiresolution and uniform grid methods in terms of performance and approximation errors.
J. Cownie and W. Gropp, "A Standard Interface for Debugger Access to Message Queue Information in MPI," Preprint ANL/MCS-P754-0699, June 1999. This paper discusses the design and implementation of an interface that allows a debugger to obtain the information necessary to display the contents of the MPI message queues. The design has been implemented in the TotalView debugger, and dynamic libraries that conform to the interface exist for MPICH, as well as the proprietary MPI implementations from Compaq, IBM, and SGI.
L. McInnes, B. Norris, S. Bhowmick, and P. Raghaven "Adaptive Sparse Linear Solvers for Implicit CFD Using Newton-Krylov Algorithms," Preprint ANL/MCS-P998-0902, September 2002. We consider the simulation of three-dimensional transonic Euler flow using pseudeo-transient Newton-Krylov methods. The main computation involves solving a large, sparse linear system at each Newton (nonlinear) iteration. We develop a technique for adaptively selecting the linear solver method to match better the numeric properties of the linear systems as they evolve during the course of the nonlinear iterations. We show how such adaptive methods can be implemented using advanced software environments, leading to significant improvements in simulation time.
W. D. Gropp, D. E. Keyes, L. C. McInnes, and M. D. Tidriri, "Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD," Int. J. High Perform. Comput. Appl. (to appear). Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz algorithmic framework is presented as a widely applicable answer. This article shows that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, NKS can simultaneously deliver (1) globalized, asymptotically rapid convergence through adaptive pseudo-transient continuation and Newton's method; (2) easonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-communication scaling in the Krylov linear solver; and (3) high per-processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.
L. G. de Pillis and A. Radunskaya, "A Mathematical Tumor Model with Immune Resistance and Drug Therapy: An Optimal Control Approach," Preprint ANL/MCS-P805-0400, April 2000. We present a competition model of cancer tumor growth that includes both the immune system response and drug therapy. This is a four-population model that includes tumor cells, host cells, immune cells, and drug interaction. We analyze the stability of the drug-free equilibria with respect to the immune response in order to look for target basins of attraction. One of our goals was to simulate qualitatively the asynchronous tumor-drug interaction known as "Jeff's phenomenon." The model we develop is successful in generating this asynchronous response behavior. Our other goal was to identify treatment protocols that could improve standard pulsed chemotherapy regimens. Using optimal control theory with constraints and numerical simulations, we obtain new therapy protocols that we then compare with traditional pulsed periodic treatment. The optimal control generated therapies produce larger oscillations in the tumor population over time. However, by th4e end of the treatment period, total tumor size is smaller than that achieved through raditional pulsed therapy, and the normal cell population suffers nearly no oscillations.
P. D. Cha and L. G. de Pillis, "Model Updating by Adding Known Masses," International Journal for Numerical Methods in Engineering (to appear). Also Preprint ANL/MCS-P841-0800, August 2000. New approaches are developed that use measured data to adjust the analytical mass and stiffness matrices of a system so that the agreement between the analytical modes of vibration and the modal survey is improved. By adding known masses to the structure of interest, measuring the modes of vibration of this mass-modified system, and finally using this set of new data in conjunction with the initial lmodal survey, the analytical mass matrix of the structure can be corrected, after which the analytical stiffness matrix can be readily updated. By manipulating the correction matrices into vector forms, the connectivity information can be enforced, thereby preserving the physical configuration of the system and reducing the sizes of the least squares problems that need to be solved. Solution techniques for updating the system matrices are introduced, and the numerical issues associated with solving overdetermined and underdetermined least squares problems are investigated. The effects of round-off errors are also studied, and heuristic criteria are given for determining the minimum number of modes that need to be measured in order to ensure accurate updated mass and stiffness matrices. Numerical experiments are presented to validate the proposed model-updating techniques, to illustrate the effects of the number of measured modes on the quality of the updated model, to show how the magnitudes and locations of the added masses influence the updated matrices, and to highlight the numerical issues discussed in this paper.
L. G. de Pillis and E. G. de Pillis, "A Mathematical Framework for Understanding Continuum Effects of Budget Fluctuations on a University," Preprint ANL/MCS-P806-0400, April 2000. As a large part of most state budgets, public universities are tempting targets of budget cuts. Using the theory of diffusion of innovation, we build a mathematical framework that allows us to simulate the continuum effects of such budget cuts. This technique of mathematical modeling can provide insights into the mechanisms that affect the beneficial impact of the university. Such insights are difficult to obtain from standard economic impact studies.
E. G. de Pillis and L. G. de Pillis, "The Long-Term Impact of University Budget Cuts: A Mathematical Model," Preprint ANL/MCS-P817-0500, May 2000. Policymakers acknowledge the regional benefits of the university, yet cut higher education budgets. Incorporating the theory of diffusion of innovation, we develop a mathematical model to explore the long-term effects of university budget cuts. Simulations indicate that the full impact of budget modifications may not be realized for several decades.
L. A. Freitag and R. M. Loy, "Using Desktop Graphics Workstations for Interactive Remote Exploration of Large Data Sets," Preprint ANL/MCS-P809-0400, April 2000. The interactive visualization and exploration of large scientific data sets is a challenging and difficult task; their size often far exceeds the performance and memory capacity of even the most powerful graphics workstations. To address this problem, we have created a technique that combines multiresolution data reduction methods with parallel computing to allow interacative exploration of large data sets while retaining full-resolution capability. We describe the creation of reduced data sets using several different criteria including user-specified error bounds or a preset performance criterion. We discuss the software architecture of the system with particular emphasis on the algorithms used to efficiently create a reduced data set and the software used to communicate between the remote data reduction server and the local graphics client. We present performance results for the visualization of Rayleigh-Taylor instability and hairpin vortex data sets.
Z. Ren, R. Sheng, and S. J. Wright, "Advanced Computational Techniques for Laue Diffraction," Preprint ANL/MCS-P740-0199, Feb. 1999. We describe LaueView, a code for processing the measured intensity data in Laue X-ray diffraction experiments to obtain corrected structure amplitudes for each reflection that take account of the various distortion effects. The resulting corrected intensity data can then be used to recover the molecular structure by isomorphous refinement or by solution of the phase problem. We describe the key numerical techniques used in LaueView and outline the improvements we made to obtain a new, more efficient, and parallel version of the code. We conclude with some computational results obtained on a real data set that illustrate our improvements. The basic principles of the Laue method are described in an appendix, where we outline the distortions in the measured intensity data due to effects such as blurring, overlap of the spots, the nonuniform distribution of intensities in the incident X-ray beam, and absorption effects of various types.
E. A. Yildirim and S. J. Wright, "Warm-start Strategies in Interior-Point Methods for Linear Programming," Preprint ANL/MCS-P799-0300, March 2000. We study the situation in which, having solved a linear program with an interior-point method, we are presented with a new problem instance whose data is slightly perturned from the original. We describe strategies for recovering a "warm-start" point for the perturbed problem instance from the iterates of the original problem instance. We obtain worst-case estimates of the number of iterations required to converge to a solution of the perturbation and on the conditioning and other properties of the problem instances.
S. J. Wright, "On Reduced Convex QP Formulations of Monotone LCP Problems," Preprint ANL/MCS-P808-0400, April 2000. Techniques for transforming convex quadratic programs (QPs) into monotone linear complementarity problems (LCPs) and vice versa are well known. We describe a class of LCPs for which a reduced QP formulation---one that has fewer constraints than the "standard" QP formulations---is available. We mention several instances of this class, including the known case in which the coefficient matrix in the LCP is symmetric.
S. J. Wright, "Recent Developments in Interior-Point Methods," Proceedings of the IFIP Conference on System Modelling and Optimization, M. J. D. Powell, ed., Kluwer, to appear. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have become quite sophisticated, while extensions to more general classes of problems, such as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in many other optimization contexts as well, including analytic center methods and column-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these other problem areas.
I. Lee, P. Raghaven, S. Schofield, and P. Fischer, ""An O(n log n) Solution Algorithm for Spectral Element Methods," Preprint ANL/MCS-P1041-0403, April 2003. To leverage significant software development effort, general-purpose unstructured codes are often used in structured or semi-structured applications. We show that O(n log n) computational complexities, competitive with classic Fourier methods, are achievable for some classes of semi-structured spectral element applications.
J. W. Lottes and P. F. Fischer, "Hybrid Multigrid/Schwarz Algorithms for the Spectral Element Method," Preprint ANL/MCS-P1052-0403, April 2003. We study the performance of the multigrid method applied to spectral element discretizations of the Poisson equation. Smoothers based on finite element (FE) discretizations, overlapping Schwarz methods, and point-jacobi are considered in conjunction with conjugate gradient and GMRES acceleration techniques. It is found that Schwarz methods based on restrictions of the originating SE matrices converge faster than FE-based methods and that weighting the Schwarz matrices by the inverse of the diagonal coutning matrix is essential to effective Schwarz smoothing. Several of the methods considered achieve convergence rates comparable to those attained by classic multigrid on regular grids.
R. Armstrong, D. Gannon, A. Geist, K. Keahey, S. Kohn, L. McInnes, S. Parker, and B. Smolinski, "Toward a Common Component Architecture for High-Performance Scientific Computing," Preprint ANL/MCS-P759-0699, Proceedings of the 1999 High-Performance Distributed Computing Conference, to appear. This paper describes work in progress to develop a standard for interoperability among high-performance scientific components. This research stems from growing recognition that the scientific community needs to better manage the complexity of multidisciplinary simulations and better address scalable performance issues on parallel and distributed architectures. Driving forces are the need for fast connections among components that perform numerically intensive work and for parallel collective interactions among components that use multiple processes or threads. This paper focuses on the areas we believe are most crucial in this context, namely, an interface definition language that supports scientific abstractions for specifying component interfaces and a ports connections model for specifying component interactions.
H. M. Tufo and P. F. Fischer, "Terascale Spectral Element Algorithms and Implementations," Preprint ANL/MCS-P762-0699, June 1999. We describe the development and implementation of an efficient spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. We review basic and recently developed algorithmic underpinnings that have resulted in good parallel and vector performance on a broad range of architectures, including the terascale computing systems now coming on line at the DOE labs. Sustained performance of 219 GFLOPS has been recently achieved on 2048 nodes of the Intel ASCI-Red machine at Sandia.
H. M. Tufo, P. F. Fischer, M. E. Papka, and M. Szymanski, "Hairpin Vortex Formation, a Case Study for Unsteady Visualization," Preprint ANL/MCS-P774-0799, July 1999. To better understand the vortex dynamics of coherent structures in turbulent and transitional boundary layers, we consider direct numerical simulation of the interaction between a flat-plate-boundary-layer flow and an isolated hemispherical roughness element. Of principal interest is the evolution of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer. Using animations of unsteady three-dimensional representations of this flow, produced by the vtk toolkit and enhanced to operate in a CAVE virtual environment, we identify and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats.
W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Achieving High Sustained Performance in an Unstructured Mesh CFD Application," Preprint ANL/MCS-P776-0899, Aug. 1999. Overview: Many applications of economic and national security importance require the solution of nonlinear partial differential equations (PDEs). In many cases, PDEs possess a wide range of time scales---some (e.g., acoustic) faster than the phenomena of prime interest (e.g., convective), suggesting the need for implicit methods. In addition, many applications are geometrically complex and possess a wide range of length scales, requiring an unstructured mesh to adequately resolve the problem without requiring an excessive number of mesh points and to accomplish mesh generation and adaptation (almost) automatically. The best algorithms for solving nonlinear implicit problems are often Newton Methods, which themselves require the solution of very large, sparse linear systems. The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods. This nested hierarchy of tunable algorithms has proved effective in solving complex problems in areas such as aerodynamics, combustion, radiation transport, and global circulation. Typically, for steady-state solutions from a trivial initial guess, the number of "work units" (evaluations of the discrete residuals on the finest mesh on which the problem is represented) is around 10**3 (to achieve reductions in the norm of the residual of 10**-8 to 10**-12). Although these algorithms are efficient (in the sense of using relatively few floating-point operations to arrive at the final result), they do not necessarily achieve the absolute flops-per-second (flop/s) ratings that less efficient or less versatile algorithms may.
P. F. Fischer and J. S. Mullen, "Filter-based Stabilization of Spectral Element Methods," C. R. Acad. Sci. Paris 332, series 1 (2001) 265-270. We present a simple filtering procedure for stabilizing the spectral element method (SEM) for the unsteady advection-diffusion and Navier-Stokes equations. A number of example applications are presented, along with basic analysis for the advection-diffusion case.
P. F. Fischer and H. M. Tufo, "High-Performance Spectral Element Algorithms and Implementations," Preprint ANL/MCS-P778-0899, Aug. 1999. We describe the development and implementation of a spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. Parallel performance is present on up to 2048 nodes of the Intel ASCI-Red machine at Sandia.
H. M. Tufo, P. F. Fischer, M. E. Papka, and K. Blom, "Numerical Simulation and Immersive Visualization of Hairpin Vortices," Preprint ANL/MCS-P779-0899, Aug. 1999. To better understand the vortex dynamics of coherent structures in turbulent and transitional boundary layers, we consider direct numerical simulation of the interaction between a flat-plate-boundary-layer flow and an isolated hemispherical roughness element. Of principal interest is the evolution of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer. Using animations of unsteady three-dimensional representations of this flow, produced by the vtk toolkit and enhanced to operate in a CAVE virtual environment, we identify and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats.
L. Freitag and T. Urness, "Analyzing Industrial Furnace Efficiency Using Comparative Visualization in a Virtual Reality Environment," Preprint ANL/MCS-P744-0299, Feb. 1999. We describe an interactive toolkit used to perform comparative analysis of two or more data sets arising from numerical simulations. Several techniques have been incorporated into this toolkit, including (1) successive visualization of individual data sets, (2) data comparison techniques such as computation and visualization of the differences between data sets, and (3) image comparison methods such as scalar field height profiles plotted in a common coordinate system. We describe each technique in detail and show example usage in an industrial application aimed at designing an efficient, low-NOx burner for industrial furnaces. Critical insights are obtained by interactively adjusted color maps, data culling, and data manipulation. New paradigms for scaling small values in the data comparison technique are described. The display device used for this application was the CAVE virtual reality theater, and we describe the user interface to the visualization toolkit and the benefits of immersive 3D visualization for comparative analysis.
L. A. Freitag and P. Plassmann, "Local Optimization-Based Simplicial Mesh Untangling and Improvement," Preprint ANL/MCS-P749-0399, May 1999. We develop an optimization-based approach for mesh untangling formulated by maximizing the minimum area or volume of the simplicial elements in a local submesh. The method that the function level sets are convex regardless of the position of the free vertex, and hence the local subproblem is guaranteed to converge. Maximizing the minimum area of mesh elements, although well-suited for mesh untangling, is not ideal for mesh improvement and its use often results in poor quality meshes. We therefore combine the mesh untangling technique with optimization-based mesh improvement techniques, and expand previous results to show that the level sets for a commonly used mesh quality criterion is convex in the feasible region. Typical results showing the effectiveness of the combined untangling and smoothing techniques are given for both two- and three-dimensional meshes.
L. A. Freitag, W. D. Gropp, P. D. Hovland, L. C. McInnes, and B. F. Smith, "Infrastructure and Interfaces for Large-Scale Numerical Software," Preprint ANL/MCS-P751-0599, May 1999. The complexity of large-scale scientific simulations often necessitates the combined use of multiple software packages developed by different groups in areas such as adaptive mesh manipulations, scalable algebraic solvers, and optimization. Historically, these packages have been combined by using custom code. This practice inhibits experimentation with and comparison of multiple tools that provide similar functionality through different implementations. The ALICE project, a collaborative effort among researchers at Argonne National Laboratory, is exploring the use of component-based software engineering to provide better interoperability among numerical toolkits. We discuss some initial experiences in developing an infrastructure and interfaces for high-performance numerical computing.
L. A. Freitag and R. M. Loy, "Adaptive, Multiresolution Visualization of Large Data Sets Using Parallel Octrees," Preprint ANL/MCS-P752-0599, May 1999. The interactive visualization and exploration of large scientific data sets is a challenging and difficult task; their size often far exceeds the performance and memory capacity of even the most powerful graphics workstations. To address this problem, we have created a technique that combines hierarchical data reduction methods with parallel computing to allow interactive exploration of large data sets while retaining full-resolution capability. The hierarchical representation is built in parallel by strategically inserting field data into an octree data structure. We provide functionality that allows the user to interactively adapt the resolution of the reduced data sets so that resolution is increased in regions of interest without sacrificing local graphics performance. We describe the creation of the reduced data sets using a parallel octree, the software architecture of the system, and the performance of this system on the data from a Rayleigh-Taylor instability simulation.
L. A. Freitag and P. M. Knupp, "Tetrahedral Element Shape Optimization via the Jacobian Determinant and Condition Number," Preprint ANL/MCS-P769-0799, July 1999. We present a new shape measure for tetrahedral elements that is optimal in the sense that it gives the distance of a tetrahedron from the set of inverted elements. This measure is constructed from the condition number of the linear transformation between a unit equilateral tetrahedron and any tetrahedron with positive volume. We use this shape measure to formulate two optimization objective functions, which are differentiated by their goal: the first seeks to improve the average quality of the tetrahedral mesh, the second aims to improve the worst quality element in the mesh. Because the element condition number is not defined for tetrahedra with negative volume, these objective functions can be used only when the initial mesh is valid. Therefore, we formulate a third objective function using the determinant of the element Jacobian that is suitable for mesh untangling. We review the optimization techniques used with each objective function, and present experimental results that demonstrate the effectiveness of the mesh improvement and untangling methods. We show that a combined optimization approach that uses both condition number objective functions obtains the best quality meshes.
L. A. Freitag and P. M. Knupp, "Tetrahedral Mesh Improvement via Optimization of the Element Condition Number," Preprint ANL/MCS-P810-0500, May 2000. We present a new shape measure for tetrahedral elements that is optimal in that it gives the distance of a tetrahedron from the set of inverted elements. This measure is constructed from the condition number of the linear transformation between a unit equilateral tetrahedron and any tetrahedron with positive volume. Using this shape measure, we formulate two optimization objective functions that are differentiated by their goal: the first seeks to improve the average quality of the tetrahedral mesh; the second aims to improve the worst-quality element in the mesh. We review the optimization techniques used with each objective function and present experimental results that demonstrate the effectiveness of the mesh improvement methods. We show that a combined optimization approach that uses both objective functions obtains the best-quality meshes for several complex geometries.
R. Butler, W. Gropp, and E. Lusk, "A Scalable Process-Management Environment for Parallel Programs," Preprint ANL/MCS-P812-0400, April 2000. We present a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising a thousand processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and better runtime management of MPICH jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. MPD is implemented and freely distributed with MPICH.
J. E. Flaherty, R. M. Loy, M. S. Shephard, and J. D. Teresco, "Software for the Parallel Adaptive Solution of Conservation Laws by Discontinuous Galerkin Methods," Preprint ANL/MCS-P773-0799, July 1999. We develop software tools for the solution of conservation laws using parallel adaptive discontinuous Galerkin methods. In particular, the Rensselaer Partition Model (RPM) provides parallel mesh structures within an adaptive framework to solve the Euler equations of compressible flow by a discontinuous Galerkin method (LOCO). Results are presented for a Rayleigh-Taylor flow instability for computations performed on 128 processors of an IBM SP computer. In addition to managing the distributed data and maintaining a load balance, RPM provides information about the parallel environment that can be used to tailor partition to a specific computational environment.
M. Anitescu, "On the Rate of Convergence of Sequential Quadratic Programming with Nondifferentiable Exact Penalty Function in the Presence of Constraint Degeneracy," Preprint ANL/MCS-P760-0699, June 1999. We analyze the convergence of the sequential quadratic programming (SQP) method for nonlinear programming for the case in which the Jacobian of the active constraints is rank deficient at the solution and/or strict complementarity does not hold for some or any feasible LaGrange multipliers. We use a nondifferentiable exact penalty function, and we prove that the sequence generated by the SQP is locally R-linearly convergent if the matrices of the quadratic program are uniformly positive definite and bounded, provided that the Mangasarian-Fromowitz constraint qualification and some second-order sufficiency conditions hold.
M. Anitescu, "Global Convergence of an Elastic Mode Approach for a Class of Mathematical Programs with Complementarity Constraints," Preprint ANL/MCS-P1143-0404, April 2004. We prove that any accumulation point of an elastic mode approach, applied to the optimization of a mixed P variational inequality that approximately solves the relaxed subproblems is a C-stationary point of the problem of optimizaing a parametric mixed P variational inequality. If, in addition, the accumulation point satisfies the MPCC-LICQ constraint qualitfication and if the solutions of the subproblem satisfy approximate second-order sufficient conditions, then the limiting point is an M-stationary point. Moreover, if the accumulation point satisfies the upper-level strict complementarity condition, the accumulation point will be a strongly stationary point. If we assume that the penalty function associated with the feasible set of the mathematical program with complementarity constraints has bounded level sets and if the objective function is bounded below, we show that the algorithm will produce bounded iterates and will therefore have at least one accumulation point. We prove that the obstable problem satisfies our assumptions for both a rigid and a deformable obstacle. The theoretical conclusions are validated by several numerical examples.
M. Anitescu and G. D. Hart, "A Fixed-Point Iteration Approach for Multibody Dynamics with Contact and Small Friction," Preprint ANL/MCS-P985-0802, February 2004. Acceleration-force setups for multi-rigid-body dynamics are known to be inconsistent for some configurations and sufficiently large friction coefficients (a Painleve paradox). This difficulty is circumvented by time-stepping methods using impulse-velocity approaches, which solve complementarity problems with possibly nonvconvex solution sets. We show that very simple configurations involving two bodies may have a nonconvex solution set for any nonzero value of the friction coefficient. We construct two fixed-point iteration algorithms that solve convex subproblems and that are guaranteed, for sufficiently small friction coefficients, to retrieve, at a linear convergence rate, the unique velocity solution of the nonconvex linear complementarity problem whenever the frictionless configuration can be disassembled. In addition, we show that one step of one of the interative algorithms provides an excellent approximation to the velocity solution of the original, possibly nonconvex, problem if for all contacts we have that either the friction coefficient is small or the slip velocity is small.
M. Anitescu, W. J. Layton, and F. Pahlevani, "Unconditional Stability for Numerical Scheme Combining Implicit Timestepping fo r Local Effects and Explicit Timestepping for Nonlocal Effects," Preprint ANL/MCS-P1093-0903, September 2003. A combination of implicit and explicit timestepping is analyzed for a system of ODEs motivated by ones arising from spatial discretizations of evolutionary partial differential equations. Loosely speaking, the method we consider is implicit in local and stabilizing terms in the underlying PDE and explicit in nonlocal and unstabilizing terms. Unconditional stability and convergence of the numerical scheme are proven by the energy method and by algebraic techniques. This stability result is surprising because usually when different methods are combined, the stability properties of the least stable method plays a determining role in the combination.
F. A. Potra, M. Anitescu, B. Gavrea, and J. Trinkle, "A Linearly Implicit Trapezoidal Method for Integrating Stiff Multibody Dynamics with Contact, Joints, and Friction," Preprint ANL/MCS-P1162-0504, May 2004. We present a hard constraint, linear complementarity-based method for the simulation of stiff multibody dynamics with contact, joints, and friction. The approach uses a linearization of the modified trapezoidal method, incorporates a Poisson restitution model at collision, and solves only one linear complementarity problem per time step when no collisions are encountered. We prove that the method has order two, a fact that is also demonstrated by our numerical simulations. When we use a special approximation of the Jacobian matrix for the case where the stiff forces originate in springs and dampers attached to two points in the system, we prove that the linear coplementarity problem can be solved for any value of the time step and that the method is stiffly stable. In particular, the numerical solution converges to the numerical solution of the contraint problem where the stiff springs and dampers have been replaced by rigid links, a fact that we also demonstrate by numerical simulations. The method was implemented in UMBBRA, an industrial-grade virtual prototyping software.
G. D. Hart and M. Anitescu, "A Hard-Constraint Time-Stepping Approach for Rigid Multibody Dynamics wit h Joints, Contact, and Friction," Preprint ANL/MCS-P1099-0903, October 2003. We present a method for simulating rigid multibody dynamics with joints, contact, and friction. In this work, the nonsmooth contact and frictional constraints are represented by hard constraints. The method requires the soluiton of only one linear complementarity problem per step an can progress at much larger time steps than explicit penalty methods, which are currently the method of choice for most of these simulations.
M. Anitescu, "Degenerate Nonlinear Programming with a Quadratic Growth Condition," Preprint ANL/MCS-P761-0699, June 1999. We show that the quadratic growth condition and the Mangasarian-Fromowitz constraint qualification imply that local minima of nonlinear programs are isolated stationary points. As a result, when started sufficiently close to such points, an L_\infty exact penalty sequential quadratic programming algorithm will induce at least R-linear convergence of the iterates to such a local minimum. We construct an example of a degenerate nonlinear program with a unique local minimum satisfying the quadratic growth and the Mangasarian-Fromowitz constraint qualification but for which no positive semidefinite augmented Lagrangian exists. We present numerical results obtained using several nonlinear programming packages on this example, and discuss its implications for some algorithms.
L. A. Freitag and P. Plassmann, "Local Optimization-Based Simplicial Mesh Untangling and Improvement," Preprint ANL/MCS-P749-0399, May 1999. We develop an optimization-based approach for mesh untangling formulated by maximizing the minimum area or volume of the simplicial elements in a local submesh. The method that the function level sets are convex regardless of the position of the free vertex, and hence the local subproblem is guaranteed to converge. Maximizing the minimum area of mesh elements, although well-suited for mesh untangling, is not ideal for mesh improvement and its use often results in poor quality meshes. We therefore combine the mesh untangling technique with optimization-based mesh improvement techniques, and expand previous results to show that the level sets for a commonly used mesh quality criterion is convex in the feasible region. Typical results showing the effectiveness of the combined untangling and smoothing techniques are given for both two- and three-dimensional meshes.
M. K. Mihçak, P. Moulin, M. Anitescu, and K. Ramchandran, "Rate-Distortion-Optimal Subband Coding without Perfect-Reconstruction Constraints," Preprint ANL/MCS-P766-0699, June 1999. We investigate the design of subband coders without the traditional perfect-reconstruction constraint on the filters. The coder uses scalar quantizers, and its filters and bit allocation are designed so as to optimize a rate-distortion criterion. Convexity properties play a central role in the analysis. Our results hold for a broad class of rate-distortion criteria. First, we show that optimality can be achieved using filter banks that are the cascade of a (paraunitary) principal component filter bank for the input spectral process and a set of pre- and post-filters surrounding each quantizer. An algorithm for computing the globally optimal filters and bit allocation is given. We then develop closed-form solutions for the special case of two-channel coders under an exponential rate-distortion model. Finally, we investigate a constrained-length version of the filter design problem, which is applicable to practical coding scenarios. While the optimal filter banks are nearly perfect-reconstruction at high rates, we demonstrate an apparently surprising advantage of optimal FIR filter banks: they significantly outperform optimal perfect-reconstruction FIR filter banks at all bit rates.
S. J. Wright, "Recent Developments in Interior-Point Methods," Proc. 19th IFIP TC7 Conference on System Modeling and Optimization, Kluwer Academic Publishers (to appear 2000). Also Preprint ANL/MCS-P783-0999, Sept. 1999. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have become quite sophisticated, while extensions to more general classes of problems, such as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in many other optimization contexts as well, including analytic center methods and column-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these other problem areas.
F. A. Potra and S. J. Wright, Journal of Computational and Applied Mathematics 124 (2000) 281-302. The modern era of interior-point methods dates to 1984, when Karmarkar proposed his algorithm for linear programming. In the years since then, algorithms and software for linear programming have bec ome quite sophisticated, while extensions to more general classes of problems, s uch as convex quadratic programming, semidefinite programming, and nonconvex and nonlinear problems, have reached varying levels of maturity. Interior-point methodology has been used as part of the solution strategy in man y other optimization contexts as well, including analytic center methods and col umn-generation algorithms for large linear programs. We review some core developments in the area and discuss their impact on these o ther problem areas.
S. J. Wright and D. Orban, "Properties of the Log-Barrier Function on Degenerate Nonlinear Programs," Preprint ANL/MCS-P772-0799, July 1999. We examine the sequence of local minimizers of the log-barrier function for a nonlinear program near a solution at which second-order sufficient conditions and the Mangasarian-Fromowitz constraint qualifications are satisfied, but the active constraint gradients are not necessarily linearly independent. When a strict complementarity condition is satisfied, we show uniqueness of the local minimizer of the barrier function in the vicinity of the nonlinear program solution, and obtain a semi-explicit characterization of this point. When strict complementarity does not hold, we obtain several other interesting characterizations, in particular, an estimate of the distance between the minimizers of the barrier function and the nonlinear program in terms of the barrier parameter, and a result about the direction of approach of the sequence of minimizers of the barrier function to the nonlinear programming solution.
R. Thakur, W. Gropp, and E. Lusk, "Achieving High Performance with MPI-IO," Preprint ANL/MCS-P742-0299, Feb. 1999. The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. An application can be written in many different ways with MPI-IO. We classify the different ways of expressing an application's I/O access pattern in MPI-IO into four levels, called level 0 through level 3. We demonstrate that, for applications with noncontiguous access patterns, the I/O performance improves significantly if users write the application such that it makes level-3 MPI-IO requests (noncontiguous, collective) rather than level-0 requests (Unix style). We describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications---an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)---on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
J. Cownie and W. Gropp, "A Standard Interface for Debugger Access to Message Queue Information in MPI," Preprint ANL/MCS-P754-0699, June 1999. This paper discusses the design and implementation of an interface that allows a debugger to obtain the information necessary to display the contents of the MPI message queues. The design has been implemented in the TotalView debugger, and dynamic libraries that conform to the interface exist for MPICH, as well as the proprietary MPI implementations from Compaq, IBM, and SGI.
A. S. Bondarenko, D. M. Bortz, and J. J. More', "COPS: Large-Scale Nonlinearly Constrained Optimization Problems," Technical Memorandum ANL/MCS-TM-237, Sept. 1998, revised Oct. 1999. We have started the development of COPS, a collection of large-scale nonlinearly Constrained Optimization ProblemS. The primary purpose of this collection is to provide difficult test cases for optimization software. Problems in the current version of the collection come from fluid dynamics, population dynamics, optimal design, and optimal control. For each problem we provide a short description of the problem, notes on the formulation of the problem, and results of computational experiments with general optimization solvers. We currently have results for DONLP2, LANCELOT, MINOS, SNOPT, and LOQO.
Y. Zhou, "Eigenvalue Computation from the Optimization Perspective: On Jacobi-David son, IIGD, RQI, and Newton Updates," Preprint ANL/MCS-P1074-0803, August 2003. We discuss the close connection between eigenvalue computation using the Newton method and subspace methods. From the connection we derive a new class of Newton updates. The new update formulation is similar to the well-known Jacobian-Davidson method. This similarity leads to simplified versions of the Jacobi-Davidson method and the inverse iteration generalized Davidson (IIGD) method. We prove that the projection subspace augmented by the updating direction from each of these methods is able to include the Rayleigh quotient iteration (RQI) direction. Hence, the locally quadratic (cubic for normal matrices) convergence rat of the RQI method is retained and strengthened by the subspace methods. The theory is supported by extensive numerical results. Preconditioned formulations are also briefly discussed for large-scale eigenvalue problems.
L. Freitag, "Users Manual for Opt-MS: Local Methods for Simplicial Mesh Smoothing and Untangling," Technical Memorandum ANL/MCS-TM-239, April 1999. Creating meshes containing good-quality elements is a challenging, yet critical, problem facing computational scientists today. Several researches have shown that the size of the mesh, the shape of the elements within that mesh, and their relationship to the physical application of interest can profoundly affect the efficiency and accuracy of many numerical approximation techniques. If the application contains anisotropic physics, the mesh can be improved by considering both local characteristics of the approximate application solution and the geometry of the computational domain. If the application is isotropic, regularly shaped elements in the mesh reduce the discretization error, and the mesh can be improved a priori by considering geometric criteria only. The Opt-MS package provides several local node point smoothing techniques that improve elements in the mesh by adjusting grid point location using geometric criteria. The package is easy to use; only three subroutine calls are required for the user to begin using the software. The package is also flexible; the user may change the technique, function, or dimension of the problem at any time during the mesh smoothing process. Opt-MS is designed to interface with C and C++ codes, and examples for both two- and three-dimensional meshes are provided.
W. D. Gropp, "Runtime Checking of Datatype Signatures in MPI," Recent Advances in PVM and MPI, eds. J. Dongarra, P. Kacsuk, N. Podhorszki, 7th European PVM/MPI User's Group Meeting, 9/10-13/00 (to appear). Also Preprint ANL/MCS-P826-0500, May 2000. The MPI standard provides a way to send and receive complex combinations of datatypes (e.g., integers and doubles) with a single communication operation. The MPI standard specifies that the type signature, that is, the basic datatypes (language-defined types such as int or DOUBLE PRECISION), must match in communication operations such as send/receive or broadcast. Because datatypes may be defined by the user in MPI, there is a limitless collection of possible type signatures. Detecting the programmer error of mismatched datatypes is difficult in this case; detecting all errors essentially requires sending a complete description of the type signature with a message. This paper discusses an alternative: send the value of a function of the type signature so that (a) identical type signatures always give the same function value, (b) different type signatures often give difference values, and (c) common cases (e.g., predefined datatypes) are handled exactly. Thus, erronneous programs are often (but not always) detected; correct programs never are flagged as erroneous. The method described is relatively inexpensive to compute and uses a small (and fixed, independent of the complexity of the datatype) amount of space in the message envelope.
Y. Chen, B. F. Hobbs, S. Leyffer, and T. S. Munson, "Leader-Follower Equilibria for Electric Power and NOx Allowances Markets," Preprint ANL/MCS-P1191-0804, August 2004. This paper investigates the ability of the largest producer in an electricity market to manipulate both the electricity and emission allowances markets to its advantage. The Stackelberg game to analyze this situation is constructed in which the largest firm plays the role of leader while the medium-sized firms are treated as Cournot followers with price-taking fringes that behave competitively in both markets. Analysis of the computed solution for the Pennsylvania-New Jersey-Maryland electricity market shows that the leader can gain substantial profits by withholding allowances and driving up NOx allowance costs for rival producers. The allowances price is higher than the corresponding price in the Nash-Cournot case, although the electricity prices are essentially the same.
T. Munson, "Mesh Shape-Quality Optimization Using the Inverse Mean-Ratio Metric," Preprint ANL/MCS-1136-0304, April 2004. We present a nonlinear fractional program that relocates the vertices of a given mesh to optimize the average element shape quality as measured by the inverse mean-ratio metric. To solve the resulting large-scale optimization problems we apply an efficient implementation of an inexact Newton algorithm using the conjugate gradient method with a block Jacobi preconditioner to compute the direction. Numerical results on several test meshes are presented and used to quantify the impact on solution time and memory requirements of using a modeling language and general-purpose algorithm to solve these problems.
S. J. Benson and T. S. Munson, "Flexible Complementarity Solvers for Large-Scale Applications," Preprint ANL/MCS-P1055-0603, June 2003. Discretizations of infinite-dimensional variational inequalities lead to linear and nonlinear complementarity problems with many degrees of freedom. To solve these problems in a parallel environment, we propose two active-set methods that solve only one linear system of equations per iteration. The linear solver, preconditioner, and matrix structures can be chosen by the user for a particular application to achieve high parallel performance. The parallel scalability of these methods is demonstrated for some discretizations of infinite-dimensional variational inequalities.
J. J. More' and T. S. Munson, "Computing Mountain passes and Transition States," Preprint ANL/MCS-P957-0502, May 2002. We propose the elastic string algorithm for computing mountain passes in finite-dimensional problems. We analyze the convergence properties and numerical performance of this algorithm for benchmark problems in chemistry and discretizations of infinite-dimensional variational problems. We show that any limit point of the elastic strring algorithm is a path that crosses a critical point at which the Hessian matrix is not positive definite.
S. J. Benson and Y. Ye, "Approximating Maximum Stable Set and Minimum Graph Coloring Problems with the Positive Semidefinite Relaxation," Preprint ANL/MCS-P830-0600, June 2000. We compute approximate solutions to the maximum stable set problem and the minimum graph coloring problem using a positive semidefinite relaxation. The positive semidefinite programs are solved using an implementation of the dual scaling algorithm that takes advantage of the sparsity inherent in most graphs and th4e structure inherent in the problem formulation. From the solution to the relaxation, we apply a randomized algorithm to find approximate maximum stable sets and a modification of a popular heuristic to find graph colorings. We obtained high quality answers for graphs with over 1000 vertices and over 6000 edges.
J. Nocedal and S. J. Wright, Numerical Optimization, Springer, 1999. Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization. It responds to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Stephen J. Wright, Primal-Dual Interior-Point Methods, SIAM, 1996. In the past decade, primal-dual algorithms have emerged as the most important and useful algorithms from the interior-point class. This book presents the major primal-dual algorithms for linear programming in straightforward terms. A thorough description of the theoretical properties of these methods is given, as are a discussion of practical and computational aspects and a summary of current software.
B. Smith, P. Bjorstad, and W. Gropp, Domain Decomposition, Cambridge University Press, 1996. The emergence of parallel computers and their potential for the numerical solution of Grand Challenge problems has led to a large amount of research in domain decomposition methods. This book presents an easy-to-read discussion of domain decomposition algorithms, their implementation and analysis. This book is ideal for graduate students about to embark on a career in computational science. It will also be a valuable resource for all those interested in parallel computing and numerical computational methods.
S. J. Wright, "Algorithms and Software for Linear and Nonlinear Programming," Foundations of Computer-Aided Process Design, AIChE Symposium Series, CACHE Publications, 1999. We investigate a modified Cholesky algorithm similar to those used in current interior-point codes for linear programming. Cholesky-based interior-point codes are popular for three reasons: their implementation requires only minimal changes to standard sparse Cholesky codes (allowing us to take full advantage of software written by specialists in that area); they tend to be more efficient than competing approaches that use different factorizations; and they perform robustly on most practical problems, yielding good interior-point steps even when the coefficient matrix is ill conditioned. We explain the surprisingly good performance of the Cholesky-based approach by using analytical tools from matrix perturbation theory and error analysis, illustrating our results with computational experiments. Finally, we point out the limitations of this approach.
Using MPI, 2nd edition, William Gropp, Ewing Lusk, and Anthony Skjellum, MIT Press, 1999. Using MPI is a completely up-to-date version of the authors' 1994 introduction to the core functions of MPI. It adds material on the new C++ and Fortran 90 bindings for MPI throughout the book. It contains greater discussion of datatype extents, the most frequently misunderstood feature of MPI-1, as well as material on the new extensions to basic MPI functionality added by the MPI-2 Forum in the area of MPI datatypes and collective operations.
Using MPI-2, William Gropp, Ewing Lusk, and Rajeev Thakur, MIT Press, 1999. Using MPI-2 covers the new extensions to basic MPI. These include parallel I/O, remote memory access operations, and dynamic process management. The volume also includes material on tuning MPI applications for high performance on modern MPI implementations.
MPI: The Complete Reference - 2nd Edition, Volume 2 - The MPI-2 Extensions, William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir, MIT Press, 1998. Since its release in summer 1994, the Message Passing Interface (MPI) specification has become a standard for message-passing libraries for parallel computations. There exist more than a dozen implementations on a variety of computing platforms, from the IBM SP-2 supercomputer to PCs running Windows NT. The MPI Forum, which has continued to work on MPI, has recently released MPI-2, a new definition that includes significant extensions, improvements, and clarifications. This volume presents a complete specification of the MPI-2 Standard. It is annotated with comments that clarify complicated issues, including why certain design choices were made, how users are intended to use the interface, and how they should construct their version of MPI. The volume also provides many detailed, illustrative programming examples.
Using MPI: Portable Parallel Programming with the Message Passing Interface, William Gropp, Ewing Lusk, and Anthony Skjellum, MIT Press, 1994. This book, now replaced by Using MPI, second ed., introduced the core functions of the Message Passing Interface.
W. Gropp, E. Lusk, and D. Swider, "Improving the Performance of MPI Derived Datatypes," Proceedings of the 3rd Annual MPI Developers and Users Conference, ed. A. Skjellum, P. V. Bangalore, and Y. S. Dandass, Starkville, Mis sissippi, March 10-12, 1999, pp. 25-30. Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool, but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format, and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java, and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user's attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing very many events.
R. Thakur, W. Gropp, and E. Lusk, "On Implementing MPI-IO Portably and with High Performance," Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, 1999, 23-32 (Preprint ANL/MCS-P732-1098). We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Performance Modeling and Tuning of an Unstructured Mesh CFD Application," Preprint ANL/MCS-P833-0700, July 2000.This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific machines, the SGI Origin, the Cray T3E, and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on partial differential equations, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop-orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters, and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation.
H.G. Kaper and H. Nordborg, "The Frozen-Field Approximation and the Ginzburg-Landau Equations of Superconductivity," J. Engineering Mathematics 39 (2001) 221-240. The Ginzburg-Landau (GL) equations of superconductivity provide a computational model for the study of magnetic flux vortices in type-II superconductors. In this article we show through numerical examples and rigorous mathematical analysis that the GL model reduces to the frozen-field model when the charge of the Cooper pairs (the superconducting charage ) goes to zero while the applied field stays near the upper critical field.
E. Dolan, P. Hovland, J. More', B. Norris, and B. Smith, "Remote Access to Mathematical Software," Preprint ANL/MCS-P889-0601, June 2001. The network-oriented application services paradigm is becoming increasingly common for scientific computing. The popularity of this approach can be attributed to the numerous advantages to both user and developer provided by network-enabled mathematical software. The burden of installing and maintaining complex systems is lifted from the user, while enabling developers to provide frequent updates without disrupting service. Access to software with similar functionality can be unified under the same interface. Remote servers can utilize potentially more powerful computing resources than may be available locally. We discuss some of the application services developed by the Mathematics and Computer Science Division at Argonne National Laboratory, including the Network Enabled Optimization System (NEOS) Server and the Automatic Differentiation of C (ADIC) Server, as well as preliminary work on Web access to the Portable Extensible Toolkit for Scientific Compuoting (PETSc). We also provide a brief survey of related work.
R. Thakur and W. Gropp, "Parallel I/O," in CRPC Handbook of Parallel Computing, J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, Morgan Kaufmann Publishers, Inc. (to appear). Also Preprint ANL/MCS-P837-0700, July 2000. Many parallel applications need to access large amounts of data. In such applications, the I/O performance can play a significant role in the overall time to completion. Although I/O is always much slower than computation, it is still possible to achieve good I/O performance in parallel applications by using a combination of sufficient amount of high-speed I/O hardware, appropriate file-system software, appropriate API for I/O, a high-performance implementation of the API, and by using that API the right way. We explain in further detail. [from a preliminary draft of the CRPC Handbook of parallel Computing]
P. M. Dickens and R. Thakur, "On Implementing High-Performance Collective I/O," Preprint ANL/MCS-P852-0700, November 2000. In many parallel applications, the I/O requirements quite often present a significant obstacle in the way of achieving good performance. An important area of current research is collective I/O, where the processors cooperatively develop an I/O strategy that reduces the number, and increases the size, of I/O requests, thereby making a better use of the I/O subsystem. While many studies have been presented in the literature showing excellent results using collective I/O techniques, there has been little discussion of implementation techniques for collective I/O operations and the impact on performance of varioius implementation strategies. A closely related issue, which has not yet received sufficient attention, is whether the high cost of I/O can be further reduced by executing the collective I/O operation in the background, thus overlapping its execution with other computation occurring in the foreground. In this paper, we investigate both of these important issues. First, we explore the issues that arise in the implementation of a collective I/O library and show the impact on performance resulting from various implementation strategies. To make these results as general as possible, we investigate the performance of collective I/O implementations on four different parallel architectures: the IBM SP2, the Intel Paragon, the HP Exemplar, and the SGI Origin2000. We show that a naive implementation of collective I/O does not result in significant performance gains for any of the architectures, but that an optimized collective I/O implementatin provides excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all computational platforms. Next, we explore the issues that arise in the implementation of thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve the performance of thread-based collective I/O, we develop an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this new technique can provide significant performance gains, offering up to 50% improvement when compared with implementations that do not attempt to overlap collective I/O and computation.
A. Roy, I. Foster, W. Gropp, N. Karonis, V. Sander, and B. Toonen, "MPICH-GQ: Quality-of-Service for Message Passing Programs," Preprint ANL/MCS-P838-0700, July 2000. Parallel programmers typically assume that all resources required for a program's execution are dedicated to that purpose. However, in local and wide area networks, contention for shared networks, CPUs, and I/O systems can result in significant variations in availability, with consequent adverse effects on overall performance. We describe a new message-passing architecture, MPICH-GQ, that uses quality of service (QoS) mechanisms to manage contention and hence improve performance of message passing interface (MPI) applications. MPICH-GQ combines new QoS specification, traffic shaping, QoS reservation, and QoS implementation techniques to deliver QoS capabilities to the high-bandwidth bursty flows, complex structures, and reliable protocols used in high-performance application---characteristics very different from the low-bandwidth, constant bit-rate media flows and unreliable protocols for which QoS mechanisms were designed. Results obtained on a differentiated services testbed demonstrate our ability to maintain application performance in the face of heavy network contention.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, B. F. Smith, "Understanding the Parallel Scalability of an Implicit Unstructured Mesh CFD Code," Preprint ANL/MCS-P845-0900, Sept. 2000. In this paper, we identify the scalability bottlenecks of an unstructured grid CFD code (PETSc-FUN3D) by studying the impact of several algorithmic and architectural parameters and by examining different programming models. We discuss the basic performance characteristics of this PDE code with the help of simple performance models developed in our earlier work, presenting primarily experimental results. In addition to achieving good per-processor performance (which has been addressed in our cited work and without which scalability claims are suspect) we strive to improve the implementation and convergence scalability of PETSc-FUN3D on thousands of processors.
R. Butler, W. Gropp, and E. Lusk, "A Scalable Process-Management Environment for Parallel Programs," Preprint ANL/MCS-P812-0400, April 2000, We present a process management system for parallel programs such as those written using MPI. A primary goal of the system, which we call MPD (for multipurpose daemon), is to be scalable. By this we mean that startup of interactive parallel jobs comprising a thousand processes is quick, that signals can be quickly delivered to processes, and that stdin, stdout, and stderr are managed intuitively. Our primary target is parallel machines made up of clusters of SMPs, but the system is also useful in more tightly integrated environments. We describe how MPD enables much faster startup and beter runtime management of MPICH jobs. We show how close control of stdio can support the easy implementation of a number of convenient system utilities, even a parallel debugger. MPD is implemented and freely distributed with MPICH.
P. D. Hovland and L. C. McInnes, "Parallel Simulation of Compressible Flow Using Automatic Differentiation and PETSc," Preprint ANL/MCS-P796-0200, November 2000. Many aerospace applications require parallel implicit solution strategies and software. We consider the use of two computational tools, PETSc and ADIFOR, to implement a Newton-Krylov-Schwarz method with pseudo-transient continuation for a particular application, namely, a steady-state, fully implicit, three-dimensional compressible Euler model of flow over an M6 wing. We describe how automatic differentiation (AD) can be used within the PETSc framework to compute the required derivatives. We present performance data demonstrating the suitability of AD and PETSc for this problem. We conclude with a synopsis of our results and a description of opportunities for future work.
T. Iliescu, "Genuinely Nonlinear Models for Convection-dominated Problems," Preprint ANL/MCS-P857-1100, November 2000. This paper introduces a general, nonlinear subgrid-scale (SGS) model, having bounded artificial viscosity, for the numerical simulation of convection-dominated problems. We also present a numerical comparison (error analysis and numerical experiments) between this model and the most common SGS model of Smagorinsky, which uses a p-Laplacian regularization. The numerical experiments for the 2-D convection-dominated convection-diffusion test problem show a clear improvement in solution quality for the new SGS model. This improvement is consistent with the bounded amount of artificial viscosity introduced by the new SGS model in the sharp transition regions.
S. L. Lee and P. D. Hovland, "Sensitivity Analysis Using Parallel ODE Solvers an d Automatic Differentiation in C: SensPVODE and ADIC," Preprint P818-0500, May 2000. PVODE is a high-performance ordinary differential equation solver for the types of initial value problems (IVPs) that arise in large-scale computational simulations. Often, one wants to compute sensitivities with respect to certain parameters in the IVP. We discuss the use of automatic differentiation (AD) to compute these sensitivities in the context of PVODE. Results on a simple test problem indicate that the use of AD-generated derivative code can reduce the time to solution over finite difference approximations.
< name="seaice"> J. G. Kim and P. D. Hovland", "Sensitivity Analysis and Parameter Tuning of a Sea-Ice Model," Preprint ANL/MCS-P819-0500, May 2000. The values of many of the parameters in climate models are often not known with any great precision. We describe the use of automatic differentiation to examine the sensitivity of an uncoupled dynamic-thermodynamic sea-ice model to various parameters. We also illustrate the effectiveness of using these sensitivity derivatives with an optimization algorithm to tune the parameters to maximize the agreement between simulated results and observational data.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Performance Modeling and Tuning of an Unstructured Mesh CFD Application," Preprint ANL/MCS-P833-0700, July 2000. This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific, the SCI Origin, the Cray T3E, and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on PDEs, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters, and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation. P. Dickens and R. Thakur, "On Implementing High-Performance Collective I/O," Preprint ANL/MCS-P849-1000, Argonne National Laboratory, September 2000. Java is quickly becoming the preferred language for writing distributed applications because of its inherent support for programming on distributed platforms. In particular, Java provides compile-time and run-time security, automatic garbage collection, inherent support for multithreading, support for persistent objects and object migration, and portability. Given these significant advantages of Java, there is a growing interest in using Java for high-performance computing applications. To be successful in the high-performance computing domain, however, Java must have the capability to efficiently handle the significant I/O requirements commonly found in high-performance computing applications. While there has been significant research in high-performance I/O using languages such as C, C++, and Fortran, there has been relatively little research into the I/O capabilities of Java. In this paper, we evaluate the I/O capabilities of Java for high-performance computing. We examine several approaches that attempt to provide high-performance I/O--many of which are not obvious at first glance--and investigate their performance in both parallel and multithreaded environments. We also provide suggestions for expanding the I/O capabilities of Java to better support the needs of high-performance computing applications.
H. M. Bucker, K. R. Buschelman, and P. D. Hovland, "A Matrix-Matrix Multiplication Approach to the Automatic Differentiation and Parallelization of Straight-Line Codes," Preprint ANL/MCS-P854-1000, Argonne National Laboratory, October 2000. A straight-line code, which consists of assignment, addition, and multiplication statements, is an abstraction of a serial computer program to compute a function with n inputs. Given a serial straight-line code with N statements, we derive an algorithm that automatically evaluates not only the function but also its first-order derivatives with respect to the n inputs on a parallel computer. The basic idea of the algorithm is to marry automatic computation of derivatives with automatic parallelization of serial programs.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Latency, Bandwidth, and Concurrent Issue Limitations in High-Performance CFD," Preprint ANL/MCS-P850-1000, October 2000. This paper presents some strategies that have proved effective in tolerating the latencies arising from the hierarchical memory system and networ. We compare the different programming models from a performance standpoint. The demonstration code, PETSc-FUN3D, solves the Euler and Navier-Stokes equations of fluid flow in incompressible and compressible forms with second-order flux-limited characteristics-based convection schemes and Galerkin-type diffusion on unstructured meshes. The solution algorithm is pseudo-transient Newton-Krylov-Schwartz with block-incomplete factorization on each subdomain of the Schwarz preconditioner and with varying degrees of overlap.
K. R. Buschelman, W. D. Gropp, L. C. McInnes, and B. F. Smith, "PETSc and Overture: Lessons Learned Developing an Interface between Components, " Preprint ANL/MCS-P858-1100, November 2000. We consider two software packages that interact with each other as components: Overture and PETSc. An interface between these two packages could be of tremendous value to application developers in that Overture provides a simple mechanism for generating the large, sparse systems of linear equations that correspond to discretizations of a PDE, and PETSc provides a powerful collection of methods for solving these systems. Two types of interfaces are discussed: the internal interface between components, and the external interface for the application developer. We compare three basic approaches to developing the internal interface between Overture and PETSc, the final one of which is a peer-to-peer model.
B. Norris and P. D. Hovland, "An Application Server for Automatic Differentiation," Preprint ANL/MCS-P856-1100, November 2000. The ADIC Application Server brings the accuracy and efficiency of automatic differentiation to the World Wide Web. Users of the ADIC Application Server can upload source code written in ANSI-C, manage remote files, differentiate selected functions, and download code augmented with derivative computations. Using a simple driver and linking to the appropriate libraries, the user can compile and run the differentiated code locally. We discuss the unique requirements for an automatic differentiation application server and describe the implementation of the ADIC Application Server.
P. D. Hovland, D. E. Keyes, L. C. McInnes, and W. Samyono, "Using Automatic Differentiation for Second-Order Matrix-Free Methods in PDE-constrained Optimization," Preprint ANL/MCS-P855-1100, November 2000. Classical methods of constrained optimization are often based on the assumption that projection onto the constraint manifold is routine but accessing second-derivative information is not. Both assumptions need revision for the application of optimization to systems constrained by PDEs, in the contemporary limit of millions of state variables and in the parallel setting. Large-scale PDE solvers are complex pieces of software that exploit detailed knowledge of architecture and application and cannot easily be modified to fit the interface requirements of a blackbox optimizer. Furthermore, in view of the expense of PDE analyses, optimization methods not using second derivatives may require too many iterations to be practical. For general problems, automatic differentiation is likely to be the most convenient means of exploiting second derivatives. We delineate a role for automatic differentiation in matrix-free optimization formulations involving Newton's method, in which little more storage is required than that for the analysis code alone.
J. J. More', "Automatic Differentiation Tools in Optimization Software," Preprint ANL/MCS-P859-1100, November 2000. We discuss the role of automatic differentiation tools in optimization software. We emphasize issues that are important to large-scale optimization and that have proved useful in the installation of nonlinear solvers in the NEOS Server. Our discussion centers on the computation of the gardient and Hessian matrix for partially separable functions and shows that the gradient and Hessian matrix can be computed with guaranteed bounds in time and memory requirements.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "High-Performance Parallel Implicit CFD," Preprint ANL/MCS-P863-1200, December 2000. Fluid dynamical simulations based on finite discreteizations on (quasi-)static grids scale well in parallel, but execute at a disappointing percentage of per-processor peak floating point operation rates withiout special attention to layout and access ordering of data. We document both claims from our experience with an unstructured grid CFD code that is typical of the state of the practice at NASA. These basic performance characteristics of PDE-based codes can be understood with surprisingly simple models, for which we quote earlier work, presenting primarily experimental results herein. These perormance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per node performance. This snapshot of ongoing work updates our 1999 Bell Prize-winning simulation on ASCI computers.
S. J. Wright, "Constraint Identification and Algorithm Stabilization for Degenerate Nonlinear Programs," Preprint ANL/MCS-P865-1200, Devember 2000. In the vicinity of a solution of a nonlinear programming problem at which both strict complementarity and linear independence of the active constraints may fail to hold, we describe a technique for distinguishing weakly active from strongly active constraints. We show that this information can be used to modify the sequential quadratic programming algorithim so that it exhibits superlinear convergence to the solution under assumptions weaker than those made in previous analyses.
S. J. Benson, L. C. McInnes, and J. J. More', "GPCG: A Case Study in the Performance and Scalability of Optimization Algorithms." Preprint ANL/MCS-P768-0799, September 2000. PCG is an algorithm within the Toolkit for Advanced Optimization (TAO) for solving bound constrained, convex quadratic problems. Originally developed by More' and Toraldo, this algorithm was designed for large-scale problems but had been implemented only for a single processor. The TAO implementation is available for a wide range of high-performance architecture, and has been tested on up to 64 processors to solve problems with over 2.5 million variables.
E. D. Dolan and J. J. More', "Benchmarking Optimization Software with COPS," Technical Memo ANL/MCS-TM-246, November 2000. We describe version 2.0 of the COPS set of nonlinearly constrained optimization problems. We have added new problems, as well as streamlined and improved most of the problems. We also provide a comparison of the LANCELOT, LOQO, MINOS, and SNOPT solvers on these problems.
M. C. Ferris and T. S. Munson, "Semismooth Support Vector Machines," Preprint ANL/MCS-P860-1100, November 2000. The linear support vector machine can be posed as a quadratic program in a variety of ways. In this paper, we look at a formulation using the two-norm for the misclassification error that leads to a positive definite quadratic program with a single equality constraint when the Wolfe dual is taken. The quadratic term is a small rank update to a positive definite matrix. We reformulate the optimality conditions as a semismooth system of equations using the Fischer-Burmeister function and apply a damped Newton method to solve the resulting problem. The algorithm is shown to converge from any starting point with a Q-quadratic rate of convergence. At each iteration, we use the Sherman-Morrison-Woodbury update formula to solve a single linear system of equations. Significant computational savings are realized as the inactive variables are identified and exploited during the solution process. Results for a 60 million variable problem are presented, demonstrating the effectiveness of the proposed method on a personal computer.
J. No, R. Thakur, D. Kaushik, L. Freitag, and A. Choudhary, "A Scientific Data Management System for Irregular Applications," Preprint ANL/MCS-P866-1000, October 2000. Many scientific applications are I/O intensive and generate or access large data sets, spanning hundreds or thousands of "files." Management, storage, efficient access, and analysis of this data present an extremely challenging task. We have developed a software system, called Scientific Data Manager (SDM), that uses a combination of parallel file I/O and database support for high-performance scientific data management. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. In this paper, we describe how we designed and implemented SDM to support irregular applications. SDM can efficiently handle the reading and writing of data in an irregular mesh as well as the distribution of index values. We describe the SDM user interface and how we implemented it to achieve high performance. SDM makes extensive use of MPI-IO's noncontiguous collective I/O . SDM also uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in the database. We present performance results with two irregular applcations, a CFD code called FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000 at Argonne National Laboratory.
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Understanding the Parallel Scalability of an Implicit Unstructured Mesh CFD Code," Preprint ANL/MCS-P845-0900, Spetember 2000. In this paper, we identify the scalability bottlenecks of an unstructured grid CFD code (PETSc-FUN3D) by studying the impact of several algorithmic and architectural parameters and by examining different programming models. We discuss the basic performance characteristics of this PDE code with the help of simple performance models developed in our earlier work, presenting primarily experimental results. In addition to achieving good per-processor performance (which has been addressed in our cited work and without which scalability claims are suspect) we strive to improve the implementation and convergence scalability of PETSc-FUN3D on thousands of processors.
M. Anitescu, "On the Rate of Convergence of Sequential Quadratic Programming with Nondifferentiable Exact Penalty Function in the Presence of Constraint Degeneracy," Preprint ANL/MCS-P760-0699, June 1999. We investigate nonlinear programs that have a nonempty but possibly unbounded Lagrange multiplier set and that satisfy the quadratic growth condition. We show that such programs can be transformed, by relaxing the constraints and adding a linear penalty term to the objective function, into equivalent nonlinear programs that have differentiable data and a bounded Lagrange multiplier set and that satisfy the quadratic growth condition. As a result we can define, for this type of problem, algorithms that are linearly convergent, using only first-order information, and superlinearly convergent.
E. D. Dolan and J. J. More', "Benchmarking Optimization Software with Performance Profiles," Preprint ANL/MCS-P861-1200, January 2001. We propose performance profile---distribution functions for a performance metric---as a tool for benchmarking and comparing optimization software. We show that performance profiles combine the best features of other tools for performance evaluation.
M. Grimsditch, G. K.Leaf, H.G. Kaper, D. A. Karpeev, and R. E. Camley, "Normal Modes of Spin Excitations in Magnetic Nanoparticles," Preprint ANL/MCS-P1049-0503, May 2003. This article describes a technique to compute magnetic normal modes of nanosized particles. The technique is based on the Landau-Lifschitz formalism of micromagnetics and accounts fully for both the exchange and the dipolar field. It requires no more than the specification of the material parameters and the geometry of the sample; it does not require the specification of boundary conditions. It also allows the large-amplitude nonlinear regime to be proved. The technique is applied to a model of a polycrystalline iron particle, which is shown to possess a rich variety of normal modes. Some of these modes are reminiscent of standing waves, while others are more or less localized in parts of the sample, The variation of the mode frequencies with the applied field is analyzed and compared with existing approximations.
P. J. Moon, G. Sandi, D. Stevens, and R. Kizilel, "Computational Modeling of Ionic Transport in Continuous and Batch Electrodialysis," Preprint ANL/MCS-P1111-1203, December 2003. This article describes the transport of ions and dissociation of a single salt and a solvent solution in an electrodialysis stack. Three models for a single salt (KCl) areproposed: the first for a 1D and 2D continuous electrodialysis, and the third for batch electrodialysis. We examine the diffusion and electromigration of ions in the polarization region and consider electromigration and convection in the bulk region. We show that the ionic surface concentration of both membranes in the diute compartment is affected by two parameters: fow rate and current density. We also show that, in the diute compartment, concentration changes along the x-axi are geater than along the y-axis. In simulations, the dilute voltage drop accounts for more than 36 percent of the total voltage drop. This value is reduced to 7 percent as the dilute concentration increases, and contributions of both ion-exchange membranes drop account for more than 50 percent of the total cell voltage drop. All three models are validated experimentally.
A. Zagaris, H.G. Kaper, and T. J. Kaper, "Analysis of the Computatioal Singular Perturbation Reduction Method for Chemical Kinetics," J. Nonlinear Sci. 14 (2004) 59-91. This article is concerned with the asymptotic accuracy of the computational singular perturbation (CSP) method developed by Lam and Goussis to reduce the dimensionality of a system of chemical kinetics equations. The method exploits the presence of disparate time scales to model the dynamics by an evolution equation on a lower-dimensional slow manifold. In this article it is shown that the successive applications of the CSP algorithm generate, order by order, the asymptotic expansion of a slow manifold. The results are illustrated on the Michaelis-Menton-Henri equations of enzyme kinetics.
J. Samuel Jiang, H.G. Kaper, and G. K. Leaf, "Hysteresis in Layered Spring Magnets," Discrete and Continuous Dynamical Systems - Series 1, no. 2 (May 2001), 219-232. This article addresses a problem of micromagnetics: the reversal of magnetic moments in layered spring magnets. A one-dimensional model is used of a film consisting of several atomic layers of a soft material on top of several atomic layers of a hard material. Each atomic layer is taken to be uniformly magnetized, and spatial inhomogeneities within an atomic layer are neglected. The state of such a system is described by a chain of magnetic spin vectors. Each spin vector behaves like a spinning top driven locally by the effective magnetic field and subject to damping (Landau-Lifshitz-Gilbert equation). A numerical integration scheme for the LLG equation is presented that is unconditionally stable and preserves the magnitude of the magnetization vector at all times. The results of numerical investigations for a bilayer in a rotating in-plane magnetic field show hysteresis with a basic period of 2 pi at moderate fields and hysteresis with a basic period of pi at strong fields.
M. Garbey and H.G. Kaper, "Asymptotic-Numerical Study of Supersensitivity for Generalized Burgers' Equatio ns," SIAM J. Sci. Comput. 22, no. 1 (2000) 368-385. This article addresses some asymptotic and numerical issues related to the solition of Burgers' equation. The solutions presented exhibit a transition layer in the interior of the domain, whose position as t goes to infinity is supersensitive to the boundary perturbation. Algorithms are presented for the computation of the position of the transition layer at steady state. The algorithms generalize to viscous conservation laws with a convex nonlinearity and are scalable in a parallel comoputing environment.
W. M. Gertz, P. E. Gill, and J. Muetherig, "Users Guide for SnadiOpt: A Package Adding Automatic Differentiation to Snopt," Technical Memorandum ANL/MCS-TM-245, January 2001. SnadiOpt is a package that supports the use of the automatic differentiation package ADIFOR with the optimization package Snopt. Snopt is a general-purpose system for solving optimization problems with many variables and constraints. It minimizes a linear or nonlinear function subject to bounds on the variables and sparse linear or nonlinear constraints. It is suitable for large-scale linear and quadratic programming and for linearly constrained optimization, as well as for general nonlinear programs. The method used by Snopt requires the first derivatives of the objective and constraint functions to be available. The SnadiOpt package allows users to avoid the time-consuming and error-prone process of evaluating and coding these derivatives. Given Fortran code for evaluating only the values of the bjective and constraints, SnadiOpt automatically generates the code for evaluating the derivatives and builds the relevant Snopt input files and sparse data structures.
E. M. Gertz, "A Quasi-Newton Trust-Region Method," Preprint ANL/MCS-P873-0201", February 2001. The classical trust-region method for unconstrained minimization can be augmented with a line search that finds a point that satisfies the Wolfe conditions. One can use this new method to define an algorithm that simultaneously satisfies the quasi-Newton condition at each iteration and maintains a positive-definite approximation to the Hessian of the objective function. This new algorithm has strong global convergence properties and appears to be robust and efficient in practice
S. J. Wright, "Effects of Finite-Precision Arithmetic on Interior-Point Methods for Nonlinear Programming," Preprint ANL/MCS-P705-0198, Jan. 2001. We show that the effects of finite-precision arithmetic in forming and solving the linear system that arises at each iteration of primal-dual interior-point algorithms for nonlinear programming are benign, provided that the iterates satisfy centrality and feasibility conditions of the type usually associated with path-following methods. When we replace the standard assumption that the active constraint gradients are independent by the weaker Mangasarian-Fromovitz constraint qualification, rapid convergence usually is attainable, even when cancellation and roundoff errors occur during the calculations. In deriving our main results, we prove a key technical result about the size of the exact primal-dual step. This result can be used to modify existing analysis of primal-dual interior-point methods for convex programming, making it possible to extend the superlinear local convergence results to the nonconvex case.
S. J. Benson and Y.Ye, "DSDP3: Dual Scaling Algorithm for General Positive Semidefinite Programming," Preprint ANL/MCS-P851-1000, Feb. 2001. We implement a dual scaling algorithm for positive semidefinite programming to handle a broader class of problems than could be solved with previous implementations of the algorithm. With appropriate representations of constraint matrices, we can solve general semidefinite programs and still exploit the structure of large-scale combinatorial optimization problems. Computational results show that our preliminary implementation is competitive with primal-dual solvers on many problems requiring moderate precision in the solution and is superior to primal-dual solvers for several types of problems.
E. D. Dolan and J. J. More', "Benchmarking Optimization Software with Performance Profiles," Preprint ANL/MCS-P861-1200, Jan. 2001. We propose performance profile---distribution functions for a performance metric---as a tool for benchmarking and comparing optimization software. We show that performance profiles combine the best features of other tools for performance evaluation.
E. D. Dolan, "NEOS Server 4.0 Administrative Guide," Tech. Memo ANL/MCS-TM-250 (May 2001). The NEOS SErver 4.o provides a general Internet-based client/server as a link between users and software applications. Tnhe administrative guide covers the fundamental principles behiind the operation of the NEOS Server, installation and trouble-shooting of the Server software, and implementation details of potential interest to the NEOS Server administrator. The guide also discusses making new software applications available through the Server, including areas of concern to remote solver administrators such as maintaining security, providing usage instructions, and enforcing reasonable restrictions on jobs. The guide is intended both as an introduction to the NEOS Server and as a reference for use when running the Server.
A. Kawano, Y. Guo, D. L. Thompson, A. L. Wagner, and M. Minkoff, "Improving the Accuracy of Interpolated Potential Energy Surfaces by Using an Analytical Zeroth-Order Potential Function," Preprint ANL/MCS-P1137-0304, March 2004. We present a method for improving the accuracy and efficiency of interpolation methods, in which an analytical zeroth-order potential energy surface is employed as a reference surface. To investigate and test the method, we apply it to HOOH, where there exists an accurate analytical survace that we take as the "exact" surface for obtaiing the nergies and derivatives for fitting and assessing the accuracy. Examples are given for 4-D and 6-D surfaces interpolated by using eighter the modified Shepard or second-degree interpolating moving least squares approach, with comparisons for cases with and without using the zeroth-order potential.
R. Shepard, A. F. Wagner, J. L. Tilson, and M. Minkoff, "The Subspace Projected Approximate Matrix (SPAM) Modification of the Davidson Method," Journal of Computational Physics, Vol. 172, no. 2 (Sept. 2001), 472-514. A modification of the iterative matrix diagonalization method of Davidson is presented that is applicable to the symmetric eigenvalue problem. This method is based on subspace projections of a sequence of one or more approximate matrices. The purpose of these approximate matrices is to improve the efficiency of the solution of the desired eigenpairs by reducing the number of matrix-vector products that must be computed with the exact matrix. Several applications are presented. These are chosen to show the range of applicability of the method, the convergence behavior for a wide range of matrix types, and also the wide range of approaches that may be employed to generate approximate matrices.
S. J. Wright, "Solving Optimization Problems on Computational Grids," Preprint ANL/MCS-P874-0101 (January 2001). This article describes some of the results obtained during the first three years of the metaNEOS project. Efforts have led to development of the runtime support library MW, for implementing algorithms with master-worker control structure on Condor platforms. Other metaNEOS work includes new algorithms and codes for integer linear programming, the quadratic assignment problem, and stochastic linear programming.
J. Linderoth and S. Wright, "Decomposition Algorithms for Stochastic Programming on a Computational Grid," Preprint ANL/MCS-P875-0401 (April 2001). We describe algorithms for two-stage stochastic linear programming with recourse and their implementation on a grid computing platform. In particular, we examine serial and asynchronous versions of the L-shaped method and a trust-region method. The parallel platform of choice is the dynamic, heterogeneous, opportunistic platform provided by the Condor system. The algorithms are of master-worker type (with the workers being used to solve second-stage problems), and the MW runtime support library (which supports master-worker computations) is key to the implementation. Computational results are presented on large sample average approximations of problems from the literature.
T. Colin, C. Galusinski, and H.G. Kaper, "Waves in Ferromagnetic Media," Commun. in PDEs 27, no. 7-8 (2002) 1625-1658. It is shown that small perturbations of equi8librium states in ferromagnetic media give rise to standing and traveling waves that are stable for long times. The evolution of the wave profiles is governed by semilinear heat equations. The mathematical model underlying these results consists of the Landau-Lifshitz equation for the magnetization vector and Maxwell's equations for the electromagnetic field variables. The model belongs to a general class of hyperbolic equations for vector-valued functions, whose asymptotic properties are analyzed rigorously. The results are illustrated with numerical examples.
E. Ong, E. Lusk, and W. Gropp, "Scalable Unix Commands for Parallel Processors: A high-Performance Implementation," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, eds. Y. Cotronis and J. Dongarra, Lecture Notes in Computer Science, Vol. 2131, Springer-Verlag, pp. 410-418, Sept. 2001. We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
E. Dolan, P. Hovland, J. More', B. Norris, and B. Smith, "Remote Access to Mathematical Software," Preprint ANL/MCS-P889-0601, June 2001. The NEOS Server provides access to optimization solvers through the Internet with a suite of interfaces. In particular, the Kestrel interface enables the remote solution of optimization problems within the AMPL and GAMS modeling languages. Problem generation, including the run-time detection of syntax errors, occurs on the local machine using any available modeling language facilities. Solution takes place on a remote machine, with the result returned in the native modeling language format for further processing. No significant differences exist between local and remote solutions. A byproduct of the Kestrel interface is the ability to solve in parallel multiple problems generated by a modeling language.
L. C. Berselli, G. P. Galdi, T. Iliescu, and W. J. Layton, "Mathematical Analysis for the Rational Large Eddy Simulation Model," Preprint ANL/MCS-P914-1001, October 2001. In this paper we consider the Rational Large Eddy Simulation model recently introduced by Galdi and Layton. We briefly present this model, which (in principle) is similar to others commonly used, and we prove the existence and uniqueness of a class of strong solutions. Contrary to the gradient model, the main feature of this model is that it allows a better control of the kinetic energy. Consequently, to prove existence of strong solutions, we do not need subgrid-scale regularization operators, as proposed by Smagorinsky. We also introduce some breakdown criteria that are related to the Euler and Navier-Stokes equations.
R. Thakur, W. Gropp, and E. Lusk, "Optimizing noncontiguous Accesses in MPI-IO," Parallel Computing 28 (2002) 83-105. The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. We first provide a classification of the different ways of expressing an application's I/O needs in MPI-IO -- we classify them in to four levels, called levels 0-3. We demonstrate that, for applications with noncontinguous access patterns, the I/O performance improves dramatically if users write their applications to make level-3 requests (noncontiguous, collective) rather than level-0 requests (Unix style). We then describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how we have implemented these optimizations portably on multiple machines and file systems, controlled their memory requirements, and also achieved high performance. We demonstrate the performace and portability with perormance results for three applications - an astrophysics-application template, the NAS BTIO benchmark, and an unstructured code -- on five different parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
T. Iliescu and P. Fischer, "Large Eddy Simulation of Turbulent Channel Flows by the Rational LES Model," Preprint ANL/MCS-P938-0302 (March 2002). The rational large eddy simulation (RLES) model is applied to turbulent channel flows. This approximate deconvolution model is based on a rational (subdiagonal Padé) approximation of the Fourier transform of the Gaussian filter and is proposed as an alternative to the gradient (also known as the nonlinear or tensor-diffusivity) model. We used a spectral element code to perform large eddy simulations of incompressible channel flows at Reynolds numbers based on the friction velocity and the channel half-width Ret = 180 and Ret = 395. We compared the RLES model with the gradient model. The RLES results showed a clear improvement over those corresponding to the gradient model, comparing well with the fine direct numerical simulation. For comparison, we also present results corresponding to a classical subgrid-scale eddy-viscosity model such as he standard Smagorinsky model.
E. M. Gertz and S. Wright, OOQP User Guide, Tech Memo ANL/MCS-TM-252, October 2001. OOQP is an object-oriented software package for solving convex quadratic programming problems (QP). We describe the design of OOQP, and document how to use OOQP in its default configuration. We further describe OOQP as a development framework, and outline how to develop custom solvers that solve QPs with exploitable structure or use specialized linear algebra.
B. Norris, S. Balay, S. Benson, L. Freitag, P. Hovland, L. McInnes, and B. Smith, "Parallel Components for PDEs and Optimization: Some Issues and Experiences, Preprint ANL/MCS-P932-0202, February 2002. High-performance simulations in computational science often involve the combined software contributions of multidisciplinary teams of scientists, engineers, mathematicians, and computer scientists. One goal of component-based software engineering in large-scale scientific simulations is to help manage such complexity by enabling better interoperability among codes developed by different groups. This paper discusses recent work on building component interfaces and implementations in parallel numerical toolkits for mesh manipulations, discretization, linear algebra, and optimization. We consider several motivating applications involving partial differential equations and unconstrained minimization to demonstrate this approach and evaluate performance.
J. M. Restrepo and G. K. Leaf, "Noise Effects on Wave-Generated Transport Induced by Ideal Waves," J. Phys. Oceanography 32 (2002) 2334-2349. We consider the transport velocity in boundary layer flows driven by either noisy monochromatic progressive or standing waves. The central issue addressed here is whether such flows are capable of sustaining a transport velocity when noise is present in the wave field and, if so, in what ways the noise affects the transport velocity, the mean wall shear stress, and the total mass flux. Specifically, we address the effect of noise due to unresolved processes. Our study is motivated by the fact that in the natural setting it is the norm rather than the exception that noise is present in the wave field. We find that when noise is added to standing waves, the transport in the boundary layer leads to a nonzero mass flux. On the other hand, noise due to progressive waves reduces the mass flux. Further, we find that the drift velocity will have two components: a deterministic one and a diffusive one.
G. K. Leaf, S. Obukhov, S. Scheidl, and V. M. Vinokur, "Transient Dynamics of Pinning of Domain Wall," J. Magnetism and Magnetic Mat. 241 (2002) 118-123. We study the evolution of an elastic string, serving as a model for a domain wall, into the pinned state at driving forces slightly below the depinning threshold force. We quantify the temporal evolution of the string by an actigity function representing the fraction of active nodes at time t and find three distinct dynammic regimes. There is an initial stage of fast decay of the activity; in the second, intermediate, regime, an exponential defcay of activity is observed; and eventually, the fast collapse of the string towards its final pinned state results in decay in the activity.
A. Rodriguezm D. Aulakhe, E. Marland, V. Nefedova, G. X. Yu, and N. Maltsev, "GADU -- Genome Analysis and Database Update Pipeline," Preprint ANL/MCS-P1029-0203. Realizing the enormous scientific potential of exponentially growing biological information requires the development of high-throughput automated computational environments that integrate large amounts of genomic and experimental data, and powerful tools for knowledge discovery and data mining. To assist high-throughput analysis of the genomes, we have developed the Genome Analyssis and Databases Update system. GADU efficiently automates major steps of genome analysis: data acquisition and data analysis by a variety of tools and algorithms, as well as data storage and annotation. We are developing a TeraGrid technology-based backend for large-scale computations using GADU. GADU can function in either an automated or interactive mode via a Web-based user interface. Programs monitor every operation in GADU and report the status of the process. This architecture ensures GADU's robust performance and allows simultaneous processing of a large number of sequenced genomes regardless of their size. <)> S. Leyffler, "Mathematical Programs with Complementarity Constraints," Preprint ANL/MCS-P1026-0203, February 2003. This short note surveys recent developments in the use of nonlinear program solvers for solving a large class of mathematical programs with complementarity constraints.
S. Leyffer, "The Penalty Interior-Point Method Fails to Converge," Preprint ANL/MCS-P1091-0903, September 2003. Equilibrium equations in the form of complementarity conditions oftwn appear as constraints in optimization problems. Problems of this type are commonly referred to as mathematical programs with complementarity constraints (MPCCs). A popular method for solving MPCCs is the penalty interior-point algorithm (PIPA). This paper presents a small example for which PIPA converges to a nonstationary point, providing a counterexample to the established theory. The reasons for this adverse behavior are discussed.
>M. Cohen, U. Naumann, and J. Riehme, "Towards Differentiation-Enabled Fortran 95 Compiler Technology," Proc. ACM Symp. on Applied Computing, Melbourne, FL, 3/9-12/2003, pp. 143-147 (also Preprint ANL/MCS-P935-0202, May 2003). We present a novel approach to generating derivative code for mathematical models implemented as Fortran 95 programs using Automatic Differentiation inside a compiler. This technique allows us to combine the advantages of both operator overloading and source transformation -based tools for automatic differentiation. Furthermore, the compiler's infrastructure for syntactic, semantic, and static data flow analysis can be built on.
L. Hascoet, U. Naumann, and V. Pascual, "TBR Analysis in Reverse-Mode Automatic Differentiation," Preprint ANL/MCS-P936-0202, February 2003. The automatic generation of adjoints of mathematical models that are implemented as computer programs is receiving a increased attention in the scientific and engineering communities. Reverse-mode automatic differentiation is of particular interest for large-scale optimization problems. It allows the computation of gradients at a small constant multiple of the cost for evaluating the objective function itself, that is independent of the number of input parameters. Source-to-source transformation tools for automatic differentiation are available to generate adjoint codes based on the adjoint version of every statement that can be built by applying simple differentiation rules. Therefore, a reversal of the control flow of the original program becomes necessary. To guarantee correctness, certain values that are computed and overwritten in the original program must be made available in the adjoint program. They can be determined by performing a static data flow analysis, the so-called TBR analysis. Overestimation of this set must be kept minimal to get efficient adjoint codes. For many real-world applications the applicability of source-to-source transformation tools for automatic differentiation cannot be achieved without this efficiency.
U. Naumann, "Optimal Accumulation of Jacobian Matrices by Elimination Methods on the Dual Computational Graph," Preprint ANL/MCS-P943-0402, April 2003. This paper introduces face elimination as the basic technique for accumulating Jacobian matices by using a minimal number of aarithmetic operations. Its superirity over both edge and vertex elimination methods is shown. The intention is to establish the conceptual basis for the ongoing development of algorithms for optimizing the computation of Jacobian matrices.
U. Naumann and P. Heimbach, "Coupling Tangent-Llinear and Adjoint Models," Preprint ANL/MCS-P1022-0203, June 2003. We consider the solution of a (generalized) eigenvalue problem arising in physical oceanography that involves the evaluation of both the tangent-linear and adjoint versions of the underlying numerical model. Two different approaches are discussed. First, tangent-linear and adjoint models are generated by the software tool TAF and used separately. SEcond, the two models are combined into a single derivative model based on optimally preaccumuulated local gradients of all scalar assignments. Th3e coupled tangent-linear/adjoint model promises to be a good solution for small or medium-sized problems. However, the simplicity of the example code at hand prevents us from observiing considerable run-time differences between the two approaches.
U. Naumann, "On Optimal Jacobian Accumulation for Single Expression Use Programs," Preprint ANL/MCS-P944-0402, April 2002. ADIFOR and ADIC, the widely used software tools for automatic differentiation, use assignment-level reverse mode to compute local gradients of scalar assignment. This pre-accumulation often results in very efficient forward-mode derivative code. Scalar assignments belong to the class of single expression use (SEU) programs. There, the values of all intermediate variables are read exactly once. Based on several theoretical results, we derive an algorithm for generating optimal Jacobian code for SEU programs. The number of floating-point operations performed during the accumulation of the Jacobian is minimized.
U. Naumann and A. Walther, "An Introduction to Using Software Tools for Automatic Differentiation," An Introduction to Using Software Tools for Automatic Differentiation", Preprint ANL/MCS-TM-254, July 2003. We give a gentle introduction to using various software tools for automatic differentiation (AD). Ready-to-use examples are discussed, and links to further information are presented. Our target audience includes all those who are looking for a straightforward way to get started using the available AD technology. The document is dynamic in that its content will be updated as the AD software evolves.
U. Naumann, "Statement-Level Optimality of Tangent-Linear and Adjoint Models," Preprint ANL/MCS-P-1066, June 2003. We discuss a combinatorial problem that arises in the optimization of first-derivative code generated by exploiting the associativity of the chain rule. Our objective is to minimize the arithmetic complexity of such codes. A hierarchical approach is taken that ensures optimality locally at the level of single scalar assignments. New optimal algorithms are presented for three important cases that represent the main components of any derivative code.
M. Anitescu, A. Miller, and G. D. Hart, "Constraint Stabilization for Time-Stepping Approaches for Rigid Multibody Dynamics with Joints, Contact, and Friction," in Proc. DETC'03, Chicago, September 2003. Also Preprint ANL/MCS-P1023-0403, April 2003. We present a method for achieving geometrical constraint stabilization for a linear-complementarity-based time-stepping scheme for rigid multibody dynamics with joints, contact, and friction. The method requires the solution of only one linear complementarity problem per step. Several examples are used to demonstrate the constraint stabilization effect.
M. Anitescu and G. D. Hart, "Solving Nonconvex Problems of Multibody Dyamics with Contact and Small Fr iction by Successive Convex Relaxation," Preprint ANL/MCS-P995-0902, September 2002. Timestepping methods using impulse-velocity approaches are guaranteed to have a solution for any friction coefficient, but they may have nonconvex solution set for any nonzero value of the friction coefficient. We construct an iterative algorithm that solves convex subproblems and that is guaranteed, for sufficinetly small friction coefficients, to retrieve, at a linear convergence rate, the velocity solution of the nonconvex linear complementarity problems whenever the frictionless configuration can be disassembled. In addition, we show that one step of the iterative algorithm provides an excellent approximation to the velocity solution of the original, possibly nonconvex, problem if the product between the friction coefficient and the slip velocity is small.
M. Anitescu and G. D. Hart, "A Constraint Stabilized Time-Stepping Approach for Multibody Dynamics with Cont act and Small Friction," Preprint ANL/MCS-P1002-1002, October 2002. We present a method for achieving constraint stabilization for a linear-complementarity-based time-stepping scheme for multi-rigid-body dynamics with joints, contact, and friction. The method requires the solution of only one linear complementarity problem per step. We show that under certain assumptions, which include the limited differentiability of the mappings and at most linear growth of the external forces, the velocity is bounded for a sufficiently small size of the timestep over a fixed time interval and the geometrical constraint infeasibility at step (t) is bounded above by a constant multiple of the square of the time step and the square of the norm of the current value of the velocity. If, in addition, the velocity is uniformly bounded with respect to the time interval of the simulation, then the geometrical constraint infeasibility is bounded by the same bound irrespective of the time interval of the simulation.
G. X. Yu and E. Marland, "Establishment of a Knowledge Base for Function Annotat ion in High-Throughput Sequence Analysis," Preprint ANL/MCS-P1004-1002, October 2002. We have developed a rule-based knowledge system specifically for automated high-throughput genetic sequence analysis. It includes more than 22,000 protein function groups and their evolutionary spaces (distributions), which are characterized by protein sequence conservations, the phylogenetic distribution of protein motifs and domains, and their relationships to biological functions. Our knowledge base demonstrates that tremendous variations exist among protein functional group. One of the important applications of our knowledge base is cross-verification of protein function annotations obtained by different computational tools.
J. N. Lyness and S. Joe, "The Number of Lattice Rules of Specified Upper Class and Rank," Preprint ANL/MCS-P979-0802, August 2002. The upper class of a lattice rule is a convenient entity for classification ad other purposes. The rank of a lattice rule is a basic characteristic, also used for classification. By introducing a rank proportionality factor and obtaining certain recurrrence relations, we show how many lattice rules of each rank exist in any prime upper class. The Sylow p-decomposition may be used to obtain corresponding results for any upper class.
J. N. Lyness, "Notes on Lattice Rules," J. Complexity 19 (2003) 321-331. An elementary introduction to lattices, integration lattices, and lattice rules is followed by a description of the role of the dual lattice in assessing the trigonometric degree of a lattice. The connection with the classical lattice-packing problem is established. J. N. Lyness and T. Sorevik, "Four-Dimensional Lattice Rules Generated by Skew-Circulant Matrices," Preprint ANL/MCS-P1012-0702, May 2003. We introduce the class of skew-circulant lattice rules. These are s-dimensional lattice rules that may be generated by the rows of an sxs skew-circulant matrix (a minor variant of the familiar circulant matrix). We present briefly some of the underlying theory of these matrices and rules. We are particularly interested in finding rules of specified trigonometric degree d. We describe some of the results of computer-based searches for optimal four-dimensional skew-circulant rules.
S. Byna, W. Gropp, X.-H. Sun, and R. Thakur, "Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost," Preprint ANL/MCS-P1045-0403, April 2003. In this paper, we present a technique for improving the performance of derived datatypes by automatically using packing algorithms that are optimized for memory-access cost. The packing algorithms use memory-optimization techniques that the user cannot apply easily without advanced knowledge of the memory architecture. We present performance results for a matrix-transpose example that demonstrate that our implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation MPICH.
R. Thakur and W. Gropp, "Improving the Performance of Collective Operations in M PICH," Preprint ANL/MCS-P1038-0405, April 2003. We report on our work on improving the performance of collective operations in MPICH on clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth usage for long messages. Although we have implemented new algorithms for all MPI collective operations, becuase of limited space we describe only the algorithms for allgather, broadcast, reduce-scatter, and reduce. We present performance results using the SKaMPI benchmark on a Myrinet-connected Linux cluster and an IBM SP. In all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster and, in many cases, outperform the algorithms used in IBM's MPI on the SP.
J. N. Lyness and G. Monegato, "Asymptotic Expansions for Two-Dimensional Hypersi ngular Integrals," ANL/MCS-P1011-1102, November 2002. We define and examine two-dimensional hypersingular integrals on [0,1)^2 and on [0,inf)^2 and relate their Kadamard finite-part (HFP) values to Mellin transforms. The integrands have algebraic singularities of a possibly unintegrable nature on the axes and at the origin. Extending our work on one-dimensional integrals reported in 1998, we obtain variants of the classical Euler-Maclaurin expansion for various 2D integrals. In many cases the constant term in the expansion (which is not necessarily the leading term) provides the value of the HFP integral. These expansions may be used as the basis for the numerical evaluation of a class of HFP integrals by extrapolation.
R. Ross, N. Miller, and W. Gropp, "Implementing Fast and Reusable Datatype Processing," Preprint ANL/MCS-P1068-0703, July 2003. In the paper the authors descirbe an internal representation of types that aids in maintaining the highest apossible performance during processing. The performance of this system, used in the MPICH2 implementation, is compared to well-written manual processing routines and other available MPI implementations. The authors show that performance for most tested types is comparable to manual processing. They point to additional opportunities for optimization and other software where this implementatin could be leveraged.
J. Wu and P. Wyckoff and D. Panda and R. Ross, "Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfinBand," Preprint ANL/MCS-P1122-0204, February 2004. The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks. However, they pose challenges in integration and interaction among storage server application components and system components. In this paper, we put forward a component, called Unifier, to provide more efficient integration and better interaction among these components. Unifier has three notable features. (1) Unifier integrates cache management and communication buffer management. It offers a single copy data sharing among all components in a server application safely and concurrently. (2) It reduces memory registration and deregistration costs to enable applications to take full advantage of RDMA operations. (3) It provides means to achieve adaptation, application-specific optimization, and better cooperation among different components. This paper presents the design and implementation of Unifier. This component has been deployed and evaluated in a version of PVFS1 implementation over InfiniBand. Experimental results show performance improvements between 30% and 70% over other approaches. Better scalability is also achieved by the PVFS I/O servers.
A. Ching, A. Choudhary, W.-K. Liao, R. Ross, and W. Gropp, "Evaluating Structured I/O Methods for Parallel File Systems," Preprint ANL/MCS-P1125-0204, February 2004. The authors demonstrate an implementatin of strucutred data access support in the context of the Parallel Virtual File System (PVFS0. They call this support datatype I/O because of its similarity to MPI datatypes. This support is built by using a reusable datatype-processing component from the MPICH MPI implementation. Also presented is a comparison of I/O characteristics of modern high-performance noncontiguous I/O methods. The authors use their I/O characteristics comparison to assess all the methods using three ttest applications. Further optimizations could be leveraged for even more efficient operation.
J. Li, W.-K. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, and R. Latham, "Parallel netCDF: A Scientific High-Performance I/O Interface," Preprint ANL/MCS-P1048-0503, Argonne National Laboratory, May 2003. This work presents a new parallel interface for writing and reading netCDF datasets. The interface is derived with aminimal changes from the serial netCDF interface but defines semantics for parallel access and is tailored for high performance. The underlying parallel I/O is achieved through MPI-IO, allowing for dramatic performance gains through the use of collective I/O optimizations. We compare the implementation strategies with HDF5 and analyze both. Our tests indicate programming convenience and significant I/O performance improvement with this parallel netCDF interface.
P. D. Hovland, U. Naumann, and B. Norris, "An XML-Based Platform for Semantic Transformation for Numerical Programs," Preprint ANL/MCS-P950-0402, April 2002. We describe a simple component architecture for the development of tools for mathematically based semantic transformations of scientific software. This architecture consists of compiler-based, language-specific front- and backends for source transformation, loosely coupled with one or more language-indepenedent "plug-in" transformation modules. The coupling mechanism between the front- and backends and transformation modules is provided by XML. XAIF provides an abstract, language-independent representation of language constructs common in imperative languages, such as C and Fortran. We desicribe the use of this architecture in the construction of tools for automatic differentiation (AD) of programs written in Fortran 77 and ANSI C. The XAIF is particularly well suited for performing the source transformations needed for AD. Differentiations modules typically operate within the scope of statements or basic blocks, working at a level where procedural languages are very similar. Thus, it is possible to specify a common interface format for mathematically based semantic transformations that need not represent the union of all languages.
M. M. Strout and P. D. Hovland, "Metrics and Models for Reordering Transformations," Preprint ANL/MCS=P1167-0604, June 2004. This paper describes models for determining which combination of run-time data- and iteration-reordering heuristics will result in the best performance for a given dataset. We propose that the data- and iteration-reordering transformations be viewed as approximating minimal linear arrangements on two separate hypergraphs: a spatial locality hypergraph and a temporal locality hypergraph. Our results measure the efficacy of locality metrics based on these hypergraphs in guiding the selection of data- and iteration-reordering heuristics. We also introduce neew iteratin- and data-reordering heuristics based on the hypergraph models that result in better performance than do previous heuristics.
E. D. Dolan, R. Fourer, J.-P. Goux, and T. S. Munson, "Kestrel: An Interface from Modeling Systems to the NEOS Server," Preprint ANL/MCS-P986-0802, August 2002. The NEOS Server provides access to a variety of optimizaiton packages via the Internet. The new Kestrel interface to the NEOS Server extends the server's capabilities by permitting local programs to request optimization services and retrieve results. As a result, a locally running modeling system can have much the same access to remote NEOS solvers as to those installed locally. Kestrel clients have been implemented for the AMPL and GAMS modeling environments. Extensions to the client designs enable them to queue subproblems for execution and later retrieval of results, making possible a limited form of parallel processing. The creation of the Kestrel interface has also led to enhancements in the NEOS Server for submitting binary files and for tracking progress via the Web.
M. R. Paul, M. C. Cross, and P. F. Fischer, "Rayleigh-Benard Convection with a Radial Ramp in Plate Separation," Preprint ANL/MCS-P982-0802, August 2002. Pattern formation in Rayleigh-Benard convection in a large-aspect-ratio cylinder with a radial ramp in the plate separation is studied analytically and numerically by performing numerical simulations of the Boussinesq equations. A horizontal mean flow and a vertical large-scale counterflow are quantified and used to understand the pattern wavenumber. Our results suggest that the mean flow, generated by amplitude gradients, plays an important role in the roll compression observed as the control parameter is increased. Near threshold, the mean flow has a quadrupole dependence with a single vortex in each quadrant, while away from threshold the mean flow exhibits an octupole dependence with a counterrotating pair of vortices in each quadrant. This is confirmed analytically using the amplitude equation and Cross-Newell mean flow equation. By performing numerical experiments, the large-scale counterflow is also found to aid in the roll compression away from threshold but to a much lesser degree. Our results yield an understanding of the pattern wavenumbers observed in experiment away from threshold and suggest that near threshold the mean flow and large-sacle counterflow are not responsible for the observed shift to smaller than critical wavenumbers.
M. P. Friedlander and M. A. Saunders, "A Globally Convergent LCL Method for Nonlinear Optimization," ANL/MCS-P-1015, December 2002. For optimization problems with nonlinear constraints, linearly constrained Lagrangian (LCL) methods sequentially minimize a Lagrangian funciton subject to linearized constraints. These methods converge rapidly near a solution but may not be reliable from artibrary starting points. The well-known example MINOS has proven effective on many large problems. Its success motivates us to propose a globally convergent variant. Our stabilized LCL method possesses two important properties: the subproblems are always feasible, and they may be solved inexactly. These functions are present in MINOS only as heuristics. The new algorithm has been implemented in Matlab, with the option to use either the MINOS or SNOPT Fortran codes to solve the linearly constrained subproblems. Only first derivatives are required. We present numerical results on a nonlinear subset of the COPS, CUTE, and HSS test problem sets, which include many large examples. The results demonstrate the robustness and efficiency of the stabilized LCL procedure.
M. Vilayannur, R. B. Ross, P. H. Carns, R. Thakur, A. Sivasubramaniam, and M. Kandemir", "Improving the Performance of the POSIX I/O Interface to PVFS," Preprint ANL/MCS-P1010-1102, November 2002. This paper describes some of the key performance and scalability improvements implemented for the Unix I/O interface to PVFS. We present experimental results using Bonnie++, a commonly used file system benchmark to test file system throughput; a synthetic parallel I/O application for calculating aggregate read and write bandwidths; and a synthetic benchmark that calculates the time taken to untar the Linux kernel source tree to measure performance of large numbers of small file operations. We also compare the I/O performance of these techniques when using a Myrinet-based network and when using a fast Ehternet-based netowrk for I/O-related communications. With these techniques we achieve aggregate read and write bandwidth as high as 550 MB/s with Myrinet and 160 MB/s with fast Ethernet.
J. J. Evans, S. Baik, C. S. Hood, and W. Gropp, "Toward Understanding Soft Faults in High-Performance Cluster Networks," Preprint ANL/MCS-P10 17-0700, January 2003. Fault management in high-performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper we classify such degradations as soft faults. In addiiton, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster management.
M. Anitescu, "A Fixed Time-Step Approach for Multibody Dynamics with Contact and Friction," Preprint ANL/MCS-P1034-0303, March 2003. We present a fixed time-step algorithm for the simulation of multi-rigid-body dynamics with joints, contact, collision, and friction. The method solves a linear complementarity problem (LCP) at each step. We show that the algorithm can be obtained as the stiff limit of fixed time-step schemes applied to regularized contact models. We do not perform collision detection. Instead, a noninterpenetration constraint stabilization is replaced by its linearization, which, together with a judicious choice of the active constraints, guarantees geometrical constraint stabilizaiton without the need to perform a reduction of the time step to detect new collision or stick-slip transition events. Partially elastic collisions are accommodated by a suitable modification of the free term of the LCP.
J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen, "Design and Implementation of MPICH2 over InfiniBand with RDMA Support," Preprint ANL/MCS-P1103-1003, October 2003. In this paper we present our experiences designing and implementing MPICH2 over Infiniband. Our focus is on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. We apply different optimizations and also propose a zero-copy-based design. We characterized the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 microseconds latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abStraction that encapsulates implementations with high performance by exploiting RDMA operations in InfiniBand.
W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda, "Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-based Collective Message-Passing Protocol," Preprint ANL/MCS-P1121-0204, February 2004. This paper explores the characteristics of all-to-all broadcast and proposes new algorithms to exploit the potential advantages of Network Interface Card (NIC) programmability. Salient strategies have been used to provide scalable topology management, global buffer management, efficient communication processing, and message reliability. The algorithms have been incorporated into an NIC-based collective protocol over Myrinet/GM. The NIC-based all-to-all broadcast operations improve all-to-all broadcast bandwidth over 16 nodes by a factor of 3, compared to host-based all-to-all broadcast operations.
T. Ozgokmen, P. Fischer, J. Duan, and T. Iliescu", Three-Dimensional Turbulent Bottom Density Currents from a High-Order Non hydrostatic Spectral Element Model," Preprint ANL/MCS-P1106-1103, November 2003. In this study, nonhydrostatic 3D simulations of bottom gravity currents are carried out. Nek5000, a parallel high-order spectral element Navier-Stokes solver, is used as the basis of gthe simulations. Numerical experiments are conducted in an idealized setting focusing on the startup phase of a dense water mass released at the top of a sloping wedge. Three-dimensional results are compared with results from 2D experiments and laboratory experiements, based on propagation speed of the density front, growth rate of the characteristic head at the leading edge, turbulence overturing length scales, and entrainment parameters. Results are found to be in general agreement with those from laboratory tank experiments. In 2D simulations, the propagation speed is approximately 20% slower, the head growth rate is 3 times larger, Thorpe scales are 30-50% larger, and entrainment parameter is up to 2 times higher than those in the 3D experiments. The difference between 2D and 3D simulations are entirely due to internal factors associated with the truncation of the Navier-Stokes equations for 2D approximation. We conclude that in the absence of external factors that will trigger 3D circulation patterns, such as topographic features or rotation, 2D dynamics still represent a reasonable approximation to the general evolution of bottom gravity currents.
U. Naumann and P. Gottschling, "Simulated Annealing for Optimal Pivot Selection in Jacobian Accumulation," Preprint ANL/MCS-P1032-0303, March 2003. We report on new results in logarithmic simulated annealing applied to the optimal Jacobian accumulation problem. We discuss the optimal edge elimination problem in linearized computational graphs in the context of linear algebra. We introduce row and column pivoting on the extended Jacobian analogs to front and back edge elimination in linearized computational graphs. Neighborhood relations for simulated annealing are defined on a metagraph derived from the computational graph. All prerequisites for simulated annealing are fulfilled for dyadic pivoting, which is equivalent to vertex elimination in linearized computational graphs. For row and column pivoting, we cannot yet give a proof that the corresponding elimination sequences are polynomial. In practice, however, the likelihood for an exponential eliminationsequence to occur is neglibible. NBumerical results are presented for algorithms basd on both homogeneous and inhomogeneous Markov chains for all pivoting techniques. The superiority of row and column pivoting over dyadic pivoting can be observed when applying these techniques to Roe's numerical flux.
A. Zagaris, H.G. Kaper, and T. J. Kaper, "Fast and Slow Dynamics for the Computational Singular Perturbation Method,"Preprint ANL/MCS-P1114-0104, January 2004. The Computational Singular Perturbation (CSP) method of Lam and Goussis is an iterative method to reduce the dimensionality of systems of ordinary differentaion equations with multiple time scales. Recently, the authors showed that each iteration of the CSP algorithm improves the approximation of the slow manifold by one order. Here we shw that the CSP method simultaneously approximates the tangent spaces to the fast fibers along which solutions relax to the slow manifold. Again, each iteration adds one order of accuracy. In some studies, the output of the CSO algorithm is postprocessed by linearly projecting initial data onto the slow manifold along these approximate tangent spaces. These projections, in turn, also become successively more accurate.
A. A. Dunca, "Optimal Design of Fluid Flow Using Subprobvlems Reduced by Large Eddy Simulation," Preprint ANL/MCS-P1117-1003, October 2003. This report presents tests of a two-dimensional parallel PETSC code that computes the fluid velocity and pressure in a given domain subject to given inflow and outflow conditions. The code provides an approximation to the soluton of the Navier-Stokes equations, a classical large eddy simulation model (Smagorinsky's model_ as well a s a new LES model that was tested with good results) and whose mathematical analysis is under study. Also included is a code to compute the total kinetic energy of the flow and the total velocity (circulation per unit area) of the flow. Numerical tests results show the value of using LES as a reduced-order model in shape optimization.
J. P. Kenny, S. J. Benson, Y. Alexeev, J. Sarich, C. L. Janssen, L. C. McInnes, M. Krishnan, J. Nieplocha, E. Jurrus, C. Fahlstrom, and T. L. Windus, "Component-Based Integration of Chemistry and Optimization Software," Preprint ANL/MCS-P1148-0404, April 2004. Using the nwchem mpqc packages, we have produced components that provide capacity for energy derivative evaluation. Constructed geometry applications by integrating TAO, PETSC, global arrays which linear algebra capabilities. Present a brief overview component development process description abstract interfaces chemical optimizations. Conforming to these allow construction of diffferent mathematics packages interchangeably. Initial numerical results software demonstrate good performance highlight potential research enabled this platform.
M. R. Paul, K-H. Chiam, M. C. Cross, and P. F. Fischer, "Rayleigh-Benard Convection in Large-Aspect-Ratio Domains," Preprint ANL/MCS-P1144-0404, April 2004. The coarsening and wavenumber selection of striped states growing from random initial conditions are studied in a non-relaxational, spatially extended, and far-from-equilibrium system by performing large-scale numerical simulations of Rayleigh-Bénard convection in a large-aspect-ratio cylindrical domain with experimentally realistic boundaries. We find evidence that various measures of the coarsening dynamics scale in time with different power-law exponents, indicating that multiple length scales are required in describing the time dependent pattern evolution. The translational correlation length scales with time as t0.12, the orientational correlation length scales as t0.54, and the density of defects scale as t-0.45. The final pattern evolves toward the wavenumber where isolated dislocations become motionless, suggesting a possible wavenumber selection mechanism for large-aspect-ratio convection.
U. Naumann and J. Riehme, "A Differentiation-Enabled Fortran 95 Compiler," Preprint ANL/MCS-P1090-0903, September 2003. The availability of first derivatives of vector functions is crucial for the robustness and efficiency of a large number of numerical algorithms. A new differentiation-enabled Fortran 95 compiler is described that uses programming language extensions and a semantical code transformation known as automatic differentiation to provide Jacobians of numerical programs with machine accuracy. We describe the user interface as well as the relevant algorithmic details. The feasibility, robustness, and convenience of this approach is illustrated by various case studies.
A. Albrecht, P. Gottschling, and U. Naumann, "Markowitz-Type Heuristics for Computing Jacobian Matrices Efficiently," Preprint ANL/MCS-P933-0202, February 2002. We consider the problem of accumulating the Jacobian matrix of a nonlinear vector function by using a minimal number of arithmetic operations. Two new Markowitz-type heuristics are proposed for vertex elimination in linearized computational graphs, and their superiority over existing approaches is shown by several tests. Similar ideas are applied to derive new heuristics for edge elimination techniques. The well-known superiority of edge over bertex elimination can be observed only partially for the heuristics discussed in this paper. Nevertheless, significant improvements can be achieved by the new heuristics in terms of both the quality of the results and their robustness with respect to different tiebreaking criteria.
W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda, "Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-based Collective Message-Passing Protocol," Preprint ANL/MCS-P1121-0204, February 2004. Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demonstrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high-performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with Elan3 Network, the NIC-based barrier operation achieves a barrier latency of only 5.60 µs. This result is a 2.48 factor of impro9vement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20 µs over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13 µs latency over Quadrics and with 38.94 µs latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.
R. Thakur, F. Rabenseifner, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," Preprint ANL/MCS-P1140-0304, March 2004. We describe our work on optimizing the collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI collective operations, because of limited space we describe only the algorithms for all-gather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and non-power-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. This work demonstrates that to achieve the best performance for a collective communication operation, we need to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
A.-V. Demiguel, M. P. Friedlander, F. J. Nogales, and S. Scholz, "An Interior-Point Method for MPECs Based on Strictly Feasible Relaxation," Preprint ANL/MCS-P1150-0404, April 2004. An interiro-point method for solving mathematical programs with equilibrium constraints (MPECs) is proposed. At each iteration of the algorithm, a single primal-dual step is computed from each subproblem of a sequence. Each subproblem is defined as a relaxation of the MPEC with a nonempty strictly feasible region. In contrast to previous approaches, the proposed relaxation scheme preserves the nonempty strict feasibility of each subproblem even in the limit. Local and superlinear convergence of the algorithm is proved even with a less restrictive strict complementarity condition thatn the standard one. Moreover, mechanisms for inducing global convergence in practice are proposed. Numerical results on the MacMPEC test problem set demonstrate the fast local convergence properties of the algorithm.
W. Gropp and E. Lusk, "Fault Tolerance in MPI Programs," Preprint ANL/MCS-P1154-0404, April 2004. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructurin g a class of standard MPI programs, modifying MPI semantics, and extending the MPI specificaiotn. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
E. D. Dolan, J. J. More', and T. S. Munson, "Optimality Measures for Performance Profiles," Preprint ANL/MCS-P1155-0504, May 2004. We examine the importance of optimality measures when benchmarking a set of solvers and show that scaling requirements lead to a convergence test for nonlinearly constrained optimization solvers that uses a mixture of absolute and relative error measures. We demonstrate that this convergence test is well behaved at any point where the constraints satisfy the Mangasarian-Fromovitz constraint qualification and also avoids the explicit use of a complementarity measure. Computational experiments explore the impact of this convergence test on the benchmarking process with performance profiles.
T. Munson, "An Averaging Method for Nash Games with Hsared Decision Variables," Preprint ANL/MCS-P1134-0304, March 2004. This short communication concerns a class of Nash games in which the participants share some of the decision variables. An averaging method is developed in order to transform these games into standard Nash games. This technique is successfully applied to solve an example multileader, single-follower game described by Pang and Fukushima.
J. Peng, M. Anitescu, and S. Akella, "Optimal Control of Multiple Robot Systems with Friction Using MPCC," Preprint ANL/MCS-P1157-0504, April 2004. This paper studies optimal control of multiple robot systems with frictional contact. The robots have nonlinear dynamics, which may arise from the robot body dynamics, friction between robot and environment, and friction between robot and robot. Nonpenetration constraints between robots are imposed, and the robots are assumed rigid. The problem is modeled as a mathematical program with complementarity constraints (MPCCs). The MPCC model is solved by using its elastic mode, which is a well-behaved nonlinear programming problem. Preliminary results with this approach are illustrated on example problems. The main contributions of this paper are (1) a novel optimal control problem that can deal with friction in the multiple robot system, and (2) application of a new mathematical programming algorithm to solve the MPCC model effectively. The coordination model has potential applications in robot systems with friction, such as multifinger manipulation, manipulation with ropes, and multirobot pushing coordination.
R. Thakur, W. Gropp, and B. Toonen, "Minimizing Synchronization Overhead in the Implementation of MPI One-Disd ed Communication," Preprint ANL/MCS-P1158-0504, May 2004. The one-sided comunication operations in MPI are intended to provide the convenience of directly accessing remote memory and the potential for higher performance than regular point-to-point communication. Our performance measurements with three MPI implementations (IBM MPI, Sun MPI, and LAM) indicate, however, that one-sided communication can perform much worse than point-to-point communication if the associated synchronization calls are not implemented efficiently. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH-2. We describe our optimizations for all three synchronization mechanisisms defined in MPI: fence, post-stat-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH-2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.
U. Naumann, J. Utke, and A. Lyon, "Control Flow Reversal for Adjoint Code Generation," Preprint ANL/MCS-P1163-0504, May 2004. We describe an approach to the reversal of the control flow of structured programs. It is used to automatically generate adjoint code for numerical programs by semantic source transformation. After a short introduction to applications and the implementation tool set, we describe the building blocks using a simple example. We then illustrate the code reversal within basic blocks. The main part of the paper covers the reversal of structured control flow graphs. We show the algorithmic steps for simple branches and loops and give a detailed algorithm for the reversal of arbitrary combinations of loops and branches in a general control flow graph.
J. Utke, "OpenAD" Algorithm Implementation User Guide," Technical Memo ANL/MCS-TM-274, April 2004. OpenAD provides a framweork that allows for relative ease in the implementation of algorithms that operate on a representation of the numerical kernel of a program. Language independence is achieved by using an intermediate XML format and the abstraction of common compiler analyses in OpenAnalysis. The intermediate format is mapped to concrete programming languages via two front end/back end combinations. The design allows for reuse and combination of already implemented algorithms. We describe here the set of algorithms and basic functionality currently implemented in OpenAD and explain the necessary steps to add a new algorithm to the framework.
R. Latham, R. Ross, and R. Thakur, "The Impact of File Systems on MPI-IO Scalabilit," Preprint ANL/MCS-P1182-0604, June 2004. As the number of nodes in cluster systems continues to grow, leveraging scalable algorithms in all aspects of such systems becomes key to maintining performance. While scalable algorithms have been applied successfully tin some areas of parallel I/O, many operations are still performed in an uncoordinated manner. In this work we consider, in three file system scenarios, the possibilities for applying scalable algorithms to the many operations that make up the MPI-IO interface. From this evaluation we extract a set of file system characteristics that aid in the implementation of scalabl MPI-IO implementations.
W. Jiang, K. Liu, H.-W. Jin, D. K. Panda, D. Buntinas, R. Thakur, and W. Gropp, "Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters," Preprint ANL/MCS-P1164-0504, May 2004. In this paper we compare various design alternatives for synchronization in MPI-2 passive one-sided communication on InfiniBand clusters. We discuss several requirements for snchronization in passive one-sided communication. Based on these requirements, we present four design alternatives, which can be classified into two categories: thread-based and atomic operation-based. In thread-based designs, synchronization is achieved with the help of extra threads. In atomic operation-based designs, we exploit InfiniBand atomic operations such as Compare-and_Swap and Fetch-and-Add. Our performance evaluation results show that the atomic operation-based design can require less synchronization overhead, achieve better concurrency, and consume fewer computing resources compared with the thread-based design.
L. F. Diachin, P. Knupp, T. Munson, and S. Shontz, "A Comparison of Inexact Newton and Coordinate Descent Mesh Optimization Techniques," Preprint ANL/MCS-P1159-0504, May 2004. We compare inexact Newton and coordinate descent methods for optimizing the quality of a mesh by repositioning the vertices, where quality is measured by the harmonic mean of the mean-ratio metric. The effects of problem size, element size, heterogeneity, and various vertex displacement schemes on the performance of these algorithms are assessed for a series of tetrahedral meshes.
E. D. Dolan, J. J. More', and T. S. Munson, "Benchmarking Optimization Software with COPS 3.0," Technical memo ANL/MCS-TM-273, February 2004. We describe version 3.0 of the COPS set of nonlinearly constrained optimization problems. We have added new problems, as well as streamlined and improved most of the problems. We also probide a comparison of the FILTER, KNITRO. LOQO, MINOS, and SNOPT solvers on these problems.
M. Anitescu, "Optimization-Based Simulation of Nonsmooth Rigid Multibody Dynamics," Preprint ANL/MCS-P1161-0504, May 2004. We present a time-stepping method to simulate rigid multibody dynamics with inelastic collision, contact, and friction. The method progresses with fixed time step without backtracking for collision and solves at every step a strictly convex quadratic program. We prove that a solution sequence of the method converges to the solution of a measure differential inclusion. We present numerical results for a few examples, and we illustrate the difference between the results from our scheme and previous, linear-complementarity-based time-stepping schemes.
L. McInnes et al., "Parallel PDE-Based Simulations Using the Common Component Architecture," Preprint ANL/MCS-P1179-0704, July 2004. This paper discusses recent work on leveraging CCA efforts in parallel PDE-based simulations involving accelerator design, climate modeling, combustion, and accidental fires and explosions. We explain how component technology helps address the different challenges posed by each of these applications, and we highlight how component interfaces built on existing parallel toolkits facilitate the reuse of software for parallel mesh manipulation, discretization, linear algebra, integration, optimization, and parallel data redistribution. We also present performance data to demonstrate the suitability of this approach.
|