Computational Nano-optics: Parallel Simulations and Beyond

Modern theoretical optics often delves into the interaction between light and nanomaterials of complex geometries, and it relies on heavy numerical simulations. This article explores recent breakthroughs in the areas of plasmonics and nanopolaritonics and then describes how to assemble your own supercomputer in order to gather data on nanoscale optics.



The growing field of nano-optics owes its rapid development to the remarkable progress that has been made in nanofabrication techniques and advances in laser technology. With new tools, researchers can now go far beyond the diffraction limit, investigating the optical properties of nanoscale systems, from nanoparticles to individual atoms and molecules. Among several fascinating questions that we face about nanostructural materials is whether we can control light at the subdiffraction scale. If we can, we can set our aim to developing optical nanodevices and nanoscale coherent sources operating in the visible region of the spectrum.

A major part of the active research in nano-optics is in the area of plasmonics, which considers the general optical properties of nanoscale materials composed of noble metals. These metals are well known for their unique optical response, owing to the phenomenon known as surface plasmon-polariton resonance. Many research groups are extensively exploring this concept because of its many important potential applications, including integrated computing, nanolithography and light generation.

The local electromagnetic (EM) field is greatly enhanced near a metal/dielectric interface at the surface plasmon-polariton resonance. This feature has been widely used in many applications such as surface-enhanced Raman spectroscopy, for instance. The figure below to the right illustrates the effect of the local EM field enhancement. Here, rigorous three-dimensional simulations have been performed for a thin silver-coated film deposited on an opal array formed from close-packed polystyrene nanospheres.


figureOpal array and simulation results. (Top) Actual image of the opal array—the silver-coated array of dielectric spheres self-assembled on a glass substrate. (Bottom) Simulation results of the local electromagnetic intensity enhancement at the surface of the unit cell. Enhancement reaches at least two orders of magnitude.

Researchers have shown, both theoretically and experimentally, that the proper utilization of optical properties of metal nano structures at the subwavelength scale may lead to single atom/molecule optical trapping. We can now control the geometry of nanomaterials with outstanding precision. This presents the key to successfully manipulating individual atoms and molecules. The fundamental basis of that is a significant spatial dependence of evanescent EM fields on the environment, providing considerably large field gradients suitable for optical trapping. For example, researchers recently showed that the local EM fields associated with metal nanoparticle dimmers are noticeably dependent upon particle sizes and particle-to-particle distances.

We now have the tools to manipulate EM radiation far beyond the diffraction limit. We can focus the light to small local spots that may be used to probe individual atoms and molecules. Using the polarization dependence of surface plasmon-polariton waves, we can also use a specifically designed plasmonic material to efficiently guide EM radiation along predefined pathways. Due to the material dispersion of metals, arrays of nanoparticles can also be used to shape laser pulses aiming for coherent control.

Moreover, through a new field called molecular nanopolaritonics—the study of how molecules influence field propagation—new tools are emerging for developing molecular switches. They use the nonadiabatic alignment of a molecule on a semiconductor surface under the tip of a scanning tunneling microscope. Recent advances in experimental techniques have made it possible to measure the optical response of current-carrying molecular junctions. This has laid the groundwork for the development of theoretical formulations that can simultaneously describe both the transport and optical properties of molecular devices.

Notwithstanding the progress, there is still much debate within the optics community on the dynamics of EM radiation in the near-field region, which is just a few nanometers from the surface of the material. In many experiments, researchers have shown that the properties of EM radiation are altered by the geometry of nanomaterials. (For example, spherical nanoparticles have noticeably different optical responses and EM near-field distributions than ellipsoidal particles.)

All these and many other fascinating features can and should be explored. However, the conventional paper-and-pencil approach is only useful in a limited number of cases. The analysis often fails due to the complexity of nanomaterials. One then has to deal with numerical simulations.

In Maxwell's equations we trust

It is now widely accepted that macroscopic Maxwell equations adequately describe optics at the nanoscale. Among the many numerical methods used in nano-optics, the finite-difference time-domain (FDTD) technique is considered the easiest. It has recently become an extensively used tool in nano-optics due to its straightforward code implementation and its ability to accurately model complex nanostructures.

The core of FDTD is in the special arrangement of electric and magnetic field components on a grid, taking into account the curl nature of EM radiation, where each component of the field is surrounded by its counterparts—the so-called Yee cells (proposed and implemented by Kane Yee in 1966). The beauty of this technique is its ability to automatically satisfy boundary conditions at each grid point, allowing it to tackle almost any imaginable geometry.

But there is always a price to pay—FDTD is a memory-extensive strategy. Moreover, if we are interested in long-time dynamics, such as a steady-state solution, we must propagate equations for quite some time—which enormously increases the total execution time. For example, three-dimensional simulations on a grid 320 × 320 × 640 nm3 with a spatial step of 1 nm take several days to explore EM dynamics of a silver nanoparticle within the first 500 fs.


figureParallel implementation. Schematic of parallelization of the FDTD, where Nloc xy-slices are carried by each processor. Send-and-receive MPI subroutines are implemented at the boundaries between each processor.

Speeding the FDTD

Obviously, such long execution times are not acceptable. After all, we are eager to get the data as quickly as possible. The answer lies at the base of FDTD—its finite difference nature. As illustrated in the figure on the right, the FDTD scheme is partitioned onto a parallel grid by dividing the simulation space into M interacting xy-slices, where M is a number of available processors. Each processor carries a given number of xy-planes, Nloc, in its memory communicating with its neighbor processors via send-and-receive operations.

The conventional way to establish such operations is to use point-to-point message passing interface (MPI) communication. MPI is a standard open-source library. Coming back to the three-dimensional example, a proper utilization of 32 processors leads to a 320 × 320 × 20 nm3 local grid that is taken care of by each processor. What is most important is that all processors are now able to propagate our equations at the same time, speeding simulations up.

But it is not as simple as it may seem. Remember that we have to send and receive data for every step, which requires some time, depending on how fast the communications are between our processors. For example, if each processor is connected to its neighbors via a very slow channel, it won't matter how fast a given processor performs the time iteration.

It will have to wait for when others send their data. It is thus important to scale your parallel code—in other words, to perform simulations on different numbers of processors and measure the execution time, finding an optimal parallel configuration. Many other tricks can also be implemented on top of parallel simulations. For example, when a portion of the computational grid is occupied by a dispersive dielectric, one needs to compute additional local polarization currents, which in turn slows down simulations.

On the other hand, simulations in free space take less time. It is hence wise to scale a local number of xy-planes, Nloc, balancing a number of floating point operations for each processor.

There are many tools that can measure how efficient a particular parallel code is. The easiest one is to calculate how fast the code gets when you increase a number of processors. In that case, you have to compute the speedup factor, defined as Sn=Tm/Tn, where Tm is the execution time of the code on m processors (usually this number equals 1 for a single processor machine), and Tn is the time code takes on n processors. Ideally, the speedup factor equals n/m (so-called Amdahl's law), but clearly due to the latency of communication channels between processors and other factors, a real speedup is usually less than that.


figureFDTD codes as a function of number of processors. Speedup factor for three-dimensional FDTD codes as a function of the number of processors at the double logarithmic scale for a single silver nanodisk (red dots) and a periodic array of nanodisks. The ideal speedup factor is shown as black dots.

In rare cases, however, one can achieve so-called nonlinear speedup—a special case when the calculated speedup factor is higher than its ideal value. The nonlinear deviation of the simulated speedup from its ideal value can be explained as follows: By increasing the number of processors involved in the simulations while keeping the problem size constant, we effectively decrease the amount of swapping that the code performs. In other words, for those points at which the simulated speedup is higher than its ideal value, the memory occupied by the locally propagated matrices (consisting of the EM field components) is less than the cache size of each processor. This usually happens only for certain two-dimensional geometries.

It is also informative to calculate another important performance measure—namely, the F-factor that specifies the portion of the simulations that cannot be parallelized—in other words, the portion of the algorithm that is sequential in nature. It is defined as:

F = [nSn] / [Sn(n–m)] .

The smaller the F-factor, the better and faster the code.

The figure above on the right shows a general case of the three-dimensional FDTD speedup, where we compare speedup factors for an open system with the one obtained for a periodic structure utilizing BlueGene/L cluster at Argonne National Laboratories. Here, we simulate the interaction of a 35-fs-long laser pulse with silver nanodisks of 40 nm in diameter and of 40 nm in height. The laser radiation is modeled as a plane wave that is generated above the nanoparticles and that propagates along the negative z-direction.

The simulation domain size is 120 × 120 × 1,842 nm with a spatial step of 1.2 nm, and the total number of time iterations is 20,000. The observable of interest in these studies is a steady-state EM field distribution near interacting nanoparticles. The measured speedup factor is nearly ideal. It degrades with an increasing number of processors, since in the model simulations, the grid size was fixed and hence the number of xy-layers per processor decreased with increasing number of processors. This resulted in more sending and receiving MPI operations per iteration with consequent latency of the network. The calculated F-factor for both open and periodic codes shows outstanding parallelization. The largest F-factor is nearly 9 percent, which indicates that only 9 percent of the code is sequential.


figureExecution time of the FDTD codes. Execution time in seconds on Plasmon (blue circles) and Abe (red circles) clusters as a function of number of processors. Note that, for 120 processors, Plasmon is noticeably faster than Abe.

Assembling your own supercomputer

These days, almost any university has its own multiprocessor machine. There are also several national supercomputer facilities. But it is always nice to have your own "mini" supercomputer, where you can run test codes and whatnot. Sometimes those widely used computational facilities are overcrowded with dozens of simulation scripts piling up in a waiting queue. Once my code to simulate short-laser-pulse dynamics was sitting in a queue for two days, while actual execution time was just less than 2 hours.

As an example, here I describe how my students and I have recently assembled the 128-core cluster we named "Plasmon." The first step is to decide whether your supercomputer will be comprised of desktops or blades connected via a network switch. The former is cheaper but requires more space. The latter, being relatively expensive, saves you some lab space. On the other hand, desktop clusters are usually much easier to cool—an important thing to keep in mind. The better the cooling, the more stable your cluster is going to be. Another important aspect to consider is how many independent power lines you need to power up your entire cluster. A good estimation is four nodes per line.

We have decided to proceed with interconnected desktops (usually referred to as nodes), where each node has the following characteristics:

  • Two AMD Opteron quad-core processors

  • SuperMicro motherboard H8DME-2 (a motherboard you choose has to have two independent outputs)

  • 24 Gb of RAM, and

  • A 250-Gb hard drive.

In addition, each node must have a power supply, a DVD drive and a PC case. All these items have been ordered one by one online and put together onsite. The assembling part (putting all the pieces for a single node together) is relatively easy. It usually requires 15-20 minutes per node. We managed to assemble 16 such nodes and connect them via a 1-Gb USRobotics network switch.

Probably the most crucial part is to figure out which operating system can manage your cluster. We found (eventually, by trying almost everything) that the NSF-founded project Rocks Cluster is the easiest to install and manage. In addition, it contains all essential components for FDTD simulations, including preinstalled MPI libraries.

Testing your cluster

A good test of the parallelizability and portability of the parallelization scheme and a new cluster is to compare data with that obtained on a professionally managed supercomputer facility. The presented benchmarks below were obtained on both the Intel 64 Abe cluster at the National Center for Supercomputing Applications and plasmon.

The model tested corresponds to the scattering and absorption of an EM plane wave by a periodic array of silver nanoparticles of 50 nm in diameter. In this problem, the two-dimensional FDTD grid has a size of 792 × 792 nm, with a spatial size of 1.1 nm. The number of time iterations is 650,000, resulting in a total propagation time of 1.6 ps. The figure above on the right shows the execution times of the parallelized FDTD code.

As expected, the Abe cluster clearly outperforms Plasmon (Abe's processors are actually faster than Plasmon's), but the difference in execution times is not as dramatic as one would expect. The difference in execution times decreases with an increasing number of utilized processors. At Plasmon, 120 processors are actually faster than those at Abe. The latter is due to the fact that many users intensively occupied Abe at the time of the test—the latency of Abe's network slowed down simulations. All in all, having your own supercomputer has many benefits, ranging from intense simulations (we recently ran a code that took us six days) to providing a unique experience to students.

Maxim Sukharev is with the department of applied sciences and mathematics at Arizona State University in Mesa, Ariz., U.S.A.

References and Resources

>> P.S. Pacheco. Parallel Programming with MPI, Morgan Kaufmann Publishing, 1993.
>> A. Taflove and S.C. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method, 3rd ed., Artech House, Boston, Mass., U.S.A., 2005.
>> L. Novotny and B. Hecht. Principles of Nano-Optics, Cambridge University Press, 2006.
>> W.A. Murray and W.L. Barnes. "Plasmonic Materials," Adv. Mater. 19, 3771 (2007).
>> M. Besbes et al. "Numerical analysis of a slit-groove diffraction problem," J. European Opt. Soc. 2, 07022 (2007).

Publish Date:

Computational Nano-optics: Parallel Simulations and Beyond

Modern theoretical optics often delves into the interaction between light and nanomaterials of complex geometries, and it relies on heavy numerical simulations. This article explores recent breakthroughs in the areas of plasmonics and nanopolaritonics and then describes how to assemble your own supercomputer in order to gather data on nanoscale optics.

Become a member or log in to view the full text of this article.

OSA Members get the full text of Optics & Photonics News, plus a variety of other member benefits.

Publish Date:

Add a Comment