Millions, if not billions, of specimens reside in the world’s natural history collections, but most of these have not been carefully studied, or even looked at, in decades. While containing critical data for many scientific endeavors, most objects are quietly sitting in their own little cabinets of curiosity.
Thus, mass digitization of natural history collections has become a major goal at museums around the world. Having brought together numerous biologists, curators, volunteers and citizens scientists, such initiatives have already generated large datasets from these collections and provided unprecedented insight.
Now, a new study, recently published in the open access Biodiversity Data Journal, suggests that the latest advances in both digitization and machine learning might together be able to assist museum curators in their efforts to care for and learn from this incredible global resource.
A team of researchers from the Department of Botany at the Smithsonian’s National Museum of Natural History, and the Data Science Lab and Digitization Program Office of the Smithsonian Office of the Chief Information Officer, recently collaborated with NVIDIA Corp. to carry out a pilot project using deep learning approaches to dig into digitized herbarium specimens.
The project is among the first to describe the use of deep learning methods to enhance our understanding of digitized collection samples. It is also the first to demonstrate that a deep convolutional neural network (CNN)–a computing system modeled after the neuron activity in animal brains that can basically learn on its own–can effectively differentiate between similar plants with an amazing accuracy of nearly one- hundred percent.
In their paper, the scientists describe two different neural networks that they trained to perform tasks on the digitized portion (currently 1.2 million specimens) of the United States National Herbarium.
The team first trained a CNN to automatically recognize herbarium sheets that had been stained with mercury crystals, since mercury was commonly used by some early collectors to protect the plant collections from insect damage. The second CNN was trained to discriminate between two families of plants–clubmosses and spikemosses–that share a strikingly similar superficial appearance.
The trained CNN performed with 90 percent and 96 percent accuracy respectively (or 94 percent and 99 percent if the most challenging specimens were discarded), confirming that deep learning is a useful and important technology for the future analysis of digitized museum collections.
“Our stained vs. unstained network could theoretically be applied to digitized specimens in other herbaria to help identify mercury hotspots for potential remediation,” the scientists conclude in their paper. “Likewise, our family discrimination network has the potential to be further developed into a universal tool to identify unknowns or to flag specimens in need of additional study, in the United States National Herbarium and in other natural history collections.”
“This research paper is a wonderful proof of concept. We now know that we can apply machine learning to digitized natural history specimens to solve curatorial and identification problems,” says Laurence Dorr, chair of the Natural History Museum’s Department of Botany.“The future will be using these tools combined with large shared data sets to test fundamental hypotheses about the evolution and distribution of plants and animals.”