Estimated reading time: 11 minutes
Applied Machine-Learning to Increase Consistency of Lab Results and to Shorten the Analytical Procedure
Background
One of the key criticisms of the gemstone industry towards gem labs is their lack of consistency. Gem testing laboratories produce reports that might differ regarding color, treatment or country of origin. Not only do different gem labs show diverging results, but also the reports issued by one lab at different points in time might vary. The reasons for such differences are manifold.
Every lab has its own analytical equipment, with slight variations in hardware and software, and its own testing protocols and guidelines for data interpretation. An even higher risk for inconsistent results is induced by the reference collection, its availability, completeness and level of authenticity of the stone samples, i.e. the relative certainty of their true provenance and treatment status 1.
Finally, another main source of inconsistent lab results is the human factor, i.e. individual gemologists having different backgrounds and experience levels, hence making different observations and weighing them differently, possibly leading to different results.
The rapid development of analytical technologies over the last two decades has multiplied the number of data points gathered in today’s modern gem-testing laboratories, further increasing the complexity of evaluating the wealth of data in a consistent way. Advanced trace and minor element analysis by ICPMS alone adds more than 20 additional data points for each stone. To recognize consistent patterns in a realm of a high double-digit number of data points poses a challenge even for the most experienced human experts. Such overburdening can lead human experts to respond in unwanted manners, e.g. oversimplification. The use of binary plots to interpret trace element chemistry, the search for – wrongly assumed – “diagnostic” features, or sequential decision making are examples of how human experts tend to reduce complexity. Or, they might become reluctant to take a final conclusion, resulting in the unpopular “not determinable” calls on lab reports.
Modern machine- and deep-learning algorithms have revolutionized the analysis and interpretation of large and complex datasets in various fields, including earth and material sciences allowing for more accurate and efficient data processing.
However, their application to gemology so far has mostly been restricted to handling a single data type at a time, such as chemical element concentrations or image evaluation. The more challenging tasks, involving the simultaneous evaluation of multiple data sources, still rely heavily on human expertise.
Project Gemtelligence
In 2020, the Gübelin Gem Lab started a project to build a deep learning-based method that automates the determination of the country of origin and detection of heat treatment of gemstones, in cooperation with CSEM 2, and with government funding by Innosuisse 3.
The resulting software, called Gemtelligence™, takes on the task of handling varied and multi-modal analytical data acquired from different testing devices. It determines the country of origin of ruby, sapphire and emerald, and detects heat treatment in ruby and sapphire.
The primary innovation of the proposed approach lies in its multi-modal design, custom-tailored to effectively process and integrate diverse analytical data acquired from different instruments, FT-IR and UV-Vis for spectroscopy analyses, and LA-ICP-MS and ED-XRF for chemical analyses (Figure 1).
Gemtelligence deploys of a combination of strided convolutional neural networks and a variant of the Transformer architecture, the latter being well-known from its application in Large Language Models such as ChatGPT.
The software was trained on several tens of thousands of datasets collected in the Gübelin Gem Lab over the decades, both using data from client stones and from reference stones. We applied a so-called supervised learning approach, i.e. the software was fed with the available analytical data and the final result. It did not get any indication of what features to look at to determine origin or treatment. This approach guaranteed that the software had to establish its own pattern of features, avoiding merely reproducing the logic of the human experts.
One major task that kept us busy during the three years of the project, was the cleaning of data. Although, for many years, we have applied a relatively high level of standardization of data collection in our lab, significant amounts of data still had to be filtered from the training set. The reasons included variations in hardware and software settings, changes in data processing methods or quality of the collected data that was sufficient for the visual interpretation by a gemologist, but not suitable for machine reading and learning.
The entire process of studying data quality and identifying faulty data and its reasons was a fascinating journey back into the recent history of the gem lab. Seemingly minor changes in testing protocols, standard operating procedures and changes in lab staff became evident. We now have a much better understanding of our past data, and the requirements for data collection in the future. The current version of Gemtelligence comprises data of more than 50,000 stones.
Not every stone has the complete dataset, i.e. for some stones, some of the data types (e.g. ICP-MS) are absent, hence Gemtelligence can cope also with incomplete datasets. On the validation set, a range of additional criteria was applied to reduce the risk of ground truth errors; among others, multiple expert gemologists had to independently reach the same conclusion through visual microscopic inspection, and results obtained from ICP-MS had to align with the findings of visual inspection.
Results
The results 4 are highly satisfying for the task of origin determination for a substantial portion of high-quality ruby, sapphire and emerald, and for the detection of heat treatment in ruby and sapphire. We demonstrated the capability of Gemtelligence to provide confident predictions on a substantially larger share of stones than human experts. A stone is considered being confidently classified if the calculated probability of the stone belonging to a certain origin or being heat treated exceeds the threshold value.
The threshold has been determined by calibrating the model on the training data to match or surpass the accuracy levels reached by human experts on the test data. A stone is considered confidently classified by human experts if they are certain enough to reach a single, unambiguous conclusion, for example, assigning one country of origin and one thermal treatment state.
Different threshold values yield different trade-offs between accuracy and automation.
A lower threshold value results in more stones reaching automated classification, hence less further analysis by human experts but, at the same time, lower accuracy and potentially higher error rates. For Gemtelligence, a lot of emphasis was put into defining threshold values for the tasks of origin determination and treatment detection for the three types of gemstones.
The current version of Gemtelligence comprises data of more than 50,000 stones.
Gemtelligence demonstrates that the interpretation of gemological analytical data can be automated while achieving comparable or even higher levels of accuracy (Figure 2). This is noteworthy because in professional gem labs, which need to adhere to best practice benchmarks, the time dedicated to the assessment and evaluation of the raw analysis is significant, and inconsistencies in this process are unavoidable.
Since gemstones often are analyzed multiple times during their lifespan, inconsistent results can raise doubts about the authenticity of the asset, leading to legal and financial complications. Hence, we evaluated whether Gemtelligence provides consistent results for the same gemstone when data is collected from different instruments, at various times and under varying conditions. We have evaluated some 200 blue sapphires that we have tested multiple times in the last ten years. The application of our confidence thresholding methodology was highly useful to optimize the model in order to reach a high consistency of predictions, and to avoid inconsistent classifications that the model is not highly confident about.
As stated above, the supervised learning approach prevented the software from simply applying gemologists’ established evaluation criteria. Instead, it forced the system to recognize a new, own pattern, which allowed for better results than expected. Not only does Gemtelligence achieve highly satisfying performance in terms of confidence and accuracy compared to the human experts, but it is also able to reach equally good results with less input data. In the practical application in laboratory work on client stones, this allows for a shortening of the testing pipeline for a significant share of the gemstones.
In this case, the two most labor-intensive analytical methods – optical microscopy and LA-ICP-MS – can be waived, as Gemtelligence is able to reach a qualified confidence level (i.e. matching the performance of human experts evaluating the complete dataset) on the basis of the datasets of the less complex – UV-Vis, FTIR and XRF – methods alone.
As the calculation process for such machine-learning models is basically a “black box” that is impossible to interpret, the exact criteria of the recognized pattern cannot be determined, not even by the engineers or data scientists who created the algorithm. However, some insights can be gained through the concepts and tools offered by Explainable AI or Interpretable Machine Learning.
The software was trained on several tens of thousands of datasets collected in the Gübelin Gem Lab over the decades.
Figure 3 is an example, suggesting which areas of an FTIR spectra in blue sapphires contribute to the determination of heat treatment status. While human experts are mainly focusing on the area between 2500 and 4000 wavenumbers, the algorithms seem to retrieve information also from areas above 4000 wavenumbers. It can be speculated that it is these regions, which go largely unnoticed by human experts, that contribute to the relative outperformance of Gemtelligence compared to the human experts.
Table 1. Definitions
Machine Learning: A subdomain of artificial intelligence, machine learning is concerned with the development and study of statistical algorithms that can learn from data and be applied to unseen data, to perform tasks without explicit instructions.
Supervised Learning: A subcategory of machine learning and artificial intelligence, it is defined by its use of labeled datasets to train algorithms to classify data or predict outcomes accurately.
Accuracy: A metric in machine learning for the performance for classification task (e.g. determining the country of origin). Accuracy is the percentage of correct classification.
Confidence: This metric represents the likelihood (typically expressed as a percentage) that the output of a machine-learning model is correct.
Ground Truth: Refers to the correct or “true” answer to a specific problem or question. In our case, the true country of origin and the true treatment state, i.e. heated or unheated.
Explainable AI: A set of processes and methods that allows to comprehend and trust the results and output created by machine-learning algorithms that are often black-box by nature.
Gemtelligence provides consistent results for the same gemstone when data is collected from different instruments, at various times and under varying conditions.
Conclusions and Outlook
Gemtelligence has a series of positive outcomes; the initial goal of increasing the consistency of results, through automatization of data interpretation, was accomplished. The human factor in the work step of data evaluation can be massively reduced.
A welcome additional benefit is the shortening of the analytical pipeline for a share of gemstones, which makes the microscopic assessment by gemologists and/or the application of LA-ICP-MS analysis partially obsolete. This leads to gains in efficiency, and can be applied in new, less expensive services for some gemstones. Furthermore, it frees gemologists for more rewarding tasks such as research and development, project work or fieldtrips.
Not only does Gemtelligence achieve highly satisfying performance in terms of confidence and accuracy when compared to human experts, but it is also able to reach equally good results with less input data.
Notes
1 A more comprehensive article on the importance of reference collection for gem testing laboratories is seen in: Pardieu, Vincent (2020) Field Gemology – The Evolution of Data Collection. InColor 46 p 100-106.
2 CSEM (www.csem.ch) is a public-private, non-profit Swiss technology innovation center, developing and transferring world-class technologies for the industrial sector.
3 Innosuisse (www.innosuisse.ch) is the Swiss Innovation Agency. Its role is to promote science-based innovation in the interest of the economy and society in Switzerland.
4 A scientific article describing the detailed results of the Gemtelligence software for a subset of the blue sapphires will soon be published in Nature Communications Engineering. To access data and code relating to the article, click: https://github.com/TommasoBendinelli/Gemtelligence. A preliminary version by Bendinelli et. al. is accessible on the open access platform Arxiv: https://arxiv.org/abs/2306.06069.
CSEM and Gübelin Gem Lab hold the copyright of the figure and images shown here.