
Breast cancer is a major health concern worldwide, and early detection is crucial for improving survival rates. Advancements in deep learning and natural language processing (NLP) have opened new avenues for developing accurate and reliable models to diagnose and treat breast cancer. However, training these models requires large labeled datasets, which can be challenging to obtain, especially for low-resource languages like French.
In a groundbreaking study, researchers from Therapixel have developed a novel approach called MammoBERT to efficiently label French mammogram reports. The team combined the strengths of rule-based labeling systems and high-quality radiologist annotations to create a robust and scalable solution. By leveraging both the expansive scale of existing rule-based systems and the precision of radiologist annotations, MammoBERT significantly outperformed traditional methods, showcasing its potential to revolutionize medical imaging model development.
Tackling the Challenges of Mammogram Report Labeling
Breast cancer is the most common cancer among women worldwide, and early detection is crucial for improving survival rates. Mammography is a widely used imaging tool for the early diagnosis of breast diseases due to its high sensitivity. As deep learning models have shown promise in improving breast cancer detection, the demand for large, high-quality labeled datasets has become increasingly important.
However, the development of these models, especially for low-resource languages like French, is hindered by the limited availability of labeled datasets. Traditionally, the labeling of radiology reports has relied on sophisticated feature engineering or manual annotations by radiologists, both of which can be time-consuming and labor-intensive.
Introducing MammoBERT: A Hybrid Approach
To address these challenges, the researchers at Therapixel developed a novel approach called MammoBERT. This method combines the strengths of rule-based labeling systems and high-quality radiologist annotations to create an efficient and effective solution for French mammogram report labeling.

The MammoBERT approach consists of two phases:
1. Initial Fine-tuning on Radiologist Annotations: The team started by fine-tuning a pre-trained BERT-based model on a small dataset of radiologist annotations, ensuring the model’s initial learning was guided by expert-level knowledge.
2. Hybrid Fine-tuning with Rule-based Labels: In the second phase, the model underwent further fine-tuning through an active learning loop. This loop integrated a combination of manual annotations selected via uncertainty sampling and a larger set of automatic rule-based labels obtained through agreement sampling.
By leveraging both the scale of rule-based labels and the precision of radiologist annotations, the researchers were able to create a more comprehensive and reliable dataset for training the MammoBERT model.
Outperforming Traditional Methods
The team’s extensive experiments demonstrated the superiority of the MammoBERT approach. The model significantly outperformed traditional rule-based labeling systems, achieving an average F1 score of 0.99 for the surgery presence task and 0.98 for the surgery laterality task on the unseen test set.

Table 1 Comparative statistics and distribution of labels among classes. This table presents label statistics obtained by four strategies: labels obtained from the rule-based method, initial labels provided by radiologists, labels from the extended hybrid method, and labels from an unseen institute provided by radiologists.
The researchers identified two key sources of errors that were effectively addressed through targeted data augmentation and preprocessing:
1. Confusion between patients’ and relatives’ medical histories: The team created synthetic samples to improve the model’s ability to distinguish between a patient’s own surgical history and that of their family members.
2. Misinterpretation of surgeries on other body organs: The researchers implemented a preprocessing step to exclude irrelevant surgical history mentions from the report text, further enhancing the model’s performance.
Broader Implications and Future Directions
The success of the MammoBERT approach has far-reaching implications. The model’s architecture allows for adaptation to other medical report types and languages, making it a versatile solution for the medical imaging domain.
By retraining on specific datasets and adjusting extraction rules, the MammoBERT model can be applied to different medical contexts, such as lung or brain imaging reports. Additionally, the integration of image data with the model’s text-based analysis could provide further insights and optimization opportunities for future research.
The development of MammoBERT represents a significant step forward in the field of medical NLP, showcasing the potential of hybrid approaches that combine the strengths of rule-based systems and deep learning models. This innovative solution not only advances the state of the art in medical image report labeling but also offers an efficient and effective path for large-scale medical imaging model development, ultimately contributing to improved breast cancer detection and patient outcomes.
Author credit: This article is based on research by Nazanin Dehghani, Vera Saliba-Colombani, Aurélien Chick, Morgane Heng, Grégory Operto, Pierre Fillard.
For More Related Articles Click Here