Empowering Healthcare with Large Language Models: Reducing Clinicians' Workload and Improving Stroke Patient Care

Despite the increasing importance of clinical language models, no multilingual pre-trained encoder or decoder models specialized for clinical texts currently exist. Existing models are typically monolingual (mostly in English) and rely on general-domain tokenizers that are not well-suited to the complexity of clinical language, which includes abbreviations, numerical data, drug names, and other specialized terminology. Moreover, clinical documentation is primarily created in hospitals' native languages, not just in English, highlighting the urgent need for multilingual solutions.

This project addresses these gaps by developing the first multilingual clinical encoder and decoder models, equipped with a novel tokenizer optimized specifically for clinical data. By collecting and pre-training on multilingual clinical corpora, and benchmarking across multiple clinical tasks such as question answering, named entity recognition, and summarization, the project aims to create publicly available models that will support both researchers and real-world hospital applications across diverse languages.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form