“Training is only performed once, with users downloading and fine-tuning such language models to their specific task. In doing so, we are trusting large tech companies to train the base model responsibly since we have no control over this. This seems inherently undemocratic.” (de Vassimon Manela et al., 2021)

Large pre-trained language models and machine translation models are now in regular use powering search engines, mediating online disputes through the EU’s service, making recommendations, identifying hate speech, and evaluating résumés. These models are deployed by a variety of enterprises, including SMEs, but training very large language models is currently the province of a few large American and Chinese technology companies. This gives their modeling decisions exceptional market power. Their modeling decisions do not encode values expressed in the call text: multilinguality (Alabi et al., 2020), FAIR principles (Scott, 2020), minimising bias (Lucy and Bamman, 2021), and energy eciency (Schwartz et al., 2020); Section 1.1.1 addresses these points further along with our ambition. Thirdparty researchers attempting to address these issues generally lack the resources to fully validate their hypotheses by training and releasing a better model. The high-profile firings of Google ethics co-leads Timnit Gebru and Margaret Mitchell (who wrote a letter of support) over a paper (Bender et al., 2021) shows these companies cannot be trusted to openly research their modeling decisions either. Nonetheless, academia and industry continue to deploy these models for commercially important languages because they produce state-of-the-art results, and only weaker models (or none at all) are created for commercially less important languages.

While one machine translation model is generally cheaper to train than a large language model, making a viable machine translation product requires support for a large number of languages and therefore scale.

It is crucial to lower barriers to entry in training large language models and machine translation models, reducing the concentration of market power in a handful of companies. We propose a language data space that substantially lowers the three main barriers to training at scale: data gathering, compute, and reproducibility. Using our data space, we will produce free language and translation models supporting all official European languages and beyond.

Project web: https://hplt-project.org/

UNIVERZITA KARLOVA (CUNI) - Coordinator
PROMPSIT LANGUAGE ENGINEERING, SL (PROMPSIT)
UNIVERSITETET I OSLO (UiO)
HELSINGIN YLIOPISTO (UH)
TURUN YLIOPISTO (UTU)
CESNET ZAJMOVE SDRUZENI PRAVNICKYCH OSOB (CESNET)
UNINETT SIGMA2 AS (SIGMA2)
THE UNIVERSITY OF EDINBURGH (UEDIN), as the associated partner

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

HPLT