Welcome to the PRINCIPLE Project website
PRINCIPLE stands for Providing Resources in Irish, Norwegian, Croatian and Icelandic for Purposes of Language Engineering and is implemented by a five member consortium.
Consortium Members: Dublin City University (Project Coordinator), University of Iceland, Faculty of Humanities and Social Sciences, University of Zagreb, National Library of Norway, Iconic Translation Machines Ltd.
PRINCIPLE is a 2 year Connecting Europe Facility (CEF) funded project (Action 2018-EU-IA-0050, Grant Agreement No. INEA/CEF/ICT/A2018/1761837) whose main aim is to identify, collect and process high-quality Language Resources (LRs) for four under-resourced European languages:
- Norwegian (Bokmål and Nynorsk)
The project started in September 2019 and will finish in August 2021.
PRINCIPLE will produce these high-quality curated LRs in order to improve translation quality in the Digital Service Infrastructures of eJustice and eProcurement via domain-specific Machine Translation (MT) systems (CEF eTranslation engines).
These high-quality LRs will be identified through the following process:
- A number of national bodies and local stakeholders across Croatia, Iceland, Ireland and Norway have agreed to provide LRs to the PRINCIPLE consortium and become ‘early adopters’
- Iconic Translation Machines will develop Neural MT engines from the donated Language Resources in order to verify the quality of the resources. MT systems will be provided to the project’s ‘early adopters’ for the duration of the project in order to validate quality in real user scenarios and gather feedback.
- Language Resources will then be provided for CEF eTranslation engines through the ELRC-SHARE portal
Outline of PRINCIPLE Project
- Language Resources (LRs) are collected from data holders and ‘early adopters’ from each of the 4 countries involved (in the specific domains of eJustice and eProcurement)
- Machine Translation (MT) systems are produced from these LRs and evaluated to ensure high quality output
- The MT systems are provided to ‘early adopters’ for free for the duration of the project
- ‘Early adopters’ use the MT systems and provide feedback
- Based on MT output evaluation and early adopter feedback, high quality LRs are identified
- Parallel LRs of high quality are uploaded to the ELRC-SHARE portal in order to improve the automated translation system eTranslation.
Key activities that will be conducted in PRINCIPLE include the following and will be rolled out in specific phases across the duration of the project.
- Activity 1: Project Implementation
- Activity 2: Use-case analysis, Data Requirements and Data Preparation
- Activity 3: Development, evaluation and deployment of MT systems
- Activity 4: Identification, Collection & Consolidation of Language Resources
- Activity 5: Exploitation & Sustainability
- Activity 6: Dissemination
Language resources have been collected from a number of dataholders and early adopters across the 4 countries. This data has been analysed and prepared for MT development based on agreed use cases. Iconic Translation Machines has created bespoke neural MT systems for 10 Early Adopters:
- National University of Ireland Galway (Ireland)
- CIKLOPEA D.O.O. (Croatia)
- Ministry of Foreign Affairs (Iceland)
- Standards Norway
- Ministry of Foreign Affairs (Norway)
- Rannóg an Aistriúcháin (the Translation Section of the Houses of the Oireachtas)
- Foras na Gaeilge
- Ministry of Foreign and European Affairs (Croatia)
- Icelandic Standards
- Icelandic Meteorology Office
Data identified during the development of these engines as being of “high quality” will be uploaded to the ELRC-SHARE repository in June 2021.
- CEF website: https://ec.europa.eu/digital-single-market/en/connecting-europe-facility
- eTranslation website: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation
- INEA website: https://ec.europa.eu/inea/
- ELRC website: http://www.lr-coordination.eu/
- ELRC-SHARE Repository website: https://elrc-share.eu/
- ELRI website: http://www.elri-project.eu
Dates for workshops:
Project Launches took place across the four countries in 2021:
Croatia: The PRINCIPLE project launch for Croatia took place on July 1st, 2021. The launch was co-located with the Professional and Scientific Symposium for Lectors of Croatian as a Second and Foreign Language (SIH) within Session 2 of the first day “Project and Book Presentations””. The agenda for the SIH symposium can be found here.
Providing Resources in Irish, Norwegian, Croatian and Icelandic for Purposes of Language Engineering – Filip Klubička presented an overview of the PRINCIPLE project with a focus on Croatian Language Resources (LRs). The process of identifying LRs in the public and business sector was explained. Early Adopters and data contributors were presented as well as details on LRs they donated (e.g. size, domain, etc.). The main issues experienced were presented and described, ranging from identifying data contributors in the public and business sectors in Croatia to the processing of collected LRs.
Iceland: The PRINCIPLE project launch for Iceland took place on May 18th, 2021. The launch was on the agenda of an online conference that was broadcast on Ruv.is, the Icelandic national television‘s website. The conference was organized by SÍM (Consortium for Icelandic LT) and Almannarómur (Centre for Language Technology) and covered topics such as language technology, artificial intelligence, and the exploitation of university research in daily life.
Gauti Kristmannsson from UoI gave an overview of the PRINCIPLE project with emphasis on the process used to identify high quality language resources and the Early Adopters participating in the project. Níels Rúnar Gíslason from UoI presented the automatic and human evaluation scores, he also explained their meaning as well as the importance of companies and public bodies donating specialized data. The agenda for the conference can be found here. The formal project launch was pre-recorded and edited for the video conference and can be found here. The presentation slides used during the recording can be found here.
On the same day it was broadcast, Gauti Kristmannsson and Níels Rúnar Gíslason gave talks for the professionals in the industry at the Association of Icelandic Court Interpreters and Translators annual meeting (via teleconference) . They also participated in the extended Q&A session. A recording of this meeting can be found here.
Ireland: The PRINCIPLE project launch for Ireland took place on June 24th, 2021. The launch was co-located with the ELRC workshop for Ireland and acted as Session 3 – LT in Ireland: the PRINCIPLE Project use-case. The agenda for the ELRC workshop can be found here. 3 presentations on the PRINCIPLE project were presented during Session 3:
- PRINCIPLE Project Overview – Jane Dunne from DCU presented an overview of the PRINCIPLE project as an example of how Language Technology has been used in the public sector in Ireland. Data collection and pre-processing were explained, as was the automatic and human evaluation of the various MT systems that have been built for the 10 Early Adopters across the four countries.
- Language Technology Use-case in the Public Sector – Micheál Ó Maolruanaigh from Foras na Gaeilge presented on the organisation’s experience of being an Early Adopter in the PRINCIPLE project (through Irish). Translation needs were discussed (e.g. application forms, annual reports, policies, social media, etc.) followed by an explanation of the translation memory tools in use by their inhouse translator (MemoQ and SDLTrados). Feedback on the experience of being part of the project was positive and Micheál explained how they learned a lot about translation technology and about human evaluation of the translation systems’ output.
- Irish Language Technology: SMEs and the public sector – Róisin Moran from RWS Language Weaver (formerly Iconic Translation Machines) discussed the company’s role in the project, described how the MT systems were trained and evaluated and how they performed when compared to other online generic MT systems, which on the whole was significantly better.
Norway: The PRINCIPLE project launch for Norway took place on March 3rd, 2021. The launch was co-located with the ELRC workshop for Norway and the agenda can be found here. Two of the presentations in the workshop were related to the PRINCIPLE project:
- Using Machine translation in the Department of Foreign Affairs – Stein Gabrielsen from the Norwegian Ministry of Foreign Affairs presented the
organization’s experience as Early Adopter in the PRINCIPLE project. Stein described the translation tasks of the Section for European Economic Area Affairs (mainly translation of EU law into Norwegian), previous use of CAT tools (SDLTrados) and a short experience with machine translation, and their current experience with using the machine translation engine developed by Iconic as part of the PRINCIPLE project. Stein explained that their experience has so far been very positive. The engine produces good results that increase the section’s productivity, and the dialogue with Iconic in developing and improving the engine has been good and constructive.
- What is language data and how do we collect them? – Magnus Breder Birkenes from the Norwegian Language Bank at the National Library of Norway presented two examples of data collection that both contribute to the PRINCIPLE project. The first example was the collection of Norwegian tenders published on DOFFIN (the Norwegian notification database for public procurement) and their translations into English published on TED (the European public procurement journal). The data were received in XML format with document identifiers. Magnus described how the data was processed with sentence tokenization, alignment and transformed into TMX format. The second example was the crawling of the websites of Norwegian state level public organizations. The primary goal of the project is to evaluate the amount of information in Nynorsk and Bokmål, the two official Norwegian written languages, in order to make sure that the state organizations follow the Norwegian language law. However, the data also contain parallel texts in Norwegian and English which will be extracted and made available as part of the PRINCIPLE project.
Dates for conferences where PRINCIPLE was represented:
PRINCIPLE was presented at the poster session of the XVII Machine Translation Summit, that was held at Dublin City University (Ireland) on 19-23 August 2019, and this paper was published in the conference proceedings: www.aclweb.org/anthology/W19-6718.pdf
PRINCIPLE was presented at EAMT 2020 in November 2020. A 2-page paper entitled “Progress of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian” has been published in the proceedings of the 2020 EAMT Conference and is available here (pages 465-466): https://eamt2020.inesc-id.pt/proceedings-eamt2020.pdf
PRINCIPLE was invited to give a presentation at the 5th ELRC conference on 10th March 2021. The presentation is available here.
PRINCIPLE was presented at the ELRC conference in Norway on 3rd March 2021. Stein Gabrielsen from the Norwegian Ministry of Foreign Affairs presented the experiences of an early adopter in the PRINCIPLE project, and Magnus Breder Birkenes from the Norwegian Language Bank presented the collection and extraction of bilingual language data delivered to the project. Both presentations are available (in Norwegian) on YouTube.
The work carried out on the PRINCIPLE project was presented at the MT Summit, 2021. Páraic Sheridan (Iconic), gave a presentation entitled “Building MT systems in low resourced languages for Public Sector users in Croatia, Iceland, Ireland, and Norway” where he showcased all the MT models built throughout the PRINCIPLE project as well as the Early Adopters for each language. The presentation is available here.
Dublin City University (Ireland) – Andy Way, Project Coordinator (PC), email@example.com
Iconic Translation Machines Ltd. (Ireland) – Dana Davis Sheridan, firstname.lastname@example.org
Faculty of Humanities and Social Sciences, University of Zagreb (Croatia) – Petra Bago, Data Collection Coordinator (DCC), email@example.com
University of Iceland (Iceland) – Gauti Kristmannsson, firstname.lastname@example.org
National Library of Norway (Norway) – Jon Arild Olsen, email@example.com