Speech-to-text Recognition for the Creation of Subtitles in Basque: An Analysis of ADITU Based on the NER Model

This contribution aims at analysing the speech-to-text recognition of news programmes in the regional channel ETB1 for subtitling in Basque using ADITU (2024) (a technology developed by the Elhuyar foundation) applying the NER model of analysis (Romero-Fresco and Martínez 2015). A total of 20 samples of approximately 5 minutes each were recorded from the regional channel ETB1 in May, 2022. A total of 97 minutes and 1737 subtitles were analysed by applying criteria from the NER model. The results show an average accuracy rate of 94.63% if we take all errors into account, and 96.09% if we exclude punctuation errors. A qualitative analysis based on quantitative data foresees some room for improvement regarding language models of the software, punctuation, recognition of proper nouns and speaker identification. From the evidence it may be concluded that, although quantitative data does not reach the threshold to consider the quality of recognition fair or comprehensible with regards to the NER model, results seem promising. When presenters speak with clear diction and standard language, accuracy rates are sufficient for a minority language like Basque in which speech recognition software is still in early phases of development.


Introduction
This contribution aims at analysing the speech-to-text recognition in Basque of recorded news programmes in the regional channel ETB1 applying the NER model of analysis (Romero-Fresco and Martínez 2015).
In the present study, the analysis focuses on the output (the subtitles) and does not delve into the speech-to-text recognition process.In this sense, this contribution is an evaluation of the quality of subtitles in Basque using speech-to-text technology that might help improving the automation process of subtitling in Basque.
In this section we take into account a brief history of Basque within the regional channels of ETB and the current situation and availability of subtitling in the different ETB channels.In section 2, we explain the NER model, which will serve as the basis for gathering quantitative and qualitative data for the analysis.In section 3, the methodology of how samples were gathered and analysed is explained in detail.Section 4 offers the analysis and discussion of the data.In that section we look specifically at quantitative data on the accuracy rate (AR) of the recognition software (4.1), recognition errors and corrected recognition (4.2), as well as subtitle speed (4.3).We also offer a qualitative analysis of where we believe there is room for improvement (4.4) in the recognition software for the creation of automatic subtitling in Basque.Throughout the analysis section, we will be referring to an interview with Igor Leturia, Head of speech technologies at Orai Natural Language Processing Technologies, who works in the development of ADITU (2024).This interview took place on December 14, 2022, after the quantitative analysis was carried out and a report sent to the interviewee.To conclude, section 5 offers some final remarks emphasising the contributions of the paper.

Subtitling in Basque in ETB
The Basque language, or euskera, is a minority language spoken in Euskal Herria, a small Basque-speaking area of 20,664 km 2 that spans the banks of the Bay of Biscay across northern Spain and southern France (Zuazo 1995: 5).The Basque Statistics Institute (Eustat) divides Basque speakers in three categories, depending on their proficiency and fluency levels: Basque speakers with full proficiency in Basque; Quasi-Basque speakers who face some difficulties when speaking the language, and non-Basque speakers who can neither speak nor understand Basque (Lasagabaster 2010: 404).Table 1 shows the percentage of speakers from each category in Navarre, the Basque Autonomous Community (BAC) and Iparralde 1 (Euskararen erakunde publikoa, Eusko Jaurlaritza and Nafarroako Gobernua 2023a, 2023b, 2023c): In Spain, since Basque acquired co-official status in 1978, a number of efforts have been made in order to revive the language (Cenoz and Perales 1997).The creation of a television channel 2 entirely in Basque, along with opening primary schools that teach in Basque, is closely linked to this desire to revitalise the language (Costa 1986: 343).
On May 20th 1982, the Basque Parliament unanimously adopted an act creating Euskal Irrati Telebista (EiTB).The first year of broadcasting constituted a trial period marked by coverage problems and the difficulty of complying with the initial plan to broadcast four hours a day (Larrañaga and Garitaonandía 2012: 34).Thus, the programming consisted, almost entirely, of foreign programmes (such as soap operas, cartoons or documentaries) dubbed into Basque and subtitled in Spanish (Barambones 2009: 89-91).However, over the following years the number of hours of broadcasting increased, as did the range of programmes offered.Just over three years after ETB's first broadcast, and after a string of political and sociolinguistic controversies, ETB2 was born with the aim of responding to the needs of a large percentage of the Basque population who did not speak nor understand Basque, but wanted to access content that covered their closest reality (Cabanillas 2016: 57).
Nowadays, EiTB Media owns four broadcasting channels: ETB1, a generalist channel with Basque as the broadcasting language; ETB2, a generalist channel with Spanish as the broadcasting language; ETB3, a channel aimed at broadcasting children's and youth programmes mostly in Basque; and ETB4, a channel that broadcasts reruns in both Basque and Spanish.With regard to the subtitling practices, Table 2 and Figure 1 summarise current practices and percentage of hours subtitled.En Jake and Nos echamos a la calle through respeaking by a company from Burgos.
Without editing, weather forecast using speech-totext programme (Idazle, developed by Vicomtech).In the matter under discussion, the reality is that no live nor semi-live subtitling is offered in Basque in ETB.All the subtitles included in ETB1 are pre-recorded and have been automatically generated using speech recognition software and edited afterwards.Given the high error rate in speech recognition, it is imperative that they are edited before being broadcast which is why recognition software is only used with recorded programmes, such as Mihiluze (Larrinaga, personal communication, October 14, 2022).An exception to this is the weather forecast.The weather broadcast (Eguraldia in Basque) is a pre-recorded section of the news programme that includes automatic subtitling without editing (Larrinaga, personal communication, October 14, 2022).Thus, without editing, the quality of the subtitling depends, to a large extent, on the number of interferences registered, the speaker's diction, as well as on thematic limitations (Larrinaga, personal communication, October 14, 2022).After an initial test period, and with a view to improving the quality of the subtitles for the weather forecast, ETB's Basque Service compiled a technical weather glossary, a list of popular sayings, expressions and words related to the weather, as well as a list of Basque toponyms, which was later uploaded into the software and improved the quality of this section's subtitles (Larrinaga, personal communication, October 14, 2022).These subtitles are not analysed in the present contribution, as we deal here with another software.

Yes, most programmes
In conclusion, current broadcasting does not offer automatic live subtitles in Basque generated by speech recognition and without editing.Therefore, in order to gather samples to analyse the current recognition accuracy rates in Basque, we need to turn to non-broadcast subtitles.How these materials were gathered is explained in detail in the methodology section.

The NER model
In recent decades, the use of intralingual subtitles, especially those for the deaf and hard-of-hearing, has increased considerably, driven in large part by the ongoing efforts of some countries in the field of audiovisual accessibility (Pedersen 2017: 212).However, once subtitling quotas were relatively met and quantity had reached acceptable standards, attention was drawn to the quality and accuracy of the subtitles (Pedersen 2017, Romero-Fresco 2012).In this context, the NER model was developed, a product-oriented model aimed at analysing the accuracy of live subtitles (Romero-Fresco and Martínez 2015).By adopting the basic principles of the WER model (see Dumouchel, Boulianne and Brousseau 2011), the NER model, which divides transcripts into idea units (word chunks that are recognised as one unit) rather than words, has been particularly tailored to suit respoken and automatic intralingual subtitling (Pedersen 2017, Romero-Fresco and Martínez 2015).It is based on the following formula to determine the quality of the subtitles: Image 1.The NER model formula In order to calculate the overall score, the number of editing errors (E) and recognition errors (R) is deducted from total number of words (N) contained in the subtitles.The resulting number is divided by the N value, which is then multiplied by one hundred.An error-free subtitle would score a NER value of 100.It should be noted that the threshold set by the NER model for acceptable accuracies is 98% (Romero-Fresco and Martínez 2015, Romero-Fresco 2016).In contrast to the WER model, which is based on word errors regardless of their typology, the NER model differentiates between errors based on their severity, and some mismatches in recognition may not be considered as errors.
In our particular case study, editing errors are not taken into account, as the automatically generated subtitles in our corpus are not edited.Editing errors can occur in live editing (either by a respeaker or by a person controlling the automatically generated subtitles live) or in semi-live and prerecorded subtitling, as would be the case of some subtitling practices for ETB.For this contribution, only recognition errors (R) are taken into account.

Methodology
The analysis presented in this contribution was carried out within the project "QUALISUB: The Quality of Live Subtitling: A regional, national and international study".The methodology originally established for the project could not be followed.The aim of the project was to analyse the speech-to-text recognition on the regional news programmes broadcast by TVE, the Spanish public broadcast.Nevertheless, the subtitles provided by TVE between May and December 2021 contained too few automatic subtitles in Basque 3 .This resulted in having to follow a different methodology for Basque, when compared to other languages analysed in the project.The sociocultural reality in which live events and automatic subtitles take place in the Basque Country (see section 1.1) led to the need for using recorded programmes.How the samples were gathered is explained in detail in the next paragraphs.
News programmes from ETB1 (the regional TV channel broadcasting only in Basque) were recorded in May 2022 4 .The whole news event was recorded over 10 non-consecutive days, between May 2 nd and May 15 th , from all time slots (morning, afternoon and night) and from all days of the week.From each day, two samples of 5 minutes each were extracted, with the aim of having comparable samples with other languages of the project.Samples contain different types of subgenres (sports news, weather forecast, interviews, reporters, headlines, etc.) and are heterogeneous in the use of language (monolingual samples, bilingual [Basque and Spanish] samples, different dialects, etc.).Our corpus, therefore, consists of 20 samples from the news programmes of ETB1 recorded in May 2022.The samples add up to 1 hour and 41 minutes (101 minutes) and their characteristics are shown in Table 3.
Since no real (broadcast) automatic subtitles in Basque are available for live events in the Basque Country (see section 1.1), we proceeded to obtain automatic subtitles by contacting the Elhuyar Foundation.The Elhuyar Foundation is a private non-profit organisation.It was founded in 1972 as a cultural association with the aim of combining science and the Basque language.In 2002 it became a foundation.Currently it offers services for the application of advanced knowledge and aids companies, social agents and administrations to seek innovative solutions to respond to the challenges of globalisation from a multidisciplinary approach.The Elhuyar Foundation also works with professionals in science, artificial intelligence, lexicography, translation and language management.

. Summary of samples in the corpus
In March 2020, the Elhuyar Foundation launched ADITU, a speech recognition software that automatically transcribes and subtitles audio and video files.Its design was based on artificial intelligence together with neural network technology and works with both pre-recorded and live audio and video files (Aztiria, Jauregi and Leturia 2020).In its early version, the software recognised Basque and Spanish speech separately, but the latest version can also be used to transcribe and subtitle bilingual files.As with many other programmes, ADITU works with both an acoustic model and a language model.Put simply, the acoustic model identifies phonemes and the language model determines what word sequences are more frequent based on different texts that feed the programme.ADITU works with 4-or 5-word sequences to determine which set of words is more probable.Texts that feed the language model of the programme are mostly journalistic and the language model is updated with more texts at various times each year (Leturia, personal communication, December 14, 2022).
As mentioned above, we contacted the staff of Elhuyar in charge of the programme ADITU.With the aim of having subtitles of as high quality as possible for the Basque segments, the monolingual version of ADITU was chosen.All samples (20) were sent in .tsformat 6 .Elhuyar provided raw .srtfiles 7 (with no segmentation), arranged .srtfiles (with segmentation in two lines, maximum, with the maximum of 37 characters per line established by ADITU) and .txtfiles.In this particular study, arranged non-edited .srtfiles were analysed.One sample (ETB1_1) was not recognised by the programme ADITU and we have neither data nor subtitle files from that sample 8 .Therefore, in this analysis, only 19 samples could be analysed, making a total of 97 minutes and 1737 subtitles.In this case study, pop-on subtitles containing one or two lines were provided.Throughout the analysis, we refer to the term subtitle as a one-or two-line pop-on block.It should be noted that ADITU can also create roll-up subtitles for live broadcasts.However, although the subtitles analysed in the present article have been generated for programmes broadcast in a live setting, the insufficient availability of live subtitling in Basque has rendered pop-on block subtitles, as if they were for recorded programmes.Consequently, the results presented here may differ from those obtained if broadcast live subtitles were analysed, as delay or editing errors, among other factors, would need to be considered.
For the data to fit into the NER model of analysis, transcription of all samples was performed manually.Only subtitles related to segments entirely in Basque are analysed in the present contribution.Therefore, segments containing audio/subtitles in Spanish or other languages are excluded from the analysis.
The quantitative analysis is based on the NER model, which establishes a threshold of 98% as the minimum AR for subtitles to be considered comprehensible.Following the NER model, in this analysis minor recognition errors (R-M) are those which do not hinder the correct comprehension of the audiovisual content.Standard recognition errors (R-ST) are those that the viewer will be able to recognise as an error but which will not allow for a correct comprehension of the audiovisual content or may create confusion for the viewer.Serious recognition errors (R-SE) are those considered unlikely to be noticed by the viewer and will cause viewers to infer a different understanding than that heard in the audio.Finally, we consider corrected recognition (CR) the omission or reduction of information that is redundant on the audio, such as markers of oral speech or unintended repetitions of words that the software suppresses autonomously.
Labelling of the errors and CRs was carried out manually in Excel templates used specifically for the analysis using the NER model.Following data provided by the Excel sheet and information provided by Igor Leturia, the next sections offer a quantitative and qualitative analysis of the automatic subtitles in Basque for news events created with the ADITU monolingual (Basque) version by Elhuyar using the developmental stage of May, 2022.

Results and discussion
In this section, we include a quantitative and qualitative analysis of the 97 minutes and 1737 automatic subtitles in Basque following the NER model.Firstly, we examine the AR of automatic subtitles in Basque, with and without considering punctuation errors and taking into account the NER threshold for acceptable recognition accuracy.Secondly, we look at the types and severity of errors, and we discuss the possible explanations for errors encountered.Thirdly, we examine the average subtitle speed of each programme and compare it with the maximum recommended by the UNE standard in Spain: 15 characters per second (CPS) (AENOR 2012).Finally, we offer a qualitative analysis of the data bearing in mind the particularity of the Basque language and the programmes analysed, and looking at the possible comprehension of automatically generated subtitles beyond the numbers provided by the NER model.

Accuracy rates
The average AR for the automatic subtitles of this contribution is 94.63%, if we take all errors into account, and 96.09% if we leave punctuation errors aside.The highest AR is 98.01%(99.12% without punctuation errors) and the lowest AR is 89.12% (90.84% without punctuation errors).Average AR for automatic subtitles in our sample, as well as highest and lowest AR are below any data reported by Romero-Fresco and Fresno (2023) for English.AR per each sample can be seen in Figure 2. The NER threshold for acceptable recognition accuracy is set at 98% (shown in grey).Only sample 7 (AR with punctuation errors = 98.01%,AR without punctuation errors = 99.12%) and sample 13 (AR without punctuation errors = 98.17%) can be considered fair or good taking into account only the quantitative measurements by the NER model.Both samples contain the delivery of news by anchors and reporters in which the speech they deliver is prefabricated, very well structured and with standard diction.Special attention should be given to samples 19 and 20, which score considerably lower than the rest, as these are the only two samples that contain news and spontaneous speech (interviews) of people from Iparralde (French Basque Country).The AR of sample 19 is especially low due to the fact that most of the sample consists of an interview with a farmer from Iparralde, with a non-standard accent and dialect.Excluding samples 7, 19 and 20, all other samples score between 92.84% and 97.06% (94.41% and 98.17% without punctuation errors) which, although it might not translate into acceptable data for the NER model, seem satisfactory numbers and are similar to the 95% reported in Aztiria, Jauregi and Leturia (2020) for this same software.These numbers should be read bearing in mind the minoritised situation of Basque; the short trajectory of voice recognition software in this language (and, especially, of the software ADITU), when compared to hegemonic languages such as English or Spanish; and the recognition difficulties that may arise from those accents and dialects that differ from the unified Basque (euskera batua) 9 (see Aztiria, Jauregi and Leturia 2020).
If we take the sample with the highest AR (sample 7), we can infer which characteristics of the audiovisual product favour the absence of recognition errors and, therefore, a higher AR and comprehension of subtitles and the audiovisual text.When videos contain fewer people who are speaking normalised Basque (euskera batua), a formal register, clear and standard diction and prefabricated linguistic code, the recognition errors observed are mostly related to punctuation.In fact, sample 7 contains only 17 non-punctuation errors (8 R-M and 9 R-ST) and 30 punctuation errors (29 R-M and 1 R-SE).

Errors and corrected recognition
In the 19 samples analysed, 1967 errors were found (1102 R-M, 768 R-ST, 97 R-SE), divided into the different samples as shown in The severity of those errors is represented in percentages as shown in Figure 3.As stated previously, minor errors (R-M) are those that the programme has not been able to recognise correctly but that do not hinder the correct comprehension of the subtitle and the audiovisual content, for example: Standard errors (R-ST) are those which the viewer will be able to recognise as such, but will not allow understanding the subtitle and the audiovisual content, and will likely trigger confusion in the viewer, for example: Serious errors (R-SE) will not be identifiable by the viewer as an error and will make the viewer understand something different from the audio, for example: It is worth noting that a remarkable part of these errors (726 errors; a 36.9%) are punctuation errors.For this analysis in particular, we understand as punctuation errors all those errors related with punctuation itself (a missing full stop at the end of a sentence, missing commas, commas that should be full stops, etc.); ortho-typographic errors that are not a consequence of a poor recognition itself, but are related to the language models inserted in the programme (upper-case and lower-case errors, accents 10 , changes of a C by a K in proper names, etc.); and errors related to misidentification of speakers.According to our conversation with Igor Leturia, a new punctuation system has been recently implemented and the number of punctuation errors has allegedly decreased (Leturia, personal communication, December 14, 2022).This new punctuation system has been trained following the ElhBERTeu model (explained in Urbizu et al. 2022).Hidalgo (2023) compared the punctuation output in both versions and concluded that while the number of punctuation errors is still high, the use of commas is more prevalent in the newer version and the programme detects the duration of the speaker's pause with greater accuracy.Considering this, we find it useful to look at the severity of errors excluding punctuation errors (a total of 1241 errors) and the severity of the punctuation errors (726 errors).These results are shown in the Figures 4 and 5   As can be seen, the majority of punctuation errors (92%) are minor errors that do not hinder comprehension of audiovisual content.When comparing these results with the results of other languages using the NER model 11 , the high percentage of standard errors is noticeable, both taking into account the severity of all errors (see Figure 3) and the severity of errors without punctuation (see Figure 4).Some of these errors are due to the number of proper nouns (for instance, regarding the Ukrainian conflict in sample 17 or the Roe versus Wade case in sample 3); others are due to the variety of dialects and dictions that 35% 59% 6%

Minor
Standard Serious the programme has difficulty in recognising (such as those in samples 19 and 20 taken from Iparraldearen orena).A third explanation for the number of standard errors is the audio that can be heard underneath a voice-over translation (such as in sample 17, where there is voice-over for German).ADITU tries to recognise that audio segment which results in errors that generate confusion in the viewer.When we asked the developers of ADITU about these errors, they confirmed that dialects, proper nouns and bad quality of audio (which may expand to voice-over tracks) are the main reasons for recognition errors in this software (Aztiria, Jauregi andLeturia 2020, Leturia, personal communication, December 14, 2022).
ADITU also applies corrected recognition (CR), the synthesis or omission of information that the software identifies as unnecessary and automatically suppresses from subtitles.In most instances, it corresponds to marks of orality and word repetitions, as shown in Table 8.Nonetheless, it should be pointed out that the software sometimes fails to recognise certain marks of orality as such, resulting in errors at all levels of severity (see Table 9).ADITU was not specifically trained to dismiss repetitions or marks of orality.CRs in ADITU occur when the programme goes from the acoustic model to the language model.As the language model is mostly fed with journalistic texts (in which such marks of orality are not usual), the programme dismisses those words (Leturia, personal communication, December 14, 2022).

Subtitle speed 12
In this study, we analysed subtitle speed in CPS. Figure 6 shows the average CPS per sample: ADITU does not set a maximum CPS for its automatically generated subtitles, so subtitle speed will vary depending on the speech rate and the information displayed on the subtitles (Leturia, personal communication, December 14, 2022).The maximum subtitle speed recommended in Spain (for Spanish) by the Standard UNE 153010 (AENOR, 2012) is 15 CPS.Only 3 of the 19 samples analysed are below that recommendation.Nevertheless, it is of note that this is a standard for Spanish and it is difficult to know if that same subtitle speed would be the recommendation for a language such as Basque.Despite this, research shows that setting a maximum recommended speed at a specific number is neither easy nor advisable.Many researchers and studies suggest that the maximum recommended depends on factors such as the information that needs to be inferred from the image, the amount of visual action on a given moment, the complexity of the vocabulary and the syntax used in each subtitle, the audience at which the audiovisual product is aimed, the complexity of the topics tackled in the video, the audiovisual genre, etc.In a recent interview, researcher Agnieszka Szarkowska (episode 32, podcast En Sincronía, July 2022) stated that the maximum recommended speed is usually at some point between 12 and 20 CPS, depending on the factors mentioned above.Taking this into consideration, and without a reception study that could actually provide us with data on the comprehension of subtitles and audiovisual text by deaf people, we cannot state that the subtitle speed of these samples is inadequate.
Table 10 shows, per sample, the minimum and maximum values of CPS, the average CPS and the percentage of subtitles exceeding 15 CPS (which would go against recommendations of AENOR), 20 CPS (which would go against general recommendations for different languages, audiovisual media, genres and audiences, from academics) and 25 CPS.In a more qualitative analysis, and factoring in what has been said for other languages (see, for instance, Romero-Fresco 2016, which contains an analysis of speech rate and subtitle speed in words per minute of UK programmes), the average CPS of these samples does not appear overly high.Only one sample (ETB1_4) is above 18 CPS and 16 out of 19 samples are below 17 CPS.Of interest is that the three samples that are above 17 CPS deal with the weather forecast and sports sections of the news programmes.The speech rate of these sections in news programmes tends to be above other subgenres and therefore the average CPS of these samples is also higher.

Sample
When compared with guides other than AENOR (2012) that do not revolve around live subtitling, the average CPS values of the analysed samples are close to the maximum CPS values proposed, for instance, in the style guide for Basque subtitles by the streaming platform Netflix 13 , which sets a maximum of 17 CPS for adults and 13 CPS for children.
All the samples register, at some point, subtitles above 20 CPS that would be considered too fast for any type of audience, medium and audiovisual genre.According to Table 10 above, the sample with the fewest subtitles exceeding 20 CPS is EITB1_19, with 1.33% of the subtitles between 20 and 25 CPS and 1.33% above 25.At the other end of the scale is ETB1_4, with 22.12% of the subtitles between 20 and 25 CPS and 6.20% above 25 CPS.Thus, almost one out of three subtitles in ETB1_4 is above what is considered readable for any audience and any audiovisual medium or genre.The peaks reached by all the samples analysed are equally illegible for any audience and medium, the lowest peak being found in sample ETB1_12, reaching 22.69 CPS, and the highest peak in sample ETB1_14, with a maximum of 53.85 CPS.

Qualitative analysis and possibilities for improvement
This section offers a qualitative analysis and prospects of improvement based on the quantitative data presented above.Moreover, different areas of possible improvement have been identified.Here, we will take a look into language-related issues and more technical issues.This qualitative analysis is not without limitations, as we did not conduct a reception study with deaf people.Nevertheless, these areas of interest were discussed in our online interview with Igor Leturia, in December, 2022.

Language-related issues
The programme ADITU tends to standardise spontaneous speech and speech coming from euskaldunzaharrak, native Basque speakers who generally speak in a given euskalki or dialect.Table 11 shows some examples of this standardisation found in our corpus: This standardisation has both advantages and disadvantages.On the one hand, and taking into account the minoritised situation of Basque, it benefits language standardisation and normalisation and it facilitates subtitle comprehension by a larger population and the alphabetisation in Basque.On the other hand, it might hinder access to content, as it does not provide information on how speech is delivered.The fact that a person speaking in Basque says "dan" or "azkenian", instead of standardised versions of those words, provides information about their relationship with the language, their origins, the dialect they might prefer using, etc.The continuous standardisation and normalisation when text goes from oral to written language might limit access to some nuances related to the use of a language.In our corpus, nevertheless, this standardisation is not consistent, and excerpts with non-standardised language have been found.Consideration should be given to the future provision of access to subtitles in Basque and the advantages and disadvantages of normalisation and standardisation.Igor Leturia and the staff working in ADITU and at Elhuyar believe subtitles should be standardised to batua, as that is the norm in current subtitles practices (Leturia, personal communication, December 14, 2022).
It has come to our attention, also, that frequently used proper nouns in Basque (such as Estatu Batuak [United States] in sample 3 or Athletic in sample 8) are often misrecognised in Basque.Other misrecognitions of proper nouns have to do with the name of the reporters, names of places related to the war between Russia and Ukraine, Spanish names related to politics (in which, following the Libro de estilo de EiTB: información y actualidad style guide [Martín Sabarís et al. 2016], often diacritics are not employed), or proper names related to latest news (Roe versus Wade, for example).Errors with proper nouns in ADITU are one of its main problems but are not easy to overcome.On the one hand, proper nouns related to the latest news will imply updating the language model with a regularity nowadays not possible for the developmental stage of ADITU.Whereas the case and declension system of Basque would imply not only adding the non-declined word (for instance, Járkov), but also all the possibilities of declension for such word (Járkovtik, Járkovera, Járkoven, etc.) happening in different word-sequences for the language model to be able to choose correctly from the different possibilities that the acoustic model might offer (Leturia, personal communication, December 14, 2022).Regarding frequently used proper nouns in Basque (such as Estatu Batuak and Athletic) Igor Leturia stated that exceptions to the norms might be inserted in the programme.That would be the case of the pronunciation of Athletic, the writing of which does not follow Basque conventions and might pose difficulties for the acoustic model, which would have to be trained specifically for that word and its declension cases.After our conversation, ADITU developers stated that they will evaluate the special cases of Estatu Batuak and Athletic and will try to improve the recognition of those two nouns.
Samples containing a weather forecast, for example, show a considerable number of standard errors.The most noticeable one is sample 4, in which there are a total of 83 standard errors, the second highest number of the corpus.Some standard errors of sample 4 have to do with the topic (weather) and the specific language used in this type of subgenre: According to Igor Leturia (personal communication, December 14, 2022), the weather forecast should not pose more problems than any other subgenre, as it has been equally trained.A hypothesis posed by Leturia is that problems with recognition and standard errors in weather forecasts might be due to speech rate and not to the language model (an argument also stated in Aztiria, Jauregi and Leturia 2020).A more in-depth analysis on this subgenre might be useful to confirm such a hypothesis.
Finally, taking into account that Basque is an agglutinating language with a thorough case system, a declension of a word might be determinant to the sense of a sentence, and an error declension (which could be an error of recognition of just one letter) may change the meaning of a sentence completely.In sample 8, "Osasunak" (proper noun in ergative case) is recognised and subtitled as "Osasuna" (which acts either as a proper noun in nominative case or as a proper noun in accusative case), thus turning what should be the subject of the sentence into its direct object.Here it is labelled as a minor error, but a change in just one letter could also lead to standard or serious errors depending on the audiovisual context and the text surrounding that error.Therefore, a future hypothetical training of respeakers in Basque would need to pay special attention to the correct pronunciation of declensions and conjugation of verbs.

Technical issues
Here, we will deal with punctuation, speaker identification and the special case of "ehuneko" (percentage).Punctuation errors are, in our sample, mostly minor errors.On some occasions, lower case letters are used after a full stop or after a colon in the speaker identification.On other occasions, the full stop is missing in sentences.Other errors that have been analysed under the punctuation category such as accents 14 ("Sanchez", for example) or upper-case lettering (some proper nouns appear with all letters capitalised, for instance) are also mostly minor errors.Nevertheless, the high number of punctuation errors leads to the question of whether there are possibilities for improvement in this matter.A new version of ADITU is now available with a new punctuation system (Leturia, personal communication, December 14, 2022).In a recent comparison of two versions (2022 against 2023), Hidalgo (2023) concluded that, although the total number of punctuation errors decreased, the results and severity of errors regarding punctuation in both versions turn out to be very similar.
Speaker identification is predicted based on acoustic information and silence gaps between interventions (Leturia, personal communication, December 14, 2022).Sometimes, this leads to (1) unnecessary identification (because the tag appears again after a short period of silence even if the same person is speaking); (2) identification in subtitles that may lead to confusion (because the tag appears in the middle of a sentence when a silence gap occurs) and (3) on some occasions serious errors.For example, the same name tag (HZLR1, for instance) is sometimes used in a 5-minute-sample to identify two different speakers.Most of these errors are minor, but current criteria to identify speakers also lead to standard and serious errors.According to Leturia (personal communication, December 14, 2022) some of the errors in speaker identification have been solved in the new version of ADITU.In this regard, Hidalgo (2023) concluded that, despite the fact that the use of tags has improved in the new version, inaccuracies continue to be detected and a significant portion of the speaker identification tags used by ADITU are still incorrect.
Regarding the special case of "ehuneko" (per cent), it came to our attention that this word is not subtitled at all, which results in serious errors (because percentage is understood as an absolute number) and standard errors (because sentences lose meaning and the subtitle creates confusion).In some occasions, as with sample 5, these percentages are shown on a screen behind the presenters and errors are then marked as minor, because they are detectable and correct information can be inferred from the image and audiovisual context.Nevertheless, this is only possible thanks to information redundancy; the subtitle, by itself, does not allow for inferring correct information.According to Leturia (personal communication, December 14, 2022), the problem with "ehuneko" was an exceptional problem during the dates in which our .srtwere generated.This problem was solved in the following months and the new version of ADITU does not have those errors (Hidalgo, 2023).

Final remarks
This contribution has analysed automatic subtitles generated by the monolingual (Basque) version of ADITU, a speech recognition software developed by Elhuyar and launched in 2020.Although quantitative data does not reach the threshold to establish the quality of recognition as fair or comprehensible taking into account the NER model, both a quantitative and a qualitative analysis show that results are promising although there are possibilities for improvement regarding language and technical issues.Some of these improvements have allegedly been implemented in a new version of ADITU, partially analysed in Hidalgo (2023).These rapid improvements and their implementation are expected to significantly improve the AR of speech recognition in Basque in the near future.
The analysis permits us to foresee some factors that may have influence on the accuracy rate, such as: • Batua (standardised language) or the use of euskalkiak (dialects) The interview with Igor Leturia (personal communication, 2022) confirmed that proper nouns, bad quality of audio and non-standardised language pose the most difficulties in speech recognition for ADITU.Our findings based on the NER model and the interview with Igor Leturia also support statements by Pérez Cernuda (2022) on the evaluation of bilingual automatic captions and a previous publication on the launching of ADITU (Aztiria, Jauregi and Leturia 2020).These factors should be researched in depth, and isolated if possible, in order to identify the relationship between the variation of these factors and the AR in the NER model.
Basque society is plurilingual, and so are their audiovisual products (Larrinaga 2019).This contribution has only analysed excerpts in Basque, which give us an idea of the current situation of the recognition software, but does not account for the reality of the society nor the products.Further analysis that takes into account programmes that combine Basque, at least, with Spanish and French, and software that is compatible with that reality would be of great benefit in the future.
In addition, a second round of analysis of these same samples when improvements have been applied to the software will let us analyse the evolution and development of the recognition programme.Also, other technologies (such as the ones developed by Vicomtech or Google) should also be taken into consideration in further analysis.

Figure 2 .
Figure 2. Accuracy rates per sample with and without punctuation

Figure 3 .
Figure 3. Severity of all errors below.

Figure 4 .
Figure 4. Severity of errors without punctuation

Table 4 .
On average, our sample included 20.2 errors/minute, which is below the rate reported byRomero-Fresco and Fresno (2023)for automatic closed captions in English.

Table 12 . Errors related to weather forecast terminology
• Register • Diction of speakers • Prefabricated discourse or spontaneity • Number of speakers • Speakers being anchors, reporters or informants • Background noise and language interference in the audio • Proper nouns