430 books were used for the corpus. These were allocated National Curriculum levels and text categories by the expert panel of teachers and Welsh language specialists.

Following the terms agreed with the publishers, several samples were taken from each book and allocated random file names to create 1,720 text files in all.

Table showing number of books within each National Curriculum Level:

Level Number of Books

Table showing number of books within each text category:

Text Category  Number of Books
Children's Stories  
Children's Novel  
Picture Books  

Description of the text files

The text files are saved in Unicode format (rather than ASCII) which means that all the Welsh characters can be displayed using standard fonts (e.g. in Microsoft Word), rather than having to resort to specialised fonts.

Some of the text files are inevitably very small as there are sometimes less than 100 words in the books for younger children.

NB Under the agreement with the publishers who kindly gave permission for the use of the texts, download and use of these files is bound by the following conditions:

The moral rights of the individual authors of these texts has been identified under the Design Copyright and Patents Act 1988, and any unauthorised use of these is prohibited:

  • Authorised Use: For research purposes
  • Unauthorised Use: Reproducing the texts for commercial use.

Analysis of data - Counts of Raw Word Forms

Word counts are provided for the corpus as a whole and also for each set of National Curriculum Level texts from 1 to 7.

These were created with the assistance of Mike Scott's Wordsmith software.

The word counts are based on the actual word forms occurring. These words include spellings which represent dialectal forms, informal spellings of Welsh forms foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts). 

The total number of word form tokens in the whole corpus is

The total number of separate word form types in the whole corpus is

The 20 most frequent raw word forms in the whole corpus are:

Rank Word Frequency   Rank



1   11
2   12
3   13
4   14
5   15
6   16
7   17
8   18
9   19
10   20

At the lower end of the frequency range, there is a high proportion (%) of word forms that occur three times or less, as shown in the table below:

Number of occurrences  Number of words   Percentage of all words
x 3
X 2
X 1

The bulk of these are single occurrence words - which are mainly a combination of unusual dialect forms, mis-spelt words, English words, place names and "made up" words such as "AAAH" (many varieties of this!). 

The scanned text files were checked for accuracy using a range of processes:

  • the OCR program's reporting of possibly inaccurate words
  • detailed check of each individual text file by researchers at NFER Welsh Unit with the assistance of the Cysill 3 spelling checker
  • further checks of word list analyses for frequently occurring errors

Despite all these efforts, we are aware that there are still a number of words that may have been incorrectly encoded - our estimate is that this is under 1%. 

Analysis of data - Lemma Counts

These counts were created using Mike Scott's Wordsmith software, in conjunction with a (huge!) lemma list kindly provided by Bill Hicks from Bangor University.

The lemma list allows the different mutations of word forms and different verb forms to be grouped with their root.

An illustration of the type of  lemma analysis is shown  in the table below: 







blaen[1838] blaenau[285] flaen[2137] flaenau[92] mlaen[306]

We would also liked to have carried out manual checking of these lists and to "tag" the lemma lists and the original text files with their parts of speech but unfortunately this was well beyond the time, funds and scope of this project.

We are very grateful to Bill for his advice and for providing us with the lemma list.

Although this couldn't possibly cope with every eventuality or do all the work of a researcher sitting for months (if not years) going through the lists manually, we could not have made any attempt at lemmatising without it.

We are also very grateful to Mike Scott for his advice and his Wordsmith software which were also invaluable.

We hope that other researchers will be interested in using the corpus for more detailed study, and carrying out a deeper analysis than we have been able to do so far.

The analysis files are in Unicode text format and can be loaded into e.g. Word or Excel for viewing, printing or further analysis.

The files included are:

  • Word Counts (counts of raw word forms)
  • Lemma Counts with Analysis (counts of lemmas, plus inflected forms and mutations)

for each National Curriculum Level and the corpus as a whole

