Corpus of Children's Literature in Welsh


What is "Ein Geirau NI"? 

"Ein Geiriau NI" was funded by a grant kindly provided by the Welsh Assembly Government and contains a Corpus of Children's Literature in Welsh which has a sample of over 3,000,000 words. 

These have been taken from a wide and representative range of books written for pre-school, primary and secondary aged children.

"Ein Geiriau NI" also includes word frequency analyses based on the whole sample, and on the individual National Curriculum Level sets. 

Why was it produced? 

Since 1997, the City and County of Swansea Educational Psychology Team has developed the All Wales Reading Test with joint funding from all the LEAs in Wales and in cooperation with the Welsh Unit of the National Foundation for Educational Research. 

When developing the English Medium test forms, the team had access to "state of the art" electronic linguistic analysis carried out by Birmingham University Linguistics Department based on a collection of childrens' texts which helped to speed up the development process considerably. 

However, for the Welsh Medium forms a suitably large and representative collection of children's texts was not available in electronic format. "Ein Geiriau NI" was developed by the team in order to meet this need and also to provide a new resource that could prove valuable in many ways, including: 

  • helping authors and publishers to develop original Welsh literacy materials for children
  • supporting translators in producing materials which reflect natural use of Welsh
  • assisting researchers investigate the development and usage of the Welsh language
  • helping to target texts aimed at different key stage areas of the National Curriculum



How was it developed? 

Four major publishers were approached and gave permission for a representative selection of texts from their books to be used. 

The books were scanned by computer and the texts separated from the pictures with errors being reduced as far as possible using the correction feature of the scanning software. 

The computer text files were then checked individually by researchers and amended as necessary with the assistance of the Cysill 3 Spelling checker. 

The books themselves were then assessed by a team of four Welsh Medium Teachers and two Welsh Language specialists in order to estimate the National Curriculum level for each book and categorise it (e.g. Children's Novel).

The text files were then analysed with the assistance of specialist software provided by Bill Hicks (Bangor University) and Mike Scott (Liverpool University) to produce word frequency lists. 

Where can I find out more about "Ein Geiriau NI"?

Where can I find technical information about "Ein Geiriau NI" and download the files? 

NB Under the terms and conditions agreed with the publishers who kindly gave permission for the use of the texts, you are only allowed to download these files for research purposes and reproducing the texts for commercial use is prohibited. 

What next?

This site now includes a 
searchable database of many of the books used in the corpus including National Curriculum level and Interest Age - teachers may find this useful when selecting suitable books for children.

A range of publishers have also kindly agreed to supply us with new books they produce which we will add to the corpus and the database on a regular basis.