Development of "Ein Geiriau NI"

The project took place from September 2003 to February 2005 and was co-ordinated and contributed to by members of the All Wales Reading Test Working Group - Educational Psychologists from the City and County of Swansea Educational Psychology Team and Robat Powell head of NFER Welsh Unit.

The team has considerable experience in development of Welsh Language assessment materials for children, including the All Wales Reading Test.

Sourcing the texts

The aim was to develop as large a selection of children's literature as possible within the confines of the time and budget available.

Although there are a number of publishers of Welsh Medium materials for children, given:
the time and resources 
the need to negotiate agreeable terms and conditions for use of their materials
the group decided to focus on four major publishers and to use as many of their books as possible to gain a representative sample. 

These publishers were then approached and gave permission for selections of texts from their books to be used.

This was under the condition that when the texts were made available to others these were to be used for research purposes only, and that reproduction of the texts for commercial purposes was prohibited.

Some of these books were loaned to us by schools, but the majority were purchased from the publishers. 

Scanning the texts

The books were then scanned by computer using the ABBYY Fine Reader Pro 5.0 OCR software. This was selected particularly because of the availability of a "plugin" to allow it to read Welsh characters.

The scanning was a very long and laborious process, particularly because of the number of illustrations in children's books and the variety of layouts and fonts etc.

Initial tests for accuracy were made using the OCR software checking against the book any characters the program highlighted as possibly inaccurate 

The resulting files were then exported to Microsoft Word.

"Cleaning" the texts

These files were then further checked by researchers at NFER Welsh Unit with the assistance of the Cysill 3 spelling checker. If words did not appear in the Cysill 3 dictionary they were checked against the original text - if the same they were left "as is", otherwise they were corrected.

Given the size of the corpus and the variety of books used there are a large number of words that only occur once - these are mainly a combination of unusual dialect forms, English Words, place names and "made up" words such as "AAAH" (many varieties of this!).

These were left as they appeared in the text and are reported in the analyses "as is".

Initial word frequency lists were then created and these lists were also checked for errors.

Alterations were then made either to the individual files or with software developed in house to replace e.g. inappropriate characters (where there were very frequent occurrences).

Despite all these efforts we are aware that there are still a number of words that may have been incorrectly encoded - our estimate is that this is under 1%.


Allocating National Curriculum Levels and Categories

As we wished to subdivide the Corpus to allow us (and others) to look at the word frequencies and clusters at different age levels, we brought together a team of experts to assess each book and allocate it:
a National Curriculum Level (in terms of reading difficulty)
a main category (Children's Stories, Children's Novel, Literature) 

The team was led by Robat Powell, and also included four Welsh Medium Teachers and a Welsh Language Adviser, 

There was an initial "twilight meeting" to agree the protocol for assessment, and the assessment was then carried out over two days.

Team members initially assessed the same books as a whole group and gave an opinion with the majority decision being recorded, and moderation by Robat where necessary. 

After gaining experience with this process and demonstrating a high level of observer agreement (>90%) team members then worked in pairs, and passed any borderline decisions or disagreements to another pair to reach a group decision.

These decisions were then used when subdividing the Corpus into National Curriculum Levels.

Analysis of the texts

We also produced analyses for the whole Corpus and individual National Curriculum Levels, in the following formats
- standard word frequency lists
- lemmatised word frequency lists

These sets were created using Mike Scot's Wordsmith software, in conjunction with a (huge!) lemma list kindly provided by Bill Hicks from Bangor University.

The lemma list allows the different mutations of word forms and different verb forms to be grouped with their root.

We would have also liked to "tag" the lemma lists and the original text files with their parts of speech but unfortunately this was well beyond the time, funds and scope of this project.

We are very grateful to Bill for his advice and for providing us with the lemma list,

Although this couldn't possibly cope with every eventuality or do all the work of a researcher sitting for months (if not years) going through the lists manually, we could not have made any attempt at lemmatising without it.

We are also very grateful to Mike Scott for his advice and his Wordsmith software which were also invaluable.

We hope that other researchers will be interested in using the Corpus for more detailed study, and carrying out a deeper analysis than we have been able to do so far.

In additional to the analyses, the text files are also provided. Under the terms of our agreement with the publishers, the texts have been split into several separate files - some of these are very small as the number of words in some books for younger children may be less than 100.

The text files are saved in Unicode format (rather than ASCII) which means that all the Welsh characters can be displayed using standard fonts (e.g. in Microsoft Word), rather than having to resort to specialised fonts.

The analysis files are also in Unicode format, and they can be loaded into packages such as Microsoft Excel for further investigation.

Technical information about the corpus and the data files are available here.