Cronfa Electroneg o Gymraeg (CEG)
A 1 million word lexical database and frequency count for Welsh
Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001)
- Brief Summary
- Background
- File formats and Character coding conventions
- Description of the text files
- The Raw and Tagged Datafiles
- Data quality
- Counts of Raw Word Forms
- Lemma Counts with analyses of inflections and mutations
- Download Word Form files
- Contact Information
- Use of these Materials
Brief Summary
This is a word frequency analysis
of 1,079,032 words of written Welsh prose, based on 500 samples
of approximately 2000 words each, selected from a representative range
of text types to illustrate modern (mainly post 1970) Welsh prose writing.
It was conceived as providing a Welsh parallel to the Kucera and Francis
analysis for American English, and the LOB corpus for British English,
in the expectation that such an analysed corpus would provide research
tools for a number of academic disciplines: psychology and psycholinguistics,
child and second language acquisition, general linguistics, and the linguistics
of Modern Welsh, including literary analysis.
The sample included materials from the fields of novels and short stories, religious writing, childrenÃs literature both factual and fiction, non-fiction materials in the fields of education, science, business, leisure activities, etc., public lectures, newspapers and magazines, both national and local, reminiscences, academic writing, and general administrative materials (letters, reports, minutes of meetings).
The resultant corpus was analysed to produce frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.
Articles based on the use of the database should cite:
Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]
Available: www.bangor.ac.uk/canolfanbedwyr/ceg.php.en
Background
This project was funded for the academic year
1993-94 by a grant of £21K from the Higher Education Funding Council
for Wales to Ellis, O'Dochartaigh & Hicks of the Welsh IT Unit and
the School of Psychology, University of Wales, Bangor. The researchers
began work on the project in October 1993, and after the sample range had
been identified in collaboration with Professor Gwyn Thomas of the Department
of Welsh, proceeded to collect the required range of texts. The original
intention was that this range of materials would be acquired in an electronic
form from Welsh language publishers and other bodies, such as local
authorities,
governmental organizations, and papurau bro (locally produced newspapers).
However, it proved to be impossible to collect the necessary breadth of
materials in an electronic form, primarily because at that time Welsh language
publishers did not generally keep computer-based archive copies of books
which they may have published using electronic means.
Under these circumstances, having acquired around 200 usable samples from various bodies, it was decided to input the remainder by using both typists and an OCR system. The task of checking such typed copy, and in particular of correcting the errors introduced by the OCR software, was carried out by the researcher, assisted by the on-going development of the Welsh spelling-checker, CySill. The additional costs of this work were borne by funding from the Welsh IT Unit at Bangor.
Where material was obtained directly from publishers or from individual authors, permission was sought for the data to be included in the project analysis, with the understanding that if they were ever to be made available to a wider audience, then a formal request would be made to the copyright holders for this use. Where samples were taken either by typing or by OCR from published works, formal permission for their use has not yet been requested, as it was regarded that the samples of 2000 words in most cases could be regarded as "fair-dealing" for academic research purposes under the Copyright Acts. Any future public use of these materials will require the formal permission of their copyright holders.
It was decided to use the analytical software for Welsh which had been developed for a Welsh language spelling checker, then under way in the School of Psychology for Bwrdd yr Iaith Gymraeg / The Welsh Language Board. This spelling checker in its improved form involved a set of lemmatization algorithms for handling the language in a computer environment and it was felt that these programs could be adaptable for lemming the CEG text samples. The basic program for the spelling checker was modified to allow it to process and analyze the texts in an interactive way. This required the ability to present the original text on screen for inspection by the researcher, and to offer interactive dialogue boxes to solve two fundamental problems with the software. These were, the appearance of words or word forms which did not appear in the spelling checkerÃs own dictionary, and the possibility of homographs. The latter difficulty was solved by arranging for the software to identify a lemma by stripping off a particular ending and/or by demutating a word, then continuing to try possible endings and initial mutations in combinations with other lemmas to check for possible homographs, effectively on the fly. Any such forms identified were presented on-screen to the researcher, with the original text still visible, to allow an informed choice to be made between the possibilities. In a similar way, the appearance of an unrecognized word or word form generated a dialogue box to allow the researcher to enter such words into a user dictionary, as well as allowing the forms to be incorporated into the tagged files which were produced from each separate text sample.
The main researcher worked on 350 out of the 500 samples, and a part-time researcher was employed through the Welsh IT Unit to analyze 150 of the samples. The average time for the analysis of each was around 1 hour, though the need to read over and correct typed or OCR scanned text, raised this to a figure of around 2 hours per sample.
File
formats and Character coding conventions
All files are Windows files with<CR><LF> used as line separator.
Accents
are place after the vowel ( + = circumflex, % =
dieresis, / = acute
accent, \ =
grave accent)
Description of the text files
Details of the 500 text samples are provided in the files below which list file number, text category, title, author and date.The description data can be downloaded in the following formats:
The text category codes are as follows:
|
||
Gwasg - Gwyddonol | G Gw | Press - Scientific |
Gwasg - Adroddiad | G A | Press - Report |
Gwasg - Golygyddol | G G | Press - Editorial |
Gwasg - Adolygiad | G Ad | Press - Review |
Gwasg - Llythyrau | G Ll | Press - Letters |
Plant - Ffeithiol | P Ff | Factual - Children |
Ysgrythurol | Y | Scriptural |
Bro a Bywyd Gwerin | B | Community Life |
Gweinyddol - Adroddiad | Gw Ad | Administrative - Report |
Gweinyddol - Llythyrau | Gw Ll | Administrative - Letters |
Gweinyddol - Cofnodion/cytundebau | Gw C | Administrative - Minutes/contracts |
Academaidd | A | Academic |
Hunangofiant / Cofiant/ Dyddiaduron / Atgofion | H | Biography/ Diaries/Memories |
Sgyrsiau/pigion | S | Discussions/ Highlights |
Medrau a Diddordebau | M | Skills and Interests |
Rhyddiaith Ddychmygol | Rh Dd | Fiction |
Nofelau | N | Novels |
Straeon Byrion | SB | Short Stories |
Plant - Nofel | PN | Children's Novel |
Plant - Straeon | PS | Children's Stories |
Dyddiadur Dychmygol | D | Fictitious Diaries |
Ysgrifau | YS | Articles/ Essays |
The Raw and
Tagged Datafiles
Most users will probably only want to
access the processed results - the frequency counts of word forms or lemmas
presented below. However, we also provide the original text samples as
ASCII files along with the 500 tagged files for those who need to find
words or constructions in their original context or for scholars who wish
to correct or take forward the analyses presented here.
The 500 original text samples, each of approximately 2000 words:
- Original ASCII files (zipped) (2.1Mb)
The 500 tagged files have the following format
:
Lemma [tab] Raw word [tab]Part Of Speech [ [tab] Mutation - if present ] [tab] Line Number
Each line shows the lemmatized form, the original word, the part of speech, type of mutation if present, and the location of the word (sample number, sentence number within sample, word number within sentence). For verbal forms, a number is used with the lemma to show the particular morphographemic form appearing.
Illustration of a sample sentence from a text follows:
a | part | |||
bod:3 | vbf | |||
hynny | DemPron | |||
'n | vbadj | |||
golygu | vb | |||
bod | vb | [74.2.7] | ||
y | [74.2.8] | |||
rhai | [74.2.9] | |||
dagreuol | [74.2.10] | |||
yn | [74.2.11] | |||
ein | [74.2.12] | |||
plith | [74.2.13] | |||
yn | [74.2.14] | |||
iachach | [74.2.15] | |||
na | [74.2.16] | |||
'r | [74.2.17] | |||
rhai | [74.2.18] | |||
sych | [74.2.19] | |||
? |
We believe this text corpus is of value for an analysis
of Welsh prose sentence patterns, for co-occurrence analyses of both individual
lemmas and grammatical parts of speech in running texts, and for further
linguistic analysis by specialist researchers in the field of Welsh syntax
and child language acquisition. However, researchers must take note of
some limitations in data quality, particularly regarding the accuracy of
some of the lemma tags which were prejudiced by word form homography -
these limitations are described below.
-
All Tagged
Files (zipped) (All fields are tab delimited) - 8 Mb
Data quality
We believe that the accuracy of the raw word
forms in the database and their counts is quite high. Whatever errors (spelling
or typographical) there were in the original samples will be carried over
to the corpus. We must surely have introduced and failed to detect some
additional errors in input, but we have tried hard to keep this number
very low.
Tag quality is something of a different matter. The
problems of high homography rates, a limited window template-matching
lemmatiser
with few rules, and the need for skilled linguistic analysis, compounded
into a non-trivial number of tagging errors. A preliminary analysis
of 5% of the corpus indicates that there is an error rate of 4% +/- 3%.
These tagging errors are by no means distributed equally about the database.
Thus, for example, inaccuracies in the tagging of yn,
bod/fod,
and a, that is more generally the high frequency closed class words,
are much more common than inaccuracies with the open class words. Thus
while the token error rate is perhaps 4%, the type error rate is much less
than that.
We do not have the resources to correct these
miscodings.
As well as noting the errors on a print-out of the output files, it would
be necessary for any corrections to be written back to the files, and we
estimate that a detailed correction of the full set would require two years
work.. Having tried to raise these resources, and waited too long, we have
decided to release the database as it now stands - it is certainly better
than nothing.
Nonetheless, researchers must take note of these limitations in data quality, particularly regarding the accuracy of some of the lemma tags.
We believe the Counts of raw word forms to be highly accurate.
The Lemma
Counts with analysis of inflections and mutations runs at about 96% accuracy
with most problems on the high frequency closed class words.
Processed Results:
Counts of Raw Word Forms
The word counts are based on the actual word forms occurring. These words include spellings which represent dialectal forms, informal spellings of Welsh forms (generally following the suggestions of Cymraeg Byw, though this is by no means a universally applied standard for informal writing), foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts).Total number of word form tokens in the corpus is 1,079,032.
The total number of separate word form types is 37,195.
The 50 most frequent raw word forms are:
55588 | yn | . | 3821 | cael |
45945 | y | . | 3754 | yw |
33327 | i | . | 3546 | wrth |
33231 | a | . | 3545 | ni |
32573 | 'r | . | 3463 | hyn |
26927 | o | . | 3023 | na |
15888 | ar | . | 2870 | o+l |
14990 | ei | . | 2721 | hynny |
14845 | 'n | . | 2646 | fe |
14523 | yr | . | 2613 | er |
11785 | ac | . | 2594 | neu |
9922 | oedd | . | 2585 | nid |
9338 | bod | . | 2542 | at |
9056 | mae | . | 2511 | sy |
7751 | am | . | 2417 | 'w |
7093 | wedi | . | 2401 | hi |
6118 | ond | . | 2360 | dim |
5568 | un | . | 2278 | mynd |
5415 | 'i | . | 2240 | byddai |
5294 | eu | . | 2160 | gyda |
4991 | gan | . | 2137 | yng |
4988 | fel | . | 2110 | iawn |
4578 | mewn | . | 2066 | pob |
4149 | a+ | . | 2065 | lle |
4142 | roedd | . | 2027 | pan |
At the other end of the frequency
range, there is a very long tail of single occurrence forms, with 44% of
the total entries falling in to this group, and between them, the numbers
of single, double and triple occurrence words make up 64% of the total
number of separate words (37,195). As might be expected, a large number
of these very low frequency words consist of foreign borrowings, mis-spellings,
dialectal forms and other types of variant spellings, and numbers. In
most cases, the analysis program does distinguish between several of these
categories (mis-spellings, foreign words, informal spellings), but such
entries would require further checking if 100% accuracy was essential.
16,316 words with a single occurrence : | 44% of separate words |
5,013 words with two occurrences : | 13% of separate words |
2,644 words showing three occurrences: | 7% of separate words |
Lemma Counts with analyses of inflections and mutations
The lemming software was
used to demutate and uninflect word forms in order to track them back to
their lemma. Examples of the resulting lemma analysis are shown for
illustration
in the table below:
ceg | 118 | ceg | n | 118 | ceg | 109 | nf | ceg | 22 | nf | |
cheg | 21 | nf | llaes | ||||||||
geg | 56 | nf | meddal | ||||||||
ngheg | 10 | nf | trwynol | ||||||||
cegau | 9 | npl | cegau | 9 | npl | ||||||
rhodio | 16 | rhodio | vb | 16 | rhodia | 2 | vbf | rhodia | 1 | vbf :3 | |
rodia | s1 | vbf :3 | meddal | ||||||||
rhodiai | 1 | vbf | rodiai | 1 | vbf :10 | meddal | |||||
rhodio | 12 | vb | rhodio | 7 | vb | ||||||
rodio | 5 | vb | meddal | ||||||||
rhodiwn | 1 | vbf | rhodiwn | 1 | vbf :4.1 |
The lemma ceg appears 118 times. It appears exclusively as a noun. 109 of these occurrences are as the noun singular feminine (ceg) and 9 as the noun plural (cegau). As the singular noun it appeared 22 in unmutated form, 21 times with aspirate mutation, 56 with soft mutation, and 10 times as a nasal mutation.
The lemma rhodio
appeared 16 times, always as a verb. Two of these occurrences were as the
third person singular present (rhodia) (once in unmutated form and
once with soft mutation), 1 occurrence was as the third person singular
imperfect in soft mutated form (rodia), 12 occurrences as the verb
noun rhodio (7 times unmutated and 5 times with soft mutation),
and once as the third person plural present tense (rhodiwn). There
are many verb forms for Welsh - the full list of verb form codes is shown
below.
Verb-form Codes
The table of verb form codes is shown below:
1 | af | present tense first person singular |
2 | i | present tense second person singular |
3 | a | present tense third person singular |
4 | wn | present tense first person plural |
5 | wch | present tense second person plural |
6 | ant | present tense third person plural |
7 | ir | present tense impersonal |
8 | it | imperfect tense first person singular |
9 | et | imperfect tense second person singular |
10 | ai | imperfect tense third person singular |
11 | em | imperfect tense first person plural |
12 | ech | imperfect tense second person plural |
13 | ent | imperfect tense third person plural |
14 | id | imperfect tense impersonal |
15 | ais | past tense first person singular |
16 | aist | past tense second person singular |
17 | odd | past tense third person singular |
18 | asom | past tense first person plural |
19 | asoch | past tense second person plural |
20 | asant | past tense third person plural |
21 | wyd | past tense impersonal |
22 | aswn | pluperfect first person singular |
23 | asit | pluperfect second person singular |
24 | aset | pluperfect second person singular |
25 | asai | pluperfect third person singular |
26 | asem | pluperfect first person plural |
27 | asech | pluperfect second person plural |
28 | asent | pluperfect third person plural |
29 | asid | pluperfect impersonal |
30 | ed | impersonal imperative |
31 | wyf | subjunctive first person singular |
32 | ych | subjunctive second person singular |
33 | o | subjunctive third person singular |
34 | om | subjunctive first person plural |
35 | och | subjunctive second person plural |
36 | ont | subjunctive third person plural |
37 | er | subjunctive second person singular |
38 | es | past tense first person singular |
39 | est | past tense first person singular |
40 | ith | Informal third person singular |
41 | iff | Informal Future third person singular |
42 | on | Informal Past third person plural |
43 | an | Informal Future third person plural |
The file, Lemma
Counts with Analysis, downloadable below, is tab-separated and can
be imported into Excel where it can be readily manipulated to provide a
wide range of analyses. One example, based on a sort of the final field
(mutation), generates the following results for initial
mutations.
Initial mutations
Welsh words can exhibit one of four types of morphophonemic initial mutation, and the occurrences and relative frequencies of such forms in the sample are:
Soft mutation (Treiglad Meddal) | 134,349 | 12.45% |
Spirant mutation (Treiglad Llaes) | 9,123 | 0.85% |
Nasal mutation (Treiglad Trwynol) | 5,667 | 0.53% |
h-provection | 1,990 | 0.19% |
- Zip file containing: (890Kb)
- Word Counts (freq) - Counts of raw word forms sorted in decreasing frequency
- Word Counts (alpha) - Counts of raw word forms sorted in alphabetic order
-
Lemma Counts with Analysis - Counts
of lemmas, plus inflected forms, parts of speech and
mutations
Use of these Materials
These materials have been produced
on a small budget for academic research. You are welcome to use the materials
for any non-commercial purpose. We have produced these analyses in good
faith to the best of our abilities given the limited resources. As we have
described above, you should be aware that there are some inaccuracies in
the taggings. We bear no responsibility for any damaging consequences that
may result from these.
We welcome further research to extend or correct these linguistic descriptions.
Articles based on the use of the database should cite:
Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]
Available: www.bangor.ac.uk/canolfanbedwyr/ceg.php.en