LANGUAGE ENGINEERING AND THE PROCESSING OF SPECIALIST TERMINOLOGY

Khurshid Ahmad,

Department of Mathematical and Computing Sciences

University of Surrey

Guildford, Surrey

UNITED KINGDOM

ABSTRACT

Ready access to specialist terminology of a domain is crucial for translating texts or in just writing a text. There is always a lapse of a few years between the time a term is either coined or undergoes a meaning shift and the time when it is entered, defined and elaborated in a paper- or electronic-dictionary. However, most neologisms and meaning shifts are discussed at length in a range of text types used by a domain community. It is possible to use corpus linguistic methods and tools for extracting terms semi-automatically in almost a matter of hours. The need for treating texts and terms on the same footing is also highlighted.

INTRODUCTION

Language Engineering is a curious term in that it implies that it is a discipline on par with Electrical Engineering, the planning, design and production of artefacts that generate or use electricity, with Civil Engineering, concerned with the life cycle of the built environment, with Mechanical Engineering, a subject that deals the analysis, design and production of meachnical artefacts. And, each of the established branches of engineering draws from the more fundamnetal sciences like physics, chemistry and mathematics. So does language engineering deal with the planning, design and production of linguistic artefacts? This question is in part rhetorical and in part expresses the curiosity that is naturally aroused by the claim that somehow an elusive cognitive activity, an abstract notion, and little understood aspect of human behaviour can be planned, produced and maintained. If one does not feel strongly, on philosophical and ethical grounds, and accepts the notion that there indeed can be 'Les idustries de la lange' (roughly Language Industry), and if there is an industry then, of course, artefacts are planned, designed and manufactured, or engineered, in that industry. Language engineering, at least in the context of this paper, would refer to the engineering and use of systems that help us to manage to the building blocks of specialist texts, the terminology of the specialist domain. The management of terminology has consequences for the authroing and translation of documents of a specialism.

The multi-lingual world of the next century will depend crucially upon the accurate translation of a wide range of documents, including scientific and technical documents. The efficient, cost-effective and timely translation of science and technology texts offers substantial opportunities for the vendors of software systems that can take on this major intellectual challenge. This challenge is essentially to our ability to understand and articulate how knowledge is disseminated through the medium of text across linguistic and cultural divides.

Documentation and translation of specialist knowledge documents are complex psychological tasks, where feats of cognitive and perceptual processing are performed. Specialist documentation, including learned journals and technical manuals, involves mapping of percepts onto concepts and vice versa, involves the description of the new ideas in a cohesive whole that maintains continuity with the old ideas, involves novel use of the lexical inventory of the language of the documentor. Once the novel idea is established and is used either as a policy instrument or used in the production and sales of goods and services, then there are additional texts that become a part of the textual archive of the specialist domain where the novel idea was generated. These texts include texts used to advertise goods and services, texts used for governing, sales and uses of the goods and services, and texts used to introduce the lay person and the novice experts to the novel idea including popular science and technology texts and textbooks. There are six major text types used in any specialist domain: learned papers and advanced texts, manuals, advertisement, official documentation(including legislation and product information) and textbooks.

Translation of specialist documentation involves not only an understanding of the grammatical structure and the lexical inventory of the source and the target languages, but involves an understanding of how text is created and will be used, involves how to communicate ideas that may be foreign to the culture of the reader. Translation is, it appears, a complex cognitive process involving the simultaneous and integrated execution of linguistic-, iconic- and symbolic-representation tasks, and translation involves the deployment of episodic and semantic memory.

The documentation of knowledge particularly in an emergent domain, like genetic engineering, sociobiology, chaos theory, artificial intelligence, and the translation of texts in these domains, brings to for e the terminology-related problems faced by documenters, translators and the readers of the documents. The problems involve neologisms, changes in the nuances of established words and phrases, and the abandonment of established terms. But the neologisms, the nuances and the censoring, all are the work of the members of the domain community. These members report discoveries, disclaimers, criticisms, negations and affirmations, principally through the medium of text. Members of the community who can communicate well carefully coin, define and elaborate neologisms and established terms. And, gifted colleagues across the linguistic divide use these terms, either in the original language or coining, defining and elaborating the terms in their own native language using text as a medium. The terms eventually are standarised, usually after a gap of at least one or two years gap, and made available more generally. The use of such data in a machine translation system has to wait almost the same length of time in that terminology data has to be adapted to the data structures and retrieval algorithms used in the computer-based dictionaries.

Exciting new developments in corpus linguistics, stochastic models of language, lexicography and terminology together with advances in software engineering and artificial intelligence, can be harnessed for building computer programs that can assist in technical and scientific writing and in the translation of such documents. These different subjects, all with different overall goals and specific objectives have one thing in common: all these subjects are slowly becoming aware of the fact that in order to process a text through a computer system, for example, either for documentation or for automatic translation, requires an understanding of how texts are organised at various linguistic levels of descritption. These levels include lexicl, syntactic, semantic, pragmatic and so on.

The machine translation community has focused on the noble task of understanding the structure of language whilst assuming that the knowledge of the contents of a document to be translated and the entirety of the document production and comprehension process, does not really have an impact on the translation process itself. The result is that although we have a variety of sophisticated models of language in general, and the English language in particular, these models cannot be operationalised in the production of machine translation systems. Worse still, one sees a whole new industry, the word processing industry, emerging that neither pays any attention to the complex and sometimes relevant models of syntax and lexica developed by the linguistics community nor produces software that has the robustness of other common or garden software systems. But these word processing systems fulfil a need, very much like the brute force systems like SYSTRAN and METAL fulfil another need.

In this paper we focus on, what we believe to be, one of the key issues in automatic translation of specialist documentation: the contiual provision of upto date and well-elaborated termninology for a given specialism. Note that the medium of text, in all its diverse typology, is important for disseminating specialist knowledge, know how and techniques. Therefore, it is important to understand how sepcialist texts are organised and are structured and, perhaps, more importantly, how to use the texts as a source of relevant terminology, including neologisms, of the specialist domain.

The establishment of collection of terms should take into account the life cycle of individual terms: their coinage ('birth'), description ('growth'), currency ('maturity') and obsolesence ('death'). This should reflect in the methods by which the term is acquired, e.g. from experts, learned text, represented through its grammatical characteristics, explicated with the help of defintions, (contemporary) illustrative usage and deployed through the use of standardized terms, that is, avoiding deprecated terms and colloquial synonyms, and through the use of an adqeaute (possibly standardized) foreign language equivalent. The data associated with the acquisition, representation, explication and deployment has to be recorded and made wholly or partly available on demand. The so-called 'record format', a term widely used in terminology literature, to refer to the template used by terminologists to record the above mentioned attributes of a term: though, it must be said here, without refernce to the terms' life cycle or by any overt refernce to epistemological, linguistic or knowledge processing considerations.

2. What is a term and where to find it?

Specialist domains of knowledge are differentiated by their terminologies, words and phrases used for the dissemination of knowledge amongst the community closely identified with each domain. The community includes experts, students, administrators, technical authors, terminologists and translators. Indeed, an analysis of a corpus of written texts or a corpus of spoken discourse, emanating from each community reveals the existence of a special language for each domain - a language for special purposes (LSP). Each domain LSP, although embedded firmly in the general language of the community, the so-called Language for General Purposes (LGP), comprises not only idiosyncratic words and phrases - the terms - but also the preferred use of certain syntactic constructs and restricted semantics. Each domain LSP evolves to reduce ambiguity inherent in LGP for the efficient and safe dissemination of specialist knowledge. This evolution also reflects the state of the specialist domain: new discoveries; novel dicthotomies; unresolved and unexplained phenomena.

The number of terms which usually constitute the LSP vocabulary of a domain vary from 2,000 to 2,000,000. The attributes of a given term vary again: in conventional print dictionaries a term headword may be accompanied by definition, grammatical category, bibliographic refrence and possibly illustrative usage; in a typical term bank there may be as few as 40 attributes or as many as 76. These statistics are based on the typical contents of a LSP dictionary or the number of terms in a term bank:

Table 1. Details of some well-known LSP dictionaries and terminology data banks


                          LSP Dictionaries                           

                             Approx. No.  
Title (Year of Pub.)                    of  Terms   

The Penguin Dictionary of Psychology (1985)                     17000 

The Cambridge Illustrated Thesaurus of Computing Terms          10000 

A Dictionary of Linguistics and Phonetics (1989)                 2000 

Multilingual Terminology Data banks                  

University of Surrey's Catalytic Converter Term bank            12000 
(English, German, Spanish)                                            

Canadian Governments's NORMA-TERM (French-English;              47000 
Tech)                                                                 

Siemens GmBH's TEAM (English,German, + 6 others; Elec.        2000000 
Electronic Eng)                                                       



The terms used by any subject discipline are used more frequently, and sometimes exclusively, in that particular discipline. The categorisation of terms is ususally carried out by postualting certain primitve/basic/fundmaental irreducible concepts that may charcaterise a specialism, for instance, elementary particles in physics, 'kingdoms, families, genus and differentia' used in zoology and the so-called vital acids, RNA and DNA, in molecular biology, text types in literary criticism and so on. These postulated entities, usually abstract entities, are then used to make up hierarchies and the nodes in the hierarchies are used as a nucleus where clusters of terms may nucleate. Some disciplines have partonomies or hetrarchies, where the relationship between parts and a postulated whole are described. Such hetraarchies are based usually on analolgies with the organisation of human body parts and functions of these parts. These hierarchies andpartonymies are used to build up an abstract structure, usually called conceptual structure of the specialism, which, in turn, used to organise terms in a terminology data bank or in a thesaural or conceptual dictionary.

This approach philosophically subscribes to a Platonist or neo-Platonist view of the world: the abstract human mind in touch with an equally abstarct external world, a world of ideas and fine thoughts accessible to the few experts who then pass this 'knowledge' to the rest of us. The criticisms of a Platonist approach to knowledge and its growth notwithstanding, the concept-based organisation of terms provides a useful framework for building computer-based term banks. Indeed, Internaltional Standards Organisation (ISO) and other bodies do prefer to use a concept-based and normative approach to terminolog yThe only caveat here is that in order to map this abstract notion of concepts one needs data structures (on a computer system) that can cope with the complex and interconnected nature of these abstraact concepts. In short, what is required here is a knowledge representation formalism which is semantically and epistemologically well-grounded. But that is another debate. The question we were addressing was how to extract terms from the domain archives: A concept-based categorisation of terms is excellent for stroing terms and retrieving terms after they had been extracted and elaborated, but does not directly help in identifying and elaborating terms of the domain.

SENSE RELATED TERMINOLOGY

If one looks at specialist texts it is not difficult to see that certain nominals are used with frequencies that can only be matched by the so-called closed words in general language: the first hundred most frequent words in the 20 million word corpus of contemporary British and American English texts created by the dictionar publishers Longman are closed class words and the same is true about other contemporary corpora of British English. However, a corpus of specialist texts will have open class nominals amongst the first fifty most frequent (single) words.

3. QUIRKS IN SCIENTIFIC TEXT CORPORA AND AUTOMATIC TERMINOLOGY EXTRACTION

The terminology of a given domain reflects the linguistic preference of the early pioneers and the later 'revolutionary scientists' (in the Kuhnian paradigm). The 19th century specialisms, mathematics, biology, chemistry, geology, borrowed heavily from Greek and Latin; the early 20th century specialisms, engineering, modern physics, genetics and psychology show heavy borrowing from English and German. The later 20th century specialisms, Intelligent Robotics, Factory Automation, show the (increasing) influence of Japanese.

For instance, the term atom was coined by the Greeks to talk about the indivisible and invisible building blocks of matter. This term was modified slightly by John Dalton, the 18th century British physicist, who argued that atom could not be divided by 'mechanical means', but the meaning of the term atom was hsifted quite dramatically by Ernest Rutherford and Neils Bohr by not only postulating that atom had an internal structure but that this structure can break down (cf. radioactivity). The term was still retained when Albert Einstien, John Oppenheimer and Enrico Fermi showed that it was possible to split not only the atom but its nucleus could be fissioned by bombarding the nucleus by a constituent of the atom - the neutron: a 'fundamental particle' that itself decays! And, in some cases a term is discarded altogether: no modern-day chemist will talk about phlogestin and its role in combustion but will use the term oxygen; the dominance of Chmosky and his colleagues in linguistics effectivley persuaded many others to discard terms like behaviour.

The growth of knowledge is always characterised by the enlargement of the community to include others of different nationalities (cf. Bohr from Denmark working with Englishman Rutherford, and Oppenheimer working with Fermi from Italy, Szilard from Hugary). This transnational cooperation renders the community multilingual, assuming the enlarged community is linguistically diverse. The consequence of this diversity is that terms need to be elaborated in at least two languages such that knowledge can be safely and accurately disseminated across the linguistic barrier. Here again we observe the lexicographical time-lag in the first usage of the terms and their appearance in term banks and specialist dictionaries.

The imaginative leaps of the scientists and the dextorous and deft moves of the technologists are not opnly manifested in their novel concepts and their pinpointing of dichototmies, but manifests itself in their writing. If we compare the relative frequency of the first six open-class words from the 25 most frequent words in the University of Surrey's (predominantly British English) automotive engineering corpus with their relative frequency in the LOB Corpus, we find that the co-efficient of the relative frequency is some guide to the quantitative differences between special-language texts and general-language texts. The very high values of this co-efficient for some words indicates that, perhaps, these words are used almost exclusively, say, for automotive engineering. The high frequency of open class words makes the specialist texts rather different in textture to general language texts: these use of these words makes the text rather opaque to the general public, a kind of weirdness is introduced because ofr the use of these open class words.

The high-frequency open class words identified in Table 3, have a large co-efficient of relative frequency ranging between 16 and infinity. Terms such as emission and catalyst, together with related terms like autocatalyst, converter, and hydrocarbon(s) have zero frequency in the LOB Corpus, but a finite frequency in the automotive engineering corpus; hence, the co-efficient of weirdness for these terms, when computed by comparison against the LOB corpus, is infinity. (Table 2a).

Table2a. The preponderance of open-class words in special-language literature.


              Surrey  Auto.         Lancaster                
Word       Engineering          Oslo-Bergen     Co-effici 
corpus               corpus         ent of   
(369,751)          (1,013,737)               

Absolute   Relative  Absolute   Relative  Weirdness 
Freq.     Freq.      Freq.     Freq.              

(a)       (b)        (c)       (d)        (b/d)     

autocataly 27        0.01       0         0.00       Infinity  
st                                                             

car        1,790     0.48       272       0.03       17.8895   

catalyst   1,700     0.46       0         0.00       Infinity  

control    1,517     0.41       199       0.02       20.8860   

emission   2,194     0.59       0         0.00       Infinity  

engine     2,083     0.56       70        0.01       81.0990   

hydrocarbo 140       0.04       0         0.00       Infinity  
n                                                              

hydrocarbo 290       0.08       0         0.00       Infinity  
ns                                                             

system     1,795     0.48       298       0.03       16.3286   

vehicle    1,884     0.51       20        0.00       258.5029  



(Figures in columns 'b' and 'd' have been rounded up)

In contrast, for the most frequently occurring closed-class words, like the, of, and, to, a and in comprising just under 20% of the LOB Corpus and the automotive engineering corpus, the co-efficient of relative frequency is close to unity (Table 2b). For other closed words, like we, what, and would for example, the coefficient of relative frequency is far less than unity: it appears that scientists in particular and specialists in general tend to 'suppress' the use of certain closed-class category words (Table 2b). This suppression is as much an idiosyncrasy of the specialist texts as is the preponderance of nominals in such texts: a kind of weirdness, a departure from the norm, a departure from the general language of everyday usage.

Table 2b. The 'suppression' of the closed-class words in special-language literature.


           Surrey              Lancaster                     
Word       Automotive          Oslo-Bergen         Co-effici 
           Engineering corpus  corpus              ent of    
            (369,751)          (1,013,737)                   

Absolute  Relative   Absolute  Relative   Weirdness 
Freq.     Freq.      Freq.     Freq.                

(a)       (b)        (c)       (d)        (b/d)     

the        26,634    7.15       68,351    6.74       1.0604    

of         12,434    3.34       35,745    3.53       0.9472    

and        8,792     2.36       27,873    2.75       0.8583    

to         8,319     2.23       26,781    2.64       0.8441    

a          7,100     1.92       22,647    2.23       0.8610    

in         7,750     2.10       21,248    2.10       1.0000    

about      431       0.12       1,898     0.19       0.6409    

we         278       0.07       3,128     0.31       0.2269    

what       171       0.05       1,925     0.19       0.2633    

would      464       0.12       2,799     0.28       0.4346    



(Figures in columns 'b' and 'd' have been rounded up)

A text-based approach to terminology can be of help to a terminologist not only in semi-automatically identifying terms, through the computation of the co-efficient of weirdness, but also by making available other kinds of text-derived data. By viewing text fragments, containing keywords-in-context (KWIC), a terminologist can deduce a variety of syntactic, semantic and pragmatic details that can be found much more easily than by manual scanning methods, introspection or interrogation of domain experts.

Using such methods we have been able to create term banks in at least ten different subject fields, ranging from drug addiction to automotive engineering, and from information technology, including artificial intelligence, to environmental protection. This work has been supported by 'System Quirk' (Holmes-Higgin & Ahmad 1992), an intelligent terminology and lexical development system.

SYSTEM QUIRK: A TERMINOLOGY AND LEXICOGRAPHY SUPPORT SYSTEM

Corpus-based terminology and lexicography

System Quirk is essentially an integrated set of programs or software 'tools' for examining, and extracting relevant material from evidence sources, such as an organised special-language text corpus, and for creating, deleting, modifying and maintaining a reference source such as terms in a term bank. There are tools for dealing with each phase of a term's life-cycle.

The literature in corpus linguistics, particularly in corpus-oriented lexicography, contains descriptions of software tools that are used for gathering data about lexicogrammatical properties of words. Leech has discussed the need for developing at least three different types of software tools that comprise a 'sophisticated computational environment' for retrieving data from a corpus and for processing linguistically the corpus itself. Leech's specification includes (i) general-purpose data retrieval tools, (ii) tools to facilitate corpus annotations at various levels and (iii) tools to provide interchange of information between corpora and lexical and grammatical databases (1991:22-23).

System Quirk contains general-purpose data retrieval tools and tools for exchange of information between corpora and lexical (and terminology) databases. System Quirk does not contain any corpus annotation tools, but it is capable of importing and exporting texts encoded in SGML format and terminology in a number of terminology interchange formats. In addition to the text (and term) analysis tools and corpus and term bank organisation tools (see Table 7 for details), System Quirk contains the so-called visualisation tools. These tools can be used for elaborating a term, selectively browsing a text corpus or corpora, and tools for visualising the inter-relationships between terms. The visualisation tool has some facility for deducing new facts from old, through the use of knowledge representation formalisms, and facility for identifying semantic relations based on linguistic cues. Table 3 below shows the functional characteristics of the tools.

Table 3: Functional characteristics of the System Quirk toolbox


                  Organisational Tools                   

Corpus Organisation          Term Bank Organisation       
Classification and           Creation, maintenance and    
Representation of full       quality control of term      
texts; Organisation along    banks.  Accessing other      
pragmatic lines.  SGML       term banks.  TIF mark-up.    
mark-up.                                                  

Analysis Tools                      

Text Analysis                Lexica/Term Analysis         
Concordance, Collocation,    Relationships with other     
Statistical Analysis. Term   lexical items, Foreign       
identification.              Language Equivalents.        

Visualisation Tools                    

Selective Explication        Illustrative Explication     
Access within and across     Selection of illustrative    
corpus, goal-oriented        text fragments - contextual  
browsing.  Selectional       examples;  use of semantic   
constraints on fragments.    nets for illustrating        
                             inter-term relations.        
                             Publishing tools             



Text Analysis and Term Identification

A methodology was developed at the University of Surrey, under the aegis of the Translator's Workbench Project (Phase I:1989-92; Phase II:1992-94), for semi-automatically extracting candidate terms, and the associated explicational data, from a corpus of specialist texts. The methodology, that is method, tools and techniques, involves a three phase examination of domain texts and of extant terminology data banks.

The first phase involves the organsation of a text corpus and the setting up of a terminology data bank if none exists, and the addition of texts and terms to existing text corpora and term banks. System Quirk allows the creation an maintenance of text corpora along pragmatic, or use-oriented, lines and the texts can be examined according to their pragmatic attrubiutes: text types, specialist domains and sub-domains, length, publication date, gender of the author, his or her first language and so on. All or some of the texts can be examined using the corpus management facilities in System Quirk. In a similar manner, terms of a given domain can be created, browsed, updated, modified and deleted along pragmatic lines.

The second phase is the analysis phase which is executed corpus and term bank analysis tools. Text analysis involves the analysis of individual texts in the corpora and the presentation of results text by text or by aggreagating the results over corpora or sub-corpora. Additionally, System Quirk allows a contratstive analysis whereby texts in a specialist corpora, or rather the results of the analysis of texts in the corpora, are compared with a genral language corpora of roughly the same vintage. The analysis of individual texts can be performed in four different ways:

(a) concordance: an alphabetical list of all the words in a text shown together with their context and reference to line in source text;

(b) collocation: a list of the co-occurrences of specified terms within sentence boundaries;

(c) wordlist: an alphabetical or frequency-sorted list of words;

and,

(d).word index: as wordlist with references to lines in source texts.

These operations can be refined and customised by a series of further options (A detailed discussion has been presented in Ahmad and Rogers forthcoming).

Contrastive analysis of specialist texts involves the following operations:

(e) access a representative corpus of general langaugae and compute the relative frequencies of words in the corpus;

(f) generate wordlists (as in item c above) of specialist text(s) in the text corpora and compute relative frequencies of these words;

(g) calculate the coefficient of weirdness of individual words by dividing the relative frequency in specialist text corpora by the correpsonding relative frequency in general language;

and,

(h) generating word indexes together with coefficient of weirdness and flagging those with a high coefficient of weirdness as candidate terms.

Fig. 2. Typical results from an information technology corpus showing all single word forms with a co-efficient of weirdness in excess of 10,000.

Candidate single and compound terms can also be extracted by System Quirk by using the heuristic that candidate terms are the words occurring between any permutation of high frequency general language words and punctuation marks. The identification of compound terms can also be pursued by examining high frequency collocates excluding those that may contain closed class words in them or by ignoring the closed class words.

The third phase of the methodology involves the use of visualisation of the text corpora and termnology data banks, whereby a term is elaborated by browsing a text corpus or corpora for extracting illustrative fragments of text that contain candidate or actual terms of a domain. Terminographical evidnece regarding the existence of terms across or within specific text types can also be investigate using System Quirk. Furthermore, the System also allows the creation of semantic networks, networks of arcs and nodes, where nodes correspond to individual terms and the arcs are labels for lexico semantic relations. These relations include hyponymy, mereonymy, causality, instrumental, material and a host of other semantic relations. The semantic networks once developed can be used for infeerring the relations that may not be apparent (cf. a is related to b and b is related to c then one can infer that a is related to c: this currently carried manually amongst term bank builders). The visualisation tasks can be carried out for individual terms or groups of terms at a time. The variant of semantic networks used in System Quirk are the language-informed, theoretically well-grounded conceptual graphs due to John Sowa (1984). The conceptual graphs are goverened by a graph grammar and restriced syntax worked out Sowa.

CONCLUSION

This aim of this paper was to present synthesis of methods and techniques from corpus linguistics and from studies on specialist texts with the a view to ground the notion 'terminological evidence'. 'Evidence' includes lexical and other data which is available to terminographers from a variety of sources. Corpus-linguistic methods and techniques can be used to identify a term, its grammatical environment and some pragmatic features, We have established a methodology that demonstrates how the use of a large body of texts, or corpus, stored in electronic form, is equally well motivated, if not better motivated, in terminology than in general-purpose lexicography.

REFERENCES