TOOLS, TRAINING AND RESEARCH


TERMINOLOGY MANAGEMENT AND EXTRACTION TOOLS


Introduction

"Terminology management", itself a neologism, was coined to emphasise the need for a methodology to collect, validate, organise, store, update, exchange and retrieve individual terms or sets of terms for a given discipline. This methodology is put into operation through the use of computer-based information management systems called terminology management systems (TMS).

Substantial activity can be noted in the field of terminology management systems (TMSs): as many as 60 such systems are reported in the literature. Despite the fact that many of these systems are university/laboratory prototypes and have yet to be marketed, there are still a number of good European products on the market which can already make a considerable contribution to the efficient management of terminology within and across institutions and linguistic boundaries.

During the investigation, a set of criteria, regarded on the basis of experience as crucial to the management of terminology, was established including technical, conceptual, linguistic and commercial factors. They can also be used as an evaluation metric to assess the price/performance of a TMS system and to help terminology users to determine the relevance of a TMS system to their own organisation. Based on this set of criteria, a representative selection of TMSs was analysed. The results indicate that current TMSs, while operative, can be substantially improved.

The POINTER study has identified a number of problems. First, terminology exchange across organisations and across languages. Second, problems related to validation and verification. Third, problems related to the user-interface, especially that of localisation and customisation. Fourth, problems in extracting terminology from text corpora. Fifth, the need for using better computing techniques for storing and retrieving terms, including multimodal methods and techniques. Finally, there is a need for more seamless integration of TMSs into standard applications in addition to the word processors already supported: database management and query systems, spreadsheets, WWW authoring tools and office automation concepts, such as voice recognition systems.

Amongst the solutions identified by the study, the most important is that there is an urgent need to define, adopt and refine existing standards for dealing with different authoring systems for marking-up terminology data using SGML, and for encoding linguistic data. It is important for the TMS developer to interact with other sectors of language engineering, particularly machine translation and information retrieval and for the other sectors to systematically use terminology. The facilities for using terminology databases across hardware platforms, across linguistic and geographical boundaries will lead to the creation of a terminology market place: the developments in local and global computing networks will lead to and support this development. The TMS-based solutions for validating and verifying terminology depend upon the development of protocols for these tasks. However, in the meantime it is important to use or develop tools for checking for duplications, tools for facilitating access for experts to termbases and so on.

The POINTER Consortium recommends the following:

The POINTER TMS Survey

TMS: Life-Cycle Support


Terminology management is unique in the sense that it can be viewed simultaneously as the management of an artefact, a collection of terms, and as an inquiry into the nature and function of human language, and the role these languages play in the promotion of sciences, arts, trade and commerce, sports and recreation to name but a few areas of human endeavour. TMSs are an essential part of a terminology infrastructure in that such systems have a strong utilitarian aspect, that is production and dissemination of terminology, and have an equally strong methodological aspect, grounded in semantics and pragmatics on the one hand and on the other in philosophy of science and in library science and information retrieval.

The POINTER study took a broader view of TMSs and notes that whilst the most important strategic function of a TMS is to store and to retrieve terms, it is nevertheless equally important to consider how these systems can be used in validating terms, in exchanging terms across organisations and across applications like word processing and MT systems, in extracting terms from text corpora, and in organising conceptual schemata for arranging terms. A number of currently available TMS products and prototypes incorporate these wider terminological services and these are the object of the study.

Description of TMSs analysed


Instead of providing an unstructured list of available terminology management systems, this chapter presents detailed descriptions of seven systems which are considered as prototypical for different levels of sophistication and complexity with regard to functionalities, types of users, software platforms, data model, and entry structure.

The order in which the systems are presented reflects the development from a relatively simple DOS-based system to complex, multi-functional systems, which are available on different platforms. In all, seven terminology management systems were analysed with a view to assess whether or not these exemplar systems can help in the execution of the three key phases of terminology management.

The management of terminology spans three interdependent phases and each phase can be identified by the existing of an artefact:

  1. acquisition including elaboration and validation;
  2. creation of the termbase;
  3. dissemination and exchange;

Each of the seven systems is capable of executing some or all of the tasks encountered in each of the phases.


   Terminology Management System               Special Features            
    (Organisation and Country of                                           
              Origin)                                                      

DANTERM (Institute of Economic       First termbase, trilingual,           
Research, Denmark)                   innovative record format, DOS-based   
                                     and works in conjunction with         
                                     WordPerfect                           

MTX-Reference (Eurolux Computer,     Simple entry structure, mono- &       
Luxembourg/Lingua Tech, Provo Utah)  bilingual look-up for translators,    
                                     single user                           

Multiterm for Windows Professional   Multi-user, multi-lingual,            
(Trados, Germany)                    import-export facilities              

System Quirk (InKE, United Kingdom)  Multi-user, multi-lingual,            
                                     import-export facilities, UNIX and    
                                     PC-based versions,                    
                                     entity-relationship model, links to   
                                     lexica and corpora                    

TermISys (University of              Single-user, multilingual,            
Saarbrücken, Germany)                pre-defined data categories, Flat     
                                     data model                            

Termstar (Star GmbH, Böblingen,      Multi-user, multi-lingual, 4-5 user   
Germany)                             definable categories, language block  
                                     orientated                            

TransTerm toolbox ( LRE TRANSTERM    Multi-user, multi-lingual; enables    
consortium, main developer:          the production of resources for       
GSI/ERLI; Aerospatiale; EDF,         Language Engineering products by      
France)                              connecting terminological resources   
                                     with general purpose lexica. This     
                                     TMS also features a                   
                                     user-parameterisable data integrity   
                                     checking mechanism                    



Table 3 : A List of TMSs Analysed and their Special Features

Evaluation Criteria


This analysis was carried out with the help of five key criteria. These criteria are based on the conceptual, technical and commercial aspects of terminology management systems. Specifically these criteria include: the technical considerations, terminology dissemination and exchange, facilities for the creation of terminology data bases, acquisition, and commercial aspects including pricing and availability and maintenance. Table 4 contains the elaboration of these key criteria into specific matrices related to each of the terminology management systems that were analysed.


Technical description      Hardware and software requirements,       
                           databases                                 

Dissemination and          User interface: installation; manuals;    
exchange                   help system; tutorial                     
                           Retrieval: look-up features; information  
                           selection; information views; security;   
                           Information exchange: printing; data      
                           import/export; interaction with other     
                           programs e.g. word processors,            
                           translator workbenches, MT programs and   
                           AI programs                               

Creation of the termbase   Terminological aspects: data management;  
                           entry model                               
                           Input of information: terminology         
                           extraction                                

Acquisition including      Text analysis, text summarisation,        
elaboration and            corpus management                         
validation                 Validation and control                    

Commercial aspects         Price, delivery, current customers,       
                           product age                               



Table 4 : Criteria for evaluating TMSs

Summary of the Analysis


Table 5 is a description of each of the seven TMSs in terms of the criteria defined above (in Table 4). Note that Table 5 is a summary of typical characteristics. Appendix 3 contains a more detailed analysis of all the sub-criteria is included in one of the POINTER deliverables.

Note that PC based TMSs generally require 4Mb RAM whereas Sun based TMSs require 16Mb RAM. The typical user of these systems is either a translator or a technical writer. An interesting aspect is that during the survey, it emerged that whilst few TMS users have access to Unix-based systems, market demand for such more powerful systems is certainly growing.


          Technic   Dissemination and Exchange     Creation of the   Acquisit Commerc 
            al                                        Termbase       ion      ial     

          Technic User      Retrieval  Exchange    Terminolog Input     Tools    Aspects  
          al      Interface                        ical                                   
          Descrip                                  Aspects                                
          tion                                                                            

DANterm   PC      Trilingua Indexed    Printing/im Dictionary New       Users    DM 250   
Base      (640kB  l         searching  port and     creation  records   are               
          Ram).   Database  /          export of              can be    advised           
           MS/              free-text  data                   stored    to use            
          PC-DOS            search                            before    Danterm           
                                                              validatio classifi          
                                                              n         cation            

MTX       PC      Windows/D Look-up    MTX         Unlimited  Warns     Eurolux  DM 150   
Reference (15kb   os        procedure  reference   dictionari against   distribu          
          Ram)              searches   can work    es can be  multiple  tor is            
          MS-DOS            all 3      with word   created    use of    availabl          
                            dictionari processors             headwords e                 
                            es of a    for                                                
                            bookshelf  printing                                           

Multiterm PC      Available A fuzzy    The user    No limit   MTW95     There    DM 1800  
 '95      (4Mb     in       search     can input   on the     generates are      (Single  
          Ram) .  German,   feature    import      number of   a        several  User)    
                  English,  helps to   data into   databases  warning   tools             
          MS-DOS  French,   find       MTW95 from  which can  if there  availabl          
          /       Catalan,  similar    an ASCII    be         is a      e which           
          Windows and       terms.     file        created    danger    are               
                  Spanish                                     of        cut-down          
                                                              repeating                   
                                                               the      versions          
                                                              entry      of               
                                                                        MTW95             

System    Profess Customisa The        System      The only   The       Text     Price    
Quirk     ional/  ble       system     Quirk       restrictio System    analysis to be    
          Persona Interface has an     allows for  ns are     Quirk     ,        announce 
          l - PC  ;         extensive  direct      imposed    can       corpus   d in     
                  Documenta searching  printing.   by the     extract   manageme FQ 1 /   
          Windows tion in   mechanism.             underlying terminolo nt, WWW  96       
          ;       English                           database  gy from   access            
          Corpora                                  management texts in  are               
          te -                                      system.   text and  among             
          SUN                                                 SGML      the               
          Solaris                                             format.   tools             
                                                                        availabl          
                                                                        e.                

TermISys  PC      German    An entry   Dictionarie Terminolog Data can  Print    DM 599   
          Windows           can only   s can be    y is       only be   module            
           3.x              be found   directly    stored in  entered   only              
                            if terms   printed     a single   manually.                   
                            in the     out with    database.                              
                            source     the print    Within                                
                            and        module.     this, an                               
                            target                 unlimited                              
                            languages              number of                              
                            exist.                 dictionari                             
                                                   es can be                              
                                                   defined.                               

TermStar  PC      English,  Users can  All         The size   As        Transit  DM 1650  
          Windows French    access     databases,  and max.   stand-alo and a    (full)   
          .       and       databases  of a        number of  ne        read-onl DM 390   
                  German    via the    single      terminolog TermStar  y        (read-on 
                            different  database    y          does not  version  ly).     
                            index      or parts    collection offer     of                
                            fields.    thereof     s limited  any       TermStar          
                                       printed     by         routines                    
                                                   user's     for                         
                                                   disk       extractio                   
                                                   capacity   n.                          

Transterm SUN     Interface Informatio Printing    The only   Terms     A        Not      
 Toolbox  Solaris           n can be   is via an   limits on  can be    linguist determin 
           2,     Customisa accessed   exported    the        analysed  ically   ed yet.  
          MOTIF   ble;      via term,  SGML file.  maximum    from      grounded          
          and     Documenta concept                number of  SGML       and              
          ObjectS tion in   or                     databanks, GRAALDOC  frequenc          
          tore    French    container               etc. are  formats   y-based           
                            names                  imposed    for       term              
                            with case              by         extractio identifi          
                            sensitivit             hardware   n of      cation            
                            y and                  limitation French    tool              
                            regular                s.         and       with              
                            expression                        English   corpus            
                            s.                                simple    manageme          
                                                              and       nt                
                                                              compound  tools.            
                                                              terms.                      
                                                              Plain                       
                                                              text                        
                                                              entry                       
                                                              also                        
                                                              possible.                   



Table 5 : Typical results of the analysis

Recommendations

Functionality


Terminology management systems are still to a great extent tailored to the needs of translators and terminologists. Therefore, more weight should be given to the design of multi-functional systems that are not only intended for use by translators, but also by other professions involved in information management and information processing within a company or institution. TMSs should be stand-alone modules with defined interfaces for interaction with other (linguistic) applications, such as corpus-management systems, machine-aided translation systems, or machine translation systems. Integration into translator's workbench environments including translation memories and translation editing tools should also be possible. Both adjusting TMSs to the needs of a wider range of user groups, and linking them to machine translation programs requires the integration of further information elements. Methods for automatic deduction of (at least part of) these elements are necessary to facilitate the supplementation of existing terminology collections.

Support should be given to the development and enhancement of intelligent retrieval mechanisms. Statistics-oriented error-tolerant similarity search features can be improved by linguistic procedures. This will serve both the inexperienced user and automatic term recognition in machine-aided translation systems like translation memories.

Moreover, data models for TMSs should allow links between bibliographical and terminological data or graphical and terminological data as well as links between terminology and text corpora or terminology and encyclopaedic knowledge. Correct handling of fuzzy equivalence relations between terms/concepts (both monolingual and across languages) may require a move from relational data models to other concepts like object-oriented models, semantic networks or neural networks (an approach being investigated by, among others, Trados and System Quirk). However, there is a need for further research and development in this field in order to provide terminology users with improved functionality.

Usability


Software developers should agree upon common strategies and procedures for human-computer interaction using graphical user interfaces such as Windows. Windows operating techniques should be implemented more consistently than is often the case today. Support should be given to software developments that make full use of on-line help features available today under Windows using hypertext techniques, context-sensitive help, intuitive operating procedures, etc. Learning the systems - not only operating skills, but also theoretical foundations and background in terms of terminology science - should be made more comfortable for the user by making use of tutorials, "guided tours", assistants, and so on.

Terminology Interchange


In order to facilitate the interchange of data between different platforms, different TMSs, and different linguistic and non-linguistic applications, as well as the import of data into print systems so that the parallel production of machine-readable and printed dictionaries will be enhanced, future TMSs should support a standard such as the SGML-based machine-readable terminology interchange format MARTIF (ISO DIS 12200), which is currently being developed.

Language Support and Character Sets


In order to facilitate terminology work and the interchange of terminology in all languages, efforts should be undertaken to fully implement a 32-bit character code as defined in ISO 10646 (BMP = Unicode) and to support this character code in TMSs. This does not only mean that all character sets should be available in the TMS, but also that sorting orders for all languages should be customisable by the user.

Electronic Networks


Having in mind the further development of the international information superhighway using, among other things, Internet and World Wide Web, TMSs have to be designed for use in wide area or global networks, in order to create a world-wide "terminology bazaar". This requires the implementation of features for controlling access to the terminological data, including procedures for calculating costs and fees, and of interfaces between TMSs and world-wide networks that help the inexperienced user who is not a language expert. In this context, multilingual terminology is not only to be regarded as the data the user wants to retrieve in a database, but also as a means of accessing information in an international information network. In the near future, keywords such as "teleworking" or "collaborative terminology work" will become increasingly important. Thus, terminology management systems should be designed in a way that allows such world-wide collaborative work between different software products in networks as a means of compiling and distributing terminology across different software products and platforms.

New Media


In many subject fields, it is not enough to represent knowledge (concepts) only in textual form. TMSs have to be designed in a way that they can integrate graphics, sound, speech, animation, and video data as a means of representing concept-oriented information. This data should not only be accessible as a whole but also be split up into parts of objects using zooming techniques and links between objects and parts of objects. Retrieving terminological entries via (parts of) a graphics- or video-object may come closer to concept-oriented access to terminology than is the case in today's TMSs.

Proposed Solutions


There are five areas of work that will alleviate problems faced by the terminology community and by TMS developers: terminology interchange and other related standards, TMS evaluation criteria, the electronic interchange of terminology, the cross-fertilisation of ideas, and technical solutions including the use of corpus linguistic methods for term extraction and neologism identification, adaptation of multimodal technology, use of linguistic models for encoding grammatical data and the use of software engineering methods in TMS development.

The Use of Standards

TMSs should fully support a 32-bits character code as defined in ISO 10646 for allowing different writing systems or character sets.

TMS should contain help text on the following ISO standards: Principles and methods of terminology (ISO 704), Presentation and layout of terminology standard (ISO 10241), Vocabulary of terminology (ISO 1087), and International harmonisation of concepts and terms (ISO 860). However, the developers would have to license this material from ISO.

Developers should be aware of, and be motivated to support standards that emphasise the use of SGML, such as the ISO Machine-Readable Terminology Interchange Format (MARTIF- ISO DIS 12200).

TMS Evaluation Criteria

The identification of the over 40 descriptors used in the study should be distributed as widely as possible. It is essential that these descriptors should be studied carefully and appropriate additions and deletions should be made where necessary. Further research is needed for operationalising these metrics.

Improved Exchange of Terminology Data

TMSs should be able to support multi-user access and editing in a networked environment. Models of configurations should include "parent" data stores, on which modifications may be made, and "child" data stores, which may contain subsets of data for read-only access only.

TMSs should provide access to the emerging information superhighway and should explore the use of the superhighway for buying and selling terminology collections.

TMSs should use the latest encryption and security systems to protect existing terminology databases against damage or theft.

Cross-Fertilisation of Ideas

TMS developers need to exchange ideas with the language engineering and computing science communities. This exchange can be facilitated through ELRA, through cross-disciplinary conferences and through nationally and EU-sponsored interdisciplinary projects.

Technical Solutions

Software Engineering Methodologies

Corpus Linguistics Methodology

Language Modelling

Multimodal Technology

Terminology Validation

Recommendations


The recommendations fall into two categories, one low-risk and pro-active, the other high-risk and reactive.

The first set of recommendations deals with problems and solutions where the TMS developers and terminology users have to take the initiative. Here, the risks are low and benefits for the developers and users are more quantifiable. Actions include the further development of metrics for evaluating TMSs, the development and wider use of exchange formats, as well as the use of electronic networks, particularly client-server architectures, and of software engineering principles. Initiatives in these low-risk areas may include the formation of an organisation of TMS developers, academics and terminology users. Bodies such as ELRA, and its constituent terminology college, should take a reactive role in the above mentioned areas, liaising with the relevant EU R&D programmes.

The second set of recommendations deals with higher-risk solutions. These include access to globally-available communications networks, raising issues of security; the use of linguistic models for the enriched linguistic description of terms; the adaptation of corpus and statistical linguistic techniques for identification of terms, especially neologisms; the use of complex data structures for the visualisation of terminology data; and the exploitation of multimodal technology.

TRAINING


Training for language professionals and others engaged in terminology work will be an essential component of a future European infrastructure. The POINTER national surveys (cf. the discussion in Chap. 3.2: "National and Regional Aspects of Terminology Work") show that awareness of the need for training in tools and methodology is growing among terminology producers. While training is currently available, it is delivered in disparate and diffuse ways. Nevertheless, there are commonalties which suggest that a more coherent framework for training can be established, incorporating academic, professional and experiential aspects, and catering for a range of applications in which terminology plays an important or even central role. This section outlines how such a framework might look as the basis for a future accreditation scheme.

Present situation

Terminology training, where it is available, is often provided in the context of a broader curriculum, or as a part of a broader professional role. Any model of accreditation needs to take into account both the academic and professional aspects as well as the different levels of expertise which there are in practice.

Terminology is taught, sometimes implicitly, sometimes explicitly, at a number of levels and for a range of purposes. Sometimes the training is formal (involving an academic syllabus, textbooks, examinations, and so on); at other times, it takes the form of intensive post-experience short courses, or on-the-job training using in-house terminology management systems or through the vendors of such systems. The POINTER terminology case studies covering a range of nine training models in seven European countries show that, despite differences in the mode of delivery and varying target groups, a consensus about the core elements of a terminology curriculum, both theory and practice, exists, suggesting a synergy which will facilitate the mutual recognition - and accreditation - of courses. The most extensive training is available in the context of translation programmes. An interesting development is the emergence of courses for subject specialisms where terminology teaching and the creation and critiquing of terminology is used as a pedagogic device in its own right.

Fundamental to a consideration of terminology training is the concept of a "terminologist" and how this can be related not only to different levels of training and expertise, but also to a range of professional roles. Indeed, it is our view that the profile of a terminologist should be considered more as a role than an independent profession. This role could be played in different professional situations according to different training profiles, depending on the needs of each profession.

The main groups of professionals involved in terminology might be included within the following profiles:

The acquisition of experience at work is regarded by many professional bodies in other industries as an additional qualification: accordingly, such bodies accredit the experience gained in a formal manner. The accreditation schemes run by these professional bodies are recognised by their member organisations. This recognition helps the professionals to have a career development path and, indeed, to have a career development plan which is agreed by their peers and their employers. Any individual having accredited training, which means not only academic qualifications but work experience, can rightly claim promotion within an organisation or enjoy better job prospects outside the organisation. This possibility is, however, currently not available in what might be called the "terminology services industry".

The current situation in terminology training is as follows: (1) there are a large number of terminology courses (often as an annexe to translation or other types of courses) together with some in-house training (especially large corporations); (2) many different types of organisations undertake terminology activities; (3) there is a degree of consistency of approach to training across different countries; (4) there is informal accreditation across corporations (in same sectors, across sectors in large corporations) and accreditation by professional bodies in some European countries and UN/large organisations; and (5) there are a number of EU-funded initiatives which are currently funding terminology training problems directly or indirectly; for instance, the LEONARDO, ERASMUS, DELTA and COPERNICUS programmes fund technology transfer and teaching and learning projects that envisage training in terminology and the use of terminology; (6) there is much autodidacticism, especially at "low end" of market; and (7) trained terminologists are gradually appearing on the market, although the existing basis is largely untrained.

Problems

A number of problems were identified in the POINTER terminology training survey, in particular some disparity and diversity in modes and sources of delivery for training across institutions as well as nationally: training may, for example, be delivered in-house, through seminars, through intensive summer courses, in full-time courses, or as modules in wider courses. A lack of awareness of training and curricula in other institutions within the same sector - and to an even greater extent, across sectors and activities - was highlighted.. This situation is not conducive to co-ordination and inter-institutional co-operation, thereby limiting the possibilities for mobility between sectors, languages and countries. Nor is it conducive to the establishment of a career development path for terminologists from entry to expert level, a situation which is in urgent need of remedial action in view of the wide range of applications in which terminology features, requiring different levels of training and different orientations and emphases. The wide variety of employers requiring staff with different abilities and specialities must also be taken into account. These include: higher education establishments; research centre/laboratories; library/documentation centres; ministry or equivalent official organisations; industrial enterprises; service enterprises; commercial enterprises; translation agencies or companies. The scale of the training task, given the current deficit, is daunting. In particular, the lack of qualified trainers, not only for terminology but also for terminology project management, is notable. Poor-quality terminology work is often attributable to lack of adequate training: to cope with the training deficit, a "train-the-trainers" concept needs to be developed.

While the number of those active in terminology work is large - although often hidden in other activities - there is at present no standard way in which their work can be compared and evaluated, and hence accredited across European borders. Furthermore, although terminology training exists, it naturally covers a range of objectives and is offered at many different types of institution. In many cases, just as for professional activity, terminology training is part of a broader framework, e.g. training programmes for translators, information scientists, and so on. Consequently, different professional profiles may emerge, according to professional exigencies. A "terminologist" may therefore be a specialist translator, a translation-oriented terminologist, a domain specialist, or a terminologically trained I & D specialist. While such diversity is clearly appropriate, the commonalties must also be recognised. At present, much terminological activity is carried out by language specialists or domain specialists on an ad hoc basis, or even by untrained administrators or managers.

From this it becomes clear that in order to optimise the value and benefit of the broad range of terminology work, there is a need for greater coherence - rather than standardisation - which acknowledges and accommodates diversity and commonality. One way of achieving this is through an accreditation scheme which incorporates at least two dimensions, namely, levels of experience and training, and areas of activity.

In order to clarify how such a scheme might be developed, we have looked at an established scheme already in operation for a different field of professional activity which seems to be a suitable and relevant model for establishing an accreditation scheme in terminology for a number of reasons, including:

This model is referred to in the next section outlining solutions.

Solutions

The framework for terminology training accreditation which is proposed here is based on a well-established professional development model used by the British Computer Society (BCS)(1). This model, as we have presented it in relation to terminology, comprises a series of matrices combining tasks and levels of experience. Professional progress can be tracked and accredited by moving between the cells which comprise the matrices of the model. The model is compatible with a four-stratum system of terminology work - comprising terminology acquisition, organisation, application, and education and research. Experience acquired and training received in each stratum will help a terminologist to develop professionally in a manner which will be visible to both the employers and the employees. More formal training can be integrated into the proposed model, as the POINTER case studies have shown.

The focus of our proposed solutions is a model for training accreditation which would take into account career development and ensure mobility. Central to this model is the definition of levels of professional development and the definition of the core tasks (and sub-tasks) in terminology training, organised in a system of matrices in which there is mobility between tasks and sub-tasks (horizontal and vertical) as well as between levels of development (vertical). The efficacy of this model would be bolstered and supplemented by the use of distance learning and by paying more attention to the training of trainers. Selected details of the Model (Core Tasks; Sub-tasks; Levels of Professional Development; Illustrations; Relationship between Tasks and Syllabuses; Matrices) are contained in Appendix 5.

It is proposed that the Model should consist of a set of performance standards designed to cover all functional areas of work carried out by practitioners and professionals engaged in occupations which fall wholly or mainly within the province of terminology.

The Model would be an independently-maintained set of standards of substantial value in planning and managing career development and training for those engaged in the vital work of recording, validating, standardising, updating and maintaining terminology.

The objectives of such a scheme would be at least six-fold:

We suggest that the proposed Model is well-motivated in the context of the "terminology services industry". The various terminology training case studies show the various curricula and syllabuses to have common core areas. These areas can be grouped broadly into: terminology acquisition; terminology organisation; the application of terminology resources, as in translation and documentation; and research into various aspects of terminology itself. The case studies also show that devising a programme for teaching terminology itself is an innovative task. The case studies clearly show that terminology draws from a range of disciplines.

A terminology training model, specifically a model which can be understood and be put into operation by a number of organisations, therefore needs to deal with the following Core Tasks:

These four tasks have to be viewed in a framework that is synergistic with on-the-job experience acquisition, is attuned to developments in tools (terminology management systems) and complies with standards including national, international, in-house and company standards.

The Model which is being proposed therefore aims to take into account two dimensions:

The first dimension will be known as "Core Tasks"; the second dimension as "Levels of Professional Development".

Following the precedent set by the BCS model, which has expanded to accommodate wider needs, additional uses may also be envisaged for the proposed terminology training Model. Amongst additional uses which also have relevance for the development of activities in terminology are:

The proposed Model is intended as an "industry"-wide career development model for use by employers and employees; it is certainly not intended to be a prescriptive set of standards. Many organisations could use it as a foundation and set of reference materials on which to build their own internal standards which would then aid the process of internal career development and training provision.

The proposed scheme can in principle be implemented by a training or similar organisation that is dedicated to the promotion of terminology in Europe. The European Language Resources Association (ELRA) may, for instance, have an interest in this enterprise through its "College of Terminology". Equally interested parties may include translators' organisations in the various EU countries and a number of technical documentation bodies. Standardisation bodies, like ISO, and national standardising bodies would be equally competent as possible administrators of the scheme.

However, it would be naive to suggest that the proposed model can - in its present form - be put directly or immediately into operation by any of the above organisations. The Model must be regarded as a first step towards defining a more elaborated and consensual scheme.

Recommendations

In pursuit of these solutions the following recommendations are made.

The commonalties which we have detected among the curricula and syllabuses investigated has encouraged us to outline a four-step framework for accrediting terminological qualifications within the proposed matrix Model:

  1. The exploitation of the commonalties and the understanding of the differences will, in our view, be the first step towards accreditation across and within organisations within the EU. This will help in the elaboration of the Core Tasks and the refinement of Sub-Tasks.
  2. The second step will be for a number of volunteering EU academic organisations to set up a pilot scheme that encourages mobility across national and sector boundaries, based on devising a system that will help each organisation to recognise the qualifications of other organisations for a higher degree. This will mean accepting, for example, graduates across geographical boundaries into postgraduate courses, which is to some extent already happening, but largely on an ad hoc basis. This step will help in clarifying issues of mobility across Core Tasks and within Sub-Tasks.
  3. The third step will involve a contract between volunteering EU academic and industrial/commercial organisations. Here, working terminologists - within the same country initially but later on across borders - will participate in a scheme for working in a business sector in which they have little or no experience. The participating academic organisations will devise evaluation criteria that will help in assessing the terminologists' basic grasp of principles and practices of terminology and, on the basis of the results of the evaluation, will recommend reading lists, attendance at lectures and participation in examinations. The examinations can be open book or closed book, may involve writing reports or the presentation of a seminar. This step will provide much-needed input regarding terminologists' on-the-job training and how it helps them in moving across sectors. Moreover, this step will help in quantifying how experience of work as a terminologist can be converted into equivalent academic training.
  4. The fourth step will be to actively enlist a public or private sector pan-European organisation that will be able to refine and market the accreditation programme. Like the marketing of any innovative product, the pan-European organisation will receive valuable feedback from customers, both satisfied and dissatisfied. It is expected that academic organisations will continue to be involved in analysing the feedback and suggesting improvements.

The implementation of these recommendations can be viewed as a first step towards the establishment of terminology as a fully-recognised part of the service industries in the multilingual information society.

RESEARCH ISSUES IN TERMINOLOGY


It is perhaps not easy to defend terminology as a research discipline which has a unified set of objectives, since the study of terms is relevant to a broad range of interests. Translators, for example, cite a range of problems they face in using terminology, including foreign language equivalents, synonyms, retronyms(2), ambiguity, effects of register and document type. For the information science community, terminology invokes a distinct yet related set of problems: the precision with which a term or "descriptor" can be used to locate a document, the recall associated with a given term. For technical documentalists, the variety of terminology across registers, from informal advertisement texts to turgid and formal learned papers, is one of the key problems; the frequency of usage of a term by a given target readership of a document is another. For the early pioneers in corpus linguistics, people like John Sinclair, Jan Svartvik, Sidney Greenbaum and Rodney Huddleston, scientific texts, their idiosyncrasies and analysis, were an important yet clearly delineated output of a linguistic community. For the structuralist Zellig Harris, the uniqueness and the frequency of noun phrases and other kindred grammatical structures in texts, comprising key terms of a scientific domain, are indicative of the structural basis.

Terms are also often studied without explicit acknowledgement or reference to the terminological literature. Computer scientists, particularly those involved in information management and in knowledge-based expert systems, extensively manipulate terms, and attempt to organise them in complex databases. However, the word "terminology" is seldom seen in computing science literature. Philosophers of science, from Ludwig Fleck to Thomas Kuhn, and scientists who became philosophers, including Alfred North Whitehead, Neils Bohr, Werner Heisenberg and Enrico Fermi, all appear to comment on language and how it is used in science or scientific discourse. Without mentioning terminology, computer scientists, philosophers of science, and key scientific figures of our time, have all commented on the coinage, organisation, uses and abuses of terminology.

Semanticists (linguistic and logical semanticists) talk about "natural-kind" and "nominal-kind" terms, proposing a strong correlation between taxonomies and natural-kind terms, and non-taxonomic hyponyms and nominal-kind terms. Despite a recent awakening of interest in terms in semantics, there is no correlation between the activities of semanticists and terminologists.

Lexicographers have also begun to show considerable interest in terminology, particularly since learners' dictionaries and "college" dictionaries contain up to 40% specialist terms. Terminology makes up a substantial proportion of neologisms and addenda to these dictionaries. The advent of encyclopaedic dictionaries, like the American Heritage and the Oxford Encyclopaedic, has meant that apart from biographies and descriptions of place names, much of the encyclopaedic dictionary deals exclusively with complex nominals. Computational lexicographers have discussed how grammatical, semantic and pragmatic data should be encoded with each of the lemmata in a lexicographical database. Such an encoding is of considerable import for machine translation systems. Terminology, as the study of terms, has recently begun to find increasing acceptance among the lexicographical and linguistic communities.

Terminology can be regarded as an enabling research discipline, since without the systematic coinage and usage of the terms of a domain, it is not possible to study or use scientific theories, instrumentation and matrices based on such theories. But since terminology is embedded in translation, information science, documentation, semantics, philosophy of science and so on, it is not possible to give it a totally independent status. Independent status or not, terminology-related research, much of it essentially European, can be of benefit in not only understanding the nature and function of language and knowledge, but, more importantly in building terminology databases and terminology management systems.

Overview of Activities

In recent years terminology research in Europe has gained considerable momentum and covers a wide spectrum of topics that reflect special socio-cultural contexts and specific linguistic requirements. In countries or regions with explicitly formulated language policies (e.g. language planning efforts such as Iceland, Finland, Catalonia, Basque Country, France, etc.), research topics such as the coinage of new terms (neologisms), the development of systematic terminology planning methods, socio-terminological studies on the acceptance of neologisms in user groups, the emancipation of minority languages via terminology planning, and language contact and its impact on terminology development receive the greatest attention. Research centres dealing with such topics include the universities of Paris XIII, Rouen, Rennes, Barcelona, Bergen, Vaasa, and many others.

In countries without such salient sociolinguistic contexts terminology research focuses both on individual subject fields (in particular economics and its related fields in numerous "economic universities" or similar institutions such as in Copenhagen, Kolding, Vienna, Bergen, or in "technical universities", etc.) and on specific aspects of terminology science that cover a heterogeneous range of topics. These include computer-assisted terminology management and the development of appropriate methodologies, including terminology interchange, comparative evaluation of terminology management systems and related systems such as translation memories etc. Other work is focused on translation-oriented terminology curricula such as in Saarbrücken, Cologne, Bozen, Mainz, Innsbruck, Paris XIII, Vienna etc. Other important areas are those of knowledge-based terminology management, corpus-based terminology acquisition and the re-use of terminology corpora in machine translation and knowledge bases (in particular Surrey, Manchester/UMIST, and Saarbrücken). The more theoretical level of terminology science as part of the philosophy of science is dealt with in particular in Vienna and Surrey, while hyperterminology research in social sciences is conducted in Tampere, Vaasa, Vienna and some extra-European centres such as Ottawa, Hawaii, Kent State University and many others.

Many other interesting and important topics (especially in LSP studies all over Europe) are of interest to terminology research, but are too extensive to be described here. Comprehensive overviews of the state-of-the-art of terminology research can be found in conference proceedings of the Language for Special Purposes (LSP) conferences held in 1993 in Bergen [Brekke et al. (eds.) 1994] and 1995 in Vienna [Budin (ed.) forthcoming], in the proceedings of the Terminology and Knowledge Engineering (TKE) conferences of 1987 and 1990 in Trier [Czap/Galinski (eds.) 1987], [Czap/Nedobity (eds.) 1990] and 1993 in Cologne [Schmitz (ed.) 1993], as well as in research journals, in particular Terminologies nouvelles (e.g. numbers 10 and 12 on phraseology and terminology planning), Terminology (since 1994) on a broad variety of research topics, and Terminology Science and Research (since 1990).

More Specific Examples of Research Issues

The range of issues with which terminology researchers are concerned is indeed broad, reflecting the pervasive nature of terminology and its interdependence with other disciplines. Some key issues may be grouped as follows, although it would be wrong to regard these either as comprehensive or as entirely discrete:

Text and Corpus-Related Issues


While all terminology work is said to start from the concept, concepts can only be communicated between members of a discourse community, or accessed by terminologists, through linguistic forms (and in some cases through symbols, formulae, and so on). Text (whether written or spoken) is therefore an important source of conceptual data, but mediated through the linguistic level. In terminology work, terminologists are required to "scan" texts - evaluated for their degree of authority - in order to extract conceptual data as one of the steps in compiling a thematic terminology, but little if any guidance is given on how to relate the linguistic and the conceptual levels. This is a problem which cannot be solved in an ad hoc way but needs input from text linguistics in the first instance to assess the interplay of lexical and syntactic functions with the epistemological role of special-language texts.

So-called "manual" scanning of texts is, however, not only labour-intensive, but also in some respects potentially unreliable, since many operations need to be performed simultaneously, i.e. the terminologist must look for many kinds of data (e.g. conceptual, syntactic, morphological, phraseological and usage). Lexicographers have, at least in the English-speaking world, been working in a semi-automated way for a number of years now, using computer support tools to extract, select and organise their data based on authentic textual material, i.e. an electronic corpus. The use of corpora for terminographical work is, however, still largely in the research domain. Issues that require particular attention include differences and similarities with the use of corpora for general-language lexicographical work, corpus design, type of corpus (e.g. fully parsed, "tagged" for word class or untagged), corpus management, and corpus size. Related to the issue of corpus size is the important question of representativeness. While lexicographers talk of larger and larger general-language corpora (100 million plus words) as being necessary in order to adequately map the linguistic and semantic patterns of general-language vocabulary, the question in terminology is still wide open: what is the optimal size for a special-language corpus for terminological purposes? Fruitful avenues of enquiry seem to be the role of domain and text type, linking the issue with that of corpus design.

Terminology Acquisition


"Acquiring" terms is often seen as one of the bottlenecks in the compilation of terminologies, as well as in related areas such as the building of expert systems. Yet there is an even earlier step which is often subsumed under the acquisition stage, that of term identification. The question "What is a term?" is one which cannot be easily answered in the abstract, since not only are terms used in a variety of communicative situations (ranging across the vertical dimension of language for special purposes) and domains (the horizontal dimension), they are also part of the lexicons of natural languages, being susceptible to all the influences of change to which general language is also open. In fact, it can be argued that special languages are even more susceptible to change than general language if we take into account the role of terminology in the development of emergent ideas and new disciplines and the frequent use of certain languages as special-language lingua francae by non-native speakers. The issue of "what is a term" (also referred to as degree of "termhood") has to date received little principled consideration, although it is well-known and acknowledged among the terminology research community.

Terminology acquisition has in the past few years become closely associated in the research community with the use of special-language corpora, as we have seen. Research has focused on the development of support tools for terminology work which can assist the terminologist to process texts in order to identify "term candidates". Such tools use statistical and contextual (linguistic) data as indicative of "termhood". Future research work is likely to focus on the identification of compounds (i.e. multi-word terms) and of semantically related terms. The importance of these two items for terminology is clear: the majority of terms are compounds (complex NPs), and lexical semantic relations (e.g. hyponymy, hyperonymy, meronymy) are likely to be indicative of underlying conceptual relations. The relevance of other word classes, notably verbs, to terminology work is a further important research issue in the linguistic domain.

Terminology Interchange and Terminology Representation


Terminology databases generally consist of term records; each record contains data related to the various attributes of a term. The data are usually broadly divided into term, source, subject field, definition or explanation, synonyms, short forms and notes (cf. ISO 1087 - Terminology Vocabulary, 1989). These attributes can be divided into four categories: administrative attributes (date of term entry, terminologists name, source details, etc.); linguistic attributes (lemma, grammatical details); usage related or pragmatic attributes (synonyms, register, deprecated terms); and conceptual/semantic data (concept type, semantic relationships with other terms).

Terminology interchange can be defined as the reuse of extant terminology databases across organisations or amongst different computer applications like word processors, machine translation systems, and so on. Terminology representation can be defined as the unambiguous and explicit representation of the various attributes of an individual term and of the interrelationship that exists between different terms.

The reusability of terminology data is impeded because the meaning of these attributes is unclear. Even if there were agreement on meaning, it is still unclear how this meaning is to be translated into an equivalent model which will be used in the setting up of terminology database (an issue of database design), since there are few standard descriptions of any of the four attribute classes. Data items associated with individual attributes are letters of an alphabet together with numbers and punctuation and other graphetic markups, or mnemonics that act as placeholders for dates, initials of terminologists or others, acronyms for institutions or country/language codes. Interchange difficulties arise because different maximums are set for various fields by database managers.

Furthermore, the inventory of attributes is often expanded in an ad hoc way, inconsistent with existing standards such as ISO 1087 resulting in extra fields, or even worse, synonyms for existing fields. Subject field labels are also notoriously polysemous.

Even if there is agreement on how to name the various attributes, there is no consensus on how the data related to the attributes should be handled. The level of detail of grammatical description of, for instance, a compound noun constrains the use of terminological data by, say, machine translation systems. Conceptual attributes present even more problems for data interchange. Library classification systems, often used to structure conceptual data, are themselves only models of reality and intended for human rather than machine interpretation. The level of detail is also often insufficient for terminological work. Usage-related attributes pose their own problems: the treatment of synonymy, for instance, is varied, synonyms sometimes being stored in a field and at other times as separate records in their own right. Text typological classifications also vary considerably.

Nevertheless, some steps have already been taken in progressing the interchange of terminological data. The emergence of ISO standards like ISO 8879 - Standard Generalised Markup Language (SGML, see, for instance [Goldfarb 1990]) and the ISO standard 12200 (MARTIF) which is based on SGML has enabled the users of different text-processing and word-processing systems not only to exchange "natural text" with each other, but also additional information indicating how the text is organised in paragraphs and pages, and more complex data like figures, artwork, tables, references with other documents. Terminology interchange is then essentially facilitated through a program that translates one set of SGML marked-up data, stored in a database, into the contents of another database.

The problems of interchange that are caused by the ambiguous description of grammatical and certain semantic data can also be alleviated by the use of a number of proposed and emergent de facto standards for describing these linguistic data items. Although these standards were developed initially for use by lexicographers, they are still relevant for terminologists because both terms and lexical data in a given language conform, on the whole, to the same syntax and morphology. Prominent amongst the proposed standards is the standard based on the EUREKA-sponsored GENELEX model and an earlier protocol, the ESPRIT-sponsored MULTILEX model.

It is clear that currently available database management systems cannot cope with the complex requirements of storing and retrieving terminological data as indicated here: the methods and techniques of artificial intelligence, particularly knowledge representation, do, however, allow a degree of explication. While encoding, say, a conceptual classification scheme in SGML is a useful start to solving the problems of data interchange, in order to address the deeper problems related to meaning of terms and their interrelationship with other terms, what is required is the representation of terms rather than mere encryption.

Representation is about making things explicit, about resolving ambiguities and above all, particularly in the context of artificial intelligence (AI), about creating a surrogate of a class of things that exist in the real world on a computer system. This surrogate should not contain ambiguities, either lexically or structurally; it should help explicate shared knowledge, since it is not possible to share knowledge between a human and a machine in the same way as is possible between humans; it should be content-addressable and heavily cross-referenced. There are a number of schemata for representing terms using knowledge representation formalisms such as frames and semantic networks. These schemata are a means of allowing end-users to visualise the rich interconnectivity of terms. The formalisms also have in-built reasoning algorithms for inferring new information from pre-stored data. Knowledge representation is a sub-discipline of AI, and researchers in knowledge representation build computer programs to simulate this aspect of human intelligence.

The best practical example of the use of semantic networks in the construction of a terminology base is that of the Unified Medical Language System (UMLS). One component of UMLS is the UMLS Semantic Network, a knowledge source that is used for categorising all concepts stored in term bank containing over 67,000 medicine-related concepts together with over 220,000 terms, plus terms from various authoritative sources.

A terminology database that uses knowledge representation schemata will be able to help its users to find terms and related data more intelligently, because a so-called terminology knowledge base knows in a limited fashion what it has stored. This means that the typical user with a fuzzy query will receive some answer, and that the system will help in storing new terms more appropriately by highlighting conflicting data and inconsistencies. Knowledge engineers involved in building knowledge bases will vouch for the facilities provided by a knowledge representation system and its superiority over a data representation system.

If the data associated with individual terms could be represented more explicitly, so that a terminology management system could make some of the inferences which human beings make whilst browsing through a terminology database, then such a terminology base would act as a pro-active base of knowledge: a so-called terminology knowledge base. Thus a terminology knowledge base comprises processed and interpreted data, or has facilities to interpret the information, whereas a terminology database can only process data.

Bilingual and Multilingual Terminology


All the issues which apply to monolingual terminology are equally relevant to bilingual and multilingual terminology. The major additional question is that of equivalence: the nature of cross-linguistic equivalence, how it can be established, and its representation in terminologies and termbases. An important area of growth is that of the use of so-called "parallel" or "shadow" corpora in order to establish "candidate equivalents" on a semi-automatic basis. There is little research on the use of parallel corpora to date in special languages. Issues include text-linguistic matters such as the nature of "parallelism", as well as technical matters concerned with linking texts across languages.

Further issues in bilingual terminology are concerned with the nature of equivalence at both the linguistic and conceptual levels. Related issues here include those of the interrelation of synonyms across languages (given that synonym variation often appears arbitrary), and the terminological as well as conceptual lacunae which are often highlighted by bilingual or multilingual work but otherwise remain hidden.

The emphasis in terminology, particularly in the Vienna School, has been on ensuring efficient specialist communication (particularly in technical and scientific areas) through the description and then harmonisation of the concept, followed by standardisation of linguistic and other representations of the concept. By contrast, the Canadian School, for example, has for reasons to do with the bilingual language policy of Canada, focused more on linguistic and text-related issues. Translation-related special language vocabularies have in the past sometimes been thought of as essentially word-based (i.e. lexicographical in nature), and indeed, the majority of bilingual and multilingual resources which are typically available for translators are still lexicographical. However, since most professional translators earn their living by translating special language texts, domain-related conceptual information is as important for their decoding of the source text and encoding of the target text as it is for standardisation experts. The needs of technical writers, including those writing in a second language, are likely to be similar to those of translators. What is needed in future is terminologies which are not only linguistically rich within entries and between entries, but also conceptually rich. The electronic medium is most likely to be able to meet these criteria (particularly concerning interconnectivity and conceptual richness).

Empirical Concept Analysis


The distinguishing characteristic of terminology as opposed to lexicography is that it is concept-based rather than word-based, reflecting the fact that it is concerned with areas of specialist knowledge. In practical terms, this means that in a terminological resource, polysemes and homonyms are given separate entries, whereas synonyms appear in the same entry. A classification system is also used to structure the terms and the concepts which they designate. In a dictionary, on the other hand, polysemes and homonyms share entries, whereas synonyms appear in different entries. Classification systems, if they are used, tend be applied in a rather ad hoc fashion. The concept is basic not only to the presentation of terminological data to the user, but also to the process and organisation of terminological work. Yet in practice, as we have seen, access to the concept and the relations between them is through linguistic forms: the term as a kind of label of the concept, and running text, which encodes relations between concepts. Since there is no one-to-one mapping of these two levels, i.e. the linguistic and the conceptual (and this is not just a question of "usage" ), the identification of concepts and their relations is partially dependent on linguistic issues. This is a topic which has been little researched outside of standardisation, which might be seen as an attempt to influence the encoding of concepts by future writers and speakers by uniquely matching terms and concepts. However, what terminologists do in their work is to decode text in order to codify terminological knowledge including concept organisation, and it is here that there is a research gap.

In concept analysis and representation, attention has in the past traditionally focused on hierarchical relations, i.e. so-called logical or genus-species relations, and part-whole relations. Some research work has been carried out on other ontologies such as cause-effect relations, but there is little agreement even about the inventory of non-hierarchical relations, let alone their more detailed workings. A great deal of work remains to be done in this area, which is essential if terminology work in non-taxonomic domains is to reflect their conceptual richness adequately.

The vocabulary of a special language, or terminology - as the aggregate of terms used in a particular domain - reflects the knowledge of the relevant domain. While there are on the one hand clear communicative reasons for attempting to stabilise terms (as well as term-concept and concept-concept relations), particularly in domains such as chemistry and medicine, on the other hand the state of human knowledge changes, shifts, progresses, and so on, as a natural part of its evolution. This tension between standardisation and evolution expresses itself in practical terms in the difficult day-to-day decisions taken in the compilation of terminologies, and in theoretical terms in attempts to describe and explain the complex way in which terms and concepts change over time. One rather concrete issue here is the question of how terms "grow": where and how do they emerge, how do they develop and how do they survive or die?

It has been noted that there are considerable synergies between terminology and knowledge engineering; hence the latter is an important source of research on the representation of conceptual relations (see above).

Conclusions

The use of terminology depends crucially on understanding the principles, the motivations, and the methods and techniques used by terminologists and language-users of all shades and opinions. One might even argue that the political and economic well-being of the European Union, where much of the terminology research takes place, depends in part on access to terminology databases. However, the off-take of terminology research and the support for terminology research at regional, national and the EU level is inadequate.

At a more general level, terminology research can contribute to research in language and research in aspects of knowledge growth and dissemination. Terminology research can contribute to and benefit from research in linguistic, in studies of writing and innovation, in information retrieval; open questions in semantics and pragmatics can be focused by working together with well-organised terminology databases, for instance. Knowledge growth and dissemination can benefit from research in neologisms, in changes in terminology usage, and indeed, neologism growth can be used as a measure in scientometry.

The above discussion leads us to the following conclusions and recommendations:


(1)

1. The BCS model consists of a set of criteria which describe the industry of information systems engineering and its related training. The model contains a full matrix map which shows the different standards required for each area of activity; each matrix describes an area of work and shows at which level(s) it may be possible to change from activity to another. It therefore represents the many possible career paths in the information systems engineering industry (cf. [Taylor 91]).

2. A retronym has been defined as "[a] noun fitted with an adjective it never used to need, but now cannot do without" (William Safire, as cited in Stan Kelly-Bootle (1995:185). Examples include natural language, virtual reality, terrestrial television etc.