KonText

KonText (Knowledge on Text) is a text analysis program that allows you to select one or more text files from a corpus, and to view, edit, print or analyse the text file(s) by performing tasks such as Key Word In Context (KWIC), Weirdness, wordlists, and indexes within user defined constraints. KonText can scan texts, looking for specified patterns, consolidating and collating the results and display these in a variety of formats. KonText can operate on texts in a variety of different languages and is extendible through add-on services.

Commands available from the KonText window are:

Main KonText window


KonText Tutorial

Start KonText from the System Quirk Language Engineering Workbench by clicking on the 'KonText' button. Make sure the language setting in the options window conforms to the language you wish to analyse. You can do this by clicking on Options in the main KonText window and changing the language under Input. We will use English here as an example.
Options dialog box

When you have pressed OK you will return automatically to the main KonText window.

If you press the Search Constraints button in the main KonText window, the Set Constraints window will appear. You can now enter text patterns to search for. You can type them in the Include dialog box, each followed by [RETURN].

Empty Set Constraints dialog box

From this dialog, press the lowermost Lists button. Another dialog will appear showing the list of available, pre-defined text patterns for English; select the 'closed.en' patterns by clicking on 'closed.en' and then pressing OK or [ENTER].

The lists dialog box

The Exclude dialog box should now have a list of text patterns, each pattern on a separate line. Press OK for these patterns to take effect.

The Set Constraints dialog box with the selected words from the list

Now, one or more text files need to be selected for scanning. The easiest way to do this is to press Selection in the main KonText window, this will open a file dialog. Select the file named 'demo.txt' by clicking on "demo.txt" and then pressing the OK button.

File dialog

The Select Source Texts window will then appear. Click on the text you want to select and press OK.

Select Source Texts dialog

All that remains is to select the task to perform. You can do this by selecting the required task from the pull down list of Task. Then press the Start button in the main KonText window.

Concordance output includes the frequency of the matched patterns, the context in which they were found (KWIC list) and line reference. Before starting this task, a dialog box will appear, asking for the "width". When you enter, for example, 5, five words will appear on either side of the matched pattern. These words form the context.

Index output includes the matched patterns, their frequency and line references. If Index was the selected task, these would be the results:

Index Results window

Wordlist output includes the matched patterns and their frequency.

Weirdness output includes the matched patterns, their frequency and the relative frequencies of those words compared to general language corpora.

Weirdness Results window

Selecting Index is a good first step. After that, you can try the others yourself and see what happens.

The results of the tasks are shown in a results window:

Index Results window with highlighted line reference

You can save the results as a whole by pressing Save to file or copy the results to a Qclip (= clipboard). You can also see where some words are in the source text by highlighting the line reference you want to see. You then press Show source.

Show Source window with the output of the highlighted line

Source Documents

The source documents (i.e. texts) to be analysed by KonText may be collected in two ways:

Main KonText window with pull down menu

When you have selected Virtual Corpus from the pull down menu of source documents in the main Virtual Corpus window and pressed selection, the Virtual Corpus Browser will appear.

Virtual Corpus Browser dialog box

NOTE: Virtual Corpus Browser is different from the Virtual Corpus Manager. Virtual Corpus Browser allows text selection and viewing only.

Search Constraints

The Search Constraints allows you to include or exclude certain words from the output of KWIC, wordlists, indexes, weirdness and collocations with the direct use of pre-stored lexica in search patterns. The lexica are created with Save List from the Set Constraints window.

Fuzzy searches are possible through the use of Wildcards.

If the Collocation check-box is checked, the items in the include list can be used to identify collocation patterns. In other words, if the Collocation check-box is checked, the result window will show the output. If you don't check it, there will be no output whatsoever. If you check the box it will also find capital letters, otherwise it will not. The items should include the following symbols for use in collocations:
 
Symbol Meaning Example Use Example Output
^ Any number of words the ^ of the book of 

the fastest car of 

the first and the last of

^~n^ Maximum n words in ^~2^ of in the event of 

in the first of

Include List

The Include List allows you to specify words, phrases or other strings to be searched for in the selected text(s). It is possible to specify any number of words manually by typing them into the list (each followed by [RETURN] ), copying and pasting from another program, or by inclusion of previously created lexica. The resulting lexica may be stored in the KonText library. Wildcards may be used to broaden the search.

The Lists button allows you to add a pre-defined lexicon (generated by Save List) into the current list. The Save List button enables the words and patterns in the current list to be saved to create a new lexica for future inclusion by the Lists button. You are prompted for a name to identify the lexica before saving. Pressing the Reset button removes the current Include lexicon.

The Exclude List provides complementary functionality.

Exclude List

The Exclude List allows you to specify words, phrases or other strings to be excluded from output when scanning the selected text(s). You can specify any number of words manually by typing them into the list, copying and pasting from another program, or by inclusion of previously created lexica. The resulting lexica may be stored in the KonText library. Wildcards may be used to broaden the search.

An Exclude List
The Lists button allows you to add a pre-defined lexicon (generated by Save List) into the current list. The Save List button enables the words and patterns in the current list to be saved to create a new lexica for future inclusion by the Lists button. You are prompted for a name to identify the lexica before saving. The Reset button removes the current Exclude lexicon.

This facility can be used in conjunction with the Include List for example, using the include patterns to select a general pattern, with specific exclusions given in the exclude patterns.

The lists button provides a file selection dialog for inclusion of existing lexica.

File Selection dialog box

Collocation

You can specify word patterns that will be searched for in the text by checking the Collocation check box in the Set Constraints window. A pattern may consist of any number of words, including single words. You add words to a pattern by entering them in the text field in the Include or Exclude window (separating each word by ^ which stands for "any number of words"). You can use compounds in a pattern by entering the constituent words, separated by space or hyphens.

For some collocation analysis it is important to be able to limit the number of words between elements in a collocation pattern. You can do this by using the tilde (~) and a number which expresses the maximum allowable number of words between the match. The asterisk includes go, going, gone etc, whereas without the * it would only find 'go'. For example:

to^~2^go*

will find "to boldly go", but not "to think before you go"

The dialog box below gives an example of the use of collocation patterns.

Set Constraints dialog box with collocation patterns.

This is the result of the KWIC search with the collocation patterns included:

 

Kwic Results Window with the use of collocation patterns

Normally, matched collocations are reported together, ignoring the number of matched words between the actual matches. You can report the numbers of words found in a match by checking the keep collocation gaps check box in the Options dialog box.

Punctuation marks may also be used as components of word patterns. Each line is taken to be a different text pattern. The patterns are active until they are Reset. Lists allows you to use a set of pre-created patterns. By pressing Save List you can store the current patterns in a file of your choice. The OK button must be pressed for the patterns to take effect.

Wildcards

When you specify strings for the Include and Exclude lists in the Search Constraints, you can use the following wildcards.
Wildcard Description Example Matches
* Any number of letters comput* compute, computes, computer, computers, computing
% Any one letter comput%%% computers, computing 
[ab] Optional letters a,b or neither compute[rs] compute, computer, computes
Any number or combination of wildcards is valid for each pattern.

KonText can also work with compound words. Wherever a single word may be specified as a constraint, so may a compound. Compounds are taken to be any number of words that are entered with blank spaces between them, or words that are hyphenated. The words that make up a compound may also use wildcards. You can use compounds such as those below.

catalytic converter matches 'catalytic converter' or 'catalytic-converter'

cat*con* matches any two words together that start 'cat' and 'con'

cat*-con* has exactly the same effect as 'cat* con*'

catalytic %* finds two word compounds with 'catalytic'

You can also use wildcards within patterns in lexica. For further use, see Options.

Tasks

The predefined tasks supplied with KonText are:

Index

KWIC

Wordlist

Weirdness

Main KonText window with pull down menu of task

Index

The task Index creates a list of all words in the vocabulary of the source documents (unless specific include or exclude lists are used) along with their frequency of occurrence and a line reference to each occurrence in the source document. If you check the Character XRef box in the Options window, you get a character reference instead of a line reference.

The diagram below shows the word count and the vocabulary at the top, followed by the source, which is the file name of the source text.

The output of the task shows a frequency match on the left and the line references on the right.

Index Results window

KWIC

The task KWIC creates a list of all tokens - exemplars of the word, phrase or other string you selected - in the source document (unless specific include or exclude lists are used) and the token's surrounding text (the width of which you can specify). This gives the Key Word In Context (KWIC). You will also see a line reference to each occurrence in the source document, unless the Character XRef is checked; it is then a character reference.

The diagram below shows the word count and the vocabulary at the top, followed by the source, which is the file name of the source text.

The output of the task shows the line reference on the left and the context of the words in order of appearance.

KWIC Results window

Wordlist

The task Wordlist creates a list of all words in the vocabulary of the source document (unless specific include or exclude lists are used) along with their frequency of occurrence. The output is the same as that for the predefined Index task without any references to the occurrences in the text.

The diagram below shows the word count and the vocabulary at the top, followed by the source, which is the file name of the source text.

The output of the task shows the frequency match on the left and the words of the text on the right.

Wordlist Results window

Weirdness

The task Weirdness creates a statistical approach to terminology extraction from text by comparing the relative frequencies of words that occur in specialist texts with their relative frequencies in general language corpora.

The diagram below shows the word count and the vocabulary, followed by the source, which is the file name of the source text.

The output of the task shows a frequency match on the left, the words of the text next to that, the frequency ratio of the word in this text and on the right, the Weirdness of the words compared to frequency in the general language corpus.

Weirdness Results window

Start Task

If you press Start Task, it runs the selected task on the source document(s). A progress bar shows how much of the task has been completed. When the task is complete, the results are displayed in the Results Window.

User File Selection

If User Files is selected from the pull-down menu under Source Documents in the main KonText window an Open File dialog box is displayed. This allows you to create a list of files to be used as source documents by KonText.

Open File dialog box

You select a text by highlighting it and press the OK button. You then see the following window:

Select Source Texts dialog box

The following commands are available:

OK enables the options to take effect.

Add displays the standard Windows Open File dialog box (see below) enabling the user to browse local or networked drives for files to be added to the list of source documents.

Drop removes the selected file(s) from the list of source documents.

Drop all removes all files from the list of documents.

View allows you to view the selected text.

Details displays file properties for the selected file(s) such as name, read/write status and size.

Help displays windows help file for the Select Source Text window.

You select a text and press the OK button.

When Virtual Corpus is selected from the pull-down menu under Source DocumentsSourceDocuments (see main KonText window) Virtual Corpus Browser is started allowing you to select files to be used as source documents by KonText.

The Virtual Corpus Browser has restricted functionality as compared to Virtual Corpus Manager. It allows text selection and viewing only, whereas in VCM you can also edit the organisation of the texts.

Virtual Corpus Browser dialog box

Options

KonText has the following groups of options: Input; Keep; Sort; Output.

Input: you may specify whether KonText should consider source document input based on:

sentence line: physical line of text in the file (which may contain a complete paragraph if the source file has been generated from a word processor without line breaks);

paragraph: (for texts generated from a text editor or word processors saved with line breaks) or

words: (a block of).

language: may be specified into English or German.

Keep: you may also specify whether or not to keep:

Punctuation,

Numbers,

Hyphens at end of line,

Hyphens in text,

Collocation gaps or

Letter case in the selected text(s).

Sort: enables the results to be sorted according to the:

Frequency

Ending (alphabetically from the end of the match backwards) or

Left Context (alphabetically on the first word to the left of the match for each occurrence of the match in KWIC output).

By default, the results are sorted alphabetically and for KWIC, alphabetically on the first word to the right of the match for each occurrence of the match.

Output: if you use multiple input files, these may be merged for output by checking Merge Files option. Use Tabs option: should be checked when you want the results to be used in a table or spreadsheet and when spaces between columns of output are to be replaced with tabs.

Character Xref: when checked, shows the character numbers rather than line references in the index output. Often Wildcards can be useful to specify punctuation as constraints to be processed.

Simply include the punctuation mark as if it were a word and check the punctuation check box in the keep group of the Options dialog box.

Comments may be placed anywhere in a pattern and are completely ignored by KonText. Comments must be contained between the starting characters /* and the terminating characters */.

Options dialog box

Set Up allows you to change how System Quirk is configured

OK when pressed allows the selections you made to take effect.

Help displays Windows Help File for System Quirk.

QClip

The QClip button allows you to copy from the Results window to a Clipboard viewer. Whereas with the Windows Clipboard you would only be able to copy single items at a time, you can use QClip for copying multiple items to external editors or word processors.

QClip Results window
Results Window

Displays the result of the executed task in a scrollable window.

Close closes the window. Show Source opens the source document in a window and highlights the line referenced, if the results include references to the source document.

Save to File saves the contents of the Results window to a file named by the user.

If you don't enter a full path name, the current working directory is assumed.

Copy to QClip copies the selected text from the Results window into the Clipboard (Qclip). For each file processed you will see a header in the output giving the filename, word count and vocabulary for words that matched the processing constraints:

Word Count: indicates the total number of words matched in the text.

Vocabulary: indicates the number of discrete words matched in the text.

HTML/SGML

You can also use KonText to process texts that have been marked up with HTML/SGML. You can use HTML/SGML markers within the Include Words and Exclude Words constraints. Text between HTML/SGML markers can be included or excluded in the same way as words, using wildcards in the HTML/SGML markers as well. KonText will include or exclude text for a given HTML/SGML marker until its end marker is found. The following are some examples:

<h1> includes/excludes main header text

<h*> includes/excludes all header text (* = wildcard)

With this wildcard you can remove almost every HTML marker. When there are several words between the brackets you type as many asterisks (followed by a space) as there are words. For instance: <* * *>.