- •Section 2
- •Determiners, pre-determiners and pre-modifiers
- •2.1. Commentaries and explanations
- •2.1.4. Substitution and omission as basic mechanisms of discourse.
- •2.2. Assignments
- •2.2.1. A) Complete the sentences using all proper pre-determiners and determiners defined by the context (including “no articles”).
- •2.2.4. A) Complete the following utterances using hints. Make use of proper pre- and central determiners, pre-modifiers and prepositional groups.
- •2.2.5. Read and translate the following word groups into Russian distinguishing between the meanings of participles. .
- •2.2.6. Specify through participle modifiers. Use the given verbs as hints. In some examples both participles can be used; explain the difference in meaning.
- •2.2.7. Read the text and specify the meaning of be stuck by choosing the appropriate description:
- •2.2.8. Read and do tasks (a) - ( ) given below.
- •2.2.9. Specify nouns through other nouns.
- •2.2.10. Practice nouns in sequence using a dictionary if necessary:
- •2.2.11. Arrange the words modifying the marked noun into a string of pre-modifiers and make a sentence using it.
- •2.2.12. Make use of some compound modifiers according to the sample pattern.
- •2.2.13. Practice compound pre-modifiers answering the following questions and arranging a discussion.
- •2.2.14. Comment on the statements using adverbs “modifying modifiers”.
- •2.2.15. Practice substitution and omission as complementary speech mechanisms.
- •2.2.16. Analyze italicized substitutes and restore their reference. There are items marked which are not substitutes - specify them.
- •2.2.17. Analyze the following noun groups arranged according to types of pre-modifiers and give their Russian equivalents.
- •2.2.18. Read the text and discuss its basic ideas relying on analyzing the modifiers of 2.2.12 as well as the questions and vocabulary notes given below.
- •Text concordance
- •2.2.19. Practice vocabulary.
2.2.18. Read the text and discuss its basic ideas relying on analyzing the modifiers of 2.2.12 as well as the questions and vocabulary notes given below.
THE PROPER PLACE OF MEN AND MACHINES IN LANGUAGE TECHNOLOGY
Computer-aided speech recognition has been in the research mainstream for the last few decades. “The proper place of men and machines in language translation”, that is, the right distribution of labor between the human translators and the computer-assisted translation system, is one of the key problems under investigation.
There is a statement attributed to Fred Jelinek “Every time I fire a linguist the results of speech recognition go up”, i.e. explicit linguistic knowledge is dispensable. This sentiment is related to a paradigmatic shift that happened in the computational linguistics in the beginning of the 1990s: with more and more data available and with the advance in the methods of machine learning, more approaches switched from careful encoding of linguistic phenomena to finding statistical correlations in texts (either annotated or raw). The vast majority of publications at major conferences on computational linguistics belong to this paradigm and present a fairly radical stance. The approach implies that it is redundant to encode linguistic knowledge explicitly; a completely automatic machine learning procedure can quickly produce a fast and reliable NLP component which rivals (and in some cases exceeds) the performance of hard-coded linguistic rules requiring the efforts of many person-months (if not years). Hence, the efforts of linguists need to be spent on creating data rather than writing rules. Thus, in this approach the human efforts are invested into creating annotated corpora, representing data and designing machine learning algorithms, while the machine is able to learn the links between the data. In the end, linguistic knowledge is induced from annotated corpora rather than explicitly hand-created by linguists. In a similar way, development of corpora is possible without manual selection of texts from a range of sources. It can be facilitated by crawling or using the API of a search engine and automatically annotating them with respect to their domains and genres.
The automatically induced rules also do not take the form of hard constraints, separating the possible from the impossible, but rather as graded constraints, distinguishing the more probable from the less probable. This makes the automatically acquitted models more robust to noise.
Nevertheless, the approach in question needs to be taken with a pinch of salt. First, it was reasonably successful since in fact it implicitly utilized some information about the language. Second, data representation in terms of tag labeling is sufficiently simple and efficient , but a tag label lacks information about the internal structure of linguistic phenomena. For example, the agreement in case, number and gender may not be taken into account by the system processing noun phrases. If the set of training examples does not contain, say, a proper nominative singular masculine noun in this sequence, the tagger will fail to treat this sequence as a noun phrase. Actually, another problem in using purely statistical methods is the reliance on patterns present in training data. Each training set has its own peculiarities, which do not necessarily match the peculiarities of the application domain. The accuracy of taggers trained on particular corpora vary dramatically on other text genres. Annotating texts in the application domain to obtain more training data is expensive, so the tools are often used in new domains without formal evaluation of their accuracy. To the best of our knowledge, this problem is partly addressed by new approaches to machine learning using domain adaptation, which uses a training corpus from the source domain (with available annotated data), a small number of annotated examples from the target domain and a large number of unlabelled examples from the target domain.
In addition to the known problem of unknowns in the domain mismatch, there is a problem of unknown knowns, namely when peculiarities inherent in the annotated set are not obvious, while machine learning is likely to emphasize them for making classification decisions. In the end, the system might achieve reasonably good accuracy on the held-out portion of the annotated set (since it is drawn from the same distribution, while the accuracy could be irrelevant outside of the annotated set alone. For example, in the field of automatic genre classification it has been shown that a large number of texts on a particular topic within a genre heading can considerably affect the decisions made by the classifier, e.g., by treating texts on hurricanes and taxation as belonging to FAQs (Frequently Asked Questions). At the same time, a classifier based on POS trigrams is much less successful, but it suffers less from the transfer from one annotation set to another.
Finally, there are problems with correcting the results. An error produced by a rule-based tagger can be corrected by debugging, finding the incorrectly fired rule, modifying it and testing the performance again. A statistical model can be amended by modification of the learning parameters or by providing more data, but this is only indirectly related to the performance of the system in the case of an individual problem.
In either case, the main contribution of the survey analysis is two-fold. First, we reveal the advantages of the baseline for natural language processing using only statistical methods and minimal adjustment to the representation of source data. In spite of its minimalism, the baseline outperforms the majority of the rule-based systems. Second, the tools mentioned are available for linguistic research. This defines the entire pipeline, which starts with POS tagging of pre-tokenized texts, proceeds to lemmatization and ends with syntactic parsing.
From Arrange Cornell University Library, https://arxiv.org/abs/1505.03132
