Abstract
Spoken corpora have long been of interest to researchers but also more challenging to compile than written corpora. It should be noted that early “corpora” (those developed before the 1980s when corpora became computerized) were more often based on spoken rather than written language, but were quite small and focused primarily on the study of phonetic features (Ling 1999: 240, qtd. in McEnery, Xiao, and Tono 2006: 3). Such corpora were rightly criticized for being “skewed” as they could not claim to be representative of speech as a whole or even in particular domains due to their small size and inability to be analyzed quantitatively (McEnery, Xiao, and Tono 2006: 4). Modern spoken corpora consist of much larger, transcribed texts stored on computers, which enables researchers to use quantitative methods of analysis. However, some of the remaining issues include: (1) consent for gathering spoken data, particularly in more sensitive domains such as legal and medical interactions; (2) the time-consuming nature of transcription; (3) lack of reliable automatic analysis tools for some spoken features, such as prosodic features. Due to particularly the first two limitations, spoken corpora are less numerous than written corpora and also have tended to focus on more limited domains. There are a number of available spoken corpora that contain face-to-face conversation, for example, the London-Lund Corpus (LLC), Cambridge and Nottingham Corpus of Discourse in English (CANCODE), the British National Corpus (BNC), the Lancaster/IBM Spoken English Corpus (SEC), and the Santa Barbara Corpus of Spoken American English (SBCSAE). While these corpora also contain other spoken registers, face to face conversation forms the bulk of the texts in the spoken sections. However, a growing number of spoken corpora focus on other more specialized registers of speech. From the corpora that are publicly avail- able, two useful examples are COCA and MICASE. Although COCA simply calls its spoken sub corpus “speech,” it is important to note that it consists primarily of transcripts of news programs and talk shows, not face-to-face interaction.
Original language | English (US) |
---|---|
Title of host publication | The Cambridge Handbook of English Corpus Linguistics |
Publisher | Cambridge University Press |
Pages | 271-291 |
Number of pages | 21 |
ISBN (Electronic) | 9781139764377 |
ISBN (Print) | 9781107037380 |
DOIs | |
State | Published - Jan 1 2015 |
Externally published | Yes |
ASJC Scopus subject areas
- General Arts and Humanities
- General Social Sciences