Spoken discourse

Research output: Chapter in Book/Report/Conference proceedingChapter

2 Scopus citations


Spoken corpora have long been of interest to researchers but also more challenging to compile than written corpora. It should be noted that early “corpora” (those developed before the 1980s when corpora became computerized) were more often based on spoken rather than written language, but were quite small and focused primarily on the study of phonetic features (Ling 1999: 240, qtd. in McEnery, Xiao, and Tono 2006: 3). Such corpora were rightly criticized for being “skewed” as they could not claim to be representative of speech as a whole or even in particular domains due to their small size and inability to be analyzed quantitatively (McEnery, Xiao, and Tono 2006: 4). Modern spoken corpora consist of much larger, transcribed texts stored on computers, which enables researchers to use quantitative methods of analysis. However, some of the remaining issues include: (1) consent for gathering spoken data, particularly in more sensitive domains such as legal and medical interactions; (2) the time-consuming nature of transcription; (3) lack of reliable automatic analysis tools for some spoken features, such as prosodic features. Due to particularly the first two limitations, spoken corpora are less numerous than written corpora and also have tended to focus on more limited domains. There are a number of available spoken corpora that contain face-to-face conversation, for example, the London-Lund Corpus (LLC), Cambridge and Nottingham Corpus of Discourse in English (CANCODE), the British National Corpus (BNC), the Lancaster/IBM Spoken English Corpus (SEC), and the Santa Barbara Corpus of Spoken American English (SBCSAE). While these corpora also contain other spoken registers, face to face conversation forms the bulk of the texts in the spoken sections. However, a growing number of spoken corpora focus on other more specialized registers of speech. From the corpora that are publicly avail- able, two useful examples are COCA and MICASE. Although COCA simply calls its spoken sub corpus “speech,” it is important to note that it consists primarily of transcripts of news programs and talk shows, not face-to-face interaction.

Original languageEnglish (US)
Title of host publicationThe Cambridge Handbook of English Corpus Linguistics
PublisherCambridge University Press
Number of pages21
ISBN (Electronic)9781139764377
ISBN (Print)9781107037380
StatePublished - Jan 1 2015
Externally publishedYes

ASJC Scopus subject areas

  • General Arts and Humanities
  • General Social Sciences


Dive into the research topics of 'Spoken discourse'. Together they form a unique fingerprint.

Cite this