close

Вход

Забыли?

вход по аккаунту

?

Corpus design I

код для вставкиСкачать
Corpus design
See
G Kennedy, Introduction to Corpus Linguistics, ChпјЋ2
CF Meyer, English Corpus Linguistics, Ch. 2
What is a corpus?
•
•
•
•
•
•
•
Corpus (pl. corpora) = �body’
Collection of written text or transcribed speech
Usually but not necessarily purposefully collected
Usually but not necessarily structured
Usually but not necessarily annotated
(Usually stored on and accessible via computer)
Corpus ~ text archive
2/18
Issues in corpus design
•
•
•
•
•
•
•
•
General purpose vs specialized
Dynamic (monitor) vs static
Representativeness and balance
Size
Storage and access
Permission
Text capture and markup
Organizations
3/18
General purpose vs specialized
• Probably obvious how to assemble
specialized corpus: appropriateness of texts
for inclusion is self-defined
• General-purpose corpus implies very
careful planning to ensure balance
• Implies making some assumptions about the
nature of language, even though (as corpus
linguists) that may go against the grain
4/18
Dynamic vs static
• Static corpus will give a snapshot of language use
at a given time
– Easier to control balance of content
– May limit usefulness, esp. as time passes (eg Brown
corpus now of historical interest, in some respects BNC
already out of date)
• Dynamic corpus ever-changing
– Called “monitor” corpus because allows us to monitor
langauge change over time
– But more or less impossible to ensure balance
5/18
Planned balance: example of BNC
• Sampling and representativeness very difficult to ensure
– BNCdesigners very explicit about their assumptions
– Acknowledge that many decisions are subjective in the end
• 100 m words of contemporary spoken and written British
English
• Representative of BrE “as a whole”
• Balanced with regard to genre, subject matter and style
• Also designed to be appropriate for a variety of uses:
lexicography, education, research, commercial applications
(computational tools)
6/18
BNC
• 4,124 texts: 90% written, 10% spoken
– Largest collection of spoken English ever collected
(10m words), but reflects typical imbalance in favour of
written text (for understandable practical reasons)
– Written portion: 75% informative, 25% imaginative
– Amount of fiction is slightly disproportionately high
compared to amount published during the sampling
period, justified because of cultural importance of
fiction and creative writing
7/18
Subject coverage
• Planned to reflect pattern of book
publishing in UK over last 20 years
Subject
Imaginative
World affairs
Social science
Leisure
Applied science
Commerce
Arts
Natural science
Belief & thought
Unclassified
Number of texts
625
453
510
374
364
284
259
144
146
50
% of total written
22
18
15
11
8
8
8
4
3
3
8/18
Sources of written material
• 60% books
• 25% periodicals
• 5% brochures and other ephemera
– eg bus tickets, produce containers, junk mail
• 5% unpublished letters, essays, minutes
• 5% plays, speeches (written to be spoken)
9/18
Register “levels”
•
•
•
•
30% literary or technical “high”
45% “middle”
25% informal “low”
Obvious difficulty of how to judge levels a
priori
10/18
Spoken corpus
• Context-governed material
–
–
–
–
Lectures, tutorials, classrooms
News reports
Product demonstrations, consultations, interviews
Sermons, political speeches, public meetings,
parliamentary debates
– Sports commentaries, phone-ins, chat shows
– Samples from 12 different regions
11/18
Spoken corpus
• Ordinary conversation
–
–
–
–
–
–
2000 hrs from 124 volunteers, 38 different regions
Four different socio-economic groupings
Equal male and female, age range 15 to 60+
All conversations over a 2-day period recorded
No secret recording, and allowed to erase
Systematic details kept of time, location, details of participants
(sex, age, race, occupation, education, social group, ), topic, etc.
– Transcription issues:
•
•
•
•
include false starts, hesitations, etc.
some paralinguistic features (shouting, whispering),
use of dialect words/grammar
but no phonetic information
12/18
Another example: ICE
• Collection of samples of English as spoken/written
around the world
• Common design (as well as common annotation
scheme, and shared tools for exploitation)
– 500 texts of approximately 2,000 words each
– 60% spoken, 40% written
– Specific domains and genres prescribed
• Prescribing common design in this way makes the
corpora comparable
13/18
ICE text
categories
Spoken (300)
Dialogues (180)
Private (100)
Public (80)
Each
sample
should be
2000 words
Monologues (120)
Written (200)
Non-printed (50)
Printed (150)
Conversations (90)
Phone calls (10)
Class lessons (20)
Broadcast discussions (20)
Broadcast interviews (10)
Parliamentary debates (10)
Cross-examinations (10)
Business transactions (10)
Unscripted (70)
Commentaries (20)
Unscripted speeches (30)
Demonstrations (10)
Legal presentations (10)
Scripted (50)
Broadcast news (20)
Broadcast talks (20)
Non-broadcast talks (10)
Student writing (20)
Student essays (10)
Exam scripts (10)
Letters (30)
Social letters (15)
Business letters (15)
Academic (40)
Humanities (10)
Social Sciences (10)
Natural Sciences (10)
Technology (10)
Popular (40)
Humanities (10)
Social Sciences (10)
Natural Sciences (10)
Technology (10)
Reportage (20)
Instructional (20)
Persuasive (10)
Creative (20)
Press reports (20)
Administrative writing (10)
Skills/hobbies (10)
Editorials (10)
Novels (20)
14/18
Length of corpus
• Resources available to create and manage corpus
determine how long it can be
– Funding, researchers, computing facilities
• Speech is easy to capture, but much more time-consuming
to process that written language
– Transcription and annotation requires 6 person-hours per 1 minute
of speech (Santa Barbara Corpus of Spoken American English)
– 4 person-hours per 1,000 words of written sample, but between 5
and 10 person-hours per 1,000 words of speech (more for
dialogues due to overlapping speech) (International Corpus of
English)
• On this basis, American component of ICE would take one
researcher working 40 hrs/week 3 years to complete
• BNC is 100 times bigger than that
15/18
Length of corpus
• Length is also determined on use to which it will be put
• Corpora for lexicographic use need to be (much) bigger
– Early corpora (1m words) seemed huge, mainly due to limitations
of computers to process them
– Sinclair (1991) described a 20m word corpus as “small but
nevertheless useful”
– Even in a billion-word corpus, data for some words/constructions
would be sparse
• How many tokens of a linguistic item are needed for
descriptive adequacy?
– Typically 40-50% of all word types occur only once in a given text
(or corpus)
– For polysemous words at least half of the possible meanings will
occur only once (if at all)
16/18
“Type” and “token”
• “Token” means individual occurrence of a
word
• “Type” means instance of a given word
• The man saw the girl with the telescope
– 8 tokens, 6 types
• “Type” may refer to lexeme, or individual
word form
– run, runs, ran, running: 1 or 4 types?
17/18
• Some attempts to base corpus size on known
statistics of existing corpora
– Biber (1993): “reliable information” on frequently
occurring linguistic items such as nouns can be got
from 120k-word sample, while an infrequently
occurring construction such as conditional clause would
need 2.4m words
– How are such figures arrived at?
– Observe point at which measures stabilise
• Also, how much data can a lexicographer absorb?
18/18
Документ
Категория
Презентации
Просмотров
2
Размер файла
106 Кб
Теги
1/--страниц
Пожаловаться на содержимое документа