close

Вход

Забыли?

вход по аккаунту

?

Robust segmentation and retrieval of environmental sounds

код для вставкиСкачать
NOTE TO USERS
This reproduction is the best copy available.
UMI*
ROBUST SEGMENTATION AND RETRIEVAL
OF ENVIRONMENTAL SOUNDS
by
Gordon Wichern
A Dissertation Presented in Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
ARIZONA STATE UNIVERSITY
May 2010
UMI Number: 3407137
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
UMI
Dissertation Publishing
UMI 3407137
Copyright 2010 by ProQuest LLC.
All rights reserved. This edition of the work is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106-1346
ROBUST SEGMENTATION AND RETRIEVAL
OF ENVIRONMENTAL SOUNDS
by
Gordon Wichern
has been approved
March 2010
Graduate Supervisory Committee:
Harvey Thornburg, Co-Chair
Andreas Spanias, Co-Chair
Lina Karam
Antonia Papandreou-Suppappola
Gang Qian
ACCEPTED BY THE GRADUATE COLLEGE
ABSTRACT
The proliferation of mobile computing has provided much of the world with the ability
to record any sound of interest, or possibly every sound heard in a lifetime. The technology to continuously record the auditory world has applications in surveillance, biological
monitoring of non-human animal sounds, and urban planning. Unfortunately, the ability to
record anything has led to an audio data deluge, where there are more recordings than time
to listen. Thus, access to these archives depends on efficient techniques for segmentation
(determining where sound events begin and end), indexing (storing sufficient information
with each event to distinguish it from other events), and retrieval (searching for and finding desired events). While many such techniques have been developed for speech and music
sounds, the environmental and natural sounds that compose the majority of our aural world
are often overlooked.
The process of analyzing audio signals typically begins with the process of acoustic
feature extraction where a frame of raw audio (e.g., 50 milliseconds) is converted into a
feature vector summarizing the audio content. In this dissertation, a dynamic Bayesian
network (DBN) is used to monitor changes in acoustic features in order to determine the
segmentation of continuously recorded audio signals. Experiments demonstrate effective
segmentation performance on test sets of environmental sounds recorded in both indoor
and outdoor environments.
Once segmented, every sound event is indexed with a probabilistic model, summarizing
the evolution of acoustic features over the course of the event. Indexed sound events are then
retrieved from the database using different query modalities. Two important query types
are sound queries (query-by-example) and semantic queries (query-by-text). By treating
each sound event and semantic concept in the database as a node in an undirected graph, a
iii
hybrid (content/semantic) network structure is developed. This hybrid network can retrieve
audio information from sound or text queries, and can automatically annotate an unlabeled
sound event with an appropriate semantic description. Successful retrieval of environmental
sounds from the hybrid network using several test databases is demonstrated and quantified
in terms of standard information retrieval metrics.
IV
ACKNOWLEDGMENTS
First, I would like to express my sincere gratitude to my advisor, Professor Harvey
Thornburg, for his invaluable guidance and support. He has provided me with the skills
necessary to become a successful researcher, and helped me achieve more than I ever thought
possible.
Next, I would like to thank my co-chair, Professor Andreas Spanias for helping me successfully navigate the many administrative hurdles present in the graduate school process.
I am extremely grateful for his time and support.
I would also like to thank Professors Gang Qian, Antonia Papandreou-Suppappola,
and Lina Karam for being in my Ph.D. supervisory committee. Their assistance is greatly
appreciated.
I'd also like to thank the School of Arts, Media and Engineering for providing me with
the NSF IGERT traineeship that supported me while I completed this work.
My colleagues in the Arts, Media and Engineering (AME) and Electrical Engineering
departments deserve thanks for creating a fun and challenging work environment. Special
thanks is due Brandon Mechtley, Alex Fink, Jiachen Xue, Jinru Liu, Patrick Cheung, and
Harish Krishnamoorthi.
Finally, I would like to thank my parents Calvin and Carol Wichern and my sister
Elsbeth for always being there and helping me get to where I am today.
v
TABLE OF CONTENTS
Page
LIST OF TABLES
ix
LIST OF FIGURES
x
CHAPTER 1
1
INTRODUCTION
1.1. The Speech-Music-Soundscape Continuum
2
1.2. Motivation and Applications
6
1.3. Computational Environmental Sound Processing
1.3.1.
Segmentation
10
10
1.3.2. Indexing and Retrieval
14
1.3.3.
16
Incorporating Semantic Information
1.4. Overview and Organization
CHAPTER 2
18
ACOUSTIC FEATURE EXTRACTION
20
2.1. Introduction
20
2.2. Types of Audio Features
21
2.2.1.
Speech Features
21
2.2.2.
Music Features
25
2.2.3.
General Audio Features
26
2.3. Choosing a Feature Set
2.3.1.
28
Illustrative Example
34
2.4. Conclusion
CHAPTER 3
36
AUTOMATIC AUDIO SEGMENTATION
38
3.1. Introduction
38
3.2. Single Feature Segmentation
40
vi
Page
3.2.1.
Distributional Specifications
43
3.3. Approximate Viterbi Inference
45
3.4. Parameter Learning
48
3.5. Multi-feature Segmentation
52
3.5.1.
Distributional Specifications
55
3.5.2.
Approximate Viterbi Inference and Parameter Learning
56
3.6. Results and Discussions
57
3.7. Conclusion
62
CHAPTER 4
EXAMPLE-BASED RETRIEVAL
63
4.1. Introduction
63
4.2. Likelihood-based Retrieval
65
4.3. Template Construction
67
4.3.1.
Query Likelihood Model
70
4.4. Cluster-based Indexing
70
4.4.1.
Modified Spectral Clustering
73
4.4.2.
Cluster Template Construction
76
4.5. Results and Discussions
77
4.5.1.
Experimental Setup
78
4.5.2.
Clustering Performance
80
4.5.3.
BIC Determination of Number of Clusters
84
4.5.4.
Computational Cost
85
4.6. Conclusion
CHAPTER 5
85
INCORPORATING SEMANTIC INFORMATION
vii
87
Page
5.1. Introduction
87
5.2. Probabilistic Retrieval Ontologies
91
5.3. An Ontological Framework Connecting Semantics and Sound
95
5.4. Acoustic Information: Sound-Sound Links
98
5.5. Semantic Information: Concept-Concept Links
100
5.6. Social Information: Sound-Concept Links
101
5.7. Results and Discussion
103
5.7.1.
Experimental setup
104
5.7.2.
Annotation
105
5.7.3.
Text-based retrieval
108
5.7.4.
In vocabulary semantic information
109
5.8. Conclusions and Future Work
CHAPTER 6
110
CONCLUSIONS
112
6.1. Summary
112
6.2. Future Directions
113
REFERENCES
119
APPENDIX A INSTITUTIONAL REVIEW BOARD APPROVAL DOCUMENTS
134
vm
LIST OF TABLES
Table
Page
2.1. Percentage of computation time for tasks in feature extraction process . . .
35
2.2. Mean feature values for four example events
36
(k)
3.1. Mode transition probabilities for P[nt
(k)
1/4-1' Mt, Mt-\)
55
3.2. Segmentation performance for approximately two hours of continuous recordings in both indoor and outdoor environments
4.1. Mean average precision for manually and automatically segmented sounds. .
62
80
4.2. Mean average precision for indoor and outdoor queries
82
5.1. Database partitioning procedure for each cross validation run
105
5.2. Annotation performance using out of vocabulary semantic concepts
107
5.3. Text-based retrieval performance using out of vocabulary semantic concepts. 109
5.4. Top four results from Soundwalks data set for text-based retrieval with out
of vocabulary query "rail". Parenthetical descriptions are not actual tags,
but provided to give an idea of the acoustic content of the sound files. . . .
109
5.5. Performance of retrieval tasks with the Soundwalks dataset using using WordNet connections between in vocabulary semantic concepts
IX
110
LIST OF FIGURES
Figure
Page
1.1. Schematic diagram concerning the segmentation, indexing, and retrieval of
continuously recorded environments
11
2.1. Waveform and feature trajectories from an example recording
35
3.1. Markov transition diagram for global mode Mt, where p n e w is the prior probability of a new onset, while poff is the prior probability of a sound turning
off, given that it is currently on
42
3.2. Directed acyclic graph for single feature audio segmentation model
43
3.3. Directed acyclic graph for feature fusion audio segmentation model
54
3.4. Spectral feature segmentation example in an indoor environment, (a) Signal waveform (upper panel) and global mode M\-T (lower panel),
(b)-(d)
Extracted audio features (upper panel) along with fusion gating parameter,
n\.T
(lower panel)
59
3.5. Multi-feature segmentation example for footsteps on metal sound recorded in
a windy outdoor environment, (a) Signal waveform (upper panel) and global
mode M\-T (lower panel), (b)-(d) Extracted audio features (upper panel)
along with fusion gating parameter /J\.T
(lower panel)
60
4.1. Example of LS quadratic fit (solid line) and corresponding control points for
a harmonicity trajectory (dotted line)
68
(i i)
4.2. Markov transition diagrams for A\
' under the three possible polynomial fits
of the feature trajectories
70
x
Page
4.3. Fictitious metric space using D(F^,F^)
features Y^jf.
extended to incorporate query
The F^' are plotted with solid circles, the Yj\T
with an ' x ' ,
and the cluster centroids with diamonds
72
4.4. Recall and precision curves averaged over ten example queries and five user
relevancy rankings for indoor and outdoor sounds
4.5. BIC value vs. number of clusters
82
84
5.1. Operating modes of hybrid network for audio retrieval and annotation.
Dashed lines indicate links added at query time
96
5.2. An example hybrid network illustrating the difference between in and out of
vocabulary tags
101
5.3. Precision and Recall curves for annotation of unlabeled sound files in the
Sound-walks dataset averaged over 10 cross-validation splits
107
5.4. Recall and precision curves for text-based retrieval of unlabeled sound files
in the Soundwalks dataset averaged over 10 cross-validation splits
XI
108
1. I N T R O D U C T I O N
Current technology has provided a large percentage of the world population with the ability
to record, process, and publish on the internet any piece of audio that interests them. While
a majority of the research related to this new audio content has focused on the important
categories of speech and music, the natural and environmental
sounds that compose the ma-
jority of our aural world are often overlooked. However, the need to automatically organize
and extract information from environmental sounds is rapidly increasing, as evidenced by
the growth of the internet audio database Freesound.org where users contribute and share
recordings of "sounds, not songs", along with renewed interest in Vannevar Bush's MEMEX
device [1,2] where continuous sound recordings become an aid to thought and memory. Unfortunately, these applications have also created a so called audio "data deluge" where there
is more sound than people have the time or patience to listen to.
Therefore, efficient indexing mechanisms are needed, which allow individual sound
events to be segmented and linked both to similar events and to the context of the original
recording. While many such mechanisms have been discussed in the context of speech and
musical sounds (e.g., automatic music transcription, speech recognition, music classification/discovery, and speaker identification), this dissertation presents a system especially
tailored to natural and environmental sounds. Specifically, we focus on the problems of
segmentation
and joint content/semantic
indexing.
In the segmentation problem we infer
the onset and end times of sound events in continuous recordings. We then index each
segmented sound event using not only its audio content, but also the ways in which that
content corresponds to semantics and natural language, which allows for intuitive retrieval.
For example, assume microphones were used to augment the video surveillance system of
a parking lot where a car was stolen. To quickly find the point in time when the crime
occurred, a security official might use semantic queries such as "loud", "breaking glass", or
2
"car alarm" and then use the onset times of the sound events retrieved by these queries as
a starting point in reviewing the video footage.
We begin this chapter by formalizing our definition of natural and environmental
sounds, and then survey a diverse range of multi-disciplinary applications based around
the analysis of large environmental sound collections. These applications accentuate the
need for automatic techniques to organize audio databases, while also allowing for intuitive
and efficient sound retrieval. Next, we introduce a series of algorithms developed for the
segmentation, indexing, and retrieval of environmental and natural sounds. While these
algorithms utilize advanced signal processing techniques, they also provide a semantic interface where sounds can be automatically retrieved and described with natural language.
1.1 T h e S p e e c h - M u s i c - S o u n d s c a p e C o n t i n u u m
The title of this section was borrowed from Barry Truax's description of acoustic communication systems [3], and provides a convenient framework for introducing our definition of
environmental and natural sounds. The term soundscape was popularized by R. Murray
Schafer [4], which he simply defined as "the sonic environment", and technically as any
portion of an actual or abstract environment that can be regarded as a field of study. Truax
expanded this definition by describing the sonic environment
as "the aggregate of all sound
energy in any given context", while soundscape emphasizes how the sonic environment is
understood by the people who both create and live within it. This element of human understanding will be extremely important in our formal definition of environmental sounds, and
their differences from speech and music. We now summarize the speech-music-soundscape
continuum of [3].
There are many reasons for ordering the acoustic communication continuum as speech,
music, and soundscape [3]. The first reason is the number of possibilities each group can
3
take: approximately 40 phonemes in a language for speech, a large and ever growing number
of instruments used to create music, and the extremely wide variety of sounds that compose
our sonic environment.
A second reason relates to the rules or structure governing the
different sound categories. In order to be understood by listeners, speech must follow a very
specific rule set. Although, the rules for music are less strict they still often exist inside
of communities or within specific musical "styles". The rules governing soundscapes are
typically very broad and limited only by possible noise pollution laws or the sound sources
present in a given environment. A third justification for the ordering of the continuum is in
terms of information rate. Speech has a very large amount of information present in a single
second, and entire phrases typically take at most a few seconds to complete. Although it is
possible for music to contain a large amount of notes in the time period of a few seconds,
it often takes much longer to understand the character and musical structure of a piece
with which the listener is unfamiliar. In a soundscape a listener can often only obtain very
broad qualities such as source density or loudness in a short period of time, while the actual
character of the soundscape can take days or years to fully develop, as external factors
influence the quality of sound transmission and sound source behavior.
In addition to the differences between speech, music and soundscapes in terms of
acoustic communication, there are also notable differences in the way humans experience
these different categories of sound. Perhaps, the most obvious difference is both speech
and musical sounds can be received and produced by humans, making it easier for us to
understand the existence of these types of sounds. Furthermore, while the actual sound in
speech is to a certain extent arbitrarily linked to the meaning of the sound, the acoustic
events in a soundscape are explicitly linked to the source that produced the sound and the
context in which the event occurred. Despite these differences, there is a growing body of
4
evidence from neuroscience and psychology that suggests the ways in which we perceive
and organize environmental sounds is similar to the ways in which we perceive and organize
speech and music.
Much of the work on the perceptual organization of sound follows Bregman's concept of
auditory stream segregation [5], i.e., the ability of the human auditory system to segregate
the complex sound source mixture that arrives at the ear into subsets or "streams" each
associated with a different physical source. Bregman further breaks down the components
used by our minds to organize sounds into "primitive" and "schema-based" segregation
components.
Primitive auditory stream segregation is constrained by the separation of
sound sources in both time and frequency, i.e., multiple sources that are closely grouped
in time and/or frequency will be perceived as a single source. Schema-based segregation
incorporates knowledge and rules built-up over the course of our auditory lives, e.g., we
might expect to hear an engine start-up after hearing jingling keys and car doors opening
and closing. Many researchers have also found that the rules of auditory stream segregation
share much in common with perception of the syntax, phonetics, and semantics used to
comprehend speech [6].
Additionally, Ballas and Howard [7] found that both environmental and linguistic
sounds share similar methods of both bottom-up and top-down perceptual processing.
Bottom-up processing consists of extracting low-level acoustic features and using these features to progressively extract additional information from the sound source. For example,
in speech we first use low-level acoustic features to recognize phonemes, then combine these
phonemes to recognize words, and finally combine words to obtain meaning from entire
speech segments. Ballas and Howard [7] found a somewhat similar process in identifying
environmental sounds by observing the process in which subjects used spectral and temporal
5
properties of environmental sounds to first determine a sound source, and then determine
an action from that source that caused the sound stimulus. Rosenblum [8] describes auditory perception in a manner similar to bottom-up perceptual processing by specifying that
we hear event primitives
such as low-level acoustic features when perceiving both speech
and non-speech events. Top-down processing consists of rules or expectations that extract
information from a sound stimulus. Temporal structure is particularly important in topdown processing of environmental sounds in ways that are similar to syntax in speech, as
shown by improved identification performance of sound sources and actions when a logical
structure was followed [7]. In a follow-up study identifying environmental sounds [9], Ballas
demonstrated that context and familiarity with environmental sounds is important in a
similar manner to language and speaker familiarity in speech. Furthermore, recent brain
imaging studies [10] have shown that similar networks in the brain activate for both verbal
(speech) and non-verbal (environmental) sounds.
Now that we have discussed several of the ways that environmental sounds compare
to speech and music, we present a formal definition of environmental and natural sounds.
An environmental sound is only a small part of the entire context-dependent auditory environment encompassed by the term soundscape discussed above. Specifically, we choose to
follow the definition criteria of Ballas and Howard [7], which has two important components:
(1) it is possible to specify the event (either physical or electronic) that produced the sound,
and (2) the sounds gain meaning by their association with these events. This definition has
several advantages in being both intuitive and also perceptually consistent with auditory
stream research by implying that perception recognizes sound events, not simply acoustic
patterns. An important property of these two criteria is that they they do not specify a
definition based on exclusion, e.g., environmental sounds are not speech or music. In many
6
sound environments, speech and music can be considered environmental sounds and should
not be excluded by the definition criteria.
1.2 M o t i v a t i o n and Applications
Following our formal definition of environmental sounds, we now describe several applications where methods to automatically capture, process, and organize these types of sounds
would be useful. While some of the forthcoming applications have been explored by audio signal processing and machine learning researchers, many are research topics based on
sound analysis, that could benefit from automated methods for the segmentation, indexing,
and retrieval of environmental and natural sounds. These applications include:
Surveillance
As previously mentioned, surveillance, or the ability to monitor an environment for
dangerous or significant events represents an important use of audio information.
While
acoustic "bugs" have been covertly hidden in telephones, walls, and clothing for several
decades [11], surveillance techniques continue to evolve. Currently, surveillance is often considered a visual problem (e.g. casino and parking lot monitoring), solved by video processing
and analysis. Video alone, however, is insufficient to characterize the semantics of certain
events, while also suffering from occlusion problems and large computation/transmission
requirements. For these reasons, acoustic surveillance has been explored as a viable alternative/addition to video [12-17] in a variety of recent security, safety, and robotics applications. In [13,14] microphones are placed in an elevator, and an outlier detection procedure
is used to detect suspicious behavior, while in [16] a microphone array is used to trigger an
alarm when an unwanted intruder enters a home by monitoring the energy, spectral content, and source direction of the audio streams. Additional applications include [17] where
microphones monitor and detect when elderly people fall down using Gaussian mixture
7
models, and [18] where audio and video sensors on a robot monitor elderly people living
alone. Acoustic surveillance techniques have also been used in office environments where
those being recorded are not in danger or under investigation, but acoustic information
aids automatic meeting summarization to eliminate the often tedious and undesirable task
of recording "meeting minutes" [19,20]. In [12] microphones continuously record an office
environment, and events are automatically segmented and archived to reveal long-term activity patterns such as when people are sick, when they socialize, and when they enter and
exit certain areas.
Life-Logging
Similar to the process of log-keeping in the office environment [12,19,20], a research
community has emerged around the concept of Continuous Archival and Recording of
Personal Experience (CARPE). Based on the revolutionary idea of a memory extender
(MEMEX) proposed by Vannevar Bush while detailing his roadmap for postwar science in
his 1945 article "As we may think" [1], the goal of the CARPE community is to archive
everything a person experiences throughout a lifetime. While realizing the MEMEX vision
could transform parts of human cognition by changing the ways we associate with past
events, significant challenges remain in terms of access. That is, finding a specific event or
memory in a lifetime's worth of multimedia is an extremely difficult task, and until these
navigational challenges are fully met the MEMEX vision will remain to a certain extent
unrealized. The most literal attempt at fulfilling the MEMEX vision is the work in [21,22]
where all documents, e-mails, photos, etc. are stored in addition to the use of a "SenseCam", which constantly records everything a person sees. In [2,23] the audio domain is
explicitly analyzed by people carrying portable recorders with them at all time. In this work
semantic scenes (e.g. "meeting", "supermarket", "library") are segmented and organized
8
from continuous recordings, but segmentation and organization at the individual event level
remains relatively unexplored for continuous audio recordings.
Biophony
The term biophony was originally proposed by Bernie Krause [24] and can be defined as
the collective non-human animal sounds created in a given environment. Several researchers
in the field of biophony have begun to study the ways in which humans are modifying animal
vocalizations by forcing birds [25,26] and frogs [27] to call at a higher pitch in the presence
of traffic noise and other types of human noise pollution. While these cases demonstrate the
ability of some species to adapt to the changing aural landscapes, animals such as bats who
rely on listening in their search for food are being severely threatened by noise pollution
that masks the rustling sound produced by insects [28]. Noise pollution not only harms
individual species, but also shifts the ecological balance in communities as some species
thrive due to the noise related decline of predators [29-31]. Furthermore, recording and
archiving animal species and environments that could become extinct is also important.
Unfortunately, the process of collecting high-quality recordings of animal vocalizations is
incredibly labor intensive, and often requires 500 hours of recordings to obtain 15 minutes
of usable material [24]. Therefore, any automated techniques to process, label, and organize
these natural sound recordings would be incredibly valuable to researchers studying acoustic
biology issues.
Architectural D e s i g n
Noise pollution can also cause discomfort amongst human listeners, and this was a
major impetus in the pioneering works on soundscapes [4] and acoustic communication [3].
In [32,33] among others, researchers have formally evaluated what properties of the soundscapes we live in both annoy and comfort us. Additionally, the interaction between en-
9
vironmental soundscapes and human communities has been studied from a sociological
perspective [34]. Fortunately, the wide array of soundscape research has not been completely ignored by those responsible for constructing the environments we live in, and has
found much interest in the landscape architecture community [35-37]. Clearly for landscape architects to make informed decisions about the soundscapes in which they work,
they must explore the temporal evolution of the soundscape at both daily and yearly time
scales. Clearly, automated techniques with the ability to summarize the soundscape at large
time scales could be valuable not only to landscape architects, but in the much wider field
of aural architecture [38] as well.
M u s i c and M u l t i m e d i a
Musicians have been incorporating field recordings into their music for generations,
and not surprisingly R. Murray Schafer the author of the seminal work on soundscapes [4]
is also a composer. Popular examples of music that involve manipulation of recorded sounds
include acousmatic composition [39] and its predecessor musique concrete [40], where traditionally non-musical sounds are recombined to compose music. A more recent example
is the TAPESTRA software [41] where specific time-frequency components of traditionally
non-musical sounds are used in the music composition process. There has also been interest
in synthesizing artificial soundscapes [42-44], where recordings of real soundscapes are used
as an analysis phase in constructing soundscape synthesis algorithms. Furthermore, environmental sound recordings have been used as an educational tool in interactive multimedia
environments [45] where students record the sounds around them and those sounds are then
used as feedback during learning exercises.
10
1.3 C o m p u t a t i o n a l Environmental Sound P r o c e s s i n g
Given the wide range of applications based-upon the characterization of sound activity,
interest in automatically organizing large collections of continuous environmental and natural sound recordings has recently emerged. In order to utilize these audio archives users
must be able to: determine when and where sound events occur, access all sound events
of a given type, and have the ability to explore the archive using natural language. That
is, there remain key technical challenges regarding segmentation (determining where sound
events begin or end), indexing (computing and storing sufficient information along with
each event to distinguish it from other events), and retrieval (obtaining all events of a given
type).
In this section we provide a description of the main components in a system for the
segmentation, indexing, and retrieval of environmental and natural sounds (Figure 1.1).
The purpose of this system is to organize sound recordings at the event level, and allow
individual sound events to be efficiently accessed. In the following three subsections we
present three principal system components: segmentation, acoustic similarity for indexing
and retrieval, and the incorporation of semantic information.
1.3.1
Segmentation
While signal segmentation is a broad area, the primary objective in this dissertation is
the identification of onset and end times for individual sound events. Clearly, given the
diversity of applications and the broad definition of environmental sounds, the term event
is somewhat ambiguous.
For speech, an event can be phonemes, words, sentences, or
other organizational units, while a music event could refer to notes, phrases, sections, or
entire pieces.
Rosenblum [8] defines a speech event in terms of the primitive qualities
of intended and/or realized sound producing gestures, e.g., the individual formants of a
11
Archival and Indexing
Retrieval
Audio Query
Feature
Extraction
Segmentation
Feature
Extraction
HMM
Indexing
Audio
Database
Time/Location Query
Matching/Ranking
Time/Location
Descriptors
Semantic Sound Labels
User Community
Fig. 1.1.
Schematic diagram concerning the segmentation, indexing, and retrieval of
continuously recorded environments
phoneme. While these "event primitives" are useful in understanding sound perception,
isolating events at this scale might make retrieving these events from a database difficult
and counterintuitive. Thus, in defining an environmental sound event we follow as closely
as possible Bregman's concept of an auditory stream, which is perceptually equivalent to
the sound emanating from a single physical source. A "speech event" is then a period
of continuous speech, and the event ends during a significant pause or when the speaker
changes. Thus, a "footstep event" consists of a cluster of footstep sounds as opposed to
individual footsteps. When multiple events overlap in time, e.g., the speech, footstep, and
engine sounds that occur simultaneously when a person walks on a busy street while talking
on the phone, a source separation or "de-mixing" process is required in addition to temporal
segmentation to perfectly isolate auditory streams. This process of source separation is often
referred to as computational auditory scene analysis (CASA) [46], and to a large extent
remains an unsolved problem attracting much research interest [47-50]. Therefore, in the
12
present study, we limit ourselves to only temporal segmentation, and overlapping events
are treated as follows: assume our recording consists of a crowd that begins by clapping,
but not cheering, and after a couple of seconds the crowd begins clapping and cheering
simultaneously. In the present study we would consider this as two events: (1) the crowd
clapping and (2) the crowd clapping and cheering simultaneously.
Prior work in audio segmentation has focused primarily on speech [51-53] and music
[54-57]. In these well-structured domains, events are typically defined as meaningful units
such as phonemes or musical notes.
Automatically segmenting periods of speech from
periods of music in radio and television broadcasts has also attracted research interest [53,
58,59]. Recently, approaches for the segmentation of continuously recorded environmental
sound have also begun to appear [2,12,23,60]. We divide these approaches into two main
categories: those that operate on the scene level [2,23,60] and those that operate at the
event level [12]. We have previously defined event-level segmentation, and by scene-level
segmentation we mean segmenting the continuous recording into periods where the recording
environment changes, e.g., in a life logging application segments could correspond to time
periods spent in meetings, supermarkets, libraries, or other acoustic environments [2,23]. We
now provide an overview of the techniques used to solve the automatic audio segmentation
problem.
One approach to speech phoneme segmentation [51] fits an autoregressive model [61]
to 20ms section of speech samples, and test statistics are monitored to detect changes in the
model parameters. While several change detection applications [62,63] use this approach,
most audio applications prefer techniques that monitor the changes of features designed to
represent specific acoustic qualities. These acoustic features are often spectral or temporal
properties of the audio waveform, and typically calculated in units of frames every 20-50ms.
13
We will discuss the process of acoustic feature extraction in detail in Chapter 2. In [12] spectral features are monitored and event onsets are determined by frames where the feature
values are above a pre-specified threshold. In [54] a distance measure between consecutive frames of feature vectors is thresholded to determine segment boundaries. Originally
proposed in [52], audio segmentation based on the Bayesian information criterion (BIC)
is widely used in speaker diarization [64] and audio life-logging [2, 23] applications. The
BIC approach operates by determining whether sequential groups of audio frames are best
described by a single (no segment boundaries) or multiple (a segment boundary is present)
Gaussian distributions. While the BIC approach works especially well in speech applications, it sometimes fails at low SNR. Additionally, the complexity of the BIC approach is
quadratic in the number of audio frames, making it impractical for long recordings.
An alternative approach to feature-based audio segmentation assumes that all frames
of the input signal belong to a pre-determined number of classes, and then trains a separate
classifier for all possible categories [53,58,59]. Each frame is then independently classified,
and segment boundaries correspond to frames where the classification decision changes.
Clearly, this approach will not extend to situations when observed audio events were not
included in the classifier training phase, which limits its utility for environmental sound
event segmentation.
Furthermore, instead of classifying each frame independently, the
introduction of temporal dynamics can dramatically improve segmentation performance
[55-57]. Thus, for the segmentation algorithm discussed in Chapter 3 we monitor changes
in six low-level acoustic features, and determine segment boundaries using a custom dynamic
Bayesian network (DBN). Rather than determining membership in a pre-trained class for
each frame, we determine segment boundaries as the points in time when the mean value
of the feature trajectories change. Furthermore, the DBN segmentation model proposed in
14
this dissertation allows for uncertainties in terms of a) which features change upon an event
boundary, b) how fast these features change, and c) how each feature can vary during an
event.
1.3.2 I n d e x i n g and Retrieval
For the indexing and retrieval problems, our goal is to simply help users find the sounds
they want quickly and intuitively.
While quantifying the speed of a retrieval system is
straightforward, measuring the "intuitiveness" is not so well-defined. In this dissertation,
"intuitive" is defined as the ability of users to quickly form the query necessary for searching
the environmental sound database. Given this objective, the indexing and retrieval problems
can operate in two fundamentally different ways: (a) query-by-metadata,
where a user must
listen to all recordings, describe them with text, and then retrieve sounds by matching
text queries with the text descriptions, and (b) query-by-example
(QBE) [65], where users
input recordings they consider similar to the sounds they want to retrieve, i.e., use sound
to search for sound. Metadata descriptions are often highly variable among different users,
while the quantification of sound similarities used in QBE often generalize over multiple
users. Another important property in QBE systems is their ability to navigate the precisionrecall tradeoff by varying the number of results returned to the user. The precision-recall
tradeoff is fundamental to the field of information retrieval [66], and can be described
using the following definitions: precision is the number of desired sounds retrieved divided
by the number of retrieved sounds and recall is the number of desired sounds retrieved
divided by the total number of desired sounds.
Each of these metrics can be trivially
maximized by either by returning all sounds in the database (recall), or retrieving only a
single desired sound (precision), thus, a truly effective retrieval system must attempt to
maximize both measures simultaneously, while also having high average precision.
For a
15
given query, average precision is defined as the average of the precision values at each point
in the ranked list where a relevant sound is located [66].
The QBE technique for audio information retrieval has achieved much success for
music and music information databases, with the Shazam "audio fingerprint" [67] currently
the most commercially successful. While these audio fingerprinting techniques have also
achieved success for databases of general continuously recorded audio [68], a major drawback
is the extreme breakdown in performance when the query does not exactly match a portion of
a sound file in the database. To overcome this drawback researchers have begun to combine
locality-sensitive hashing (LSH) with audio fingerprinting algorithms [69, 70] to improve
performance for inexact matches.
Another widely used technique in music information
retrieval is query-by-humming (QBH) where oral queries (i.e., singing) are used to match
music melodies stored in a database( [71-73]; among others). Unlike the audio fingerprinting
algorithms, QBH, will rank all melodies in the database in terms of their similarity to the
hummed query, and return the top K results to the user in rank-order. A common method
of ranking melodies in a database is to use probabilistic techniques to order the database
by the likelihood of the query given each database sound [71-73]. These likelihood-based
approaches to QBH melody retrieval will serve as the inspiration for the QBE techniques
for searching and organizing environmental audio databases explored in this dissertation.
Similar to segmentation, QBE retrieval methods also operate on acoustic features as
opposed to directly on the audio waveform. The indexing and retrieval techniques presented
in Chapter 4 utilize a flexible likelihood model for file-based query of environmental sounds,
which models query behavior and can be readily adapted for oral query. Specifically, each
sound in the archive is indexed with a hidden Markov model (HMM) which approximates
feature trends (does each feature stay constant, go up, down, or vary in more complex
16
ways?). Once presented with a query, our system extracts feature trajectories from that
query, and retrieves sounds in rank order. The ranking is determined by the likelihood that
the HMM of a database sound generated the query feature observations. While the QBE
retrieval method can be intuitive when the user has appropriate example files on hand or if
the query can be produced orally into an available microphone, this is not often the case.
Furthermore, QBE retrieval techniques will fail in cases where the user desires sounds that
are acoustically different than the query, but similar to the query in other meaningful ways
(e.g., the grunt of a horse might not retrieve the trotting sound of a horse running). In
these situations using semantic or text-based queries might be more appropriate.
1.3.3 Incorporating Semantic Information
If every sound file in the archive is labeled or "tagged" with words describing the content
of that sound then text-based audio retrieval becomes a simple string matching problem.
Unfortunately, for applications consisting of continuous recordings of environmental sounds,
labeling the entire sound archive is not realistic, and techniques that allow for text-based
retrieval of unlabeled audio are required. When attempting to develop a framework for
retrieving sound files from an unlabeled archive using text queries, there are two different
yet related techniques for solving the problem. The first technique, which we will refer
to as text-based retrieval consists of developing a system that connects words and sounds,
and uses text queries to return the sound files most relevant to that query.
technique is annotation
A second
or "auto-tagging" where an algorithm automatically applies the
most appropriate set of tags to an unlabeled file, and then this auto-tagged sound can be
used in a traditional string matching retrieval system.
The problems of text-based retrieval and/or automatic annotation of audio archives
have recently begun to attract significant research attention as a natural extension to QBE
17
retrieval. In [74] semantic words are connected to audio features through hierarchical clusters, and several researchers have tackled the text-based retrieval problem using a multicategory classification approach where the acoustic features of all sound clips labeled with a
given word are used to train classifiers such as SVM or GMM for each word. This approach
has been successfully applied to automatic record reviews [75], annotation and retrieval of
music and sound effects [76], and large-scale search of audio on the internet [77]. A disadvantage of this approach is that retraining of the classifiers may be required with the
addition of new words or sound clips. In [78] a taxonomy based on WordNet [79] is used
to tag unknown sounds with the labels belonging to their nearest neighbor in an acoustic
feature space, and a modified version of this system was used for a study involving the
Freesound project [80].
These existing techniques for audio information retrieval will fail for cases such as
retrieving all "water" sounds from a given database, because sounds related to water are
extremely diverse in terms of acoustic content (e.g., rain drops, a flushing toilet, the call of
waterfowl, etc.) despite belonging to a single semantic category. Thus, the system described
in Chapter 5 uses a taxonomic approach that aims to capture all connections between
semantics and audio content for use in a flexible system for audio retrieval allowing for
QBE and/or text-based queries of an unlabeled sound database. Furthermore, our system
can be used for automatic annotation, which describes a sound file based on its audio content
and provides suggested tags to contributors. Inspired by the approach organizing lexical
databases such as WordNet, we represent sounds in an audio database and their semantic
descriptions as an ontology, where sounds are linked to each other through a measure of
perceptual similarity, semantic tags are linked to each other through user-provided weights
or through an external ontology (e.g., WordNet), and sounds and tags are linked through
18
collected user preferences. We refer to this ontology of linked words and sounds as a hybrid
(content/semantic)
network. Queries can be either sounds or words, and do not have to be
previously connected to the network, as long as they can be linked either perceptually if
sound-based, or lexically if word-based.
1.4 Overview and Organization
In this chapter, we discussed how the proliferation of sound recording and storage technology
is changing how we characterize and experience our aural world. It is currently possible
to capture and archive all sound occurring in a human life or fixed space, but organizing
and accessing this information remains to a large extent an unsolved problem. This has
created a powerful need for automatic algorithms that perform audio event segmentation
and sound search. Currently, there are several content-based systems for segmenting and
searching databases of speech and music recordings, some of which have been extended
and tested with environmental and natural sounds. All of these systems are customized to
specific problems and have unique strengths and drawbacks.
To address some of these drawbacks, we have outlined an integrated framework for
the segmentation, indexing and retrieval of environmental and natural sounds. This system
takes as input continuous audio recordings, and automatically extracts and organizes sound
events from the raw audio data. The result is an informative index of sound environments,
which can be used as a memory aid, surveillance system, or acoustic ecology research tool.
Specifically, in this dissertation we explore algorithm performance in two situations: (1)
continuous recordings captured by a single user going about their daily life, and (2) an
internet audio database with recordings contributed by a large number of users.
In Chapter 2, we will describe the process of acoustic feature extraction, a fundamental
step in content-based audio processing. Furthermore, a summary of how the feature extrac-
19
tion process differs between speech, music, and environmental sounds will also be presented
in this chapter. Chapter 3 fully details and rigorously evaluates the segmentation algorithm
used to extract sound events from continuous recordings. In Chapter 4 a procedure for
indexing sound events, and retrieving them using a query-by-example framework is discussed. Chapter 5 extends the retrieval framework of Chapter 4 by incorporating semantic
information, allowing for text-based queries in situations where appropriate sound queries
are unavailable. Finally, Chapter 6 concludes by analyzing how well the proposed system
organizes our aural world, and also provides future research directions.
2. A C O U S T I C F E A T U R E E X T R A C T I O N
2.1 Introduction
Efficient indexing and classification schemes for audio are typically based on the fundamental
process of acoustic feature extraction, and techniques based on these features are often
referred to as content-based.
For example, low-level acoustic features have been used to
discriminate speech from music [58, 81], male/female speakers [82], musical genres [83],
musical instruments [84], and general audio [85]. Monitoring changes in acoustic features
also forms the basis of many audio segmentation [54,56,86,87] and continuous monitoring
[2,12] systems. Most acoustic feature extraction schemes rely on a frame-based analysis,
where a frame is a sequence of digital samples of the audio waveform. In audio applications
frames are typically 20-100ms in length, because in general, dynamic audio sources tend
to have some stationary characteristics, for instance spectral characteristics, at these time
scales.
While certain time domain features have proven useful in content-based audio analysis,
spectral features are often required to adequately summarize a frame. Thus, the short-time
Fourier transform (STFT) is often the fundamental step in the acoustic feature extraction
process.
For best results, the S T F T requires frames overlap in time, that samples be
multiplied by a finite length window function (e.g., Hamming, Blackman, rectangular, etc.),
and zero-padded prior to computing the discrete Fourier transform of each frame [88]. It
is often the case that the goal in content-based processing is for the automated system to
perform its task in a manner similar to a human listener, and this has inspired features
that operate on a warped frequency spectrum that approximates the response of the human
auditory system. The Mel [89] and Bark [90] scales are two well known representations of
human frequency perception that are widely used in content-based processing.
21
To a large extent the development of acoustic feature extraction techniques has followed
the speech-music-soundscape continuum discussed in Section 1.1, with features aimed at
capturing qualities of human speech receiving the most attention, which were then expanded
for music, and most recently a broadening of these features for use with environmental
and natural sounds. In this chapter we examine the process of acoustic feature extraction
through the lens of the speech-music-soundscape continuum with our main focus on relevant
features for the last level of this continuum. Section 2.2 discusses the history, development,
and technical details of features for speech, music, and environmental sound applications.
As the goals of the present study are to organize large collections of environmental and
natural sounds we must take care in selecting appropriate features. In section 2.3 criterion
for choosing appropriate features for our application are described. Additionally, this section
presents the feature set specifically tuned to environmental sounds that will form the basis
of all subsequent chapters. Concluding remarks on the acoustic feature extraction process
will be presented in Section 2.4.
2.2 T y p e s of A u d i o Features
2.2.1 S p e e c h Features
In trying to extract useful qualities from the sounds we hear, the words that most often come
to mind are pitch and loudness.
These qualities have an inherently perceptual meaning,
and several sophisticated psychoacoustic models exist [91]. Loudness is often described
in terms of two perceptually weighted scales [91,92]: the sone scale, where a sound with
loudness of two sones will be perceived as twice as loud as a sound with loudness of one sone,
and the phon scale, which is similar to the decibel (dB) scale. Therefore, a simple loudness
approximation is the sound energy/intensity converted to the dB scale, while more advanced
techniques integrate the excitation pattern (as a function of frequency) using bandpass filters
22
that approximate the frequency response at the human ear [93]. Regarding pitch, it is
commonly considered that frequencies below approximately 500 Hz are perceived according
to temporal periodicities, frequencies above approximately 1500 Hz are perceived according
to excitation along the basilar membrane, and both mechanisms operate simultaneously for
frequencies between these low and high extremes [94]. The perceptual pitch of a sound often
exhibits some type of correspondence with the fundamental
frequency, which is typically
denoted by /o, and denned for a periodic signal as the inverse of its period.
reason, the word "pitch" is often used to refer to fa. If
a
For this
signal is perfectly periodic, /o can
also be considered as the smallest positive time-shift that leaves the signal invariant [95].
Unfortunately, perfectly periodic signals do not exist, which makes /o estimation non-trivial.
Additionally, the fundamental frequency /Q does not have to be present in a spectrum in
order to be perceived, as long as several of its harmonics are present. This holistic theory
of pitch perception is known as virtual pitch [96]. This concept of virtual pitch has also
been exploited by experimental electronic music composers such as Tristan Murail [97].
While methods of /o extraction have been well studied for the the automatic processing
of speech signals [95,98,99], it is generally accepted that approaches can be divided into
either time-domain [95] or frequency domain [96,98,100,101] techniques. The most widelyused algorithm for /o estimation is the classical autocorrelation method [102].
Because
periodic signals exhibit peaks in the autocorrelation function at multiples of the period, the
autocorrelation method chooses /o to be the highest non-zero peak in the autocorrelation
function. A widely used variation of the autocorrelation function is the average magnitude
difference function (AMDF), which like the autocorrelation method is prone to underestimate the pitch period. A systematic evaluation of when the autocorrelation method fails,
23
and a way to correct it using a time-domain cumulative mean normalized difference function
is presented in [95].
Primarily due to the computational expense inherent in the STFT, frequency domain
methods for fo extraction have not received as much attention as their time-domain counterparts. However, in content-based approaches the STFT is often used in some portion
of the feature extraction process so frequency-domain fo extraction techniques make sense.
Two fundamental techniques based on using the harmonic pattern of the S T F T spectrum
to estimate fo are the works of Goldstein [98] and Terhardt [96]. The concept of virtual
pitch forms the basis of Terhardt's algorithm, which combines the relative importance of
each frequency peak in the spectrum with masking information and harmonic confidence
estimates to compute the virtual pitch. Goldstein's technique also uses the peaks in the
frequency spectrum, but assumes that all peaks are harmonic components and each peak
is characterized by an independent Gaussian distribution. An optimal estimate of fo and
the harmonic number of each peak can then be obtained under this Gaussian assumption.
Unfortunately, as discussed in [100], accurately computing the Goldstein pitch estimate requires solving a computationally intensive combinatoric optimization problem, which has
led to the highly accurate MCMC approximations proposed in [57,100]. Later in this chapter an algorithm for measuring the "harmonicity" of a signal spectrum based on Goldstein's
/o estimation algorithm will be presented.
The frequencies, amplitudes, and phases of the peaks in the S T F T spectrum can also
be used as a representation of an audio signal [103], which is often referred to as a sinusoidal
model. Because a large number of peaks are often necessary to accurately describe a frame,
the number of peaks is not constant from frame to frame, and spurious peaks can appear
due to noise or side-lobe interactions in the windowing process, parameters that specify
24
the spectral envelope are often used instead of an explicit sinusoidal model. The spectral
envelope can be defined as a smooth function that passes through the prominent peaks in
the spectrum [104], and is typically represented by a number of parameters much less than
those necessary for an explicit sinusoidal model.
Perhaps the most widely used spectral envelope feature set in speech processing are
the linear prediction coefficients (LPCs) [102,105]. The LPC features use an all-pole filter
of a given order to model a frame of speech. These predictor coefficients can then be used
as the fundamental parameters in an analysis-synthesis speech coding system [105], or as
input features in a speech recognition system [106]. To improve performance in speech
recognition systems psychoacoustic pre-processing steps have been augmented to the LPC
feature extraction process, in methods entitled perceptual linear prediction (PLP) [107] and
RElative SpecTrAl (RASTA) P L P [108].
Another popular feature set that was originally developed to parameterize speech are
the Mel frequency cepstral coefficients (MFCCs) [109], which typically provide a more distinguishable representation of the spectral envelope when compared to LPC features for
speaker verification [110] and speech recognition [111]. Following initial success in speech
recognition [112], MFCCs are also used in music applications [113], and as the benchmark
feature set in all types of content-based audio processing applications [15,114]. Cepstral
coefficients are based on the concept of cepstrum, which is the Fourier transform of the
log of the magnitude STFT spectrum, and was originally developed as a method for the
deconvolution of two signals [102,115]. The Mel in MFCC stands for the Mel scale [89],
which was experimentally derived by human listeners judging frequencies to be equal distance from one another, and frequencies in Hertz can be converted to frequencies in Mel
25
using the following relationship [114],
f(mel)
= 1127.0148/03 (l + ^ ^ )
•
(2-2.1)
To compute MFCC features the STFT magnitude spectrum at each frame is multiplied by
a series of triangular filters, which extract the energy in each band of the Mel scale. Scaling
frequencies using the Mel-scale allows MFCC features to account for the fact that human
perception is more sensitive to differences in lower frequencies than higher frequencies.
The log energy in each Mel band is then used as the input to a discrete cosine transform
(DCT), and the MFCCs are the output of this DCT. Using the DCT in the last step
decorrelates the components and reduces the dimension of the MFCC vector. It is typical
in most application to use around 12-20 MFCCs, where the low-order coefficients capture
the low-frequency formant energy, and the higher-order coefficients capture more harmonic
information [65]. Speech production can be considered as the convolution of the excitation
signal from the vocal chords with the filter of the vocal tract. In the cepstral domain this
convolution becomes addition with the vocal chord and vocal tract components often well
separated in the cepstral domain [115] providing additional motivation of MFCCs as speech
features.
2.2.2 M u s i c Features
Like human speech, the frequency spectrum of most musical instruments exhibit a harmonic
frequency structure, thus, /o estimation is also incredibly important in music applications,
most notably note onset detection [56,57]. Furthermore, as mentioned previously, MFCCs
have also been applied as music features [76, 113], where they are considered as a representation of the timbre in an audio signal [65]. Another important music feature is the
chromagram or pitch class profile (PCP) [65,116]. These features are based on two dimen-
26
sions of musical pitch perception: height and chroma [117]. Height describes the octave a
note belongs to, while chroma is the relation of the note to other notes within the same octave [118]. P C P features collapse the height dimension and represent the STFT spectrum of
the signal as a 12 dimensional vector. Each element of the vector represents the integrated
energy of all octaves in a single pitch class, and there are 12 pitch classes in western tonal
music. P C P features have proven most useful in automatic chord recognition [116, 118]
and cover song identification [119]. Another important task in content-based music analysis is beat tracking [65,120,121], where an onset detection or segmentation algorithm is
constrained to follow a specific rhythmic pattern.
2.2.3 General A u d i o Features
A seminal work in content-based indexing and retrieval of audio was the Muscle Fish
database [84]. In this work, a short-time analysis was performed on a database of approximately 400 sound files, employing loudness, pitch, brightness, bandwidth, and harmonicity
of the sound. Statistics (mean, variance, and autocorrelation at small lags) of each feature trajectory were stored and used for the indexing and classification of all sounds in the
database. Although, Muscle Fish contained a wide variety of sound types (speech, music,
natural sounds, etc.) over half of their audio classes were musical instruments, and the
system was generally biased towards musical sounds. In [58] a system that discriminated
between speech and music was proposed using 13 short-term features (four time domain,
seven spectral, and two cepstral). Each feature was specifically chosen to be salient for
only speech or music. An interesting result from [58] is that although 14 classification
schemes were tested on various subsets of the 13 features, only changing the employed
feature sets significantly varied classification results, the classifier made little difference in
27
performance. This demonstrated the importance of choosing an appropriate feature set for
a given content-based audio retrieval task.
Research on removing the "speech and music" bias from the acoustic feature extraction
and content-based retrieval problems first appeared in [85]. In this work four distinct feature
sets are compared in terms of classification performance for general audio (speech, noise,
crowds, etc.) classification and music genre classification. Although, the types of sounds
in the general audio classes were far from complete this paper presented for the first time
the extremely important result that features that work well for music do not perform well
on environmental and natural sounds. Furthermore, the feature set of [85] also became the
base feature set for the parameterization/classification components of the audio surveillance
system of [12], for monitoring a work environment.
A challenge in continuous audio archival is the long-term scale of these recordings, and
often summary features [2], which combine the short-term features over a longer window
(e.g. one minute) are used. In [2], short-term spectral and cepstral features are used and
summarized using the mean, standard deviation, and normalized deviation of the feature
trajectories. Additionally, the log of the feature trajectories over one minute windows is
utilized to place emphasis on very quiet regions of the recording. Even over these much
longer time scales the importance of acoustic features in segmentation and classification
tasks was clearly demonstrated. Recent work has also explored the use of time frequency
features for general sound detection and classification [15,70].
The rise in the number of digital multimedia files available on the internet, and the
large number of retrieval algorithms being developed to search for them led the ISO Motion
Picture Experts Group to create the MPEG-7 "Multimedia Content Description Interface"
standard in September 2001 [114]. The general goal of MPEG-7 is to standardize both
28
automatically extracted and manually annotated metadata for images, audio, and video.
While the audio portion of the MPEG-7 standard contains certain algorithms specifically
tailored to speech and music, a significant percentage of the standard is dedicated to lowlevel feature extraction and sound classification for general audio.
The standard refers
to extracted features as "descriptors", and MPEG-7 has 17 spectral and temporal lowlevel audio descriptors. The descriptors are divided into six categories: basic
which summarize time domain audio content, basic spectral descriptors
descriptors,
that describe the
envelope, centroid, spread and flatness of the audio spectrum, signal parameter
descriptors
which include harmonicity and fundamental frequency, timbral temporal descriptors
and
timbral spectral descriptors to approximate human timbre perception for musical sounds,
and finally spectral basis descriptors that represent low-dimensional projections of the audio
spectrum. The spectral basis descriptors are used in the MPEG-7 classification system,
and consist of summing the energy in logarithmically distributed S T F T bands, and then
using a dimensionality reduction technique such as principal component analysis (PCA)
to reduce the dimension of this feature set. While the audio feature extraction portion of
the MPEG-7 standard has found numerous applications [114], its lack of focus on natural
and environmental sounds has led to the development of a feature set tailored for the
organization of this type of audio, as discussed in the following section.
2.3 C h o o s i n g a Feature Set
The surveillance, biophony, and life-logging applications described in Section 1.2 require
continuous recording making it nearly impossible for a human to find events of interest
by the standard methods of listening or scanning visually via waveform display.
Thus,
a system is needed which allows individual sound events to be segmented, archived, and
linked both to similar events and to the context of the original recording. The selection of
29
a suitable feature set is the first step in this process. A fundamental goal of the feature set
in this dissertation is that features indicate the presence of a large variety of sounds while
not specifically adapting to a particular type of sound, e.g., MFCC for speech or chroma
features for music. Five dimensions of relevance for a feature set specifically adapted to
environmental sounds are as follows:
• Diversity: The particular feature or group should exhibit a large spread (entropy)
in the context of real-world data sets.
• Efficiency: We desire a small yet comprehensive feature set that can be efficiently
calculated. Thus, we should choose only one from a set of functionally redundant
features, for instance bandwidth and spectral sparsity, and avoid calculating an extremely large feature set at every frame, and then reducing the dimensionality using
a method such as PCA.
• Categoric relevance: Different categories of sound (e.g. voice, construction sounds,
traffic, wind noise) should have different average feature vectors.
• Usability: Feature values should possess intuitive meaning when searching for specific
values of a given feature, something that we do not have for example with the fourth
MFCC or third PCA coefficient.
• P e r c e p t u a l adaptation: Sounds that sound different should have different feature
vectors; i.e., feature distance varies with perceptual distance. Although not presented
in this dissertation, efforts to map out feature scales according to the concept of
just-noticeable difference (JND) is an important topic of future work.
30
Given these relevant dimensions as design goals, we now present six acoustic features
that provide an intuitive and minimal set for both segmentation and retrieval purposes.
When appropriate, examples of functionally redundant features from the literature are presented along with the selected features. Due to the diversity of sound sources, we have found
it necessary to calculate features at different time scales, from 40ms (short-term) to one second (long-term). Short-term features are computed either directly from the windowed time
series data or via S T F T using overlapping 40ms Hamming windows hopped every 20ms.
Long-term features are computed using a one-second sliding window to combine the data
from 49 of the 40ms windows, in order to capture slowly evolving textural characteristics of
the sound. Using 98% overlap for the sliding window, (i.e., slide in 20ms steps), both long
and short-term features remain synchronous, and can be concatenated into a feature vector
for each frame.
The first of our features is loudness, which we define as the RMS level in decibels of
the windowed frame indexed by t, i.e.,
Yt{1) = 20 log10(RM St).
(2.3.1)
Our loudness feature can be considered functionally redundant with the volume feature [122]
and the MPEG-7 audio power descriptor [114], but the conversion to decibels makes this
feature more perceptually accurate. Next, we compute from S T F T data the Bark-weighted
spectral centroid:
££i(&;-&,--i)l*t(J)l2
where bj is the frequency value of the center of STFT bin j in units of Barks [90]. The Barkweighted spectral centroid is a perceptually weighted version of the brightness feature in
[81,84], while also being functionally redundant with the MPEG-7 audio spectrum centroid
31
descriptor [114]. A third feature, spectral sparsity is calculated from the zero-padded STFT
data of each frame, via the ratio of £°° and i1 norms calculated over the magnitude STFT
spectrum (inspired by the common I1 sparsity metrics used for compression, blind source
separation, etc. [123]). Defining Xt(j)
as the M-point S T F T coefficient from frequency bin
j for frame t, the spectral sparsity is defined as follows,
y(3)
A max(|Xt(l)|,...,|Xt(M)|)
Ef=1\MJ)\
Spectral sparsity should be large for pure sine tones or bells, and smaller for sounds with
significant "noise" characteristics that imply a wide frequency spectrum. Spectral sparsity
tends to behave in a similar fashion to bandwidth features [81,84,122], spectral rolloff
frequency features [81,83, 122], the MPEG-7 audio spectrum spread descriptor, and the
MPEG-7 audio spectrum flatness descriptor [114].
As mentioned previously, our feature set also contains features computed using a onesecond sliding window that hops every 20ms, constructed by combining data from
N=49
of the 40ms frames. Temporal sparsity is one such feature, which we define as the ratio of
£°° and I1 norms calculated over the N small window RMS values in a given one second
window, i.e.,
y (4)
A
ma,x(RMSt_{N+1),...,RMSt)
Y,k=t-{N+\)RMSk
Temporal sparsity is large for sounds such as footsteps in relative silence and useful for
indexing and retrieving these types of sounds.
Transient index is computed by averaging over several frames of data the Euclidean
distance between the MFCC vectors of consecutive frames . That is, we define the transient
index for frame t using the MFCCs as
t
yt
(5)
=
Y,
fc=£-(N+2)
WMFCCk-MFCCk^h
(2.3.5)
32
where MFCCk
is the fifteenth-order MFCC vector for frame k, and N is & positive in-
teger signifying the number of short frames over which the transient index is computed.
The transient index should be useful in detecting and segmenting sounds with spectral
characteristics that exhibit consistent fluctuations between adjacent frames, e.g., crumpling
newspaper. Transient index helps to summarize the MFCC features into a single value per
frame, and behaves in similar manner to the spectral flux feature in [59].
Finally, harmonicity
is used to measure probabilistically whether or not the STFT
spectrum for a given frame exhibits a harmonic frequency structure. Harmonicity should
be large for speech, music, and certain machine noises and smaller for most other types of
environmental audio, as well as some musical sounds (bells). We calculate harmonicity based
on Goldstein's pitch detection algorithm, which in our case is computationally efficient since
the S T F T is already calculated for extraction of other features. Because we are calculating
harmonicity as a function of pitch, we can consider fundamental frequency as functionally
redundant to harmonicity, while other features that correlate with harmonicity include zero
crossing rate [58,83,122] and the MPEG-7 audio harmonicity descriptor [114].
The algorithm begins by choosing the L most prominent frequency peaks from the
STFT magnitude spectrum of frame t. Peak amplitudes and frequencies are determined
using quadratic interpolation [124], while amplitudes are required to be above both a global
and frame-adaptive threshold in order to provide robustness to noise. Then, the frequency
of each peak in Hz, is stored in the set pt = {/i,.., JL}- Using an adaptation of Goldstein's
algorithm [98,100], we first estimate the fundamental frequency f0, for which the fcth harmonic has frequency kf0.
As discussed in [100], accurately computing the Goldstein pitch
estimate requires solving a computationally intensive combinatoric optimization problem,
and although highly accurate MCMC approximations have been proposed [57,100], they
33
are still too slow for our application. To provide an efficient approximation that works well
in most real-world cases, we compute the Goldstein pitch estimate by searching pairwise
combinations of the L frequency peaks, and kmax < 2L possible harmonics. Assuming the
pair of peak frequencies denoted by {/i,/2} C pt, are harmonics of f0, with corresponding
harmonic numbers of k\ and &2, the Goldstein likelihood [98,100] can be computed from
2
P{.h, h\L hM) = I I -/(/j; kjf0, <TJ)
(2.3.6)
where j(x; /x, a) is the univariate Gaussian pdf with mean /z and standard deviation a
evaluated at x, and the standard deviation terms OJ are considered to be proportional to
frequency [98], i.e.,
<Jj = Ckjf0
(2.3.7)
with C a constant. In this work we set C = 0.01/v2, as this value was shown in [98] to
provide accurate estimates for a wide range of harmonic numbers. Given the form of aj
from (2.3.7), [98] shows that the fundamental frequency estimate f0 for any pair of frequency
peaks is
( / i A i ) 2 + (/ 2 /fc 2 ) 2
f
h/k\
The harmonicity Yt
,9„8,
+ J2/K2
for frame t is then given by
Y™±
max
J1J2MM
P{h,h\foMM
subject to :
1 < ki < k2 < kmax,
A < h,
Jo -> Jmin-
yZ.o.y)
34
The final constraint in (2.3.9) is used to avoid very low (1/2,1/4, etc.) estimates for f0,
while maximization of (2.3.9) is done by exhaustively evaluating (2.3.6) and (2.3.8) for
all valid combinations of / i , / 2 , &i, and ki- If less than two frequency peaks are found in
the magnitude spectrum for frame t, F4
is set to zero. This approach to harmonicity
calculation treats the peak assignment as a nuisance parameter that is resolved using a
"generalized" rather than a "marginalized" likelihood approach, where all frequency peaks
are considered explicitly in the algorithm; however, it is only the two most-likely peaks that
are used in the pitch estimate.
2.3.1 Illustrative E x a m p l e
We now provide an example of how the six features just presented satisfy the efficiency,
diversity, categoric relevance and usability dimensions presented at the beginning of this
section. As previously mentioned evaluation of the perceptual adaptation dimension remains
a topic of future work. Regarding efficiency, it should be clear from the feature definitions
that none are computationally prohibitive, and implementing a real-time feature extraction
system at 50 frames per second is easily achievable with modern hardware.
Table 2.1
shows the percentage of the computation time for each task during a frame of the feature
extraction process. The most computationally intensive task is the MFCC calculation used
for the transient index feature, while solving (2.3.9) to calculate harmonicity, and computing
the S T F T also requires significant resources.
Figure 2.1 displays the time domain waveform and the feature trajectories (all normalized between zero and one to aid presentation) for an example recording. This example
consists of four events: a whistle (2-3 seconds), a key jingle (4.5-7 seconds), a shout (9-10
seconds), and a metal knife hitting a metal sink (12-13 seconds). From Figure 2.1 we can
see the diversity of the feature set as the only feature that behaves in a consistent fashion
35
TABLE 2.1
Percentage of computation time for tasks in feature extraction process
% Computation time
24.60
26.21
47.45
0.05
0.29
1.01
0.39
Task
STFT
Harmonicity
Transient Index
Temporal Sparsity
Spectral Sparsity
Spectral Centroid
Loudness
Keys
Whistle
Waveform
Speech
Metal
•imp
JH.
Harmonicity
r
l«rth.-»<*. ••»A«»J
Transient Index K«A i»vau»»/k»» i»<J *'WMIW'lr^«^»<jA,
Temporal Sparsity
JH
Spectral Sparsity
Spectral Centroid
Loudness
0
Fig. 2.1.
_n
2
_rv J<4
4
6
8
Time (seconds)
10
12
Waveform and feature trajectories from an example recording.
36
across all four events is loudness, which because this recording was made in a low noise
environment we expect loudness to increase during events. We will provide examples in
later chapters of recordings made in noisy conditions where loudness is not so informative.
Table 2.2 provides the mean feature values averaged over the course of each event for the
same example recording. By jointly examining Table 2.2 and Figure 2.1 we also illustrate
how the presented feature set satisfies the categoric relevance and usability goals.
Harmonicity strongly reacts to the speech sound, and slightly to the metal tap sound
the only events that intuitively would have a harmonic spectrum. The "noise-like" qualities
in the key jingle sound seem to have the most effect on the transient index and spectral
centroid features. Spectral sparsity increases most during the inharmonic whistle sound,
which has a spectrum similar to a pure sine tone. Temporal sparsity has its highest average
value during the metal tap event, which is a short duration event surrounded by relative
silence. The example of Table 2.2 and Figure 2.1 should help to clarify the types of events
each feature responds to, and how these features can be used in the retrieval process.
TABLE 2.2
Mean feature values for four example events.
Harmonicity
Transient Index
Temporal Sparsity
Spectral Sparsity
Spectral Centroid
Loudness
Whistle
0.0000
4.3491
0.0804
0.2125
8.1649
-34.6946
Keys
0.0001
8.8732
0.116
0.0247
14.2722
-38.7279
Speech
0.0124
4.6863
0.0634
0.0580
3.7185
-42.3217
Metal
0.0016
4.4564
0.2004
0.1725
1.5878
-42.7817
2.4 Conclusion
The process of low-level feature extraction forms the first step in nearly all content-based
computer audition systems.
While many of the features proposed in the literature are
37
specifically geared towards speech and music sounds, features for natural and environmental
sounds have begun to appear. Given the ample number of features proposed in the computer
audition literature selecting an appropriate subset for use in a given application is a nontrivial task. To help overcome this challenge, five relevant dimensions for choosing a feature
set were described and subsequently exploited for the six-dimensional feature set proposed
in this chapter. In upcoming chapters this feature set will be used to segment sound events
from continuous recordings and retrieve sound events from large audio collections.
3. A U T O M A T I C A U D I O S E G M E N T A T I O N
3.1 I n t r o d u c t i o n
In the previous chapter the process of acoustic feature extraction, i.e., converting an acoustic
signal into a time-series of feature vectors was described. These features then serve as the
input data for more complicated audio processing tasks. Because the targeted applications
described in Section 1.2 require continuous recording, it is not inconceivable that we might
have to extract features for tens of thousands or possibly even millions of frames. Clearly, it
is difficult to archive a time-series of this length without some knowledge of when important
events occur. In this chapter we will use the extracted time-series of audio feature vectors
to segment the continuous recording into specific sound events.
For the purpose of this dissertation we define segmentation as the process of automatically determining onset and end times for individual sound events. To define a "sound
event" we follow as closely as possible Bregman's concept of an auditory stream, which is
perceptually equivalent to the sound emanating from a single physical source [5]. Thus, a
"typing event" would consist of clusters of keystrokes rather than a single keystroke, and a
"speech event" would consist of the continuous speech between pauses or transitions where
the speaker changes, not structural units such as phonemes, words, or sentences. It is also
common that events overlap in time, e.g., a recording of someone walking to a bus stop
could contain the sound mixture of a car driving by, the sound of footsteps, and birds
chirping in a nearby tree. In this case, a source separation or de-mixing process is also
necessary to isolate individual sound events. The source separation problem has attracted
much research interest [46-50], but reconstructing the separated sources to be artifact free in
real-world recording conditions has not yet been achieved. Therefore, we limit ourselves to
only temporal segmentation in order to avoid adding artifacts to the audio archive through
39
the source separation process. Thus in the present work, a "segmented sound event" could
in fact be a mixture of multiple events occurring simultaneously.
Similar to other computer audition tasks, prior work in audio segmentation has focused
primarily on speech [51-53] and music [54-57]. Although, methods for the segmentation of
environmental sounds have been proposed they often work for only specific event types and
recording conditions [2,12,60,125]. In order to develop a general event segmentation system
for natural and environmental sounds there are certain challenges that must be considered.
First, we should specifically address the context of the recording so the decisions made
regarding event boundaries use the surrounding features to help make the decision, i.e.,
we should account for dynamics in the time-series of feature vectors. Second, event onsets
and end points might not be instantaneous, but distributed over several frames of data,
i.e., most sound events typically begin and end gradually. Third, different features will be
responsive to different types of sounds, depending on the environment where the sound was
recorded, and segment boundaries between responsive features for a given sound event can
be delayed.
In order to tackle these challenges we must model not only the variations of individual
features over the course of the recording, but also the collective behavior of all features at the
event level. Thus, we hierarchically model how event boundaries influence changes in features, while allowing for uncertainties due to gradual event onsets and unresponsive/delayed
individual features with a dynamic Bayesian network (DBN) [126]. Bayesian networks are
directed graphical models where nodes in the graph represent variables and directed arcs
between nodes signify conditional dependencies between the connected variables. A DBN
consists of local Bayesian networks for each time-slice of sequential time-series data, and
the local networks are connected to represent conditional dependence between time-slices.
40
DBNs generalize other probabilistic models such as the hidden Markov model (HMM) and
linear Gauss-Markov model with Kalman filter inference process. Additionally, DBNs provide an optimal framework for inference of the sufficient statistics necessary to specify the
posterior distribution of certain hidden variables given an observed time-series, and the
learning of necessary probability distribution parameters [126]. The segmentation model
proposed in this chapter focuses on a specific subset of DBNs referred to as switching state
space models or switching dynamic systems [62,126-131]. Specifically, we monitor rapid
changes in acoustic features in order to infer when event onset and end times occur.
In the remainder of this chapter, we will present details of a custom DBN for event
segmentation in natural sound environments. Section 3.2 introduces our approach for event
segmentation using a single feature. This elementary case will serve to illustrate fundamental concepts in DBN theory, such as approximate Viterbi inference in Section 3.3 and DBN
parameter learning using the expectation maximization algorithm in Section 3.4.
Once
these concepts have been introduced we can then easily extend the segmentation model to
the multi-feature case as described in Section 3.5. A detailed analysis of how the segmentation algorithm performs on real-world continuous sound recordings is provided in Section
3.6, and concluding remarks are presented in Section 3.7.
3.2 Single Feature S e g m e n t a t i o n
Given a recording to be segmented into specific acoustic events, we assume that the feature
extraction process described in Chapter 2 has already been completed. By observing large
changes in the values of these features we can determine event onset and end times using
a generative DBN model. In this section we describe the case where only a single feature
is extracted from each frame of the audio signal, i.e., the time-series of feature vectors
simplifies to a time-series of feature scalars.
41
Let l j G R be the observed scalar feature value at the frame indexed by t. If the
recording is composed of T frames, then the feature time-series is represented as Y\:T,
where the notation i £ l : T will be used throughout this dissertation as shorthand for the
sequence of integers t £ {1,2, ...T}, e.g., Y\-T = {Y\, Y2, ...,Yp}.
In our underlying model,
we are not only concerned with separating regions of silence from regions where the sound
is on, but we also hope to account for new sound clips beginning before the previous clip
has turned off.
We model this by assigning a global mode, Mt to each audio frame of
the recording. We represent Mt as a discrete random variable, responsible for controlling
the underlying dynamics of the observed features. Similar to the note onset/continuation
formalism for segmenting musical note events [56,57] the mode, Mt in this work can take
three possible values:
• O l - T h e onset, or beginning of a new sound event;
• C l - T h e continuation of a sound event between two successive frames;
• 0 The absence of a perceptible sound event overlapping the given frame.
We use this three value model for Mt as opposed to Mt £ {on, off} because new sound events
can begin before previously sounding events end, and these onsets can only be detected by
splitting the mode to have three values. We assume Mt is a discrete first-order Markov
process [131], governed by the Markov chain shown in Figure 3.1. When a sound event begins
(transitions from Mt-\
from Mt-\
= 0 to Mt = Ol or Mt~\ = C\ to Mt = Ol) or ends (transition
= C I to Mt — 0) we expect to observe large changes in the observed feature
values at these frames. Additionally, by assigning a non-zero probability to transitions from
Mt~i = CI to Mt = 0 1 , our model explicitly accounts for sound event onsets that occur
42
Fig. 3.1. Markov transition diagram for global mode Mt, where p n e w is the prior probability
of a new onset, while poB is the prior probability of a sound turning off, given that it is
currently on.
before a different event has reached its endpoint. Once the sequence M\-,T is known for a
given recording, we can consider the segmentation complete.
As will be discussed throughout the remainder of this Chapter, points in the feature
time-series where large "jumps" are observed will be considered likely candidates for the
onset and endpoints of sound events. These observed jumps can also be considered considered as points where the mean value of the feature time-series changes, thus, we employ
a variation of the approach for segmentation in the changing mean model [62]. We model
(i)
the actual feature observations YtK' as inherent features corrupted by noise. This inherent feature is based upon a series of step functions, where the value of the step function
changes (jumps) anytime an event onset or endpoint occurs. Because event onsets are not
instantaneous, but rather fade in/out over multiple frames the inherent feature is actually
a low-pass filtered version of this series of step functions. Since the inherent feature is not
observed, we will refer to it as a hidden state, where St £ R signifies the hidden state at
frame t.
Given the definitions of the observed feature
M\-T
YI:T,
hidden state
SI-_T,
and global mode
sequences we can summarize the relationships between these variables using the di-
rected acyclic graph (DAG) [132] displayed in Figure 3.2. The presence of a directed arc
43
Fig. 3.2.
Directed acyclic graph for single feature audio segmentation model.
between two nodes in a DAG indicates conditional dependence between the random variables represented by those nodes, while the lack of an arc between two nodes represents
conditional independence. Thus, Figure 3.2 specifies the factorization of the joint distribution as
r-i
P(Y1:T,S1:T,M1:T)
r-i
= P(Mi)P(5i|Mi) J ] P(Mt+1\Mt) J ]
t-i
t=\
T
P{St+l\SuMuMt+l)J\P{Yt\St).
t=\
(3.2.1)
3.2.1 D i s t r i b u t i o n a l Specifications
To define the segmentation model of Figure 3.2 we must specify conditional probability
distributions P{Yt\St),
P(St\St-i,Mt,Mt-i),
and P(Mt+1\Mt).
We begin by presenting
the state-space equations for the dynamic system that relates the observed features to the
hidden state and global mode as
St+i = (1 - a)St + a qt(Mt, Mt+1)
(3.2.2)
Yt = St + rt.
(3.2.3)
44
Here, a is a low-pass filter coefficient, and if the a and (1 — a) terms from (3.2.2) are
removed the dynamic system becomes the changing mean model from [62]. The state noise
qt(Mt,Mt+i)
and measurement noise rt are independently distributed Gaussian processes:
qt{Mu Mt+1) ~ M(0, Q(Mu Mt+l))
(3.2.4)
rt~M(0,R).
(3.2.5)
In the system of (3.2.2) and (3.2.3), R is t h e measurement noise variance, and feature changes (jumps) are modeled by the state noise qt{Mt, Mt+i),
where the variance
Q(Mt, Mt+\) will be be large during event onset and end times (Mt ^ Mt+i)
when changes do not occur (Mt — Mt+\).
and small
In other words, the observed feature value at
time t will be close to the value at t + 1, when Mt = M i + i , and far from the value at t + 1
when Mt ^ Mt+\- The purpose of introducing the low-pass filter in (3.2.2) is to account
for the fact that in practice acoustic event onsets and end times are not instantaneous, but
rather distributed over several frames.
Given the Gauss-Markov assumption of (3.2.2) and (3.2.3) the conditional distributions
of the continuous variables from the DAG in Figure 3.2 are:
P(Yt\St)=H{SuR)
P(St+1\SuMt,A4t+1)
(3.2.6)
= Af(aSt,a2Q(Mt,Mt+1)
P(SI|MI)=JV(S0,QO(MI)).
(3.2.7)
(3.2.8)
where 5o and Qo(M\) are the initial mean and variance, respectively, of the hidden state.
Finally, we specify P(Mt+i\Mt)
as the Markov chain in Figure 3.1, which requires values
for p n e w and p off , the prior probabilities of a sound turning on and off, respectively.
45
3.3 A p p r o x i m a t e Viterbi Inference
Segmentation is achieved by estimating the global mode sequence M\-T- Ideally, our estimation criterion should preserve the correct number of segments, and the detected segment
boundaries should be near the true segment locations. In order to achieve these goals we
choose the maximum-a-posteriori (MAP) or global segmentation criterion to estimate M\:TThe MAP criterion can be defined as
MUT = a r g m a x P ( M 1 : T | r i ( : 1 T i r ) )
(3.3.1)
M1:T
and is typically more effective in estimating segmentation sequences that do not contain
boundaries in adjacent frames, in contrast to a local frame error segmentation criterion [56].
Unfortunately, computing the exact MAP estimate requires exponential-time complexity. A linear-time approximation nevertheless exists, using an approximate Viterbi inference scheme [130]. The Viterbi algorithm is widely used in the HMM setting [106], but we
will now summarize its application to the single feature segmentation model following the
switching linear dynamic system Viterbi approach of [130].
We begin by defining the following quantities
St\n(i) = E[St\Y1:n,Mt
St]n(i,j)
= E[St\Y1:n,Mt
= i]
= i,Mt-!
S t |„(i) = E[(St - SAn{i)f\Ylm,Mt
S t | n (i, j) = E[(St - St]n(i,j))2\Y1:n,Mt
(3.3.2)
= j]
= i]
= i , M t _ ! = j]
i,j G { 0 , 0 1 , C I }
where E[-] denotes the expectation with respect to the posterior distribution. From Kalman
estimation theory [62,131,133] the quantities in (3.3.2) are filtered estimates when n = t,
46
predicted estimates when t > n, and smoothed estimates when n > t. Because the values
of the global mode are assumed known, we can use the Kalman filter time update and
measurement
update recursions to calculate the quantities in (3.3.2) given the state space
formulation of (3.2.2) and (3.2.3). Furthermore, given the global mode is in state i at
frame t and state j at frame t — 1, we can interpret Stit(i,j)
and E t | t ( i , j ) as the "best"
filtered hidden state mean and variance at frame t, respectively, when t observations Y\.t
have been processed. Similarly, 5 t i t _ 1 (i) and St|t_i(i) are the "best" predicted state mean
and variance at frame t, respectively, given t — 1 observations and the global mode is in
state i at frame t.
The Viterbi algorithm is based on the partial cost at frame t, which is defined as
Jt(i)=
min
Ml:(-1,
P(Mt = i,M1:t-i,S1..t,Y1..t)
(3.3.3)
S1:t
which is the lowest cost (maximum probability) global mode path over all possible global
mode Mi : t_i and hidden state S\:t sequences ending at Mt = i, where i G {0, O l , C I } . This
cost can be computed recursively as
Jt(i) = imn{Jtit-i(i,j)
+ Jt-i(j)}
(3.3.4)
3
where Jt,t-i(i,j)
is the innovation
cost incurred from switching to Mt — i from Mt-\
=j ,
and will be defined in more detail later. Because our segmentation goal is to infer the
state sequence
MI-.T,
we must also keep track of the indices that minimize the cost in the
transition record
rpt{i) = a r g m i n { J M _ i ( i , j) + Jt-i(j)}.
i
(3.3.5)
47
We also store for each frame t, three lowest cost state and variance values
St\t{i) = SMt(i,1>t(i))
St|t(0 =
(3-3.6)
tt\t(i,ipt(i))
i£ {0,O1,C1}.
After the observations from all T frames have been combined, the total cost and global
mode value at frame T are given by
j;{i)
=minJT(i)
(3.3.7)
i
M? = arg min JT(i).
(3.3.8)
i
By backtracking through the transition record Vt,i, the estimated global mode sequence
M\;T = Mf.T is then obtained from
Mt* = MM;+l)
(3.3.9)
Now that the basic structure of the approximate Viterbi algorithm has been defined,
we provide the details necessary to compute the innovation cost Jttt-i(i,j),
heart of the inference scheme. One way of expanding the posterior
which forms the
P(MI:T\YI-.T),
is [62]
T-l
P(M1:T\Y1:T)
ex PiM^
H
[P(Mt+1\Mt)P(Yt+1\Yl:UM1:t+1)]
(3.3.10)
t=i
where the last term is the observation likelihood of the Kalman filter [62,130], which given
Mt = i and Mt-\
= j is distributed as
P{Yt\Ylu.uM1:t.2Mt-i=J,Mt
From
(3.3.10)
it is clear that
[P(Mt+i\Mt)P(Yt+i\Yi.,t,Mi:t+i)].
= i)=Af(Yt-St^1(iJ),i:t^1{iJ)
each new observation
+ R).
(3.3.11)
multiplies the posterior
by
Thus, in [130] the innovation cost is proportional to
48
the negative log of this quantity, i.e.,
Jt,t-i(iJ)
cc-log[P(Mt
= i\Mt-1=j)P(Yt\Y1:t-1,Mi:t-2,Mt-1=j,Mt
= i)]
(3.3.12)
where the first term is a Markov transition, and the second term is Gaussian as defined in
(3.3.11). The innovation cost is then
tt 1
. .!;_:^+rlog(St|t-i(t,j) + ^)-logiJ(Mt = ilMt-1=j)
Jt,t-i(i,j) = i~
(3.3.13)
3.4 P a r a m e t e r Learning
Our goal in the segmentation process is to infer the global mode sequence M\._T from the
feature sequence
Y\-.T-
Unfortunately, solving this inference problem requires values for
the parameters necessary to specify the model, which include the low-pass filter coefficient
(a), observation noise variance (R), mode transition probabilities (p new , Poa), and process
noise variances (Q(M t ,Mj+i)). A general and well-known technique for finding maximumlikelihood estimates of parameters in a probabilistic model is the expectation-maximization
(EM) algorithm [134]. The EM algorithm has found widespread use in the DBN literature [126,135,136], with the Baum-Welch algorithm for HMMs [106] a particularly wellknown example. Successful application of the EM algorithm in estimating parameters in
linear dynamic systems [137] form the basis for EM learning used in conjunction with the
approximate Viterbi inference in switching models presented in [130].
The EM algorithm alternates between obtaining sufficient statistics from the Viterbi
inference (E-step) and updating parameter estimates (M-step) until all parameter values
have converged. Unfortunately, the EM algorithm is only guaranteed to converge to a local
maximum on the likelihood surface, therefore, setting initial values of the parameters is
crucial if the EM algorithm is to find the global maximum. We now present some heuristics
for setting parameter values. For example, setting the global mode priors of Figure 3.1 to
49
Pncw = 0.001 poS = 0.002 roughly corresponds to incorporating into the model the prior
information that a sound event of length 500 frames occurs once every 1000 frames. In
general we have found that the segmentation model is relatively robust to incorrect initial
settings of the prior probabilities (p new , poff) within approximately two orders of magnitude.
We have also observed consistent model behavior for different values of the low-pass filter
coefficient a from (3.2.2). In general, an initial value around a = 0.95 seems to work rather
well for different features and event types.
To initially set the process noise variance we must provide three different values. We
now consider the three cases: (1) no events occur (Mt = Mt+i = 0), this variance will
be called Qoff, (2) an event is occurring (Mt = Mt+\ = C I ) , this variance will be called
Qon, and (3) a change occurs [Mt ^ Mt+\),
this variance will be called Qchange- Typically
Q'change is much greater (several orders of magnitude [62]) than Qon and Qoff, because this
process noise represents feature jumps. We have observed for some features, e.g., spectral
centroid, that the feature varies widely during sound events, thus, Qon should be set to a
much larger value then Qoff- Other features such as spectral sparsity behave differently,
i.e., the feature value is relatively constant during an event, and in these cases Q0ff should
be set to a larger value then Qon. The proposed model is not extremely sensitive to initial
values for Qchange (assuming it is much larger than Qon and Q0ff),
but the ratio of Qon
to Qoff is extremely important and the values of these variances should be determined
by examining feature trajectories from example events, and determining how the feature
behaves during events.
The behavior of the approximate Viterbi inference algorithm of Section 3.3 is most
sensitive to the setting of the observation noise variance R. Setting a R to a value that is
too large will lead to missed segments, as feature jumps might be mistaken for observation
50
noise.
Setting R to a value that is too small can cause false alarms, as random noise
variations are mistaken for the jumps of segment boundaries. The value of R is also very
dependent on recording conditions, as a recording captured outdoors on a windy day will
require a much higher value for R, than a clean recording captured indoors. A good initial
value of R is the sample variance of the feature trajectory.
Now that we have provided some heuristics for choosing initial parameter values, we
provide the EM re-estimation formulas for each parameter. We begin with the global mode
transition probabilities p n e w and p off .
The Markov chain governing the dynamics of Mt
shown in Figure 3.1 can also be represented as a transition matrix II, i.e.,
•*•
n=
rnew
o
Poff
/•'new
*-*
o
Pnew
(3.4.1)
I
J-
—
Poff
—
Pnew
We also define the three dimensional unit vectors e, as the vector of all zeros with a one
at the i t h position.
Once the global mode sequence M£.T is obtained from the Viterbi
algorithm (E-step), the matrix II is updated (M-step) as [130]
ft
= ( E e «; +1 eW ) dia§ ( E ew)
(3-4-2)
where ' denotes transpose and e ^ « is the unit vector corresponding to the inferred global
mode value at frame t. We can interpret the elements tljj of II from (3.4.2) as the expected
number of transitions from mode value j to mode value i divided by the expected number
of transitions from state i.
The remaining linear dynamic system parameters require the Rauch-Tung-Streibel
(RTS) smoothed statistics [133].
From the state and variance values that were stored
in (3.3.6) at each iteration of the Viterbi algorithm, we can obtain a single state sequence
51
and variance sequence, respectively, from
for t = 1,...,T.
5t*|t - St\t(Mt*)
(3.4.3)
Xt*|t = £ t | t (M t *)
(3.4.4)
We can then use S*,t and X*,t to obtain the RTS smoothed statistics
Stir' S*, r , and S * t _ 1 , r .
See [133, 137] for the well-known RTS recursions.
Given the
RTS smoothed statistics the maximum likelihood estimate of the observation noise variance
is [137]
R=f
1
T
E ^ - V ) 2 + S *|T]
(3-4.5)
To define the low-pass filter coefficient a, which in our case serves as the state transition
matrix, we define the following quantities:
T
A = ] T ( £ t * _ 1 | T + (S t *_ 1|T ) 2 )
(3.4.6)
t=i
T
t=\
c = £ ( s ; | T + (st*|T)2).
t=i
The low-pass filter coefficient is then determined from [130,137]
a = l-^r.
(3.4.7)
A
Calculating the process noise variances Q o n , Qoff,
an
d Qchange requires that we change the
limits of the sums in (3.4.6), and only sum over frames when the global mode is on, off,
or changing, respectively, to obtain values such as Aon, C0ff, etc. We can then obtain the
52
maximum likelihood estimates for the process noise variances as
where Non,
N0ff,
&2Qon = -^r- (Con ~ B2on/Aon)
(3.4.8)
&2Qoff =
T^-(C0//
(3.4.9)
a
=
Qchange
TT
^change
- B2off/Aoff)
{^change
~ ^change/
-^change)
(oA.W)
and Nchange are the number of frames where the global mode is on,
off, or changing, respectively. In this application we have ignored the maximum likelihood
estimation formulas for the initial state values, because we have observed that our model
is not overly sensitive to these (typically a mean of zero and variance one work well), but
these formulas can be found in [137].
3.5 Multi-feature S e g m e n t a t i o n
In practice a single audio feature will be insufficient to obtain an accurate segmentation of
a recording, thus, we must extend our model to the multi-feature case. By treating each
feature as conditionally independent given the global mode (rather than trying to model
possible dependencies that could negatively impact performance if modeled incorrectly), we
can easily extend the single feature model discussed in Sections 3.2-3.4 to the multi-feature
case. Our segmentation goal of estimating the global mode sequence M\:T remains the
same, but we add additional hidden discrete variables that govern the Markov switching
between on, off, and onset states for each individual feature, while the global mode Mt
fuses multiple features. Thus, each individual feature operates as a linear dynamic system
identical to that discussed in Sections 3.2-3.4, but the individual feature behavior follows
not only the global mode, but also the hidden variables we now describe.
Let K represent the number of features extracted from each frame of the signal, then
Yt{k\
for k <E 1:K are the observed audio features at time t. The global mode Mt remains
53
as a discrete random variable as previously denned, but it is now responsible for controlling
the underlying dynamics of all K features. Because of the variation in time scales and
meaning of the different features, it is possible that certain features lag behind the overall
mode Mt when turning on or off. Furthermore, even if there is a sound present at time t,
it is likely that some of the audio features will fail to respond at all. The discrete random
(k)
variables /ij , for k e 1: K serve as gates modeling the responsiveness and delays between
the onset and end times of the different features. Similarly to Mt, nt
£ {0, O l , C I } , where
the definitions of 0 , 0 1 , and C I are the same as for Mt.
Assuming T available data frames and K features, the sequence of observations can
be written as Y1T
, meaning a sequence of T observations for each of the K features.
Similar notation is used to represent the sequences of hidden variables, SVT
and \rVT
,
while all features share a common global mode sequence, M\-T- The DAG for our proposed
feature fusion segmentation model with K features is shown in Figure 3.3, and leads to the
following factorization of the joint distribution:
K
K)
,(!:*)
Q(1:K)
P(M1:T,$*\S&
\Y&
)
v
fc) ,.,ZlJ
^TJ
TfW'I
= ptA/T
P(M0
J ] P(
P(i/{
|Af!
(l:^ _
k=l
T
K
x l[P(Mt+l\Mt)
J ] [p(tft(fc)|Mt>Mt_i)] .
(k)
(3.5.1)
/ (k)
where H^ denotes all of the random variables for feature k, and P(H^
| M t , M t _ i ) at
frame t factors as
P(fft(fc)|Aftl Aft_i) =
P(4k)\f4%MuMt-i)
x P(S^%^\^1)P(Yt^\S^).
The prior P{H^\M{)
(3.5.2)
from (3.5.1) is defined as
P ( ^ f c ) | M ! ) ^ J P( M ( f c ) |M 1 )P(5fVi' c ) ) J P(^i ( f c ) |5i f c ) )-
(3-5-3)
54
(£+2)
Fig. 3.3.
fr(k+2)
Directed acyclic graph for feature fusion audio segmentation model.
55
3.5.1 Distributional Specifications
To define the segmentation model of Figure 3.3 we must specify conditional probability dis-
tributions P(Ytik)\S{tk)),
P ^ S ^ , ^ , ^ ) ,
As mentioned previously, P(Mt+i\Mt)
P ^ V j ^ M , ^ . ! ) , and
P(Mt+1\Mt).
is as specified in Figure 3.1, and functions identically
to the way it did for the single feature case only now it governs all K features. Similarly,
P{Yt{k)\S{tk))
is as specified in (3.2.6) and P ( s f
(k)
but with //J
^ S S . / ^ M J - I )
is as specified in (3.2.7),
(k)
and fJ.\_i replacing Mt and Mt-\TABLE 3.1
Mode transition probabilities for P{fit W,,(k)
(k)
Mt
Mt+i
0
0
0
Ol/Cl
Ol/Cl
Ol
CI
CI
CI
Ol/Cl
Ol/Cl
Ol
CI
0/O1/C1
0/O1/C1
0/C1
CI
o\
01
n^
= 0)
1
0
1 - B(k)
(k
11 - D
P l a "g +
0
^lag+
Wag-
\P\*R-
^ ( ^ i = Ol)
0
0
Pit^i
= CI)
0
(k)
P.ag-
0
Plag +
^lag+J
0
0
PlaK+
\/j,t_l,Mt,Mt-i).
/'lag+J
1 - »^ (l ka )g +
0
0
1 - v{k)
1
1
»(k)
• »(k)
Plag-
1
1
»
W
Mag+
-U(k)
Plag+
_i), we model possible lags between when a particular gate,
/4 , turns on after Mt has turned on as Poisson [138]. Letting p\*l+ be the probability
that the lag will continue for an additional frame (i.e., the Poisson rate parameter is 1P\ll+) >
tne
Poisson arrival time of the first event (in this case the end of the lag) follows an
exponential distribution with parameter l—p\*l+, thus, the expected additional lag becomes
1/(1 — p, (k) + ). Similarly, we model possible lags between when a particular gate, fi\ , turns
off after Mt has turned off as Poisson, with p[kg_ as the probability that the lag will continue
(k)
for an additional frame. A summary of P(nt
(k)
\fJ,t_\,Mt, Mt-i)
for all possible combinations
(k)
of n\_\, Mt, and Mt-\,
is shown in Table 3.1. For each feature, we must specify a value
56
for pl*l+ and p\*l_- As an example, we have observed that features based on spectral peaks
(e.g., harmonicity, spectral sparsity) tend to indicate event end times before features such
as spectral centroid by several frames, thus, a value of p[*l_ > 0.5 would be appropriate for
spectral centroid.
3.5.2 A p p r o x i m a t e Viterbi Inference and P a r a m e t e r Learning
To use the approximate Viterbi inference scheme for the feature fusion segmentation model,
we begin by combining the K + 1 discrete nodes Mt and /j,\ '
3.3 into a single discrete random variable \I/t = Mt x ^\'...
x /zj
from the DAG in Figure
, where x represents the
Cartesian product and * t can take |vl/| = 3K+1 possible values. We then write the posterior
of the sequence ^I.T as
(
p
K
p(*i:Tin :r)« ^i) n p(*t\*t-i) n
t=2
fc=l
T
pfrf'ftin^f'i*!.^!
t=2
(3.5.4)
where the last term in (3.5.4) is the observation likelihood of the Kalman filter [62,130].
The approximate Viterbi inference algorithm can then be used as described for the single
feature case in Section 3.3, except that the role of Mt should be replaced by ^ j , and the
observation likelihood is now a product over all K features. This changes the innovation
cost of (3.3.13) to be a sum over features, i.e.,
K
j t , t -i(»,j) = X )
fc=l
(if'-^M)) 2
,fc)
i i
+
log^i^ti+R^)
2(siJ ) _ 1 (t,j)+« ( f c ) ) ' 2
(3.5.5)
l o g P ( * t = i | * t - i =j)
The complexity of the approximate Viterbi inference algorithm is 0(|\I>| 2 T) for the multifeature case, thus, complexity is exponential in the number of features used for segmentation.
For this reason, it is desirable to use as few features as possible for segmentation, while all
57
six features presented in Chapter 2 are used for indexing and retrieval. In the upcoming
results section three features are used for segmentation.
Regarding parameter learning, the single feature approach of Section 3.4 also easily
extends to the multi-feature case, because of the independence assumption between features.
Thus, we can learn the transition matrix for \&t in a similar fashion to the approach presented
for Mt in Section 3.4. We can then learn a^k\
R^k\
Q^,
Q^f,
and Q<^ange exactly as
described in Section 3.4 using the smoothed trajectories for each feature.
3.6 R e s u l t s and Discussions
To illustrate the performance of the multi-feature segmentation algorithm discussed in this
chapter, we apply them to example audio recordings in both indoor and outdoor environments. All sound files were captured and processed at 16bits/44.1kHz uncompressed. Prior
to segmentation, all extracted features were normalized to [0,1], by linearly mapping the
empirical maximum and minimum values (obtained from several hours of test recordings)
of each feature to zero and one, respectively. The approximate Viterbi inference scheme
was used to infer the segmentation.
Figure 3.4 displays results for an indoor recording of a whistle (2-3 seconds), a key
jingle (4.5-7 seconds), and a shout (9-10 seconds). The time-domain waveform for this
sound file is shown in the upper plot of Figure 3.4(a), while the extracted global mode
sequence M\._T is displayed directly beneath the time domain waveform. Mode values of
Mt = 0 are plotted as zero, Mt = CI are plotted as one, and Mt = OX are represented by
dashed lines. The top plots of Figures 3.4(b)-(d) display the the audio feature sequences
corresponding to spectral centroid, spectral sparsity, and harmonicity, respectively.
The
bottom plots of Figures 3.4(b)-(d) contain the inferred feature fusion parameters n\'T
for
i=l,2,3, respectively.
58
The example from Figure 3.4 helps to illustrate the diversity of our chosen feature set
as each of the three spectral features indicate strongest for different events. The spectral
centroid (Figure 3.4(b)) detects the key jingle between 4.5 and 7 seconds, while only increasing slightly during the whistle and shout sounds. As expected, spectral sparsity (Figure
3.4(c)) is highest during the non-harmonic whistle sound, while only slightly indicating the
presence of the key jingle and shout. Harmonicity (Figure 3.4(d)) only contributes to the
segmentation and indicates the presence of a sound event during the shout sound, which as
it is generated by the human voice tends to exhibit harmonic structure.
A second example demonstrating the performance of the segmentation algorithm is
shown in Figure 3.5, which was recorded outdoors in particularly windy conditions. The
sound is footsteps on metal bleachers (14-19 seconds). Examining the time domain waveform
shown in the top panel of Figure 3.5(a), we see that there are no clearly visible sound events,
although the global mode sequence (segmentation) shown in the bottom panel does extract
the footsteps even in these low SNR conditions. Due to the low SNR, loudness is of no
use for segmentation as shown in Figure 3.5(b).
Fortunately, the spectral centroid and
spectral sparsity features shown in Figures 3.5(c) and (d), respectively, are able to detect
the presence of the footsteps. This example helps to demonstrate the performance of the
algorithm in low SNR conditions.
Next, we examine the performance of the proposed segmentation algorithm on approximately two hours of continuous recordings. Approximately one hour was recorded outdoors
in a park during a windy day and contains sounds of people playing sports, talking, and
footsteps amongst other sounds. The rest of the data was recorded indoors in an apartment
and contains sounds related to telephone conversations, cleaning, and cooking among others. The recordings were then manually segmented into 155 indoor events and 170 outdoor
59
Original Waveform
Spectral centroid
10
12
10
12
10
12
1
(1)
M,
n
2
4
6
8
10
12
0
2
4
time (seconds)
6
8
time (seconds)
(a) Global segmentation
(b) Spectral centroid
Spectral sparsity
Harmonicity
0.04
1
(2)
t
0
4
6
8
4
10
time (seconds)
(c) Spectral sparsity
6
time (seconds)
(d) Harmonicity
Fig. 3.4.
Spectral feature segmentation example in an indoor environment, (a) Signal
waveform (upper panel) and global mode M\-_i (lower panel), (b)-(d) Extracted audio
features (upper panel) along with fusion gating parameter, n\.'T
( l ° w e r panel).
60
Original Waveform
Loudness
10
15
20
10
15
20
M.
0
5
time (seconds)
time (seconds)
(b) Loudness
(a) Global segmentation
Spectral sparsity
Spectral centroid
1
)
0
5
10
15
time (seconds)
(c) Spectral Centroid
20
time (seconds)
(d) Spectral Sparsity
Fig. 3.5.
Multi-feature segmentation example for footsteps on metal sound recorded in
a windy outdoor environment, (a) Signal waveform (upper panel) and global mode M\:T
(lower panel), (b)-(d) Extracted audio features (upper panel) along with fusion gating
parameter n\.'T
( l ° w e r panel).
61
events, and we tested the performance of our proposed segmentation algorithm using the
manual segmentation as ground truth. Additionally, we used the BIC segmentation algorithm originally proposed for speaker segmentation [52], and used for sound environment
segmentation in [23] as a baseline system. Theoretically, the penalty term in the BIC segmentation algorithm should be one, but we found that this was far too sensitive for our
application, so we treated the penalty term as a tunable parameter [23].
Table 3.2 displays the recall and precision of the proposed segmentation approach and
the modified BIC approach with tunable penalty parameter.
For both approaches, the
loudness, spectral centroid, and spectral sparsity features were used as the inputs to the
algorithm. Approximately one minute of data was used to tune the prior and noise variance
parameters in the proposed DBN approach and the penalty parameter in the modified BIC
approach. For segmentation, precision is the number of correctly identified event onsets
divided by the total number of event onsets returned by the algorithm (similar to the
inverse of false alarm rate), and recall is the number of correctly identified event onsets
divided by the true number of event onsets (analogous to the detection rate).
We see
from Table 3.2 that the proposed DBN approach outperforms the modified BIC approach,
with the most notable improvements for the outdoor recordings. In this case, the loudness
feature is not useful for segmentation (due to a high amount of background and wind noise),
which seems to negatively influence the BIC approach, while our approach was designed
specifically to perform well in situations where certain features do not always indicate the
presence of an event. Furthermore, the proposed approach is based on approximate Viterbi
inference, which is a linear time algorithm, while the BIC segmentation algorithm has
quadratic time complexity. One downside of the proposed approach is that there are more
parameters to tune/learn (the prior for the global mode as well as the gate priors, low-pass
62
filter coefficients, and noise variances for each feature), while the BIC approach has a single
adjustable parameter.
TABLE 3.2
Segmentation performance for approximately two hours of continuous recordings in both
indoor and outdoor environments.
Segmentation Algorithm
Proposed DBN
Modified BIC
Indoor
Precision Recall
0.664
0.560
0.607
0.467
Outdoor
Precision Recall
0.5067
0.665
0.3729
0.3860
3.7 Conclusion
This chapter presented a probabilistic model for robust multi-feature segmentation of continuous recordings using a DBN framework. The fundamental idea of this technique is to
find abrupt changes or jumps in acoustic feature trajectories, and associate these jumps with
the onset and end times of specific acoustic events. Furthermore, the proposed technique
can also account for unresponsive features or delayed jumps in the trajectories of different
features. The segmentation is determined by inferring the sequence of a hidden global mode
variable using a Viterbi approach, which also facilitates re-estimation of model parameters
using an EM algorithm. Once we have determined the segmentation of a continuous recording, individual sound events must be indexed in a database that allows for intuitive and
efficient retrieval. The remainder of this dissertation will be dedicated to different solutions
to this retrieval problem.
4. E X A M P L E - B A S E D R E T R I E V A L
4.1 Introduction
In the previous chapters we focused on the processes of acoustic feature extraction and
segmentation, which are essential in transforming continuous audio recordings into distinct
events for easy storage and access. The next two chapters will tackle the problem of accessing
or retrieving these audio events in large databases where automated search and retrieval
techniques are necessary to overcome the "data deluge" of continuous recordings. In the
targeted continuous monitoring application, we have two goals for an automated retrieval
system: (1) the system should be intuitive so users can quickly form the queries that provide
their desired results, and (2) the system should be efficient in quickly retrieving the desired
sound clips.
The query-by-example (QBE) paradigm, where users query using a sound recording
they consider similar to the sounds they want to retrieve, has found much success in navigating music databases [65,71-73]. To develop a robust QBE model for environmental and
natural sounds we must account not only for the perceptual similarity between the acoustic
content of the query and the target sounds in the database, but also attempt to model the
entire range of possible query behavior a user might produce in an attempt to retrieve each
sound in the database. As discussed in Section 4.2, when only one relevant sound file exists
for a given query, a likelihood-based QBE strategy that computes the likelihood over all
possible queries that could be used to retrieve each sound in the database can be considered optimal in terms of precision and recall. An optimal retrieval system will maximize
the posterior P(X\Y),
where the query Y is the observation, and X is a hidden variable
modeling the database sound that generated the query. With the additional assumption
that all database sounds are equally likely (i.e., P(X)
is uniform), for a given query Y, we
can rank all sounds in the database with respect to the likelihood
P(Y\X).
64
Unfortunately, likelihood-based QBE requires that for a given query, we compute the
likelihood that it was generated to retrieve each sound in the database, i.e., retrieval time is
linear in the number of database sounds. For the large and constantly growing databases in
continuous monitoring applications, a faster approach is necessary. Assuming that a given
query is relevant to only a small percentage of sounds in the archive, a technique for improving retrieval efficiency is to recursively divide the database into clusters of perceptually
similar sounds. The search is then limited to those clusters that are perceptually related
to the given query. As the clusters are recursively split into smaller and smaller clusters,
retrieval complexity can be reduced to logarithmic in the number of database sounds.
In this chapter we present a likelihood-based QBE system for retrieval of environmental
and natural sounds. The observed query features Y are a time series where each sample
is a vector consisting of the six features discussed in Chapter 2. Each database sound
X is represented by a hidden Markov model (HMM) that approximates a general trend,
consisting of a zeroth, first, or second-order polynomial fit.
These fits encode whether
each feature trajectory is constant (high or low), increasing/decreasing, or exhibiting more
complex (up —> down; down —> up) behavior. Furthermore, we improve retrieval efficiency
using a modified spectral clustering technique [139,140], where the number of clusters are
automatically determined using the Bayesian information criterion (BIC). The input to
the spectral clustering algorithm is a similarity matrix obtained from a semi-metric that
approximates the distance between HMMs for all archived sounds in the database. Once
clustering is complete, we fit each cluster with its own HMM template, and when a query
is input into the system, only the cluster HMM which exhibits the highest likelihood has
its sounds returned and ranked.
65
Accurate evaluation of a multimedia information retrieval algorithm requires knowledge
of which database sounds are relevant to a given query [66]. In general, this relevance
information is not consistent among different users, thus, humans must be included in the
evaluation process. To this end, we conduct a formal study in which users rank all sounds
in two test databases captured under different recording conditions (indoor/outdoor) as
relevant/non-relevant to ten example queries. The results of this user study are discussed
in Section 4.5, and used to evaluate the presented cluster-based query-by-example system
in terms of common information retrieval metrics and computational cost.
4.2 Likelihood-based Retrieval
Assume our database consists of M sounds, where X^ represents sound i, and F^ denotes
the feature set indexed with X^l>. We denote the query sound by Y, while G is a corresponding feature set that summarizes information in Y relevant for the retrieval process. If
we assume that sound X^1' is relevant (desired by the user) with respect to query Y, while
sound X^> is irrelevant, then we desire a score function such that
J(F®,G)>J(FW,G)
(4.2.1)
That is, the score function should be highest for sounds relevant to the query.
One way to determine an effective score function is through the precision/recall criteria
used for evaluating information retrieval systems. As discussed in Section 1.3.2, precision
is the number of desired sounds retrieved divided by the number of retrieved sounds, recall
is the number of desired sounds retrieved divided by the total number of desired sounds,
and average precision summarizes precision performance when all sounds are returned. If
we assume only one relevant sound in the database and the retrieval system returns only
the top result for a given query, it should be clear that the probability of simultaneously
66
maximizing precision and recall reduces to the probability of retrieving the relevant sound.
An optimal retrieval system should maximize this probability, which is equivalent to maximizing the posterior P(Fl\G).
If we assume P(F^)
are equally likely the optimal score function J(F^,G)
Representing the relevant sound as X^l*\
is uniform, i.e., all database sounds
reduces to the likelihood
P(G\F&).
we can retrieve it with the maximum
likelihood
criterion, i.e.,
i = argmax P{G\F®).
(4.2.2)
i€l:M
If there are multiple relevant sounds in the database, and the retrieval system returns the
top R sounds, we can extend (4.2.2) to return the sounds with the R greatest likelihoods
P(G|FW).
In addition to considering the likelihood, P{G\F^l>) as a similarity score between query
and database sounds, we can interpret it as the actual distribution of query features a user
will produce to retrieve a database sound indexed with features F^1'. When producing a
query, we assume that users provide a sound they consider similar to the database sound
(or general class of sounds) they are trying to retrieve. In general, this similarity is determined by the similarity or difference between the low-level acoustic features introduced in
Chapter 2: loudness, spectral sparsity, spectral centroid, temporal sparsity, transient
and harmonicity.
index,
For certain sounds, dynamics in these features may also be salient. In
other words, both query and database features should depend, at most, on the audio feature
trajectories, Y1\T ' and X[l'T' ', where Yt '
of Tq and X t
'
is, G = g (y^Tp)
is the query audio feature vector at frame t
is the ith database sound audio feature vector for frame t of 7*. That
and
F(i}
= f (Xi:T-P))
Our model, in fact, chooses G = Y^.T
for
and F^
suitably chosen functions g(-) and /*(•)•
containing the parameters of a template
which encodes information about the "general trends" underlying the individual feature
67
trajectories, X\.'j, , for j 6 1 : P. Both database and query feature trajectories, then, are
modeled as noisy and distorted versions of this template, and the query likelihood model is
represented as P
(Y^Tp\F^A.
4.3 T e m p l a t e Construction
In constructing templates, we summarize feature trends separately, on a feature-by-feature
basis. That is, F « = F^lJ">
where F™
summarizes the trend for the j t h feature trajec-
tory. We consider only very simple trends: X^.j, can be constant, up, down, up —> down, or
down —> up, as more complicated trends might not be intuitive to the user. Furthermore,
the mean value of the trajectory carries some importance, especially for constant trends.
For example, if a user remembers a particular sound as "grainy" or "crackly", it is very important to the user that the sound has high values of transient index and temporal sparsity,
regardless of how these features vary over the course of a sound. Both general trend and
mean value information are summarized by subjecting each feature trajectory, X\.'T', to a
set of zeroth, first, and second-order least-squares polynomial fits, with the optimal order
chosen via Akaike criterion [141]. Each polynomial fit is then sampled evenly with one, three,
or five control points, corresponding, respectively, to polynomial orders zero, one or two (see
Figure 4.1 for an example). For the kth control point we let <j)^\k)
where x^
F^'i',
(k) is the value of the fit and x
= [x^'^{k)
x^(k)]T,
•* (k) its derivative at this point. The template
is then a HMM representation of how query features Y^.T
sequence of control points (hidden states), defined by </>(''•"(1 : K^'i')
advance through the
where K™'
is the
number of control points for the j t h feature and the ith sound.
In describing the HMM representation of template F™',
we must account for the
possibility that query features will advance through the control point sequence <£(''•" (1 :
K^'J^) at a variable rate. We model this time-warping
distortion, by defining hidden states
68
Frame Index
Fig. 4.1. Example of LS quadratic fit (solid line) and corresponding control points for a
harmonicity trajectory (dotted line).
Aj
£ <p-l<i>(\ : K^l'i') that represent the control point responsible for the query observation
at frame t. The advance through the control point sequence is modeled by discrete Markov
process P ( A ^ j |Aj
), which follows the appropriate Markov transition diagram displayed
in Figure 4.2, depending on the order of the polynomial fit. The reason for characterizing
linear and quadratic fits with more control points than the minimum required to define
the curve is that we want the process of advancing through the control point sequence
to represent as closely as possible the process of advancing through the optimal curve fit.
The transition probabilities, pba = P ( A ^ j = 6|Ajl'"'J = a), are set by modeling the advance
through the control point sequence as a Poisson process. For the single jump case (b = a + 1 ) ,
the rate parameter of the Poisson process is set to l/a£'J,
where a£'J is the frame difference
between control points a and b in the database feature trajectory, X ^ \
(Figure 4.1). For
the two jump case (6 = a + 2), the rate parameter is the average of \/d*+{ and l / e ^ 2 \
Thus, the transition probabilities of advancing through the control point sequence can be
69
summarized as
ex
Pba= \
P ( ZKJ) ) Jj)
(
VV+1
b=
\/
a
_t
g+2 / \"a +l
2
a+l
(4.3.1)
2
1
"
"n + 2 /
L_
, 9
Finally, we assume that the initial hidden state corresponds to the first control point with
probability one, i.e., P{A^j)
If A\
= 4>^\l))
= 1.
' = k, then the value and derivative of the observed query feature Ytu'
be close to ^h^(k).
p(y t ( j ) |A| i j ) )>
Hence, we model the observation probability as P(Y^'\A^')
will
—
where
y^
~U)
(4.3.2)
and y\^ is a 41-point, fourth-order, Savitzky-Golay (SG) [142] smoothed version of YtU) and
•
y\
(i)
the smoothed derivative. The SG smoothing filter provides an efficient framework for
derivative estimation, and removes noise in the observation trajectory with minimal loss of
high-frequency signal content. P(Yj
|Aj
P(YtU)\A[l'j) = k)
) is given as follows
=
U{^j){k)^3)).
(4.3.3)
The covariance matrix, E ( w £ R 2 x 2 , is designed to encode the joint uncertainties in value
and derivative between time-aligned versions of the template and Savitzky-Golay smoothed
observation trajectories.
In our work, the covariance matrix is estimated as the sample
covariance of the residual vector ef'
= X^'J'
— [x<
Xt
P * i-e-i the difference of the
smoothed feature trajectory and the chosen polynomial fit. The HMM template p(J>-?) for
sound i and feature j is then composed of the control point sequence
covariance matrix S ^ J \ and the associated transition probabilities.
70
Pu
pn
p22
a) Constant
Fig. 4.2. Markov transition diagrams for At
the feature trajectories.
P33
(b) Linear
;
under the three possible polynomial fits of
4.3.1 Query Likelihood M o d e l
Given the template (which was constructed by encoding each feature trend separately), we
model the query feature trajectories as conditionally independent across different features.
Furthermore, we assume that each trajectory depends only on the template information for
its corresponding feature:
i=i
Thus, given query feature set Yy.j,
;
the likelihood from (4.3.4) can be computed in linear
time using the forward algorithm [106], for each database template i * w ) .
4.4 Cluster-based Indexing
Retrieval consists of obtaining the R database sounds with the greatest query likelihoods.
Unfortunately, exact retrieval requires as many likelihood evaluations as there are sounds
in the database, which is problematic in the case of monitoring applications that involve
thousands of sounds. Instead, we have developed a cluster-based indexing strategy that
71
effectively reduces the number of likelihood evaluations per query. Our strategy is based
on partitioning the database into clusters for which, given any pair of database sounds in
different clusters, if the query likelihood for one sound is large, the query likelihood for
the other sound will be small. For retrieval, we would then restrict our search to sounds
in the first cluster. Applying this clustering recursively, our retrieval strategy would have
logarithmic, rather than linear, complexity in the number of database sounds.
In developing our strategy, we first present three challenges that provide necessary
conditions for an effective cluster-based indexing system. First, we must construct a discriminative space based on a pseudo-metric D(F^',F^k>),
only when both
and F^
P(Y}£P)\F®)
and
P(Y}£P)\FW)
for which the value will be small
are large, i.e., sounds indexed by F^l>
are both likely matches to the query sound described by feature trajectory Y±.T '.
These relations should hold as well if F^k> and F^> are interchanged - we solve this issue by
constructing the pseudo-metric to be symmetric. Second, we must develop a way of clustering the database using D(F^\
F^k'),
so that the value of this pseudo-metric is smaller
between sounds in the same cluster than between sounds in different clusters. Third, we
must develop an efficient way to assign the query to a particular cluster so that the query
likelihood will be large only for sounds in that cluster, and likelihood evaluations for sounds
in other clusters can be skipped.
If we assume D(F^l\ F^)
is a metric, we can easily address all challenges as shown
in Figure 4.3. Regarding the first challenge, D(F^1', F^)
is chosen to be both likelihood
based, and as close to a metric as possible. Concerning the second challenge, a standard
clustering approach, such as k-means or spectral clustering [139], can utilize D(F^,F^)
to
determine the clusters. Regarding efficient assignment of query to cluster, we first represent
each cluster as a template similar to those described in Section 4.3. Then, only the cluster
72
Fig. 4.3. Fictitious metric space using D{F^1', F^k') extended to incorporate query features
Y^-.if. The F(l> are plotted with solid circles, the Y1T
with an ' x', and the cluster centroids
with diamonds.
(IP)
template which exhibits the largest likelihood with respect to the query observations Y-^.T ,
will have the sounds belonging to that cluster ranked in terms of likelihood and returned
to the user.
We proceed in a manner analogous to that of Figure 4.3, as follows. Let X^.'T. b e
the feature trajectory computed for the database sound indexed with F^l\
and L(i,k) a
corresponding query log likelihood with respect to F^k':
,(i,l:P),
L(i,k)=\og\p(x^\F^)\
Similar to [143], we can define a semi-metric
(4.4.1)
D(F^l>, F(k>) based on these query likelihoods,
as follows
D ( F
«
; F
W)
=
A [ L ^ ) - L(i,k)} + ±[L(k,k)
J-i
- L(k,i)].
It is easily verified that the following properties hold: symmetry:
D(F^k\F^)
and non-negativity:
true that D(F^,F^)
(4.4.2)
-M
D(F^l\F^)
> 0.
D(F^l> ,F^k>) =
Regarding distinguishability,
it is
= 0 if F® = F^. However, it is possible that F® ^ F^ but
73
L{i,i)
= L{i,k)
and L(k,k)
= L(k,i),
which will force D(F^,F^)
= 0; however, this
event will occur with probability zero unless there is degeneracy in the model structure.
With these properties, D(F^\
F^k') is a semi-metric.
It is not a metric, because it does
not satisfy the triangle inequality.
Nonetheless, D(F^,F^),
as constructed via (4.4.1, 4.4.2), will in most cases address
the challenges outlined at the beginning of this section. Let F^1' be such that P(Yj.T
'\F^1')
is large; that is, the likelihood surface has a strong peak in the query feature space, and
YJ.J.
is near that peak. Furthermore, L(i,i)
and L{k,k)
are also large, since the feature
trajectory used to generate a template will also be near the peak in the likelihood surface.
If D(F^.F^)
is small, then, we must have L(i,i)
« L(i,k)
and L(k,k)
« L(k,i),
which
indicates that the likelihood models become interchangeable, at least near peaks. Hence, it
is very likely that P{Y^.T '\F^)
is large; either \L(i,i) — L(i,k)\
will also be large. On the other hand, suppose
or \L(k,k) — L(k,i)\
D(F^,F^)
must be large. In this case, the likelihood
surfaces for templates F^1' and F^k' exhibit peaks in different regions of the feature space;
thus if P(yi ( : rfV (i) )
is lar e
S > we e x P e c t p(Y^Tp\F{k))
t o b e sma11
-
Considering the second (clustering) challenge, the sound clustering performance of the
k-means, non-negative matrix factorization, and spectral clustering algorithms were compared in [144]. Based on these results, we developed a modified version of the Ng-JordanWeiss (NJW) spectral clustering algorithm [139], as discussed in the following section.
4.4.1 Modified Spectral Clustering
Given TV database-sound feature sets, F^1:N\
and (semi-metric) distance D(F(-l\F^'),
the
N J W spectral clustering algorithm operates on an N x TV affinity matrix with the i, j element
as follows: A(i,j)
= e~D ( F * 'F
F ^ ) are in terms of D(F^l\F^),
J
^a . Here a is a scaling factor, and the closer F^1' and
the larger the affinity. In our adaptation of the N J W
74
algorithm, we make the following modifications, inspired by the "self-tuning" approaches
of [140]. First, we introduce local scaling [140] into the affinity matrix,
P2(F(»),F(J>)
A{i,j)
whereCTJ= D(F^',F^M')
similarly).
=
e
"^
(4.4.3)
and IM is the M t h nearest neighbor of sound i (OJ is denned
Local scaling offers more flexibility in cases where database feature sets ex-
hibit multi-scale behavior, such as a concentrated cluster embedded in one that is more
diffuse [140]. Concentrated clusters can arise when there are many equivalent sounds, for
instance footsteps, while diffuse clusters can result from sounds in more flexible categories,
such as machine sounds.
Second, even if there exist distinct semantic categories, for instance footstep,
machine
and water, the diversity of sounds within each category (in particular, the water category)
may induce a greater number of clusters than there are categories, and this number is difficult to predict in advance. Therefore, we use the Bayesian Information Criterion (BIC) [145]
to automatically choose the number of clusters at each level of the process. At this point it
might be natural to ask why use BIC to estimate the number of clusters as opposed to AIC
which was used in Section 4.3 for estimating polynomial order. As discussed in [146], BIC
is preferred when the underlying model has finite dimension and belongs to the candidate
set (e.g., the number of sound clusters). Conversely, AIC is preferred when the underlying
model has infinite dimension and/or does not belong to the candidate set (e.g., whether a
constant, linear, or quadratic curve best describes a feature trajectory).
Given the affinity matrix constructed via (4.4.3), our adaptation of NJW spectral
clustering is shown in Algorithm 1.
75
A l g o r i t h m 1 Modified N J W spectral clustering algorithm.
1: Use (4.4.3) to form the locally scaled TV x N affinity matrix A.
2: Define B to be the diagonal matrix with B(i, i) = X)7'=i A{i,j) a n d construct the matrix
J = B-1/2AB-1/2.
3: Find the K largest eigenvectors of J where K is the maximum cluster number, and
stack the eigenvectors in columns to form the N x K matrix V .
4: Re-normalize each row of V to be of unit length to form the N x K matrix U, with
elements U{i,j) = n * > . 7 ' ) / ( E 7 - ^ 2 ( M ) ) 1 / 2 for k — 1 to K do
Use the first k columns of the matrix U to form the new matrix U'.
Treat each row of U ' as a data point in ~RK and cluster according to a Gaussian
mixture model (GMM) using EM.
Assign sound % to cluster m if the ith row of U ' was assigned to cluster m.
9
Calculate the BIC value under the assumption of k clusters.
10 end for
11 Choose the optimal number of clusters fcopt to be the k which minimizes the corresponding BIC value.
In Algorithm 1, step (9), the BIC value is calculated as follows:
B I C = - 2 x \og{P(V\Q))
+ Ik x log(iV)
(4.4.4)
Here 6 = { 0 i , . . • ,#&} denotes the collection of GMM parameters estimated via EM, and
V = {u\,...
,'ujy} is the collection of data points v%, where u^ represents the ith row of
matrix U ' with dimension k. We determine P(V\Q)
N
according to:
k
where otj is the weight for the jth Gaussian component in the GMM, and
PJ(U^\IJ,J,T,J)
is
the likelihood of u^ evaluated with respect to the j t h Gaussian component. From (4.4.4),
the number of free parameters to be estimated for the GMM with k Gaussian components
is represented as Ik- For a k component GMM with diagonal covariance matrices (which are
used in this work, because of better numerical behavior when compared to full covariance
matrices) Ik = 2&2 + k — 1, and is a sum of the following: (a) k — 1 unique parameters
determining the weights OLJ , (b) the k2 parameters for the mean vectors Hj (k mean vectors
76
each of dimension k), and (c) the k2 parameters for the diagonal covariances (this number
will be greater if we use full covariance matrices). Traditionally, the number of clusters
are automatically selected via the eigenvalues or eigenvectors of the affinity matrix [140],
however, we found this method tended to highly overestimate the number of clusters, when
compared with Algorithm 1. This is most likely due to the non-separability of real-world
sound clusters, which we account for by using the GMM procedure in step (7), as opposed
to k-means.
4.4.2 Cluster T e m p l a t e Construction
Recall that the third and final challenge for efficient cluster-based indexing is to develop
a strategy for pre-assigning a query to a particular cluster, so that likelihood evaluations
for sounds in other clusters can be skipped. One plausible strategy is as follows. First, we
precompute for each cluster the marginalized query likelihood with respect to a uniform
prior over the sounds in the cluster.
the highest marginalized likelihood.
Second, we assign the query to the cluster with
While this strategy certainly guarantees that high
likelihood sounds will be retrieved, it still requires query likelihood evaluations for each
sound in the database.
As such, we need a query likelihood model for an entire cluster that "behaves" sufficiently like the marginalized likelihood, without incurring the expense of individual query
likelihood computations. We at least attempt to capture the common perceptual characteristics of all sounds in the cluster. First, we resample all feature trajectories to a common
length (the geometric mean of all lengths of sounds in the cluster) then within each feature,
we treat all trajectories over all sounds in the cluster as iid observations from a common
polynomial model and perform the polynomial fit accordingly. That is, let T be the common
length and X^
the j
t h
resampled feature trajectory for the ith sound. We concatenate
77
all trajectories for the kth cluster, j
t h
feature, into the vector C^' = vec{X^'}i^/k
where
A/fc is the index set for cluster k, and use the process described in Section 4.3 to extract the
appropriate (constant, linear, or parabolic) polynomial fit.
The single-feature model is otherwise identical to that developed in Section 4.3, i.e., all
of the concatenated feature trajectories for a cluster are represented as a single HMM with
parameters determined based on the least squares polynomial fit. We denote the resultant
query likelihood as P {Y^T \FQ
j , where the cluster template
collection of HMM parameters for each feature j£l:P.
FQ' '
consists of the
Similarly, the overall cluster/query
likelihood model factors as a product distribution over single-feature likelihoods.
We precompute and store the templates FQ '
. Given a query, we first evaluate the
cluster/query likelihood (i.e., the query features are the HMM observation sequence and
the likelihood is evaluated with respect to the cluster template model F^,'
) for each
cluster. All clusters except the one maximizing the cluster/query likelihood are eliminated
from future consideration. We then evaluate (4.3.4) for each sound within the maximumlikelihood cluster. For very large audio databases, we can make this procedure recursive,
resulting in a number of likelihood evaluations that is logarithmic in the database size.
4.5 R e s u l t s and Discussions
We have evaluated the likelihood based retrieval algorithm discussed in the previous sections
using two audio databases, one containing automatically segmented sounds, and another
containing manually segmented sounds. The database of automatically segmented sounds
consists of all sound events greater than one second in length (71 outdoor sounds and
111 indoor sounds) obtained from the approximately two hours of continuous recordings
discussed in Section 3.6. The manually segmented sound database consists of 155 indoor
sounds and 145 outdoor sounds, with some of the sounds taken from the same continuous
78
recordings as the automatically segmented data set. Although we have no ground truth
labels, to provide an idea of the types of sounds we mention that each subset can be loosely
partitioned into six semantic categories: machine, speech, rhythmic,
scratchy, water, and
whistle/animal.
4.5.1 E x p e r i m e n t a l S e t u p
Our study involved ten users, adults ages 23-35 with no known hearing impairments. Each
user was presented with ten example audio queries and asked to rank either all 300 sounds
from the manually segmented database, or all 182 sounds from the automatically segmented database as relevant or non-relevant to the query. Thus, five user rankings for each
query/database sound combination were obtained. The same ten queries were used for both
databases, and consisted of five sounds that were present in the manually segmented test
database and five that were not, while none of the queries were present in the automatically
segmented database. Additionally, the ten queries were a combination of
indoor/outdoor
sounds from all of the semantic categories mentioned above. The volunteers listened to
sounds on personal computer speakers using a well known computer media player in an
acoustically isolated environment. Participants were asked to fill out a worksheet recording
their relevancy rankings for each of the ten example queries. The database sounds were
presented in random order without revealing any category information, and with indoor
and outdoor sounds mixed together, while the queries could be listened to "on-demand"
throughout the ranking process. Users were allowed to complete the study at their own
pace and it typically took between two and three hours (over multiple sessions) per subject
to finish the study.
Given this user-determined relevance information, we evaluated retrieval performance
on each database separately using the aforementioned precision, recall, and average precision
79
criteria as well as the average number of likelihood computations required to rank all sounds
in the database. For a given query we denote by F# e / the set set of relevant sounds, and
\FRCI
I a s the number of relevant sounds for that query. Assuming TV sounds in a database are
ranked in order of decreasing likelihood for a given query, recall and precision are computed
by truncating the list to the top n sounds, and counting the number of relevant sounds,
(n)
denoted by F,Rel
We can then define recall = L fle '
and precision
= —^£L. Average
precision is then found by incrementing n and averaging the precision at all points in the
ranked list where a relevant sound is located.
As a baseline retrieval system we ranked sounds according to the Euclidean distance
between the average of the smoothed feature values and their derivative for each sound
event. Table 4.1 compares the mean over users and queries of the average precision values
for both the proposed HMM approach and the average feature approach. We see that in all
cases the HMM provided higher average precision when compared with the average feature
system, suggesting that the simple dynamic trends it models are important in returning
relevant sounds to users in a QBE framework. The average feature system will perform
best on sounds with feature trajectories that exhibit constant trends, and the number of
these sounds will vary to a certain extent based on recording conditions. In an outdoor environment, where factors such as wind and traffic noise lead to low SNR conditions, we might
expect the segmentation algorithm to find longer sound events, as the background noise
makes it difficult for the algorithm to decide when a sound "turns off." These longer sound
events are then likely to exhibit constant feature trends, as they might be the combination
of several events. Conversely, in a high SNR indoor environment more accurate segmentation performance might lead to shorter sound events containing the simple dynamic trends
modeled by the HMM. This might explain why the HMM retrieval system performs best on
80
automatically segmented indoor sounds, and manually segmented outdoor sounds (the data
sets with fewer constant feature trends), while the average feature retrieval system performs
best on manually selected indoor sounds and automatically selected outdoor sounds (the
data sets with more constant feature trends). We now examine in detail the influence of
cluster-based indexing on retrieval performance for the larger manually segmented database.
TABLE 4.1
Mean average precision for manually and automatically segmented sounds.
Segmentation Type
Automatic
Manual
Indoor
HMM Average Feature
0.424
0.247
0.382
0.328
Outdoor
HMM Average Feature
0.282
0.261
0.303
0.247
4.5.2 Clustering Performance
For the manually segmented indoor and outdoor databases we compared performance measures across two cases: exhaustive search and cluster-based indexing. For exhaustive
search,
the query likelihood was evaluated for each sound in the database, and the top n sounds
in terms of likelihood were retrieved. For cluster-based indexing, the method presented in
Section 4.4 was used to limit the search to only those Nc sounds indexed by the most likely
cluster. When n < Nc the sounds in this cluster were retrieved in rank order, while for
n > Nc all sounds outside of the most likely cluster were retrieved in random order.
Figure 4.4 shows recall and precision curves averaged over all queries and user rankings.
By examining the recall curves of Figures 4.4(a) and (b) we see that for the indoor database,
approximately 40% of the relevant sounds were retrieved in the top 20; for the outdoor
database, this number is about 35%. From the precision curves of Figures 4.4(c) and (d)
we see that, for the indoor database, more than 60% of the top 10 returned sounds were
marked as relevant, whereas for the outdoor database, this number is 40%. We conjecture
81
that performance is better with the indoor database because the outdoor sounds generally
contain more noise, or overlap between sound events. Considering the diverse perceptual
quality of environmental and natural sounds, however, overall results in both cases seem
quite promising. We also notice from Figures 4.4(b) and (d) that the cluster-based indexing
actually improves recall and precision for the top 20 retrieved outdoor sounds, rather than
causing a slight detriment as expected. We conjecture that this improvement lies in the fact
that clustering segregates perceptually relevant sounds from non-relevant sounds; however
further studies need to be performed before we can conclude this. As expected, clustering
begins to degrade overall performance much beyond 20 retrieved sounds as sounds outside of
the chosen cluster are returned randomly, however, we anticipate that in many applications
the user will wish to retrieve less than 20 sounds.
As mentioned previously, average precision for an exhaustive search is simply the average of the precision values at all points in the ranked list where a relevant sound is located.
To compare cluster based indexing and exhaustive search in terms of average precision we
must slightly modify the average precision criterion in the cluster-based indexing case, because sounds are returned in random order outside of the first cluster. Thus the average
precision becomes
Nc i p H i
N
If?
I_
\F^Nch
(4.5.1)
AvgP
\Fft.el\
where lp
belongs to
n=l
n=Nc+l
is an indicator function equal to one if the sound at position n in the rank list
FR£I
and zero otherwise, and P(n) = ^ ( | ^ j " ] | +
N_NHel—)
is the precision
value at position n > Nc in the rank list, i.e., the precision after all sounds within the chosen
cluster have been returned and remaining sounds are returned randomly. For exhaustive
search the second term in the sum of (4.5.1) is unnecessary as Nc = N.
82
1
«
'
•'
0.9
Spectral Clustering
/!:
y /
*f /
.//
0.7
.
0.6
•*
0.5
..,.;;/"'
0.4
0.3
0.2
0.1
Number of Files Returned
Number of Files Returned
(a) Recall (Indoor)
(b) Recall (Outdoor)
1
0.9
0.9
0.8
0.7
0.7
r
0.6
c
0.6
O
So.5
o 0.5
^
0.4
CD
£
'-..\
0.4
0.3
'
-\- %
' • • * ; ' • • .
' * ' * ' • • • . .
0.2
'••
0.2
* ' *
0.1
" " • "
:
"
:
* * . ' * , . .
v
* . ' ' ' • - .
0.1
;
Number of Files Returned
Number of Files Returned
(c) Precision (Indoor)
(d) Precision (Outdoor)
Fig. 4.4.
Recall and precision curves averaged over ten example queries and five user
relevancy rankings for indoor and outdoor sounds.
TABLE 4.2
Mean average precision for indoor and outdoor queries.
Outdoor
Indoor
Exhaustive Spectral Clustering Exhaustive Spectral Clustering
0.5677
0.1474
0.3058
Indoor Queries
0.5038
0.4173
0.4345
Outdoor Queries
0.1258
0.1108
Query Type
83
Table 4.2 displays the mean average precision for both the exhaustive and cluster-based
retrieval schemes using both the indoor and outdoor sound database and query sets. From
Table 4.2, we note that for the more noisy outdoor sound database, retrieval performance
is improved for both indoor and outdoor queries by using spectral clustering as certain
irrelevant sounds are removed (by not being in the selected cluster) from consideration of
a high position in the ranked list. We also note a general loss in performance using indoor
queries on the outdoor sound database and vice-versa as the sound types in these cases
might be very different. For example, the outdoor database contains several recordings of
birds, which are obviously not present in the indoor database, so using a bird query might
return sounds such as laughter, crying, etc., which were clearly not labeled as relevant to
the bird query by the user.
To put the results of Table 4.2 in perspective we briefly review the use of the average
precision criterion in audio information retrieval tasks. Much work in information retrieval
follows the NIST Text REtrieval Conference (TREC), which in 1997 began a spoken document retrieval task [147] where an audio database of speech documents was searched by text
concept. In this contest the mean average precision values ranged between approximately
0.1 and 0.55 [147]. Average precision is also used in music information retrieval to compare the performance of systems used to detect cover songs [119] with values ranging from
0.0017 to 0.521. Perhaps the application most similar to ours is [148] where both MFCC
features and semantic information were used to retrieve general audio data from the BBC
sound effects collection, where two sounds were considered relevant if they came from the
same cd of the collection. The results reported in [148] were a mean average precision of
0.165 using MFCC information alone and 0.186 when semantic information was included.
Therefore, we can conclude that our mean average precision values ranging from 0.1108 to
84
0.5038 when spectral clustering is used, and between 0.1258 to 0.5677 for exhaustive search
are quite good, and in the best case, can roughly be interpreted that for every relevant
sound retrieved our system retrieves only a single irrelevant sound along with it.
4.5.3 B I C D e t e r m i n a t i o n of N u m b e r of Clusters
Figure 4.5 plots BIC vs. number of clusters, where the minimum BIC value corresponds to
the optimum number of clusters. A single point on the BIC curve represents the average of
10 BIC values each corresponding to a different random starting point of the EM algorithm.
From Figure 4.5 we see that for the indoor database, the optimum number of clusters was
chosen as 16 and for the outdoor database 14. As we stated earlier, both databases have
six semantic categories, thus, the BIC can be considered as overestimating the "correct"
number of clusters. However, at least part of this overestimate can be explained by the
fact there exist sounds within each semantic category that are perceptually distinct, for
instance a printer sound, a vacuum cleaner sound, and an elevator sound are all machine
sounds with very different perceptual characteristics.
— — Indoor
1000
-
500
>
O
CD
0
^•^.^
-500
5
4^- V
10
16
15
20
Number of Clusters
Fig. 4.5.
BIC value vs. number of clusters.
25
30
85
4.5.4 C o m p u t a t i o n a l Cost
Finally, we compare average computational costs for the cluster-based indexing vs. exhaustive search in terms of the number of query likelihood evaluations. For exhaustive search,
the number of likelihood evaluations equals the number of sounds in the sound set; for
cluster-based indexing, the number of evaluations is Nc + K, where Nc is the number of
sounds in the chosen cluster, K is the number of clusters, and NQ is obtained by averaging
over the ten example queries in addition to using all 300 database sounds as queries for
computational cost analysis only. For indoor sounds, about six times as many evaluations
are required for exhaustive search vs. cluster-based indexing (155 vs. 25.80); for outdoor
sounds, over four times as many evaluations are necessary (145 vs. 35.10). Hence in applications where less than 20 sounds are to be retrieved, we have observed advantages in both
retrieval performance and computational cost, the cost advantage being substantial. We
have remarked that for much larger databases, the clustering can be made recursive and
thus lead to even greater computational savings. So far, such a large-scale application has
not been tested, due to the difficulty in obtaining the tens of thousands of subject-hours
required to establish relevancies for all of the sounds.
4.6 Conclusion
Once events from continuous recordings have been segmented and indexed, the primary
remaining challenge is retrieval. In this chapter, a likelihood-based approach for queryby-example was detailed and evaluated. Since every sound event is indexed with its own
probabilistic template, which requires inference evaluation every time a query is presented,
efficient retrieval can be difficult. To overcome this problem a modified spectral clustering
algorithm where the similarity between two indexed sound events is related to the distance
between their corresponding probabilistic models was proposed. Furthermore, these clusters
86
can be used as a generalized environmental sound classification system, where neither the
class types nor the number of classes are known a priori. The clustering system was shown to
provide large savings in search complexity, while minimally impacting retrieval accuracy on
two test databases captured under different recording conditions. Unfortunately, the queryby-example paradigm, might not be convenient for all users in all situations. Therefore, in
the following chapter a series of semantic extensions to the current system are proposed.
5. I N C O R P O R A T I N G S E M A N T I C I N F O R M A T I O N
5.1 Introduction
With the advent of mobile computing, it is currently possible to record any sound event of
interest using the microphone onboard a smartphone, and immediately upload the audio
clip to a central server. Once uploaded, an online community can rate, describe, and reuse
the recording appending social information to the acoustic content.
This kind of user-
contributed audio archive presents many advantages including open access, low cost entry
points for aspiring contributors, and community filtering to remove inappropriate content.
The challenge in using these archives is overcoming the "data deluge" that makes retrieving
specific recordings from a large database difficult.
The content-based query-by-example (QBE) technique where users query with sound
recordings they consider acoustically similar to those they hope to retrieve has achieved
much success for both music [65] and environmental sounds (Chapter 4).
Additionally,
content-based QBE is inherently unsupervised as no labels are required to rank sounds in
terms of their similarity to the query (although relevancy labels are required for formal
evaluation).
Unfortunately, even if suitable recordings are available they might still be
insufficient to retrieve certain environmental sounds. For example, suppose a user wants to
retrieve all of the "water" sounds from a given database. As sounds related to water are
extremely diverse in terms of acoustic content (e.g., rain drops, a flushing toilet, the call of
waterfowl, etc.), QBE is inefficient when compared to the simple text-based query "water."
Moreover, it is often the case that users do not have example recordings on hand, and in
these cases text-based semantic queries are more appropriate.
Assuming the sound files in the archive do not have textual metadata, a text-based retrieval system must relate sound files to text descriptions. Techniques that connect acoustic
content to semantic concepts present an additional challenge, in that learning the param-
88
eters of the retrieval system becomes a supervised learning problem as each training set
sound file must have semantic labels for parameter learning. Collecting these labels has
become its own research problem leading to the development of social games for collecting
the metadata that describes music [149,150].
Most previous systems for retrieving sound files using text queries, use a supervised
multi-category learning approach where a classifier is trained for each semantic concept
in the vocabulary. For example, in [74] semantic words are connected to audio features
through hierarchical clusters. Automatic record reviews of music are obtained in [75] by
using acoustic content to train a one versus all discriminative classifier for each semantic
concept in the vocabulary. An alternative generative approach that was successfully applied
to the annotation and retrieval of music and sound effects [76] consists of learning a Gaussian
mixture model (GMM) for each concept. In [151] support vector machine (SVM) classifiers
are trained for semantic and onomatopoeia labels when each sound file is represented as a
mixture of hidden acoustic topics. A large scale comparison of discriminative and generative classification approaches for text-based retrieval of general audio on the internet was
presented in [77].
One drawback of the multi-class learning approach is its inability to handle semantic
concepts that are not included in the training set without an additional training phase.
Furthermore, by not explicitly leveraging the semantic similarity between concepts, the
classifiers might miss important connections. For example, if the words "trot" and "grunt"
are never used as labels for the same sound, the retrieval system cannot model the information that these sounds may have been emitted from the same physical source (a horse),
even though they are widely separated in the acoustic feature space.
89
In an attempt to overcome these drawbacks we use a taxonomic approach similar to
that of [78,80] where unlabeled sounds are annotated with the semantic concepts belonging to their nearest neighbor in an acoustic feature space, and WordNet [79,152] is used
to extend the semantics. We aim to enhance this approach by introducing an ontological
framework where sounds are linked to each other through a measure of acoustic content
similarity, semantic concepts (tags) are linked to each other through a similarity metric
based on the WordNet ontology, and sounds are linked to tags based on descriptions from a
user community. We refer to this collection of linked concepts and sounds as a hybrid (content/semantic)
network [153,154] that possesses the ability to handle two query modalities.
When queries are sound files the system can be used for automatic annotation
or "auto-
tagging", which describes a sound file based on its audio content and provides suggested
tags for use as traditional metadata. When queries are concepts they can be used for textbased retrieval where a ranked list of unlabeled sounds that are most relevant to the query
concept is returned. Moreover, queries or new sounds/concepts can be efficiently connected
to the network, as long as they can be linked either perceptually if sound-based, or lexically
if word-based.
In describing our approach, we begin with a discussion of a novel retrieval framework
referred to as probabilistic retrieval ontologies in Section 5.2. Then a formal definition of
the related problems of automatic annotation and text-based retrieval of unlabeled audio,
followed by the application of our ontological framework solution follows in Section 5.3. The
proposed hybrid network architecture outputs a distribution over sounds given a concept
query (text-based retrieval) or a distribution over concepts given a sound query (annotation). The output distribution is determined from the shortest path distance between the
query and all output nodes (either sounds or concepts) of interest. The main challenge
90
of the hybrid network architecture is computing the link weights. Section 5.4 describes
an approach to determine the link weights connecting sounds to other sounds based on a
measure of acoustic content similarity, while Section 5.5 details how link weights between
semantic concepts are calculated using a WordNet similarity metric. It is these link weights
and similarity metrics that allow queries or new sounds/concepts to be efficiently connected
to the network. The third type of link weight in our network are those connecting sounds
to concepts. These weights are learned by attempting to match the output of the hybrid
network to semantic descriptions provided by a user community as outlined in Section 5.6.
We evaluate the performance of the hybrid network on a variety of information retrieval
tasks for two environmental sound databases. The first database contains environmental
sounds without post-processing, where all sounds were independently described multiple
times by a user community. This allows for greater resolution in associating concepts to
sounds as opposed to binary (yes/no) associations. This type of community information
is what we hope to represent in the hybrid network, but collecting this data remains an
arduous process and limits the size of the database.
In order to test our approach on a larger dataset, the second database consists of
several thousand sound files form the Freesound project [155]. While this dataset is larger
in terms of the numbers of sound files and semantic tags it is not as rich in terms of semantic
information as tags are applied to sounds in a binary fashion by the user-community. Given
the noisy nature (recording/encoding quality, various levels of post production, inconsistent
text labeling) of user-contributed environmental sounds, the results presented in Section 5.7
demonstrate that the hybrid network approach provides accurate retrieval performance. We
also test performance using semantic tags that are not previously included in the network,
i.e., out of vocabulary tags are used as queries in text-based retrieval and as the automatic
91
descriptions provided during annotation. Finally, conclusions and discussions of possible
topics of future work are provided in Section 5.8.
5.2 Probabilistic Retrieval Ontologies
The term ontology has a long philosophical history in the study of metaphysics, specifically
dealing with the categorization of existence [156]. Recently, the term has been widely used in
information science for the study, storage, analysis, and retrieval of all types of information.
In this context, ontology is often defined as the representation of shared knowledge, where
different types of information, e.g., concepts, definitions, websites, etc. are nodes of a graph
with the graph edges representing relationships between nodes [157]. When these edges
have some type of semantic meaning that relates two nodes they are referred to as concept
maps [158,159]. This type of model for information organization has been observed in how
humans organize linguistic categories [160] and memories [161,162].
Recent interest in ontologies emerged primarily around two main applications: (1) the
development of the semantic web and (2) natural language processing (NLP). The semantic
web extends traditional web-based description languages such as HTML by introducing a
hierarchical structure for describing documents (e.g., XML, RDF) in an attempt to allow
computers to extract meaning from data on the internet [163]. The NLP research field is
concerned with problems such as machine translation, automatic question answering, and
text summarization [156]. This interest in NLP led to the creation of WordNet [79,152],
a lexical database that serves as a combination dictionary/thesaurus and organizes sets
of synonyms into synsets, and then captures relationships between synsets. WordNet is
organized in a hierarchical fashion with the bottom level containing 100,000 synsets and
the top level of the hierarchy composed of 25 noun and 15 verb classes. Most NLP systems
follow similar hierarchical relationships, but these relationships are often only qualitative.
92
Attempts at quantifying ontological relationships have been proposed to compute the
similarity between words using WordNet [164,165]. These measures range from the number
of common words in the definitions to the shortest path distance between words in the
WordNet ontology. In information retrieval, inexact/descriptive semantic queries can be
probabilistically handled in relational databases by connecting nodes (database objects)
using link weights (edges) with a numeric value [166]. In this chapter, we propose a novel
probabilistic retrieval ontology that quantitatively connects database objects of different
modalities (e.g., text and sound), and retrieves a distribution over a subset of database
objects for a given query. Retrieving a distribution has several advantages, including easy
interpretation by the user and an optimal method of learning link weights, which we now
describe.
In information retrieval a query q is used to search a database of N objects O =
{OI,...,OJV}
using a score function f{q, Oj) £ R. In an optimal retrieval system the score
function will be such that
f(q,Oi) > f{q,Oj)
0i
G Z(q),
0j
G Z(q)
(5.2.1)
where Z(q) C O and Z(q) c O are the subset of relevant and irrelevant database objects,
respectively. That is, the score function should be highest for objects relevant to the query.
To determine effective score functions we again review the precision and recall criteria [66].
Precision is the number of desired sounds retrieved divided by the number of retrieved
sounds and recall is the number of desired sounds retrieved divided by the total number of
desired sounds. If we assume only one relevant object (either sound or query) exists in the
database (denoted by o$«) and the retrieval system returns only the top result for a given
query, it should be clear that the probability of simultaneously maximizing precision and
93
recall reduces to the probability of retrieving the relevant document. An optimal retrieval
system should maximize this probability, which is equivalent to maximizing the posterior
P(oi\q), i.e., the relevant object is retrieved from the maximum-a-posteriori
criterion, i.e.,
i* = argmaxP(oilq').
(5.2.2)
iSl:M
If there are multiple relevant objects in the database, and the retrieval system returns the
top R objects, we can return the objects with the R greatest posterior probabilities given
the query.
When the query and database objects are of the same modality (e.g., the QBE system
described in Chapter 4) the score function must only compute similarity between objects.
However, in several modern multimedia information retrieval applications the query is of
a different modality than the database objects, e.g., text queries can retrieve unlabeled
sounds [76]. In this case the score function must compare queries that are of a different
modality than the database objects being searched. The probabilistic ontologies proposed
in this section have been designed to specifically deal with these types of applications.
Formally, we define a probabilistic retrieval ontology as a graph consisting of a set of
nodes or vertices (ovals and rectangles in the examples of Figure 5.1) denoted by M. Two
nodes i,j £ N can be connected by an undirected link with an associated nonnegative weight
(also known as length or cost), which we denote by W(i,j)
the value of W(i,j)
— W(j,i)
G M.+ . The smaller
the stronger the connection between nodes i and j . In Figures 5.1(a)-(c)
the presence of an edge connecting node i to node j indicates a value of 0 < W(i,j)
although the exact values for W(i,j)
< oo,
are not indicated, while the dashed edges connecting
the query node to the rest of the network are added at query time. If the query is already
94
in the database, then the query node will be connected through the node representing it in
the network by a single link of weight zero (meaning equivalence).
The posterior distribution (score function) is obtained from the hybrid network as:
^
e-d(q,Oi)
where (5.2.3) is the distribution over only those nodes in the network, which contain the
database objects to be retrieved, i.e., OdN.
In (5.2.3), d(q,n) is the path distance between
nodes q and n. (Here a path is a connected sequence of nodes in which no node is visited
more than once.) Currently, we represent d(q,n) by the shortest path distance
d{q, n) = mmdk(q,n),
fc
(5.2.4)
where k is the index among possible paths between nodes q and n. Given starting node q,
we can efficiently compute (5.2.4) for all n G Af using Dijkstra's algorithm [167]. We now
describe how the link weights connecting nodes are determined.
Determination of the link weights depends on the ability to obtain "training distributions" of the form P(oi\oj).
These training distributions can be obtained either from
formal user studies asking subjects to associate/label multimedia objects or from labeling
games [149,150], and a specific example we will be provided inSection 5.6. The link weights
W(c>i,Oj) should the be set such that the probability distribution output by the ontological framework using (5.2.3) with q = Oj should be as close as possible to the value from
the training distribution P(oi\oj),
and the probability distribution output by the ontolog-
ical framework using (5.2.3) with q = Oi should be as close as possible to the value from
the training distribution P(oj\oi).
The difference between probability distributions can be
computed using the Kullback-Leibler (KL) divergence.
95
We define w to be the vector of all link weights we wish to learn, Qjj(w) as the probability distribution output by the ontological framework using (5.2.3) with q = Oj, and
Pji(w)
as the probability distribution output by the ontological framework using (5.2.3)
with q — 0{. Furthermore, we define the probabilities obtained from the training distribution as Qji = P{oi\oj)
and Pji = P(oj\oi).
Treating object Oi as the query, the KL
divergence between the distribution over database objects obtained from the network and
the distribution obtained from the training distribution is
KL(0i,Yf) = Y, PjMPji/PjiW].
(5.2.5)
Oj€0
Similarly, given object Oj as the query, the KL divergence between the distribution of objects
obtained from the network and the distribution obtained from the training data is
KL(oj:w)
= ] T QjilogiQji/Qjiivf)].
(5.2.6)
OiEO
The network weights are then determined by solving the optimization problem
min J2 J2
o.eOojeO
KL
(°i>w)
+ KL(oj,vf)
(5.2.7)
The cost function from (5.2.7) is not convex, so convergence to the local minimum is not
guaranteed [168], and care must be taken in choosing the initial weight vector for gradient
descent techniques. Additionally, because calculating the cost function requires Dijkstra's
algorithm to compute the ontological framework distributions, a closed form of the gradient
vector is not easily obtained and must be computed numerically. We now describe how a
probabilistic ontology can be applied to the semantic audio retrieval problem.
5.3 A n Ontological Framework C o n n e c t i n g S e m a n t i c s and Sound
In content-based QBE, a sound query qs is used to search a database of N sounds S =
{S\,...,SN}
using a score function F(qa,Si)
€ R. The score function must be designed
96
Sound
Template 1
/
Sound
Query
/
/
Sound
Template 2
Distribution
over sounds
}':
Sound
Template N
(a) QBE Retrieval
Distribution
over sounds
(b) Text-based Retrieval
Sound
Query
fc) Annotation
Fig. 5.1. Operating modes of hybrid network for audio retrieval and annotation. Dashed
lines indicate links added at query time.
97
in such a way that two sound files can be compared in terms of their acoustic content.
Furthermore, let A(qs) C S denote the subset of database sounds that are relevant to the
query, while the remaining sounds A{qs) C 5 are irrelevant. The score function should be
highest for sounds relevant to the query, i.e.,
F(qs,Si)
> F(qs,Sj)
s, G A(qa),
Sj G A{qs)
(5.3.1)
In text-based retrieval, the user inputs a semantic concept (descriptive tag) query qc
and the database sound set <S is ranked in terms of relevance to the query concept.
In
this case, the score function G(qc, s,) G M must relate concepts to sounds, and should be
designed such that
G(qc,Si) > G(qc,Sj)
Once a function G(qc,Si)
Si G A(qc),
Sj G A(qc).
(5.3.2)
is known, it can be used for the related problem of annotating
unlabeled sound files. Formally, a sound query qs is annotated using tags from a vocabulary
of semantic concepts C = {CI,...CM}- Letting B(qs) CC be the subset of concepts relevant
to the query, and B(qs) CC the irrelevant concepts, the optimal annotation system is
G{cuqs)
> G{Cj,qs)
c% G B{qs),
Cj
G B(qs).
(5.3.3)
As described in Section 5.2 the posterior of a database object given the query is an optimal
score function.
Thus, each of the score functions in (5.3.1)-(5.3.3) for QBE, text-based
retrieval, and annotation, respectively, reduces to the appropriate posterior:
F(qs,si)
= P(si\qa),
G{qc,Si) =
P{si\qc),
G{ci,qs) =
P(ci\qs).
(5.3.4)
98
Our goal with the ontological framework proposed in this chapter is to estimate all
posterior probabilities of (5.3.4) in a unified fashion. This is achieved using the procedure
described in Section 5.2 to construct a hybrid (content/semantic) network from all elements
in the sound database, the associated semantic tags, and the query (either concept or sound
file) as shown in Figures 5.1(a)-(c). In Figure 5.1(a) an audio sample is used to query the
system and the output is a probability distribution over all sounds in the database.
In
Figure 5.1(b) a word is the query with the system output again a probability distribution
over sounds. In Figure 5.1(c) a sound query is used to output a distribution over words.
The posterior distributions (score functions) from (5.3.4) are obtained from the hybrid
network as:
e-d(qa,si)
p
=
w*> y—F^y'
(5 3 5)
--
e-d(qc,Si)
e-d(qs,Ci)
p
MqJ = y
e-«««v
(5 3 7)
--
where (5.3.5) is the distribution over sounds illustrated in Figure 5.1(a), (5.3.6) is the
distribution over sounds illustrated in Figure 5.1(b), and (5.3.7) is the distribution over
concepts illustrated in Figure 5.1(c). In (5.3.5)-(5.3.7), d{q,n) is the shortest path path
distance between nodes q and n. We now describe how the link weights connecting sounds
and words are determined.
5.4 A c o u s t i c Information: S o u n d - S o u n d Links
Determining the links connecting sounds to other sounds, begins with extraction of the six
low-level features [loudness, spectral centroid, spectral sparsity, temporal sparsity,
index, and harmonicity)
transient
described in Chapter 2. Let t 6 1 : Tj be the frame index for
a sound file of length Tj, and £ £ 1 : P be the feature index, we define Yf
as the fth
99
observed feature for sound Sj at time t. Thus, each sound file Sj can be represented as
(i l'P)
a time-series of feature vectors denoted by Y-^j,' '. If all sound files in the database are
equally likely, the maximum-a-posteriori retrieval criterion discussed in Section 5.3 reduces
to maximum likelihood.
Thus, sound-sound link weights should be determined using a
likelihood-based technique. To compare environmental sounds in a likelihood-based manner,
a hidden Markov model (HMM) A^'^ is estimated from the ^th feature trajectory of sound
Sj. These HMMs encode whether the feature trajectory varies in a constant (high or low),
increasing/decreasing, or more complex (up —> down; down —> up) fashion. All features
are modeled as conditionally independent given the corresponding HMM, i.e, the likelihood
that the feature trajectory of sound Sj was generated by the HMM built to approximate
the simple feature trends of sound Sj is
L(Sj, Si) = log P(Y^:P)\X^^)
(5.4.1)
Details on the estimation of \W) and computation of (5.4.1) are described in Chapter 4. To make fast comparisons, in this chapter we allow only constant HMMs, so
A(w = {fiM'e',a^,('},
where fj,^e' and a^> are the sample mean and standard deviation of
the £ih feature trajectory for sound Sj. Thus,
T
P(Y™\\M) =f[1(Ytm;^\a^)
(5.4.2)
t=i
where j(y; /J,, a) is the univariate Gaussian pdf with mean \i and standard deviation a
evaluated at y.
The ontological framework we have defined is an undirected, acyclic graph, which
requires weights be symmetric
(W(si,Sj)
= W(sj,Si))
Therefore, we cannot use the log-likelihood L(si,Sj)
and nonnegative
(W(si,Sj)
> 0).
as the link weight between nodes Sj
100
and Sj, because it is not guaranteed to be symmetric and nonnegative. Fortunately, a well
known semi-metric that satisfies these properties and approximates the distance between
HMMs exists [143,153]. Using this semi-metric we define the link weight between nodes Si
and Sj as
W ( * . *;) = i[L(*i,
* ) " £ ( * . »j)] +
J-i
^[L{SJ,SJ)
- L{s3,Si)l
(5.4.3)
J-j
where T{ and Tj represent the length of the feature trajectories for sounds s» and Sj, respectively. Although the semi-metric in (5.4.3) does not satisfy the triangle inequality, its
properties are: (a) symmetry W(si,Sj)
= W(sj,Si),
(b) non-negativity W{si,Sj)
> 0, and
(c) distinguishability W(si, Sj) — 0 iff Si = Sj.
5.5 Semantic Information: C o n c e p t - C o n c e p t Links
One technique, for determining concept-concept link weights is to a assign a link of
weight zero (meaning equivalence) to concepts with common stems, e.g., run/running
and laugh/laughter, while other concepts are not linked. To calculate a wider variety of
concept-to-concept link weights, we use a similarity metric from the WordNet"Similarity
library [169]. A comparison of five similarity metrics from the WordNet::Similarity library
in terms of audio information retrieval was studied in [154]. In that work the Jiang and
Conrath (jcn) metric [164] performed best in terms of average precision, but had part of
speech incompatibility problems that did not allow concept to concept comparisons for adverbs and adjectives. Therefore, in this work we use the v e c t o r metric, because it supports
the comparison of adjectives and adverbs, which are commonly used to describe sounds.
The v e c t o r metric computes the co-occurrence of two concepts within the collections of
words used to describe other concepts (their glosses) [169]. For a full review of WordNet
similarity, see [165,169].
101
Fig. 5.2.
An example hybrid network illustrating the difference between in and out of
vocabulary tags
By defining Simfe,
Cj) as the WordNet similarity between the concepts represented by
nodes Q and Cj, an appropriately scaled link weight between these nodes is
W(ci,Cj) = - l o g
Sim(ci,Cj)
_maxk,iSim(ck,ci)
The link weights between semantic concepts
W(CJ,CJ)
(5.5.1)
allow the hybrid network to
handle out of vocabulary tags, i.e., semantic tags that were not applied to the training sound
files used to construct the retrieval system can still be used either as queries in text-based
retrieval or as tags applied during the annotation process. This flexibility is an important
advantage of the hybrid network approach as compared to the multi-class supervised leaning
approaches to audio information retrieval, e.g., [76,77]. Figure 5.2 displays an example
hybrid network illustrating the difference between in and out of vocabulary semantic tags.
While out of vocabulary tags are connected only to in vocabulary tags through links with
weights of the form of (5.5.1), in vocabulary tags are connected to sound files based on
information from the user community via the procedure described in the following section.
5.6 Social Information: S o u n d - C o n c e p t Links
We quantify the social information connecting sounds and concepts using a M x J V dimensional votes matrix V, with elements Vji equal to the number of users who have tagged
102
sound Si with concept Cj divided by the total number of users who have tagged sound s,.
By appropriately normalizing the votes matrix, it can be interpreted probabilistically as
P{si,cj) = Vji/^2^2vkl
k
Q3> = P(si\Cj)
(5.6.1)
I
= Vji/ J2 Vjk
(5-6.2)
k
Pji = P{c3\Si)
where P(si,Cj)
= Vji/ Y, Vfci,
fc
(5-6.3)
is the joint probability between s^ and Cj, Qji =
probability of sound Sj given concept Cj, and Pji = P(cj\si)
P(SJ|CJ)
is the conditional
is defined similarly. Our goal
in determining the social link weights connecting sounds and concepts W(si,Cj)
is that
the hybrid network should perform both the annotation and text-based retrieval tasks in
a manner consistent with the social information provided from the votes matrix. That is,
the probability distribution output by the ontological framework using (5.3.6) with qc = Cj
should be as close as possible to Qji from (5.6.2) and the probability distribution output
using (5.3.7) with qs = Sj should be as close as possible to Pji from (5.6.3). We now describe
how the weight learning technique of Section 5.2 can be applied to the hybrid audio retrieval
network.
We define w = {W(si,Cj)\Vji
^ 0} to be the vector of all sound-word link weights,
Qji(w) as the probability distribution output by the ontological framework using (5.3.6)
with qc = Cj, and Pji (w) as the probability distribution output by the ontological framework
using (5.3.7) with qs = Sj. Treating sound Sj as the query, the KL divergence between the
distribution over database concepts obtained from the network and the distribution obtained
from the user votes matrix is
KL(suw) = Y, PjilogiPji/PjiW}.
(5.6.4)
103
Similarly, given concept Cj as the query, the KL divergence between the distribution of
sounds obtained from the network and the distribution obtained from the user votes matrix
is
KL(Cj,vr)
= J2 Qjilog[Qji/Qji(w)].
(5.6.5)
The network weights are then determined by solving the optimization problem
min ] T J ^ KL(Si,w)
+ KL(cj:w)
(5.6.6)
Si €<S Cj EC
Empirically,
we have found that setting the initial weight values to
— log P(si, Cj), leads to quick convergence.
W(SJ,CJ)
=
Furthermore, if resources are not available
to use the KL weight learning technique, setting the sound-concept link weights to
W(si,Cj)
= — log P(si,Cj)
provides a simple and effective approximation of the optimized
weight.
Presently, the votes matrix is obtained using only a simple tagging process. In the
future we hope to augment the votes matrix with other types of community activity, such
as discussions, rankings, or page navigation paths on a website. Furthermore, sound-toconcept link weights can be set as compositional parameters rather than learned from a
"training set" of tags provided by users. For example, sounds can be made equivalent
to certain emotional concepts (happy, angry, etc.)
through the addition of zero-weight
connections between specified sounds and concepts.
5.7 R e s u l t s and Discussion
In this section, the performance of the hybrid network on the annotation and text-based
retrieval tasks will be evaluated. (QBE results were considered in Chapter 4 and are not
presented here).
104
5.7.1 E x p e r i m e n t a l s e t u p
Two datasets are used in the evaluation process. The first dataset, which we will refer to as
the Soundwalks data set contains 178 sound files uploaded to the Soundwalks.org
website.
The 178 sound files were recorded during seven separate field recording sessions, lasting
anywhere from 10 to 30 minutes each and sampled at 44.1KHz. Each session was recorded
continuously and then hand-segmented into segments lasting between 2-60s. The recordings
took place at three light rail stops (75 segments), outside a stadium during a football game
(60 segments), at a skatepark (16 segments), and at a college campus (27 segments). To
obtain tags, study participants were directed to a website containing ten random sounds
from the set and were asked to provide one or more single-word descriptive tags for each
sound. With 90 responses, each sound was tagged an average of 4.62 times. We have used
88 of the most popular tags as our vocabulary.
Because the Soundwalks dataset contains 90 subject responses, a non-binary votes
matrix can be used to determine the sound-concept link weights described in Section 5.6.
Obtaining this votes matrix requires large amounts of subject time, thus, limiting its size.
To test the hybrid network performance on a larger dataset, we use 2064 sound files and
a 377 tag vocabulary from Freesound.org.
In the Freesound dataset tags are applied in a
binary (yes/no) manner to each sound file. The sound files were randomly selected from
among all files (whether encoded in a lossless or lossy format) on the site containing any
of the 50 most used tags and between 3-60 seconds in length. Additionally, each sound file
contained between three and eight tags, and each of the 377 tags in the vocabulary were
applied to at least five sound files.
To evaluate the performance of the hybrid network we adopt a two-fold cross validation
approach where all of the sound files in our dataset are partitioned into two non-overlapping
105
TABLE 5.1
Database partitioning procedure for each cross validation run.
Number of sound files
In network (training)
Out of network (testing)
Number of tags
In vocabulary
Out of vocabulary
Soundwalks
178
89
89
88
18
70
Freesound
2064
1032
1032
377
75
302
subsets. One of these subsets and its associated tags is then used to build the hybrid network
via the procedure described in Sections 5.3-5.6.
The remaining subset is then used to
test both the annotation and text-based retrieval performance for unlabeled environmental
sounds. Furthermore, an important novelty in this work is the ability of the hybrid network
to handle out of vocabulary tags. To test performance for out of vocabulary tags, a second
tier of cross validation is employed where all tags in the vocabulary are partitioned into five
random, non-overlapping subsets. One of these subsets is then used along with the subset of
sound files to build the hybrid network, while the remaining tags are held out of vocabulary.
This partitioning procedure is summarized in Table 5.1 for both the the soundwalks and
freesound datasets. Reported results are the average over these 10 (five tag, two sound
splits) cross validation runs. Relevance is determined to be positive if a held out sound file
was actually labeled with a tag.
5.7.2 A n n o t a t i o n
In annotation each sound in the testing set is used as a query to provide an output distribution over semantic concepts. For a given sound query qs we denote by B(qs) the set of
relevant tags, and \B\ the number of relevant tags for that query. Assuming M tags in a
database are ranked in order of decreasing probability for a given query, by truncating the
106
list to the top n tags, and counting the number of relevant tags, denoted by |$(™)|, we define
precision
= \BW\/n
and recall = \B^\/\B\.
Average precision is found by incrementing n
and averaging the precision at all points in the ranked list where a relevant tag is located.
Additionally, the area under the receiver operating characteristics curve (AROC) is found
by integrating the ROC curve, which plots the true positive versus false positive rate for
the ranked list of output tags.
Figures 5.3(a) and (b) display the precision and recall curves, respectively, averaged
over all sound queries and cross validation runs for the soundwalks dataset.
The three
curves in Figure 5.3 represent three different ways of building the hybrid network. The in
vocabulary curve can be considered as an upper bound of annotation performance as all
tags are used in building the network. The out of vocabulary (WordNet)
curve uses only a
subset of tags to build the hybrid network, and remaining tags are connected only through
concept-concept links as described in Section 5.5. The out of vocabulary (Baseline)
curve
uses only a subset of tags to build the hybrid network, and remaining tags are returned in
random order. This is how the approach of training a classifier for each tag, e.g., [76,77,151]
would behave for out of vocabulary tags. From Figures 5.3(a) and (b) we see that out of
vocabulary performance is improved both in terms of precision and recall when WordNet
link weights are included. Additionally, from the precision curve of Figure 5.3(a) we see
that approximately 15% of the top 20 out of vocabulary tags are relevant, while for in
vocabulary tags this number is 25%. Considering the difficulty of the out of vocabulary
problem, and that each sound file is labeled with much less than 20 tags this performance
is quite promising. From the recall curve of Figure 5.3(b) approximately 30% of relevant
out of vocabulary tags are returned in the top 20, compared to approximately 60% of in
vocabulary tags.
107
Number of Tags Returned
Number of Tags Returned
(a) Precision
(b) Recall
Fig. 5.3. Precision and Recall curves for annotation of unlabeled sound files in the Soundwalks dataset averaged over 10 cross-validation splits.
Table 5.2 contains the mean average precision (MAP) and mean area under the receiver
operating characteristics curve (MAROC) values for both the Soundwalks and Freesound
databases. We see that performance is comparable between the two datasets, despite the
Freesound set being an order of magnitude larger. The slightly better performance on the
Soundwalks dataset is most likely due to the large amount of social information contained in
the votes matrix, which is used to set sound-concept link weight values. The in vocabulary
MAP values of 0.4333 and 0.4113 compare favorably to the per-word MAP value of 0.179
reported in [76] for annotating BBC sound effects. Benchmarking the performance for out
of vocabulary tags is more difficult since this task is often not considered in the literature.
TABLE 5.2
Annotation performance using out of vocabulary semantic concepts.
In vocabulary (upper bound)
Out of vocabulary (WordNet)
Out of vocabulary (baseline)
Soundwalks
MAP
MAROC
0.4333
0.7523
0.2131
0.6322
0.1789
0.5353
Freesound
MAP
MAROC
0.4113
0.8422
0.1123
0.6279
0.1092
0.5387
108
0.3
\
\\
->*<$r
Out of Vocabulary (WordNet) •
Out of Vocabulary (Baseline)
0.26
\
0.24
\
\
8 °2
\v
Q.
0.18
.
0.16
0.14
•
*
-
"
-
.
*x
K-.
*
•
*
.
.
0.12
*
0.1
'
-
•
*
.
^ ^ i ^ ^ - i ^ n n ^
20
30
40
In Vocabulary
Out of Vocabulary (WordNet) •
Out of Vocabulary (Baseline)
•
*
'
j
"
50
'
"
"
"
•
60
Number of Sounds Returned
(a) Precision
-
jr
.
10
20
30
40
50
60
70
80
Number of Sounds Returned
(b) Recall
Fig. 5.4.
Recall and precision curves for text-based retrieval of unlabeled sound files in
the Soundwalks dataset averaged over 10 cross-validation splits.
5.7.3 T e x t - b a s e d retrieval
In text-based retrieval each semantic tag is used as a query to provide an output distribution
over the test sounds. For a given query we denote by A(qc) the set of relevant test sounds
that are labeled with the query word, and \A\ as the number of relevant test sounds for
that query. Precision, recall, MAP, and MAROC values are then computed as described
above. Figures 5.4(a) and (b) display the precision and recall curves, respectively, averaged
over all sound queries and cross validation runs for the Soundwalks dataset, while Table 5.3
displays the MAP and MAROC values. As with annotation, text-based retrieval with out
of vocabulary concepts is significantly more difficult than with in vocabulary concepts, but
including the concept-concept links based on the measure of WordNet similarity helps to
ameliorate retrieval performance.
To demonstrate that retrieval performance is most likely considerably better than the
reported precision, recall, MAP, and MAROC performance averaged over noisy tags contributed by non-expert users, we provide the example of Table 5.4. Here, the word "rail" is
109
TABLE 5.3
Text-based retrieval performance using out of vocabulary semantic concepts.
In vocabulary (upper bound)
Out of vocabulary (WordNet)
Out of vocabulary (baseline)
Soundwalks
MAP
MAROC
0.2725
0.6846
0.1707
0.6291
0.1283
0.5355
Freesound
MAP
MAROC
0.7100
0.2198
0.5788
0.0681
0.5414
0.0547
used as an out of vocabulary query to retrieve unlabeled sounds, and the top four results
are displayed. Additionally, Table 5.4 displays the posterior probability of each of the top
four results, the shortest path of nodes from the query to the output sounds, and whether or
not the output sound is relevant. The top result is the sound mixture of automobile traffic
and a train horn, but is not tagged by any users with the word "rail," even though like the
sounds actually tagged with "rail" it is a recording of a train station. Although filtering
these types of results would improve quantitative performance it would require listening to
thousands of sound files and overruling subjective decisions made by the users who listened
to and labeled the sounds.
TABLE 5.4
Top four results from Soundwalks data set for text-based retrieval with out of vocabulary
query "rail". Parenthetical descriptions are not actual tags, but provided to give an idea
of the acoustic content of the sound files.
Posterior Probability
0.19
0.17
0.15
0.09
Node Path
rail=>train=>segment94.wav(iram bell)=>segn\ent\65.'wa.v(traffic/ train horn)
rail =^voice=>segment 136.wav(pa announcement)=$>segmentl33.'wa,v(pa announcement)
rail=>train=>segment40.wav (train brakes)=>segment30.wav(train bell/brakes)
rail=>train=>segmen40.wav (train brakes)=>segmentlA7 .v/a.v(train horn)
Relevant
No
Yes
Yes
Yes
5.7.4 In vocabulary semantic information
Effective annotation and retrieval for out of vocabulary tags requires some method of relating the semantic similarity of tags, e.g., the WordNet similarity metric used in this work. In
110
this section we examine how the inclusion of semantic connections between in vocabulary
tags affects annotation and text-based retrieval performance. Table 5.5 compares the MAP
and MAROC values for the Soundwalks dataset where all tags are used in building the
network both with and without semantic links connecting tags. The results of Table 5.5
suggest that when the information connecting sounds and tags is available (i.e., tags are
in the vocabulary) the semantic links provided by WordNet confound the system by allowing for possibly irrelevant relationships between tags. This is not unlike the observations
of [170] where using WordNet did not significantly improve information retrieval performance. Comparing the environmental sound retrieval performance of WordNet similarity
with other techniques for computing prior semantic similarity (e.g., Google distance [171])
remains a topic of future work, since some measure of semantic similarity is necessary to
handle out of vocabulary tags.
TABLE 5.5
Performance of retrieval tasks with the Soundwalks dataset using using WordNet
connections between in vocabulary semantic concepts.
With WordNet
Without WordNet
text-based retrieval
MAP
MAROC
0.2166
0.6133
0.3744
0.6656
annotation
MAP
MAROC
0.2983
0.6670
0.4633
0.7978
5.8 Conclusions and Future Work
Currently, a significant portion of freely available environmental sound recordings are usercontributed and inherently noisy in terms of audio content and semantic descriptions. To
aid in the navigation of these audio databases we show the utility of a system that can
be used for text-based retrieval of unlabeled audio, content-based query-by-example, and
automatic audio annotation. Specifically, an ontological framework connects sounds to each
Ill
other based on a measure of perceptual similarity, tags are linked based on a measure of
semantic similarity, while tags and sounds are connected by optimizing link weights given
user preference data. An advantage of this approach is the ability of the system to flexibly
extend when new sounds and/or tags are added to the database. Specifically, unlabeled
sound files can be queried or annotated with out of vocabulary concepts, i.e, tags that do
not currently exist in the database.
6. C O N C L U S I O N S
6.1 S u m m a r y
As the ability to store and transmit data increases it has become possible to record, archive,
and share all sound that occurs in a given space or human life. This has inspired researchers
in the audio signal processing community to look beyond the speech and music sound categories that have dominated the research field up to this point, and begin to explore the
utility of environmental and natural sound research. Possible applications include audio
life-logging where people continuously record the sound experiences that make up their life,
biophony - the study of non-human animal sounds, multimedia
composition,
and surveil-
lance. As all of these applications can benefit from the "always on" nature of continuous
recordings, the enormous amount of audio data produced must be organized and accessed.
The problems of organizing and accessing these continuously recorded sound archives
present three specific areas in need of technical innovation: segmentation
(determining
where sound events begin or end), indexing (computing and storing sufficient information
along with each event to distinguish it from other events), and retrieval (intuitively obtaining all events of a given type). All of these techniques are based on the the extraction of
low-level acoustic features, where a frame of raw audio (e.g., 50 milliseconds) is converted
into a feature vector summarizing the audio content. By monitoring changes in the values
of these features we can determine all of the onset and end times for each sound event in
the continuous recording. Once segmented, each event is then indexed with a probabilistic
model summarizing the evolution of acoustic features over the course of the event. Every
indexed sound event is stored in a database, and all events of interest can be intuitively
retrieved.
In order to develop a truly intuitive sound information retrieval system, we must allow
for different query modalities. Two important query types are sound queries (query-by-
113
example) and semantic queries (query-by-text).
Using sound to search for sound is an
intuitive approach in many situations, but the number of possible queries is limited by
the sound files the user has on hand and the types of sounds they can produce/record in
their surrounding environment. In situations where the user does not have adequate sound
queries available, query-by-text is clearly the preferred approach. By treating each sound
event and semantic concept in the database as a node in an undirected graph we propose a
hybrid (acoustic content/semantic) network structure that can retrieve audio information
from sound or text queries. Additionally, this hybrid network structure can be used for
annotation,
i.e., automatically providing semantic descriptions for an unlabeled piece of
sound.
6.2 Future D i r e c t i o n s
As the computer audition community has only recently begun extensive research into the
broad class of natural and environmental sounds, there are several interesting directions for
future work. While some of these directions are straightforward extensions of the algorithms
proposed in this dissertation, others are more indirectly related.
Feature e x t r a c t i o n
Although a massive amount of low-level acoustic features have been proposed in the
computer audition literature and even standardized (e.g., [109,114]), very little is understood
about how humans perceive acoustic characteristics other than pitch and loudness. Both
the segmentation and retrieval algorithms proposed in this dissertation were based on some
measure of distance/similarity between the six low-level features proposed in Chapter 2.
Although we attempted to perceptually weight some features by using the decibel, Bark [90]
and Mel [89] frequency scales, we are still unsure whether or not feature distance varies with
perceptual distance. Ideally the scale of each feature should be determined using a just-
114
noticeable difference (JND) technique to find out what differences in feature values lead to
perceived differences in sound characteristics among most humans.
One challenge in mapping JND scales for the low-level acoustic features that are important in environmental sound perception, is the lack of a clear understanding on how to
synthesize environmental sounds based on these features. As in other aspects of computer
audition, techniques for speech [172] and music [173] synthesis have attracted lots of attention, while environmental sound synthesis techniques have only recently drawn interest.
Perhaps the banded waveguide approach of [174], the impact/noise/chirp technique of [175],
or the filter statistics method of [176] will eventually provide an extensible framework for
synthesizing environmental sounds and mapping JND scales for the features that compose
them.
Segmentation
In this dissertation we focused on single-channel
algorithms, i.e., we assumed that a
single microphone is responsible for capturing the sound. While this assumption makes
sense for applications such as life-logging where a person carries a microphone with them;
for characterizing the sound activity in large fixed spaces multiple microphones located large
distances from each other might be necessary. That is, sounds originating in one part of
the space may not be perceptible in another, or may be perceptible only at high SNR. To
this end, we have extended the environmental sound segmentation algorithm proposed in
Chapter 3 to the multi-channel case in [177]. Here the sound is continuously recorded using
a microphone array distributed throughout the space. The method jointly infers onsets and
end times of the most prominent sound events in the space, along with the active subset
of microphones responsible for capturing the sound corresponding to each event.
Given
115
a sound event segmented using this multi-channel algorithm how multi-channel events are
indexed and retrieved using the framework of this dissertation is not exactly clear.
Perhaps the greatest downside to a multi-channel extension of the segmentation algorithm (and to a lesser extent the single-channel approach described in this dissertation)
is the computational complexity of the approximate Viterbi inference scheme. In the approximate Viterbi inference scheme every possible combination of discrete variables has a
linear dynamic system matched to it. For the single-channel case with three features used
for the segmentation, there are |\I/| = 81 possible discrete states. For the five microphone
and three feature segmentation experiments described in [177] there are \$>\ — 2592 possible
discrete states, and the order of complexity is 0(|\I/| 2 T), which even in an offline inference
situation makes adding additional microphones or acoustic features difficult. In order to
make online inference more feasible, we would like to implement a Rao-Blackwellised particle filter (RBPF) [178], which is a very efficient inference methodology as the number of
discrete states grows increasingly large. Unfortunately, optimizing a MAP criterion in an
online environment using R B P F would require the incorporation of MCMC steps to obtain
a "local MAP estimate" currently an unsolved problem.
Q u e r y - b y - e x a m p l e retrieval
Another possible enhancement would be to explicitly extend the query-by-example
system proposed in Chapter 4 to the query-by-humming domain, i.e., users can search the
database with their voice. The current QBE retrieval system will work for human voice
generated queries under certain conditions, but in dealing with natural and environmental
sounds, there are certain acoustic qualities the human voice cannot imitate accurately, if
at all [179]. Thus, beginning with the feature set of Chapter 2 we would hope to establish
through extensive user studies, those features of environmental sounds that the human
116
voice can accurately imitate, and those which cannot be mimicked. We would also like
to investigate whether cross mappings occur, e.g., users are very likely to use pitch to
mimic a sound quality that is actually due to spectral centroid. We then hope to use this
information to improve the user experience for oral retrieval of natural and environmental
sounds. Further study of our distortion-aware models [180] for accommodating scale/level
distortion in acoustic feature trajectories might also help improve retrieval performance for
oral queries.
Semantic retrieval
A drawback of the ontological framework proposed in Chapter 5 is that computational
complexity grows linearly in the number of sounds and semantic tags in the database. One
possible solution would be to augment the hybrid network with a recursive clustering scheme
such as that described in Section 4.4. We have successfully tested this approach in [153],
where each cluster becomes a node in the hybrid network, and all sounds assigned to each
cluster are connected to the appropriate cluster node by a link of weight zero. These cluster
nodes are then linked to the nodes representing semantic tags. While this approach limits
the number of sound-tag weights that need to be learned, the additional cluster nodes and
links tend to cancel out this savings. Furthermore, when a new sound is added to the
network we still must compute its similarity to all sounds previously in the network (this
is also true for new tags). For sounds, it might be possible to represent each sound file
and sound cluster as a Gaussian distribution, and then use symmetric Kullback-Leibler
divergence to calculate the link weights connecting new sounds added to the network to
pre-existing clusters. Unfortunately, this approach would not extend to the concept nodes
in the hybrid network as we currently know of no technique for representing a semantic
tag as a Gaussian, even though the WordNet similarity metric could be used to cluster
117
the tags. Perhaps a technique where a fixed number of sound/tag nodes are sampled to
have link weights computed each time a new sound/tag is added to the network could help
scale the network. A link weight pruning approach might also help improve computational
complexity.
Another improvement to the hybrid network structure connecting semantics and sound
might be achieved by exploring different link weight learning techniques. Currently, we use
a "divide and conquer" approach where the three types of weights (sound-sound, conceptconcept, sound-concept) are learned independently. This could lead to scaling issues, especially if the network is expanded to contain different node types. One possible approach
to overcome these scaling issues could be to learn a dissimilarity function from ranking
data [181]. For example, using the sound similarity, user preference, and WordNet similarity data to find only rankings between words and sounds of the form "A is more like B than
C is D", we can learn a single dissimilarity function for the entire network that preserves
this rank information.
Applications
While several applications that could benefit from the techniques explored in this dissertation were described in Section 1.2, the proliferation of mobile computing is introducing
new uses for environmental audio databases. One idea that has recently rose to prominence
is "listening to maps". In this case, sounds are displayed on a map, based on the location in
which they were recorded [182-184]. Ideally users can then listen to points on the map and
gain a sense of the activity in a given location that is not available from visual information
alone. It is now possible for GPS-enabled smart-phones to record sound events of interest
using the onboard microphone and immediately upload the audio clip along with its location to a central server, where it can then be displayed on one of these "sound maps". As
118
this type of interaction grows in popularity the need for automatic segmentation onboard
the mobile device, along with indexing and retrieval frameworks on the database end will
be necessary.
The rise in mobile computing has also led to an explosion in the number of augmented
reality (AR) systems, where the real-world environment is mixed with computer generated
information [185]. The automatic environmental sound annotation system proposed in this
dissertation could be used as a mobile phone application for the deaf (or those who are
temporarily deaf by using in ear headphones to listen to music as they explore the world),
where the sound observed by the mobile phone microphone is automatically annotated and
those descriptions are displayed on the mobile phone screen to provide the user a computergenerated description of the sound activity around them.
[1]
V. Bush, "As we may think," Atlantic Monthly, July 1945.
[2]
D. P. W. Ellis and K. Lee, "Minimal-Impact Audio-Based Personal Archives," October 2004, First ACM workshop on Continuous Archiving and Recording of Personal
Experiences CARPE-04, New York.
[3]
B. Truax, Acoustic Communication,
[4]
R.M. Schafer, The Soundscape: our sonic environment
Destiny Books, Rochester, VT, 1994.
[5]
A.S. Bregman, Auditory Scene Analysis:
Press, Cambridge, MA, 1990.
[6]
A. M. Liberman, "On finding that speech is special," American
no. 2, pp. 148-167, 1982.
[7]
J. A. Ballas and J. H. Howard, "Interpreting the language of environmental sounds,"
Environment and Behavior, vol. 19, no. 1, pp. 91-114, 1987.
[8]
L. D. Rosenblum, "Perceiving Articulatory Events: Lessons for an Ecological Psychoacoustics," in Ecological Psychoacoustics, J. G. Neuhoff, Ed. Elsevier Academic Press,
2004.
[9]
J.A. Ballas, "Common factors in the identification of an assortment of brief everyday
sounds," J. Exp. Psychol. Hum. Percept. Perform., vol. 19, no. 2, pp. 250-267, 1993.
Ablex Publishing, Norwood, NJ, 1984.
and the tuning of the world,
The Perceptual Organization
of Sound, MIT
Psychologist, vol. 37,
[10] A. Cummings, R. Ceponienee, A. Koyama, A. P. Saygin, J. Townsend, and F. Dick,
"Auditory semantic networks for words and natural sounds," Brain Research, vol.
1115, no. 1, pp. 92-107, 2006.
[11] L. Lapin, How to Get Anything on Anybody, Auburn Wolfe, San Francisco, CA, 1983.
[12] A. Harma, M. F. McKinney, and J. Skowronek, "Automatic Surveillance of the acoustic
activity in our living environment," Amsterdam, The Netherlands, July 2005, IEEE
International Conference on Multimedia and Expo.
[13] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, "Audio analysis for surveillance
applications," in Proceedings of the IEEE Wokshop on the Applications of Signal
Processing to Audio and Acoustic (WASPAA), New Paltz, NY, 2005.
[14] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, "Systematic Acquisition of Audio classes for elevator surveillance," SPIE Image and Video Communications
and
Processing, vol. 5685, pp. 64-71, 2005.
120
[15] S. Chu, S. Narayanan, and C. C. J. Kuo, "Environmental Sound Recognition with
Time-Frequency Audio Features," IEEE Transactions on Audio, Speech and Language
Processing, vol. 17, no. 6, pp. 1142-1158, 2009.
[16] C. Zieger, C. Brutti, and P. Savizer, "Acoustic Based Surveillance System for Intrusion
detection," in Sixth IEEE International Conference on Advanced Video and Signal
Based Surveillance (AVSS), Genoa, Italy, 2009, 314-319.
[17] X. Zhuang, J. Huang, G. Potamianos, and M. Hasegawa-Johnson, "Acoustic fall detection using Gaussian mixture models and GMM supervectors," in IEEE
International
Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 2009.
[18] X. Wu, H. Gong, P. Chen, Z. Zhong, and Y. Xu, "Surveillance Robot Utilizing Video
and Audio Information," Journal of Intelligent and Robotic Systems, vol. 55, no. 4,
pp. 403-421, 2009.
[19] A. Dielmann and S. Renals, "Automatic Meeting Segmentation Using Dynamic
Bayesian Networks," IEEE Trans, on Multimedia, vol. 9, no. 1, pp. 25-36, 2007.
[20] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, "Modeling Individual and
Group Actions in Meetings With Layered HMMs," IEEE Trans, on Multimedia, vol.
8, no. 3, pp. 509-520, 2006.
[21] J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong, "MyLifeBits:Fulfilling the
Memex vision," in Proceedings of ACM Multimedia, Juan Les Pins, France, 2002.
[22] J. Gemmell, G. Bell, and R. Lueder, "MyLifeBits:a personal database for everything,"
Commun. ACM, vol. 49, no. 1, pp. 88-95, 2006.
[23] D. Ellis and K. Lee, "Accessing minimal-impact personal audio archives,"
Multimedia, vol. 13, no. 4, pp. 30-38, 2006.
[24] B. Krause, Into a Wild Sanctuary,
IEEE
Heyday Books, Berkeley, CA, 1998.
[25] H. Slabbekoorn and A. den Boer-Visser, "Cities change the songs of birds,"
Biology, vol. 16, no. 23, pp. 2326-2331, 2006.
Current
[26] K. M. Parris and A. Schneider, "Impacts of traffic noise and traffic volume on birds of
roadside habitats," Ecology and Society, vol. 14, no. 1, 2009.
[27] K. M. Parris, M. Velik-Lord, and J. M. A. North, "Frogs call at a higher pitch in traffic
noise," Ecology and Society, vol. 14, no. 1, pp. article 25, 2009.
[28] G. Jones, "Sensory Ecology: Noise annoys foraging bats," Current Biology, vol. 18,
no. 23, pp. 1098-1100, 2008.
121
[29] J. R. Barber, K. R. Crooks, and K. M. Fristrup, "The costs of chronic noise exposure
for terrestrial organisms," Trends in Ecology and Evolution, vol. In Press, pp. 1-10,
2009.
[30] C D . Francis, C. P. Ortega, and A. Cruz, "Noise pollution changes avian communities
and species interactions," Current Biology, vol. 19, no. 16, pp. 1415-1419, 2009.
[31] H. Slabbekoorn and W. Halfwerk, "Behavioural Ecology: Noise annoys at community
level," Current Biology, vol. 19, no. 16, pp. 693-695, 2009.
[32] P. Lercher and B. Schulte-Forktamp, "The relevance of soundscape research to the
assessment of noise annoyance at the community level," in the Eighth International
Congress on Noise as a Public Health Problem, Rotterdam, The Netherlands, 2003.
[33] W. Yang and J. Kang, "Acoustic comfort evaluation in urban open public spaces,"
Applied Acoustics, vol. 66, no. 2, pp. 211-229, 2005.
[34] S. Feld, Sound and sentiment : birds, weeping, poetics, and song in Kaluli
University of Pennsylvania Press, Philadelphia, PA, 1990.
expression,
[35] P. Hedfors and P. Grahn, "Soundscapes in urban and rural planning and design - a
brief communication of a research project," in Northern Soundscapes: Yearbook of
Soundscape Studies, R.M. Schafer and H. Jarviluoma, Eds., vol. 1, pp. 67-82. 1998.
[36] P. Hedfors and P.G. Berg, "Site interpretation by skilled listeners - methods for
communicating soundscapes in landscape architecture and planning," in Soundscape
Studies and Methods, H. Jarviluoma and G. Wagstaff, Eds., pp. 91-114. Helsinki, 2002.
[37] P. Hedfors and P. G. Berg, "The Sounds of Two Landscape Settings: auditory concepts
for physical planning and design," Landscape Research, vol. 28, pp. 245-263, 2003.
[38] B. Blesser and L. R. Salter, Spaces speak, are you listening?, The MIT Presss, 2007.
[39] F. Dhomont, "Acousmatic update," Contact!, vol. 8, no. 2, 1995.
[40] P. Schaeffer, Introduction
a la musique concrete, Seuil, Paris, France, 1950.
[41] A. Misra, P. R. Cook, and G. Wang, "Musical Tapestry: Re-composing natural
sounds," in Proceedings of the International Computer Music Conference
(ICMC),
New Orleans, LA, USA, 2006.
[42] D. Birchfield, N. Mattar, and H. Sundaram, "Design of a generative model for soundscape creation," in Proceedings of the International Computer Music Conference
(ICMC), Barcelona, Spain, 2005.
122
[43] A. Valle, V. Lombardo, and M. Schirosa, "A graph-based system for the dynamic
generation of soundscapes," in Proceedings of the 15th International Conference on
Auditory Display, Copenhagen, 2009, pp. 217-224.
[44] A. Valle, M. Schirosa, and V. Lombardo, "A framework for soundscape analysis and
re-synthesis," in Proceedings of the Sound and Music Computing Conference, Porto,
Portugal, 2009.
[45] D. Birchfield, T. Ciufo, and H. Thornburg, "Sound and Interaction in K-12 mediated
education," in ICMC, New Orleans, LA, USA, 2006.
[46] D. F. Rosenthal and H. G. Okuno, Computational
Erlbaum Associates, Mahwah, NJ, 1998.
auditory scene analysis, Lawrence
[47] S. Roweis, "On Microphone Separation," in Advances in Neural Information
Systems, Vancouver, BC, 2001.
Processing
[48] A.S. Master, "Bayesian Two Source Modeling for Separation of N Sources from Stereo
Signals," in IEEE International Conference on Acoustics Speech and Signal Processing
(ICASSP), Montreal, 2004.
[49] A. S. Master, Stereo Music Source Separation via Bayesian Modeling.,
Stanford University, 2006.
Ph.D. thesis,
[50] D. Wang and G. J. Brown, Computational auditory scene analysis : principles,
rithms, and applications, Wiley Interscience, Hoboken, NJ, 2006.
algo-
[51] R. Andre-Obrecht, "A new statistical approach for automatic segmentation of continuous speech signals," IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 36, no. 1, pp. 29-40, 1988.
[52] S. Chen and P. Gopalakrishnan, "Speaker environment and channel change detection and clustering via the Bayesian information criterion," 1998, Proceedings of the
DARPA Broadcast News Transcription and Understanding Workshop.
[53] D. Kimber and L. Wilcox, "Acoustic segmentation for audio browsers,"
Interface Conf., Sydney, Australia, July 1996.
in Proc.
[54] G. Tzanetakis and P. Cook, "Multifeature audio segmentation for browsing and annotation," in Proceedings of the IEEE Wokshop on the Applications of Signal Processing
to Audio and Acoustic (WASPAA), New Paltz, NY, October 1999.
[55] A. T. Cemgil, Bayesian music transcription.,
Nijmegen, 2004.
Ph.D. thesis, Radboud University of
123
[56] H. Thornburg, Detection and Modeling of Transient Audio Signals with Prior
mation, Ph.D. thesis, Stanford University, 2005.
Infor-
[57] H. Thornburg, R. Leistikow, and J. Berger, "Melody extraction and musical onset
detection via probabilistic models of STFT peak data," IEEE Transactions on Audio,
Speech, and Language Processing, vol. 15, no. 4, pp. 1257-1272, 2007.
[58] E. Scheirer and M. Slaney, "Construction and Evaluation of a Robust Multifeature
Speech/Music Discriminator," in IEEE International Conference on Acoustics Speech
and Signal Processing (ICASSP), Munich, Germany, 1997.
[59] L. Lu, H. J. Zhang, and H. Jiang, "Content analysis for audio classification and
segmentation," IEEE Transactions on Speech and Audio Processing, vol. 10, no. 7, pp.
504-516, 2002.
[60] L. Lu, R. Cai, and A. Hanjalic, "Audio Elements based Auditory Scene Segmentation," in IEEE International Conference on Acoustics, Speech and Signal Processing,
Toulouse, France, 2006.
[61] P. J. Brockwell and R. A. Davis, Introduction to Time Series and Forecasting, Springer,
2003.
[62] F. Gustaffsson, Adaptive Filtering and Change Detection, Wiley, New York, 2001.
[63] M. Basseville, "Detecting Changes in Signals and Systems-a survey," Automatica,
24, pp. 309-326, 1988.
vol.
[64] S. E. Tranter and D. A. Reynolds, "An overview of automatic speaker diarization
systems," IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no.
5, pp. 1557-1565, 2006.
[65] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, "ContentBased Music Information Retrieval: Current Directions and Future Challenges," Proceedings of the IEEE, vol. 96, no. 4, pp. 668-696, 2008.
[66] C. J. Van Rijsbergen, Information
Retrieval, Butterwoths, London, 1979.
[67] Avery Wang, "The Shazam music recognition service," Commun.
8, pp. 44-48, 2006.
ACM, vol. 49, no.
[68] J. P. Ogle and D. P. W. Ellis, "Fingerprinting to identify repeated sound events in longduration personal audio recordings," in IEEE International Conference on Acoustics,
Speech and Signal Processing, Honolulu, Hawaii, 2007.
124
[69] M. Slaney and M. Casey, "Locality-Sensitive Hashing for Finding Nearest Neighbors,"
IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 128-131, 2008.
[70] C. Cotton and D. P. W. Ellis, "Finding similar acoustic events using matching pursuit
and locality-sensitive hashing," in Proceedings of the IEEE Wokshop on the Applications of Signal Processing to Audio and Acoustic (WASPAA), 2009, pp. 125-128.
[71] A. S. Durey and M. A. Clements, "Melody Spotting Using Hidden Markov Models,"
in Int. Symposium on Music Information Retrieval (ISMIR), Bloomington, Indiana,
2001.
[72] J. Shifrin, B. Pardo, C. Meek, and W. Birmingham, "HMM-based musical query retrieval," in Proceedings of the 2nd A CM/IEEE-CS joint conference on Digital libraries,
Portland, Oregon, 2002.
[73] S. Shalev-Shwartz, S. Dubnov, N. Friedman, and Y. Singer, "Robust temporal and
spectral modeling for query By melody," in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information
retrieval,
Tampere, Finland, 2002.
[74] M. Slaney, "Semantic-Audio Retrieval," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Orlando, Florida, 2002.
[75] B. Whitman and D. Ellis, "Automatic Record Reviews," in Int. Symposium
Information Retrieval (ISMIR), 2004, pp. 470-477.
on Music
[76] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, "Semantic annotation and retrieval of music and sound effects," IEEE Transactions on Audio, Speech and Language
Processing, vol. 16, no. 2, pp. 467-476, 2008.
[77] G. Chechik, E. Le, M. Rehn, S. Bengio, and D. Lyon, "Large-scale content-based
audio retrieval from text queries," in Proc. of the ACM International Conferennce on
Multimedia Information Retrieval, Vancouver, BC, 2008.
[78] P. Cano, M. Koppenberger, S. Le Groux, J. Ricard, P. Herrera, and N. Wack, "Nearestneighbor generic sound classification with a WordNet-based taxonomy," in Proc. 116th
AES Convention, Berlin, Germany, 2004.
[79] "WordNet," h t t p : / / w o r d n e t . p r i n c e t o n . e d u / .
[80] E. Martinez, O. Celma, M. Sordo, B. de Jong, and X. Serra, "Extending the folksonomies of freesound.org using content- based audio analysis," in Sound and Music
Computing Conference, Porto, Portugal, 2009.
125
[81] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee, "Classification of general audio data
for content-based retrieval," Pattern Recognition Letters, vol. 22, no. 5, pp. 533-544,
2001.
[82] J. Foote, "A similarity measure for automatic audio classification," in In Proc.AAAI
1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and
Audio Corpora., 1997.
[83] G. Tzanetakis and P. Cook, "Musical Genre classification of audio signals," IEEE
Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293-302, 2002.
[84] E. Wold, T. Blum, D. Keislar, and J. Wheaton, "Content-based Classification, Search,
and Retrieval of Audio," IEEE Multimedia, vol. 3, no. 3, pp. 27-36, 1996.
[85] M. F. McKinney and J. Breebaart, "Features for Audio and Music Classification,"
in Proceedings of the J^th International Conference on Music Information
Retrieval,
Baltimore, MD, October 2003.
[86] P. Hanna, N. Louis, M. Desainte-Catherine, and J. Benois-Pineau, "Audio Features
for Noisy Sound Segmentation," in Proceedings of the 5th International Conference on
Music Information Retrieval, Barcelona, Spain, October 2004.
[87] M. Goodwin and J. Laroche, "Audio segmentation by feature-space clustering using
linear discriminant anlysis and dynamic programming," in Proceedings of the IEEE
Wokshop on the Applications of Signal Processing to Audio and Acoustic
(WASPAA),
New Paltz, NY, October 2003.
[88] A. V. Oppenheim and R. W. Schafer, Discrete-time
1999.
Signal Processing, Prentice Hall,
[89] S. S. Stevens, J. Volkman, and E. Newman, "A scale for the measurement of the
psychological magnitude of pitch," Journal of the Acoustical Society of America, vol.
8, pp. 185-190, 1937.
[90] J. O. Smith III and J. S. Abel, "Bark and ERB Bilinear Transforms," IEEE
tions on Speech and Audio Processing, vol. 7, no. 6, pp. 697-708, 1999.
[91] E. Zwicker and H. Fasti, Psychoacoustics
1990.
Transac-
Facts and Models, Springer-Verlag, Berlin,
[92] M. Mathews, "What is Loudness?," in Music, Cognition, and Computerized Sound:
An Introduction to Psychoacoustics, P. Cook, Ed., pp. 71-78. MIT Press, Cambridge,
MA, 1999.
126
[93] B. C. J. Moore, B. R. Glasberg, and T. Baer, "A model for the prediction of thresholds,
loudness, and partial loudness," Journal of the Audio Engineering Society, vol. 45, no.
4, pp. 224-240, 1997.
[94] J. Pierce, "Introduction to Pitch Perception," in Music, Cognition, and Computerized
Sound: An Introduction to Psychoacoustics, P. Cook, Ed., pp. 57-70. MIT Press,
Cambridge, MA, 1999.
[95] A. de Cheveigne and H. Kawahara, "YIN, a fundamental frequency estimator for
speech and music," The Journal of the Acoustical Society of America, vol. I l l , no. 4,
pp. 1917-1930, 2002.
[96] E. Terhardt, G. Stoll, and M. Seewann, "Algorithm for extraction of pitch and pitch
salience from complex tonal signals," Journal of the Acoustical Society of America,
vol. 71, no. 3, pp. 679-688, 1982.
[97] R. B. Smith and T. Murail, "An interview with Tristan Murail,"
Journal, vol. 24, no. 1, pp. 11-19, 2000.
Computer
Music
[98] J. L. Goldstein, "An optimum processor theory for the central formation of the pitch
of complex tones," Journal of the Acoustical Society of America, vol. 54, no. 6, pp.
1496-1516, 1973.
<
[99] W. Hess, Pitch Determination
of Speech Signals, Springer-Verlag, Berlin, 1983.
[100] H. Thornburg and R. J. Leistikow, "A new probabilistic spectral pitch estimator:
Exact and MCMC-approximate strategies," in Lecture Notes in Computer Science
3310, U. Kock Wiil, Ed. Springer-Verlag, 2005.
[101] T. Abe, T. Kobayashi, and S. Imai, "Harmonics tracking and pitch extraction based
on instantaneous frequency," in IEEE International Conference on Acoustics, Speech
and Signal Processing, Detriot, MI, 1995, pp. 756-759.
[102] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall,
Englewood Cliffs, NJ, 1978.
[103] R. J. McAulay and T. F. Quatieri, "Speech Analysis/Synthesis Based on a Sinusoidal
Representation," IEEE Transactions on Acoustics, Speech and Signal Processing, vol.
34, no. 4, pp. 744-754, 1986.
[104] F. Villavicencio, A. Robel, and X. Rodet, "Improving LPC spectral envelope extraction of voiced speech by true-envelope estimation," in IEEE International
Conference
on Acoustics, Speech and Signal Processing, 2006, pp. 869-872.
127
[105] A. Spanias, "Speech coding: a tutorial review," Proceedings of the IEEE, vol. 82, no.
10, pp.1541-1582, 1994.
[106] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989.
[107] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Journal of
the Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, 1990.
[108] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE
on Speech and Audio Processing, vol. 2, no. 4, pp. 578-589, 1994.
Transactions
[109] E. T. S. I. standard document, "Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm;
compression algorithms," ETSI ES 201 108 vl.1.3 (2003-09), 2003.
[110] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification," Journal of the Acoustical Society of America, vol.
55, no. 6, pp. 1304-1312, 1974.
[Ill] L. R. Rabiner and B. H. Huang, "ecognition Of isolated digits using hidden Markov
models and continuous mixture densities," AT&T Tech. Journal, vol. 64, no. 6, pp.
1211-1234, 1985.
[112] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[113] B. Logan, "Mel frequency cepstral coefficients for music modeling," in Int.
on Music Information Retrieval (ISMIR), 2000.
Symposium
[114] H. G. Kim, N. Moreau, and T. Sikora, MPEG-7 Audio and beyond: audio content
indexing and retrieval, John Wiley &; Sons Ltd., West Sussex, England, 2005.
[115] D. G. Childers, D. P. Skinner, and R. C. Kemerait, "The Cepstrum: A Guide to
Processing," Proceedings of the IEEE, vol. 65, no. 10, pp. 1428-1443, 1977.
[116] T. Fujishima, "Realtime chord recognition of musical sound: A system using common
lisp music," in Proc. Int. Computer Music Conf., Beijing, China, 1999.
[117] R. N. Shepard, "Demonstrations of circular components of pitch,"
Audio Engineering Society, vol. 31, pp. 641-649, 1983.
Journal of the
[118] K. Lee, "Automatic Chord Recognition from audio using enhanced pitch class profile,"
in Proc. Int. Computer Music Conf., New Orleans, LA, 2006.
128
[119] J. H. Jensen, M. G. Christensen, D. Ellis, and S. H. Jensen, "A Tempo-Insensitive Distance Measure for Cover Song Identification based on Chroma Features," in IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, Nevada,
2008.
[120] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri, "Evaluation of audio
beat tracking and music tempo extraction algorithms," J. of New Music Res., vol. 36,
no. 1, pp. 1-16, 2007.
[121] D. P. W. Ellis, "Beat tracking by dynamic programming," J. of New Music Res., vol.
36, no. 1, pp. 51-60, 2007.
[122] Y. Wang, Z. Liu, and J. Huang, "Multimedia content analysis-using both audio and
visual clues," IEEE Signal Processing Magazine, vol. 17, no. 6, pp. 12-36, 2000.
[123] D. L. Donoho and M. Elad, "Optimally sparse representation in general (nonorthogonal) dictionaries via I1 minimization," Proceedings of the National Academy of Sciences, USA, vol. 100, no. 5, pp. 2197-2202, 2003.
[124] M. Abe and J. O. Smith III, "Design criteria for simple sinusoidal parameter estimation based on quadratic interpolation of F F T magnitude peaks," in AES 117th
Convention, San Francisco, CA, 2004.
[125] R. Cai, L. Lu, and A. Hanjalic, "Unsupervised content discovery in composite audio,"
in Proceedings of ACM Multimedia, Singapore, 2005, pp. 628-637.
[126] K. P. Murphy, Dynamic Bayesian Networks: Representation,
Ph.D. thesis, University of California-Berkeley, 2002.
Inference and Learning,
[127] R. H. Shumway and D. S. Stoffer, "Dynamic linear models with switching,"
of the American Statistical Association, vol. 86, no. 415, pp. 763-769, 1991.
[128] C . J . K i m , "Dynamic linear models with Markov-switching," Journal of
vol. 60, pp. 1-22, 1994.
Journal
Econometrics,
[129] Zoubin Ghahramani and Geoffrey E. Hinton, "Switching State-Space Models," Tech.
Rep., 6 King's College Road, Toronto M5S 3H5, Canada, 1998.
[130] V. Pavlovic, J. M. Rehg, and T. Cham, "A dynamic Bayesian network approach
to tracking learned switching dynamic models," in Proceedings of the International
Workshop on Hybrid Systems, Pittsburgh, PA, 2000.
[131] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation
ing and Navigation, Wiley Interscience, New York, 2001.
with Applications
to Track-
129
[132] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kauffman Publishers, Inc., San Mateo, CA, 1988.
[133] B. D. O. Anderson and J. B. Moore, Optimal Filtering,
Cliffs, NJ, 1979.
Prentice-Hall, Englewood
[134] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete
data via the EM algorithm," Journal of the Royal Statistical Society B, vol. 39, no. 1,
pp. 1-38, 1977.
[135] R. M. Neal and G. E. Hinton, "A new view of the EM algorithm that justifies
incremental and other varients," in Learning in Graphical Models, M. Jordan, Ed., pp.
355-368. Kluwer Academic Publishers, 1998.
[136] Z. Ghahramani, "Learning Dynamic Bayesian Networks," Lecture Notes in Computer
Science, vol. 1387, pp. 168-197, 1998.
[137] R. H. Shumway and D. S. Stoffer, "An approach to time series smoothing and forecasting using the EM algorithm," J. Time Series Analysis, vol. 3, no. 4, pp. 253-264,
1982.
[138] A. Papoulis, Probability, Random
New York, 1984.
Variables and Stochastic Processes, McGraw Hill,
[139] A. Y. Ng, M. Jordan, and Y. Weiss, "On spectral clustering analysis and an algorithm," in Advances in Neural Information Processing Systems, Vancouver, BC, 2002.
[140] L. Zelnik-Manor and P. Perona, "Self-tuning spectral clustering,"
Neural Information Processing Systems, Whistler, BC, 2004.
[141] H. Akaike, "A new look at the statistical model identification,"
Automatic Control, vol. 19, no. 6, Mar. 1974.
in Advances
IEEE
in
Trans, on
[142] A. Savitzky and M. J. Golay, "Smoothing and differentiation of data by simplified
least squares procedures," Analytical Chemistry, vol. 36, no. 8, pp. 1627-1639, 1964.
[143] B. H. Huang and L. R. Rabiner, "A probabilistic distance measure for hidden Markov
models," AT&T Tech. Journal, vol. 64, no. 2, pp. 1251-1270, 1985.
[144] J. Xue, G. Wichern, H. Thornburg, and A. S. Spanias, "Fast query-by-example
of environmental sounds via robust and efficient cluster-based indexing," in IEEE
International Conference on Acoustics, Speech and Signal Processing", Las Vegas, NV,
2008.
130
[145] Gideon Schwarz, "Estimating the dimension of a model," The Annals of
vol. 6, no. 2, pp. 461-464, 1978.
Statistics,
[146] Y. Yang, "Can the strengths of AIC and BIC be shared? A conflict between model
indentification and regression estimation," Biometrika, vol. 92, no. 4, pp. 937-950,
2005.
[147] J. S. Garofolo, C. G. P. Auzanne, and E. M. Voorhees, "The T R E C Spoken Document
Retrieval Track: A Success Story," in In proc. of the eigth Text REtrieval Conference
(TREC), Gaithersburg, Maryland, 1999.
[148] L. Barrington, A. Chan, D. Turnbull, and G. R. G. Lanckriet, "Audio Information
Retrieval Using Semantic Similarity," in IEEE International Conference on Acoustics,
Speech and Signal Processing, Honolulu, Hawaii, 2007.
[149] D. Turnbull, R. Liu, L. Barrington, and G. Lanckriet, "A game-based approach for
collecting semantic annotations of music," in Int. Symposium on Music
Information
Retrieval (ISMIR), Vienna, 2007.
[150] M. Mandel and D. P. W. Ellis, "A web-based game for collecting music metadata,"
J. of New Music Res., vol. 37, no. 2, pp. 151-165, 2008.
[151] S. Kim, S. Narayanan, and S. Sundaram, "Acoustic topic model for audio information
retrieval," in IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, New Paltz, NY, 2009, pp. 37-40.
[152] Christiane Fellbaum, WordNet: an electronic lexical database, MIT Press, Cambridge,
MA, 1998.
[153] G. Wichern, H. Thornburg, and A. Spanias, "Unifying Semantic and Content-based
Approaches for Retrieval of Environmental Sounds," in Proceedings of the IEEE Wokshop on the Applications of Signal Processing to Audio and Acoustic (WASPAA), New
Paltz, NY, 2009.
[154] B. Mechtley, G. Wichern, H. Thornburg, and A. S. Spanias, "Combining Semantic,
Social, and Acoustic Similarity for Retrieval of Environmental Sounds," in IEEE
International Conference on Acoustics, Speech and Signal Processing, 2010.
[155] "Freesound," h t t p : / / w w w . f r e e s o u n d . o r g .
[156] S. Nirenburg and V. Raskin, Ontological Semantics,
MIT Press, Cambridge, 2004.
[157] M. Ushold and M. Gruninger, "Ontologies: Principles, methods, and applications,"
Knowledge Engineering Review, vol. 11, no. 2, pp. 93-136, 1996.
131
[158] H. Liu and P. Singh, "ConceptNet: A practical commonsense reasoning toolkit," BT
Technology Journal, vol. 22, no. 4, pp. 211-226, 2004.
[159] J. D. Novak and A. J. Canas, "The Theory Underlying Concept Maps and How
to Construct and Use Them," Tech. Rep. IHMC CmapTools 2006-01 Rev 01-2008,
Florida Institute for Human and Machine Cognition (IHMC), 2008.
[160] G. Lakoff, Women,
Chicago, 1990.
Fire, and Dangerous
Things,
University of Chicago Press,
[161] A. M. Collins and M. R. Quillan, "Retrieval time from semantic memory,"
of Verbal Learning and Verbal Behavior, vol. 8, no. 2, pp. 240-247, 1969.
Journal
[162] W. F. Brewer and G. V. Nakamura, "The nature and functions of schemas," in
Handbook of Social Cognition, R. S. Wyer Jr. and T. K. Skrull, Eds., vol. I, pp. 119—
160. Erlbaum, Hillsdale, NJ, 1984.
[163] D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, and P. F. Patel-Schneider,
"OIL: an ontology infrastructure for the semantic web," IEEE Intelligent Systems, vol.
16, no. 2, pp. 38-45, 2001.
[164] J. Jiang and D. Conrath, "Semantic similarity based on corpus statistics and lexical
taxonomy," in International Conference on Research in Computational
Linguistics
(ROCLING X), Taiwan, 1997, pp. 19-33.
[165] A. Budanitsky and G. Hirst, "Semantic distance in WordNet: an experimental,
application-oriented evaluation of five measures.," in Workshop on WordNet and Other
Lexical Resources, Second meeting of the North American Chapter of the Association
for Computational Linguistics, Pittburgh, PA, 2001.
[166] O. Udrea, D. Yu, E. Hung, and V. S. Subrahmanian, "Probabilistic ontologies and
relational databases," Lecture Notes in Computer Science, vol. 3760, pp. 1-17, 2005.
[167] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction
ed.), MIT Press and McGraw-Hill, Cambridge, 2001.
[168] S. Boyd and L. Vandenberghe, Convex Optimization,
New York, 2004.
to Algorithms
(2nd
Cambridge University Press,
[169] T. Pederson, S. Patwardhan, and J. Michelizzi, "WordNet::Similarity - Measuring the
Relatedness of Concepts," in AAAI-04, Cambridge, MA, 2004, pp. 1024-1025, AAAI
Press.
132
[170] R. Mandala, T. Tokunaga, and H. Tanaka, "The use of WordNet in information
retrieval," in Proceedings of the Workshop on Usage of WordNet in Natural Language
Processing Systems, Montreal, 1998, pp. 31-37.
[171] R. L. Cilibrasi and P. M. B. Vitanyi, "The google similarity distance," IEEE
on Knowledge and Data Engineering, vol. 19, no. 3, pp. 380-383, 2007.
[172] W. B. Klejin and K. K. Paliwal, Speech Coding and Synthesis,
York, 1995.
Trans,
Elsevier Science, New
[173] C. Dodge and T. A. Jerse, Computer music: synthesis, composition,
Macmillan Library Reference, 1985.
and
performance,
[174] A. Fink, H. Thornburg, and A. S. Spanias, "An individually tunable perfect reconstruction filterbank for banded waveguide synthesis," in IEEE International
Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010.
[175] C. Verron, G. Pallone, M. Aramaki, and R. Kronland-Martinet, "Controlling a spatialized environmental sound synthesizer," in Proceedings of the IEEE Wokshop on the
Applications of Signal Processing to Audio and Acoustic (WASPAA), New Paltz, NY,
2009, pp. 321-324.
[176] J. H. McDermott, A. J. Oxenham, and E. P. Simoncelli, "Sound texture synthesis via
filter statistics," in Proceedings of the IEEE Wokshop on the Applications of Signal
Processing to Audio and Acoustic (WASPAA), New Paltz, NY, 2009, pp. 297-300.
[177] G. Wichern, H. Thornburg, and A. Spanias, "Multi-channel audio segmentation for
continuous observation and archival of large spaces," in IEEE International
Conference
on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 2009.
[178] A. Doucet, N. de Freitas, K. Murphy, and S. Russell, "Rao-Blackwellised particle
filtering for dynamic Bayesian networks," Stanford, CA, 2000, Conference on Uncertainty in Artificial Intelligence.
[179] K. Dobson, B. Whitman, and D. P. W. Ellis, "Learning auditory models of machine
voices," in IEEE WASPAA, New Paltz, NY, 2005.
[180] G. Wichern, J. Xue, H. Thornburg, and A. Spanias, "Distortion-aware query by
example for environmental sounds," in Proceedings of the IEEE Wokshop on the Applications of Signal Processing to Audio and Acoustic (WASPAA),,New
Paltz, NY,
2007.
[181] H. Ouyang and A. Gray, "Learning dissimilarities by ranking: from SDP to QP," in
Proc. of the International Conference on Machine Learning (ICML), Helsinki, 2008.
133
[182] "Wild Sanctuary," h t t p : / / w w w . w i l d s a n c t u a r y . c o m .
[183] "Sound Around You," h t t p : //www. soundaroundyou. com.
[184] "Soundwalks," h t t p : //www. soundwalks. org.
[185] G. Papagiannakis, G. Singh, and N. Magnenat-Thalmann, "A survey of mobile and
wireless technologies for augmented reality systems," Computer Animation and Virtual
Worlds, vol. 19, no. 1, pp. 3-22, 2008.
APPENDIX A
A INSTITUTIONAL REVIEW BOARD APPROVAL DOCUMENTS
Discovering Sound Relevancy Rankings
Consent Form - 1/3
INFORMED CONSENT FORM
MINIMAL RISK
ARIZONA STATE UNIVERSITY
DISCOVERING SOUND RELEVANCY RANKINGS
INTRODUCTION
The purposes of this form are to provide you (as a prospective research study participant)
information that may affect your decision as to whether or not to participate in this research and
to record the consent of those who agree to be involved in the study.
RESEARCHERS
Harvey Thornburg (Professor of Electrical Engineering/AME), Gordon Wichern (Graduate
Student in Electrical Engineering/AME), Jiachen Xue (Graduate Student in Electrical
Engineering/AME)
STUDY PURPOSE
The purpose of the research is to evaluate an audio database search algorithm for
environmental and natural sounds using the relevancy rankings between the sound files that
you will provide by listening to sound files at a desktop computer and recording your results.
DESCRIPTION OF RESEARCH STUDY
As a study participant you will join a study funded by the National Science Foundation involving
research in developing a better way to search for sound files. You will be provided two folders
("database sounds" and "Queries") each containing several short (1-60 second) audio files. You
will then listen to all the sounds in both folders and fill out the provided table by marking an "X"
in the appropriate space if you feel that the two corresponding sound files are related.
RISKS
There are minimal risks associated with this study, as only participants with normal hearing are
advised to participate.
BENEFITS
The possible/main benefits of your participation in the research are a better understanding of
how audio recording, archival, and interactive search can allow humans to gain a better
understanding of their activity in everyday spaces and how audio information alone can reveal
how these spaces are used.
CONFIDENTIALITY
All information obtained in this study is strictly confidential unless disclosure is compelled by
law. The results of this research study may be used in reports, presentations, and publications,
but the researchers will not identify you.
WITHDRAWAL PRIVILEGE
It is acceptable for you to say no to participation in this study. Even if you say yes now, you are
free to say no later, and withdraw from the study at any time. Your decision will not affect your
relationship with Arizona State University or otherwise cause a loss of benefits to which you
might otherwise be entitled. If you are a student, participation is voluntary and that withdrawal
from the study will not affect your grade. If you withdraw, all your data will be deleted and/ or
destroyed.
ASUIRB
Consent Form - 2/3
Discovering Sound Relevancy Rankings
COMPENSATION FOR ILLNESS AND INJURY
If you agree to participate in the study, then your consent does not waive any of your legal
rights. However, in the event of (harm, injury, illness) arising from this study neither Arizona
State University nor the researchers are able to give you any money, insurance coverage, free
medical care, or any compensation for such injury.
VOLUNTARY CONSENT
Any questions you have concerning the research study or your participation in the study, before
or after your consent, will be answered by the faculty: Harvey Thornburg
(Harvey.Thornburg@asu.edu, 480-727-7902), or one of the graduate students listed at the
bottom of the form. If you have questions about your rights as a subject/participant in this
research, or if you feel you have been placed at risk, you can contact the Chair of the Human
Subjects Institutional Review Board, through the ASU Office of Integrity and Assurance, at 480965-6788. This form explains the nature, demands, benefits and any risk of the project. By
signing this form you agree knowingly to assume any risks involved. Remember, your
participation is voluntary. You may choose not to participate or to withdraw your consent and
discontinue participation at any time without penalty or loss of benefit. In signing this consent
form, you are not waiving any legal claims, rights, or remedies. A copy of this consent form will
be given (offered) to you.
Your signature below indicates that you consent to participate in the above study. Also, by
signing below, you are granting to the researchers the right to use your performance - whether
recorded on or transferred to audiotape or disk - for presenting or publishing this research (or for
whatever use).
I CONSENT TO PARTICIPATE IN THE AFOREMENTIONED STUDY
Subject's Signature
Printed Name
Date
INVESTIGATOR'S STATEMENT
"I certify that I have explained to the above individual the nature and purpose, the potential
benefits and possible risks associated with participation in this research study, have answered
any questions that have been raised, and have witnessed the above signature. These elements
of Informed Consent conform to the Assurance given by Arizona State University to the Office
for Human Research Protections to protect the rights of human subjects. I have provided
(offered) the subject/participant a copy of this signed consent document."
Signature of Investigator
Harvey Thornburg (Harvev.Thornburq@asu.edu)
Gordon Wichern (Gordon.Wichem@asu.edu)
Date
Jiachen Xue (icxue@asu.edu)
Instructions:
If you are hearing impaired please notify the facilitator, it is possible your
results might not be used in the study, but you are still welcome to
complete the experiment.
You have been provided two folders containing several 1-60 second sound clips labeled
"sound files" and "Queries". Please listen to all sounds in both folders in order to fill
out the following table. The columns of the table correspond to the filenames of the
sounds in the "Queries "folder, and the rows of the table correspond to the filenames of
the sounds in the "soundfiles" folder.
Mark an X in the corresponding box ifyou feel that the given soundfile and query are
relevant to each other.
Leave blank or mark aO in the corresponding box ifyou feel that the given sound file
and query are NOT relevant to each other.
Query 1
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5
Sound 6
Sound 7
Sound 8
Sound 9
Sound 10
Sound 11
Sound 12
Sound 13
Sound 14
Sound 15
Sound 16
Sound 17
Sound 18
Sound 19
Sound 20
Sound 21
Sound 22
Sound 23
Sound 24
Sound 25
Sound 26
Sound 27
Sound 28
Query 2
Query 3
Query 4
'ARIZONA STATK
UNIVERSITY
RESEARCH ANO ECONOMIC AFFAIRS
Oftice of Research Integrity and Assurance
To:
Harvey Thornburg
BRICKYARD
From:
Mark Roosa, Chair
Soc Beh IRB
Date:
02/20/2009
Committee Action:
Renewal
Renewal Date:
02/20/2009
Review Type:
Expedited F7
IRB Protocol #:
0803002762
Study Title:
Discovering Sound Relevancy Rankings
Expiration Date:
03/13/2010
The above-referenced protocol was given renewed approval following Expedited Review by the Institutional
Review Board.
It is the Principal Investigator's responsibility to obtain review and continued approval of ongoing research
before the expiration noted above. Please allow sufficient time for reapproval. Research activity of any sort
may not continue beyond the expiration date without committee approval. Failure to receive approval for
continuation before the expiration date will result in the automatic suspension of the approval of this protocol on
the expiration date. Information collected following suspension is unapproved research and cannot be reported
or published as research data. If you do not wish continued approval, please notify the Committee of the study
termination.
This approval by the Soc Beh IRB does not replace or supersede any departmental or oversight committee
review that may be required by institutional policy.
Adverse Reactions: If any untoward incidents or severe reactions should develop as a result of this study, you
are required to notify the Soc Beh IRB immediately. If necessary a member of the IRB will be assigned to look
into the matter. If the problem is serious, approval may be withdrawn pending IRB review.
Amendments: If you wish to change any aspect of this study, such as the procedures, the consent forms, or the
investigators, please communicate your requested changes to the Soc Beh IRB. The new procedure is not to
be initiated until the IRB approval has been given.
Consent Form - 1/2
Learning Semantic Relationships of Multimedia
INFORMED CONSENT FORM
MINIMAL RISK
ARIZONA STATE UNIVERSITY
LEARNING SEMANTIC RELATIONSHIPS OF MULTIMEDIA
INTRODUCTION
The purposes of this form are to provide you (as a prospective research study participant)
information that may affect your decision as to whether or not to participate in this research and
to record the consent of those who agree to be involved in the study.
RESEARCHERS
Harvey Thornburg (Professor of Electrical Engineering/AME), Gordon Wichern (Graduate
Student in Electrical Engineering/AME), Jinru Liu (Graduate Student in Electrical
Engineering/AME), Brandon Mechtley(Graduate Student in Computer Science/AME)
STUDY PURPOSE
The purpose of this research is to understand and evaluate how a large collection of multimedia
(sounds, images, videos, gestures, words, etc.) is searched for and experienced. By providing
descriptions of an existing collection of multimedia and contributing multimedia you create, we
hope to develop fast and intuitive ways of searching multimedia collections.
DESCRIPTION OF RESEARCH STUDY
As a study participant you will join a study funded by the National Science Foundation involving
research in developing a better way to search for and experience multimedia (sounds, images,
videos, gestures, words, etc.). Your participation will last approximately 60 minutes and your
task will be to: (i) label an existing database of multimedia objects, e.g., listen to a sound and
describe it with words, (ii) create/record new sounds/videos/gestures using microphones,
cameras, or computers, and (iii) experience search in a virtual reality/gaming type environment
where video game controllers and microphones are used to sense activity and control projected
images and audio playback.
RISKS
There are minimal risks associated with this study, as only participants with normal hearing and
vision are advised to participate.
BENEFITS
If eligible, you will receive course credit for your participation. Additionally, benefits of your
participation in the research are a better understanding of how multimedia is captured, stored,
searched for, and ultimately experienced.
CONFIDENTIALITY
All information obtained in this study is strictly confidential unless disclosure is compelled by
law. The results of this research study may be used in reports, presentations, and publications,
but the researchers will not identify you.
WITHDRAWAL PRIVILEGE
It is acceptable for you to say no to participation in this study. Even if you say yes now, you are
free to say no later, and withdraw from the study at any time. Your decision will not affect your
relationship with Arizona State University or otherwise cause a loss of benefits to which you
might otherwise be entitled. If you are a student, participation is voluntary and that withdrawal
from the study will not affect your grade. If you withdraw, all your data will be deleted and/ or
destroyed.
"—
ASU IRB
cSign
I Date
^CY\
Approved
v^l')
vf | a - n A « [ ^ |
7
U
l,rr_
Learning Semantic Relationships of Multimedia
Consent Form - 2/2
COMPENSATION FOR ILLNESS AND INJURY
If you agree to participate in the study, then your consent does not waive any of your legal
rights. However, in the event of (harm, injury, illness) arising from this study neither Arizona
State University nor the researchers are able to give you any money, insurance coverage, free
medical care, or any compensation for such injury.
VOLUNTARY CONSENT
Any questions you have concerning the research study or your participation in the study, before
or after your consent, will be answered by the faculty: Harvey Thornburg
(Harvey.Thornburg@asu.edu, 480-727-7902), or one of the graduate students listed at the
bottom of the form. If you have questions about your rights as a subject/participant in this
research, or if you feel you have been placed at risk, you can contact the Chair of the Human
Subjects Institutional Review Board, through the ASU Office of Integrity and Assurance, at 480965-6788. This form explains the nature, demands, benefits and any risk of the project. By
signing this form you agree knowingly to assume any risks involved. Remember, your
participation is voluntary. You may choose not to participate or to withdraw your consent and
discontinue participation at any time without penalty or loss of benefit. In signing this consent
form, you are not waiving any legal claims, rights, or remedies. A copy of this consent form will
be given (offered) to you.
Your signature below indicates that you consent to participate in the above study. Also, by
signing below, you are granting to the researchers the right to use your performance - whether
recorded on or transferred to audiotape or disk - for presenting or publishing this research (or for
whatever use).
I CONSENT TO PARTICIPATE IN THE AFOREMENTIONED STUDY
Subject's Signature
Printed Name
Date
INVESTIGATOR'S STATEMENT
"I certify that I have explained to the above individual the nature and purpose, the potential
benefits and possible risks associated with participation in this research study, have answered
any questions that have been raised, and have witnessed the above signature. These elements
of Informed Consent conform to the Assurance given by Arizona State University to the Office
for Human Research Protections to protect the rights of human subjects. I have provided
(offered) the subject/participant a copy of this signed consent document."
Signature of Investigator
Harvey Thornburg (Harvev.Thornburg@asu.edu)
Brandon Mechtley (Brandon.Mechtlev@asu.edu)
Gordon Wichern fGordon.Wichern@asu.edu)
Jinru Liu (Jinru.Liu@asu.edu)
Date
'ARIZONA STATK
UNIVERSITY
ftESIARCH
AND ECONOMIC AFFAIRS
Office of Research Integrity and Assurance
To:
Harvey Thornburg
BRICKYARD
From:
Mark Roosa, Chair
Soc Beh IRB
Date:
04/27/2009
Committee Action:
Expedited Approval
Approval Date:
04/27/2009
Review Type:
Expedited F7
IRB Protocol #:
0904003936
Study Title:
Learning Semantic Relationships of Multimedia
Expiration Date:
04/26/2010
The above-referenced protocol was approved following expedited review by the Institutional Review Board.
It is the Principal Investigator's responsibility to obtain review and continued approval before the expiration
date. You may not continue any research activity beyond the expiration date without approval by the
Institutional Review Board.
Adverse Reactions: If any untoward incidents or severe reactions should develop as a result of this study, you
are required to notify the Soc Beh IRB immediately. If necessary a member of the IRB will be assigned to look
into the matter. If the problem is serious, approval may be withdrawn pending IRB review.
Amendments: If you wish to change any aspect of this study, such as the procedures, the consent forms, or the
investigators, please communicate your requested changes to the Soc Beh IRB. The new procedure is not to
be initiated until the IRB approval has been given.
Please retain a copy of this letter with your approved protocol.
Документ
Категория
Без категории
Просмотров
0
Размер файла
5 783 Кб
Теги
sdewsdweddes
1/--страниц
Пожаловаться на содержимое документа