close

Вход

Забыли?

вход по аккаунту

?

Modeling Identity in Archival Collections of Email: A Preliminary Study

код для вставкиСкачать
Modeling Identity
in Archival Collections of Email:
A Preliminary study
Tamer Elsayed and Douglas W. Oard
Institute for Advanced Computer Studies
Department of
Computer Science
College of
Information Studies
Conference on Email and Anti-Spam (CEAS), July 28th, 2006
Real Problem
Clinton
White House
32 million
emails
search
request
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
Tobacco
Policy
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
80,000
National Archives
hired 25
persons
for 6 months …
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
200,000
Modeling Identity in Archival Collections of Email: A Preliminary Study
Email Search
Searcher
Participant
Non-participant
Personal
My own emails
Shneiderman’s
Postel’s
Organizational
CS
UMIACS
White House
Enron
TREC Enterprise
Usenet news
W3C
Public
Meaning пѓЁ Modeling Content
пЃ° People пѓЁ Modeling Identity
пЃ°
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity
Nickname
sent email to
Name
Nickname
Name
Email Address
Email Address
Sender
Receivers
sent
mentioned
~~~~~~~~~
~~Email~~
~~~~~~~~~
~~~~~~~~~
~~~~~~~~~
received
mentioned to
mentions
Mentioned
Email Address
Name
Nickname
Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline
пЃ° Problem
пЃ° Identity
Resolution Architecture
пЃ° Evaluation
пЃ° Conclusion
Modeling Identity in Archival Collections of Email: A Preliminary Study
Entity Example
Nickname
Name
“Robert Bruce”
“Bob”
Main Headers (915)
Quoted Headers (8)
Salutations (7)
Free Signatures (9)
Email Address
“robert.bruce@enron.com”
Static Signature (140)
Robert E. Bruce
Senior Counsel
Enron North America Corp.
T (713) 345-7780
F (713) 646-3393
robert.bruce@enron.com
Signature Block
Modeling Identity in Archival Collections of Email: A Preliminary Study
Enron Collection
Example of large organizational collection
пЃ° CMU version
пЃ°
пЃ®
пЃ®
пЃ°
about half million emails
133,581 unique email addresses
~52% of emails are duplicates!
пЃ®
same address, subject, body
Modeling Identity in Archival Collections of Email: A Preliminary Study
Typical Enron Email
Message-ID: <1494.1584620.JavaMail.evans@thyme>
Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)
From: elizabeth.sager@enron.com
To: sstack@reliant.com
Subject: RE: Shhhh.... it's a SURPRISE !
X-From: Sager, Elizabeth
</O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>
X-To: 'SStack@reliant.com@ENRON'
Hi Shari
Salutation
Main Body
Hope all is well.
Count me in for the group present.
See ya next week if not earlier
Liza
Elizabeth Sager
713-853-6349
Message
Body
Signature Block
-----Original Message----Quoted Header
From: SStack@reliant.com@ENRON
Sent: Monday, July 30, 2001 2:24 PM
To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com;
wfhenze@jonesday.com
Cc: ntillett@reliant.com
Subject:
Shhhh.... it's a SURPRISE !
Please call me (713) 207-5233
Thanks!
Shari
Message
Header
Quoted
Text
Quoted Main Body
Quoted Signature
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Resolution Architecture
Entities
Clustering
Associations
Address-Address
Associations
Address-Name
Associations
Address-Nickname
Associations
Nickname
Extraction
Salutation lines
Signature lines
Extraction from
Quoted Header
Quoted
headers
Extraction from
Main Header
Signature Line
Detection
Salutation Line
Detection
Main body
Body and Quoted
Text Separation
Unique
emails
Duplicate
Detection
Modeling Identity in Archival Collections of Email: A Preliminary Study
Extraction From Main Headers
Name-Address
Message-ID: <1486175.1075858665169.JavaMail.evans@thyme>
Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT)
Association
From: jmathes@nbchamber.com
To: mark.vandini@enron.com, steve.urbon@enron.com,
sapienza.tony@enron.com, o'rourke.tom@enron.com, lyons.tom@enron.com
Subject: New Email Address
X-From: Jim Mathes <jmathes@nbchamber.com>
X-To: Vandini, Mark <Mark_Vandini@nstaronline.com>, Urbon Steve <surbon@s-t.com>,
Tony Sapienza <sapiena@gftusa.com>, Tom O'Rourke <tom@plymouthchamber.com>,
Tom Lyons <tlyons@frfive.com>, Tom Hodgson <sheriff@BCSO-MA.org>
X-cc:
X-bcc:
We have just launched our "New & Improved Website",
www.newbedfordchamber.com
and I have a new email address:
Address-Address
Association
jmathes@newbedfordchamber.com
Name-Address
Please
make the appropriate changes in your email address book.
Association
Thank you,
Jim Mathes, President
New Bedford Area Chamber of Commerce
Modeling Identity in Archival Collections of Email: A Preliminary Study
Extraction From Quoted Headers
Hi Jeff,
Did you get our registration packet? If not,Name-Address
stop by and pick one up
because you need it. Make sure you get theAssociation
one for new students.
Shawn
On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich
[SMTP:jdasovic@enron.com] wrote:
>
>
> ok, don't shoot me, but what's the deadline for scheduling for classes?
>
> signed,
> clueless
---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000
12:02 PM ---------------------------
"Patricia Young" <PYoung@eei.org> on 02/09/2000 08:50:59 AM
To: Elizabeth Sager/HOU/ECT@ECT
cc:
Subject: If possible, would you forward your resume to me electronically?
Thanks.
Name-Address
If possible, would you forward your resume
to me electronically? Thanks.
Association
Modeling Identity in Archival Collections of Email: A Preliminary Study
Signature & Salutation Detection
From: susan.scott@enron.com
Had another sleepless night Sun. and finally took some Unisom and had a good
night's sleep last night. What a relief. I have really never had this
problem
It'sOK.
good
have
a lotand
of energy,
buthas
youleft
have
shutsore
down
The weekbefore.
is going
Alltothe
tennis
swimming
metowith
sometime.
muscles so this is my night off. Am planning to do some more house chores so
I do not end up with another weekend like the last.
Am sending you my travel schedule for next week. The following week (May 29
The
kiddies
going
back
already
now
would
be
good
time to
-I'm
June
2)
I'mare
planning
to be to
in school
SF also,
butweekend,
I'msonot
sure
I'll
actually
have
still
planning
on coming
to
Austin
next
I'm
just
notasure
when,
plan
tripyou
tothat
D.C.
at last. Maybe early Sept?
to
long.
butbe
I'llathere
let
know.
Also I'd be game for a girls' trip to Destin.
Have
good
Call if ayou
getafternoon!
lonely!
Time to work!
Love,
love,
Love,
-Sooz
sooz
Sooz
Procurement,
and
Contracts
Procurement,Logistics,
Logistics,
and
Contracts
Enron
Inc.
EnronBroadband
BroadbandServices,
Services,
Inc.
1400
1400 Smith,
Smith,Suite
SuiteEB-4573A
EB-4573A
Houston,
TX
77002
Houston, TX 77002
Modeling Identity in Archival Collections of Email: A Preliminary Study
Nickname Extraction
From: susan.scott@enron.com
Had another sleepless night Sun. and finally took some Unisom and had a good
night's sleep last night. What a relief. I have really never had this
problem before. It's good to have a lot of energy, but you have to shut down
sometime.
Am sending you my travel schedule for next week. The following week (May 29
- June 2) I'm planning to be in SF also, but I'm not sure I'll actually have
to be there that long.
Have a good afternoon!
love,
sooz
nickname
Procurement, Logistics, and Contracts
Enron Broadband Services, Inc.
1400 Smith, Suite EB-4573A
Houston, TX 77002
3,151 address-nickname associations
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities
Nickname
Name
“Robert Bruce”
“Bob”
Main Headers (915)
Quoted Headers (8)
82,084
addr-name
Salutations (7)
Free Signatures (9)
3,151
addr-nickname
Email Address
“robert.bruce@enron.com”
Static Signature (140)
Robert E. Bruce
Senior Counsel
Enron North America Corp.
T (713) 345-7780
F (713) 646-3393
robert.bruce@enron.com
Signature Block
Main Headers (7)
19,708
addr-addr
Email Address
“rbruce@hotmail.com”
Quoted Headers (5)
“Robert”
66,715 entities
Modeling Identity in Archival Collections of Email: A Preliminary Study
Name
Outline
Problem
пЃ° Identity Resolution Architecture
пЃ°
пЃ° Evaluation
Conclusion
пЃ° Future Work
пЃ°
Modeling Identity in Archival Collections of Email: A Preliminary Study
Stratified Sampling
Weakest Evidence
Stronger Evidence
Address-Name Associations
Main headers only
50 / 29677
50 / 31248
Quoted headers only
50 / 8042
50 / 3828
Both headers
50 / 9289
Address-Nickname Associations
Salutations only
50 / 272
50 / 465
Signatures only
50 / 172
50 / 1754
Both
Address-Address
Associations
50/490
50 / 6514
50 / 4194
Modeling Identity in Archival Collections of Email: A Preliminary Study
Judgment Process
Incorrect
kmpresto@msn.com пѓ§пѓЁ "home email"
terrie.james@enron.com пѓ§пѓЁ "alexis james-petty"
Correct but not informative
june-deadrick@reliantenergy.com  “june deadrick”
robbie.lewis@enron.com  “robbie lewis”
Correct and somewhat informative
terriecovarrubias@hotmail.com пѓ§пѓЁ "terrie covarrubias"
randal.maffett@enron.com пѓ§пѓЁ "randy"
Correct and very informative
lemelpe@nu.com пѓ§пѓЁ "phyllis"
piazzet@wharton.upenn.edu пѓ§пѓЁ "tom"
Modeling Identity in Archival Collections of Email: A Preliminary Study
Evaluation Measures
Correct
Judged
Associations
Very
Informative
Informative
Modeling Identity in Archival Collections of Email: A Preliminary Study
Accuracy
пЃ°
пЃ°
80
60
Weakest evidence
Average evidence
Stronger evidence
40
20
0
Main Headers
Quoted
Headers
Both
Overall
Address-Name Associations
100
Percent Accuracy
пЃ°
100% accuracy with multiple
sources of evidence.
Address-name association
was nearly perfect
80% minimum accuracy in
address-nickname
96.7% entity accuracy
80
60
40
20
0
Salutation
Signature
Both
Overall
Address-Nickname Associations
100
Percent Accuracy
пЃ°
Percent Accuracy
100
80
60
40
20
0
Main Headers
Address-Address Associations
Modeling Identity in Archival Collections of Email: A Preliminary Study
Informativeness
80
60
40
20
0
Main Headers
Quoted
Headers
Both
80
60
40
20
0
Overall
Main Headers
Address-Name Associations
Quoted
Headers
Both
Overall
100
Percent Very Informative
100
Percent Informative
100
Percent Very Informative
Percent Informative
100
80
60
40
20
0
Salutation Signature
Both
Weakest evidence
80
Average evidence
60
Stronger evidence
40
20
0
Salutation
Overall
Signature
Both
Overall
Address-Nickname Associations
60
40
20
Percent Accuracy
Percent
Very Informative
80
Percent
Informative
100
100
100
80
60
40
20
80
60
40
20
0
0
Main Headers
Main Headers
0
Salutation
Address-Address Associations
Modeling Identity in Archival Collections of Email: A Preliminary Study
Signature
Outline
Problem
пЃ° Identity Resolution Architecture
пЃ° Evaluation
пЃ°
пЃ° Conclusion
Modeling Identity in Archival Collections of Email: A Preliminary Study
Conclusion
пЃ°
Introduced a computational model of identity
пЃ®
пЃ®
пЃ®
a set of simple techniques put together
provide a useful baseline
assessed its potential utility in the context of one fairly
complex email collection
Automatic detection of nicknames in salutations
and signature lines.
пЃ° Most informative results from weakest evidence &
least accurate
пЃ° Accuracy and informativeness are both important
пЃ°
Modeling Identity in Archival Collections of Email: A Preliminary Study
Limitations
Email address associated with single identity
пЃ° Strength of evidence not exploited
пЃ° Heuristics hand-tuned for Enron collection
пЃ° Focus on personal attributes
пЃ° No reconciliation of multiple identities for single
person
пЃ° No attempt to classify identities as machines or
groups
пЃ° Recall?
пЃ°
Modeling Identity in Archival Collections of Email: A Preliminary Study
Thank You!
Questions?
Modeling Identity in Archival Collections of Email: A Preliminary Study
Backup
Modeling Identity in Archival Collections of Email: A Preliminary Study
Future Work
пЃ°
пЃ°
пЃ°
пЃ°
пЃ°
пЃ°
extend the model to exploit temporal features and
behavioral evidence
implement machine learning techniques
perform ablation studies
characterize the coverage of our methods in more detail
replicate this work in other contexts
integrate these techniques with the ultimate applications
for which computational models of identity are needed
(e.g., social network analysis).
Modeling Identity in Archival Collections of Email: A Preliminary Study
Helping in Judgments
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Framework
Entity
Group
Person
Machine
Identity
Identity
Identity
Entity
Entity
Entity
Entity
Entity
Candidates
Modeling Identity in Archival Collections of Email: A Preliminary Study
Modeling Identity
пЃ°
Attributes (stable explicit features)
пЃ®
пЃ°
Associations
пЃ®
пЃ®
пЃ°
email addresses, names, nickname, contact info
Link attributes together
Based on observations
Entities
пЃ®
пЃ®
Representation of an identity
Set of attributes in undirected graph
пЃ°
Linked by weighted associations
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities
пЃ°
First round
пЃ®
пЃ°
limited transitive closure
Merging associations
пЃ®
пЃ®
based on unique attributes
Address-address associations
No use of strength of evidence yet
пЃ° 66,715 entities
пЃ°
пЃ®
Covering 77,420 unique email address (58% of all
addresses)
Modeling Identity in Archival Collections of Email: A Preliminary Study
Related Work
Attribute/association extraction
пЃ° Name recognition and reference resolution
пЃ° Applications:
пЃ°
пЃ®
пЃ®
Social network analysis
Finding experts
Modeling Identity in Archival Collections of Email: A Preliminary Study
Unjudged Associations
Unjudged Associations
5
4
Weakest evidence
Stronger evidence
3
2
1
0
Main Headers
Quoted Headers
Address-Name
Associations
Salutations
Signatures
Address-Nickname
Associations
Main Headers
Address-Address
Associations
Only 19 пѓЁ ~3%
Modeling Identity in Archival Collections of Email: A Preliminary Study
Документ
Категория
Презентации
Просмотров
17
Размер файла
798 Кб
Теги
1/--страниц
Пожаловаться на содержимое документа