la indexación con técnicas lingüísticas en el modelo clásico de recuperación de información...
TRANSCRIPT
La indexación conLa indexación con técnicas lingüísticas técnicas lingüísticas en el en el modelo clásico de Recuperación de Informaciónmodelo clásico de Recuperación de Información
Julio Gonzalo, Anselmo Peñas y Felisa Verdejo
Grupo de Procesamiento de Lenguaje Natural
Dpto. Lenguajes y Sistemas Informáticos
UNED
Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002
2
ContentContent
Goal Morpho-syntactic ambiguity in IR Phrase indexing Conceptual indexing Conclusions
3
GoalGoal Indexing with automatic linguistic
techniques within the classic IR model
Information
need
Search engine
Docs.Document ranking
Refinement
Query
Formulation• POS tagging• Phrase indexing• WSD & Conceptual Indexing
Bad strategies or too much error in automatic processing?
IR-Semcor, hand-annotated test collection• Lemmas and phrases
• Senses
• Synsets
4
Morpho-syntactic ambiguity in IRMorpho-syntactic ambiguity in IRTexts
...particle crosses the wall...
...canadian red cross...
...boat to cross mississippi river...
Query
cross_N
...particl_N cross_V the_D wall_N...
...canadian_ADJ red_ADJ cross_N...
...boat_N to_TO cross_V mississippi_N river_N...
POS Tagged
Query
cross
...particl cross the wall...
...canadian red cross...
...boat to cross mississippi river...
Plain
matches
matches
6
Morpho-syntactic ambiguity in IRMorpho-syntactic ambiguity in IR Documents matched are ranked much higher (there are
less competing documents)
Manual POS tagging misses relevant matches• Query: ...talented baseball player... (talent_ADJ)
• Doc: ...top talents of the time... (talent_N)
• Missing Match
Automatic makes more mistakes, but not always correlated to retrieval decrease
• Query: summer_N shoes_N design_V (design_V)
• Doc: Italian_ADJ designed_V sandals_N (design_V)
• Match
7
Phrase indexingPhrase indexing
Texts
...a guide for the fisher who...
...information on cat care...
...arboreal carnivorous called fisher cat...
Query
fisher
...a guide for the fisher who...
...arboreal carnivorous called fisher cat...
...information on cat care...
Plain
Query
fisher
Phrase indexing...a guide for the fisher who...
...arboreal carnivorous called fisher_cat...
...information on cat care...
matches
matches
9
Phrase indexingPhrase indexing
Phrase indexing harms retrieval sometimes• Query: Candidate in governor’s_race• Doc: Opened his race for governor• Missing match
Phrase meaning is highly compositional
Needs semantic distinction
10
Conceptual IndexingConceptual Indexing
This model can improve text retrieval (Gonzalo 1998; Gonzalo 1999) Depending on WSD error rate
Query
spring
Texts
...spring...
...muelle...
...spring...
...fountain...
...fuente...
...spring...
...springtime...
...primavera...
Conceptual Index
n03114639
n05727069
n09151839n09151839
WSD
11
Word Sense DisambiguationWord Sense Disambiguation (Sanderson 1994) introduced fixed error rates in pseudo-words
disambiguationbanana banana/education/toy/gun/forest WSD toy
to conclude (over Reuters collection)– WSD must be above 90% accuracy
Reproduce Sanderson’s experiment (over IR-Semcor)
Compare precision in retrieval over synsets with WSD errors n07062238 spring WSD n04985670 (error)
{spring,springtime} {spring, hook}
14
Conceptual IndexingConceptual Indexing Although explicit disambiguation strategies applied to Indexing
• POS tagging
• Phrase indexing
• Word Sense Disambiguation
don’t produce a significative improvement in IR
Conceptual indexing based on synsets• Needs automatic WSD accuracy near to state-of-the-art (60%)
• Permit Cross-Language Information Retrieval
Qualitative evaluation (Item Search engine)• Some unsolved challenges (mainly WSD)• Users perceive a slower and less transparent system
15
ConConclusionsclusions
Think of users– Even an improvement of 10% wouldn’t change users
perception
– Don’t subordinate NLP to classic IR model
– Find new paradigms in Information Access
– In a higher level, closer to users• Consider users tasks
• Consider users interaction
La indexación conLa indexación con técnicas lingüísticas técnicas lingüísticas en el en el modelo clásico de Recuperación de Informaciónmodelo clásico de Recuperación de Información
Julio Gonzalo, Anselmo Peñas y Felisa Verdejo
Grupo de Procesamiento de Lenguaje Natural
Dpto. Lenguajes y Sistemas Informáticos
UNED
Jornadas de Tratamiento y Recuperación de la Información JOTRI 2002
17
IR-Semcor test collectionIR-Semcor test collection
– 254 hand-annotated documents in English– 82 hand-annotated queries in English with ~6.8 relevant
documents eachExample
The Fulton County Grand Jury investigates possible irregularities in Atlanta’s primary election
Lemmas and phrase annotationThe Fulton_County_Grand_Jury investigate possible irregularity in
atlanta primary_election Sense annotation
Fulton_County_Grand_Jury investigate2 possible2 irregularity1 atlanta1 primary_election1
Synset annotation (actually synset offsets or ILI-records)Fulton_County_Grand_Jury v00441414 a00036893 n00412042 n5608324
n00103176{ investigate,
carry_out_an_investigation_of }{ irregularity, abnormality }
{ Atlanta, capital_of_Georgia }
{ primary_election, primary }
{ possible, potential }
18
IR-Semcor test collectionIR-Semcor test collectionSemcor 1.5
Doc 1
Doc 2
Doc 1Doc 1
Doc~100
Semcor 1.6
Doc 1
Doc 2
Doc 1Doc 1
Doc83
IR-Semcor
Doc 1
Doc 2
Doc171
Doc 1Doc 1
Doc254
Query 1
Query 2
Query82
Hand-annotated sumaries only for chunked docs
Assume the summary of a text is relevant to all fragments of the original Semcor document
19Textual representation: query istranslated into the target language
Conceptual representation: queryand documents are compared
at a conceptual level
Selection ofquery language
Selection of WSD strategy
Selection of newspaper
determines the target language
Retrieved documents
20
AApproachpproacheses
NaturalLanguage
ProcessingDisambiguation Conceptual indexing
Terminology
Controlled vocabularies indexing & browsing
String
ProcessingFree text indexing
Information Retrieval
Phrase indexing & browsing (Phind)
Keyphrase navigation (Phrasier)
AutomaticTerminology Extraction
Terminology Retrieval & Term browsing
(WTB)
23
Semantic distinction of compoundsSemantic distinction of compoundsII. Experiments in Lexical Ambiguity and Indexing
Automatic classification through WordNet
Endocentric: one component is hyperonym
Appositional: all components are hyperonyms
Exocentric: no components are hyperonyms
purchasingdepartment
department
is_a
Endocentric
aspirin powder
powderaspirin
is_ais_a
Appositional
fisher cat
Exocentric
Types of lexical compounds