|
|||
|
|||
|
Search Related Papers |
Cross-Document Entity Identification and Training Dan Roth of the University of Illinois Department of Computer Science is studying and developing theoretical and algorithmic approaches to support an intelligent text analysis tool for identifying entities of interest and different mentions of them, and for tracing of entities and information relevant to them within and across documents. For example, an interactive open domain question answering system attempts, given a question like "When was President Kennedy born?" to search a large collection of documents and newspaper articles in order to come up with the (correct) short answer: "on May 29, 1917." This sentence, and even the document that contains it, may not contain the name "President Kennedy"; it may refer to this entity as "Kennedy," "JFK," or John Fitzgerald Kennedy. On the other hand, other documents may say that "John F. Kennedy Jr. was born on November 25, 1960", but this fact refers to the target entity's son. Other mentions, such as "Senator Kennedy" or "Mrs. Kennedy," are even closer to the "writing" of the target entity, but clearly refer to different entities. Even the statement "John Kennedy, born 5-29-1941& turns out to refer to a different entity, as one can tell by observing that it appears in a document that discusses Kennedy's batting statistics. Similar issues exist for many categories such as names of locations, organizations, etc. Reading and understanding text to the extent that it allows for identifying entities of interest and tracing them and information relevant to them across documents requires the ability to disambiguate at several levels, to abstract away details, and to use background knowledge in a variety of ways. One of the key difficulties -- that humans resolve instantaneously and unconsciously -- is that of reading names. Most names, of people, locations, organizations and others, may have different "writings" that are being used freely within and across documents. This research views the technical problems from the perspective of a question-answering task and will address the following issues: 1. Entity recognition: Identify different types of entities and categories in text (e.g. this phrase represents a name of an organization; this phrase represents a location). 2. Entity identity: Determine whether name mentions A and B (typically, occurring in different documents, or in a question and a document) refer to the same entity. This problem requires both identifying when different mentions refer to the same entity, and when very similar or identical mentions refer to different entities. 3. Name expansion: Given a writing of a name (say, in a question), find the k most likely writings of the same name. This is important when dealing with a large collection of document, or the Web, since the target entity may occur there in a different form than in the query. This is a difficult problem that may include dealing with titles (that change with time), context etc. 4. Prominence: Given a question such as "Where was Poe born?" and a large collection of documents that, necessarily, contains several "Poes," there is a need to identify the prominent "Poe," perhaps given some context variables that restrict the range of candidates. |
|||||
|
||||||