|Sažetak (engleski)|| |
As our society becomes increasingly digital, there are a growing number of textual information sources (e.g., breaking news, investigative stories, police reports, tweets, historical texts, electronic health records) that are filled with descriptions of events. The ability to automatically extract and analyse events from text is now more important than ever, with applications that range from security and intelligence to journalism, media analysis, and historical research. Efficiently satisfying event-oriented user information needs requires precise extraction of event-related information, which is a very demanding task considering the complexity, vagueness, and ambiguity of natural language. In text, real-world events are represented by the so-called linguistic events, or event mentions. Due to ambiguity and vagueness of natural language, the mapping of real-world events and their relations (temporal, causal, etc.) to their linguistic counterparts introduces a loss of information. Event mentions are structured — they consist of event anchors, being words bearing the core meaning of events, and event arguments, being the phrases that denote protagonists and circumstances (e.g., time and location) of events. Documents describing real-world events, thus, give rise to a structure in which there are relations between different event mentions as well as relations between anchors and arguments within individual event mentions. In this dissertation I have proposed an event graph, a structured representation of event-oriented documents containing all informationally-relevant aspects of real-world events. Vertices of event graphs denote individual event mentions extracted from text, whereas edges may denote various semantic relations that hold between event mentions. Although, model-wise, event graphs allow for any semantic relation between events, temporal relations between events have been considered in particular due to inherent temporal aspect of events. Based on the model of event graph, a fully automated procedure for constructing event graphs has been developed. Automated construction of event graphs includes four different information extraction models: (1) a supervised model for extraction of event anchors, (2) a rule-based model for extraction of event arguments, (3) a supervised model for extracting temporal relations between events, and (4) a supervised model for resolving coreference of event mentions. Models for extracting event anchors and temporal relations between event mentions are linear regression models based on rich set of lexical, syntactic, and semantic features. The argument extraction model is based on a set of syntactic extraction patterns and semantic disambiguation rules. The event coreference resolution model is a support vector machines model with the set of numeric features indicating the similarities between anchors and arguments with matching roles between two event mentions. Each of the four models was thoroughly intrinsically evaluated using standard evaluation metrics — precision, recall, and F-score. Two novel metrics for evaluating the overall quality of the automated construction process have been proposed and empirically validated and the overall quality of automatically constructed event graphs has been measured using these metrics. In order to develop and evaluate information extraction models included in construction of event graphs, a large corpus, named EvExtra, manually annotated with factual event mentions has been compiled. The EvExtra is currently the largest corpus manually annotated with event-oriented information. It is approximately three times larger than the TimeBank corpus, which has typically been used in event extraction tasks. Comparison of documents describing real-world events is performed by comparing their corresponding event graphs. An innovative method for efficient comparison of event graphs, based on semantic extensions of graph kernels has been designed and implemented. Two different graph kernels — product graph kernel and weighted decomposition kernel — have been semantically extended to account for event-specific semantics. Efficient information retrieval models based on construction and comparison of event graphs have been proposed and evaluated on several information retrieval tasks. Experimental results show that the retrieval models based on event graph and graph kernels outperform traditional retrieval models, which represent documents in an unstructured fashion (i.e., as bags of words) such as vector space models, language models, and probabilistic models. The usefulness of structured event-centered document representation has been additionally verified on two different natural language processing tasks: multi-document summarization and text simplification. A novel algorithm for multi-document summarization which exploits event-oriented information and temporal structure contained in event graphs has been developed. The novel event-based multi-document summarization algorithm outperforms competitive methods on standard summarization datasets. The algorithm for automated simplification of news stories eliminates all content not relating to event mentions and transforms individual event mentions into separate sentences in the simplified text. Human evaluation shows that text produced with this simplification method are highly grammatical and contain only the most relevant information from the original text. The research covered in the dissertation focused on texts written in English. Although the event graph formalism itself is language independent, some parts of the models used for automated construction of event graphs are language dependent. The adjustment of the graph construction pipeline for another language is possible, although not an easy task. One of the main directions in future work will tackle adjustment of the automated graph construction pipeline for Croatian. This dissertation lays the foundation for structured event-based document analysis and uncovers many interesting directions for future research. Event graphs can be extended conceptually by considering relations between event mentions other than temporal relations (e.g., causality, subordination). Event graphs could also be applied in other natural language processing tasks (e.g., question answering) and other text domains (e.g., biographies). Finally, I envisage a formal framework, based on event graphs, which would enable the modeling of events in a continuous event space that spans from linguistic events at the lowest level to topics at the highest level. Such an event graph-based framework would enable a uniform and elegant treatment of both events and topics for the purpose of event-based document analysis.