|Sažetak (engleski)|| |
The Information Age emerged due to exceptionally efficient digital storage and exchange of texts. The quantity of daily published texts in all areas of human society surpasses individual's capacity to consume them in a traditional way. Visualization, a computer-based approach, helps reduce this gap by enabling people to discover knowledge in very large text collections. The research of text visualization presented in this dissertation is motivated by the fact that text information often include a temporal dimension, and by the fact that today's digitally available collections contain documents spanning long periods of time. That secures the conditions for performing analyses which aim to discover temporal changes and constants in text content. During this research, CatViz, a novel visualization method, was designed and investigated. The CatViz method is based on correspondence analysis and can be used to visualize temporal changes in the content of a text collection. In order to use the CatViz method on text collections, text representation features were constructed using natural language processing methods. Since the CatViz method displays properties of both the semantic space and the term trend approach to text visualization, it is considered a fusion of these two approaches. To illustrate the CatViz method, in this dissertation, three case studies are presented. A very efficient visualization system was developed in pursuance of investigating the capabilities of the CatViz method, and in order to conduct an empirical evaluation. A user-oriented evaluation methodology was designed and used to evaluate the CatViz method. Usefulness of the CatViz method was shown on tasks of large news text collection analysis. When examining related work, it is easily recognized that in order to create successful text visualization methods or systems knowledge from many research fields is required. First, texts are represented with features by means of information extraction techniques developed within the fields of natural language processing and computational linguistics. To draw a display, methods of multivariate statistics and computer engineering are used. Finally, good visualizations exhibit efficient interfaces which are enhanched within the fields of human-computer interaction, cognitive and perceptive psychology, design, and aesthetics. The final result of a visualization is an increase of users' knowledge, so real users participate in empirical evaluations during which subjectivity has to be controlled. For that reason, evaluation approaches are drawn from social sciences. It is desirable to evaluate visualizations on real data, with real users solving real analytical tasks. The corpus of related work shows a rising trend in research of temporal text visualization methods. In the most recent works, an evident aspiration draws attention -- researchers aim to simultaneously display many text aspects such as topics, names, events, and time. The CatViz method, an extension of the correspondence analysis, enables an analysis of any multivariate data with a temporal dimension. This method is scalable and efficiently enables visualization of very large data. Besides the calculation and interpretation of CatViz plots, application procedures to tasks of text analysis are explained. In this dissertation, text features based on named entity recognition, topic modeling, and clustering are proposed and explored. Feature construction is motivated by the concept of complete reporting which seeks to answer the basic questions: Who?, Where?, When?, What?, Why?, and How?. Also, in furtherance of ameliorating the robustness of the CatViz method, a smoothing option during the calculation of CatViz was investigated. In order to show the possibilities of the CatViz method, this dissertation presents case studies which include examples of plot interpretation, collection exploration, data restriction, source comparison, seasonality visualization, and text feature choice. It is shown that clustering methods can be used to construct representations which show important events, as well as to emphasize the features with constant temporal distributions while using the CatViz method. Furthermore, the case studies reveal an interesting point -- strong seasonality in content can clearly be seen on CatViz plots. An example of comparing two large non-parallel corpora written in different languages that describe the same events confirms the robustness of the CatViz method. During this research, CatViz System was developed to enable the evaluation of the CatViz method and the proposed text representation features. The CatViz System implements two visualization methods, intuitive text selection, easy parameter setting, display of features' temporal distributions, and reading access to texts in an advanced display. Two natural language processing tasks were solved in order to use the defined text features with the CatViz visualization. First, a rule-based named entity linking module for English was developed. It operates on rules for matching different name forms, amended by the frequencies of those names in the analyzed corpus. Second, a methodology and a program for manual labeling of topics were developed. Description of the architecture and functionality of the CatViz System illustrates the complexity of a development path starting from a theoretical visualization method and finishing in a production-ready visualization system. A few important conclusions are drawn. Firstly, the visualization systems necessarily need to have a good interface with an intuitive interaction. Secondly, due to practically unlimited sizes of available data, computational complexity and selection of data structures pose very important questions during initial method choice and system design. Thirdly, while developing a visualization system, communication with the end users is critical since they give valuable advice and objective judgement on advantages and drawbacks of a method or a system. Fourthly, the appropriateness of a client-server architecture for visualization systems is confirmed. Two user studies classified as laboratory experiments with quantitative and qualitative methodologies were performed using the CatViz System. These studies confirm the usefulness of the CatViz method paired with the proposed text features. The first study shows that the users can use the CatViz method to discover and interpret important events in very large news text collections. The second study involved working with real users, on real tasks, using real data. This comparative evaluation shows a strong tendency of the CatViz method outperforming a baseline visualization (the temporal frequency plot) on the complex analytical tasks with free-answer questions. For this study, an evaluation methodology which includes manual expert assessment of answers on the criteria of fact coverage, quality of inference, and quality of expression was developed and employed. The users of the CatViz System state satisfaction by giving high marks to attributes of usefulness, intuitiveness, and enabling of knowledge discovery. The CatViz method is seen as adequate for display of relevant names and for solving tasks where temporal dimension is important. Besides many useful traits of the CatViz System, the users accentuate using topic modeling for text representation. This research advances the field of text visualization by enabling individuals to efficiently and objectively discover knowledge in large collections. The importance of the CatViz method is in that it enables both high-level overviews and detailed inspection, giving the users a capability of exploring millions of texts at a time and bringing us closer to the objectivization of history and contemporary affairs. It is believed that the CatViz method will enrich historical research of text archives, media research of contemporary sources, as well as knowledge discovery from all other text collections.