Large Scale Streaming Text Analytics and Visualization System
The rise of the information society is continually transforming the way people communicate, reason, conduct business, accumulate influence, and pursue collective goals. The task of the information analyst is to observe, discern, and report on patterns of events that can explain why or how; more clearly describe what is; and predict when and what is to come. The patterns of events of interest are often hidden in the vast quantity of information -- in its myriad forms and modes of transfer. These patterns are also obscured by the ever quickening pace of information flows, including an ever growing abundance of noise such as spam mail. Add to these challenge intelligent adversaries that regularly use these complicating characteristics to avoid detection or propagate misinformation.

Effective and efficient large stream pattern mining and visualization is a critical research and development initiative. The goal of this project is to develop a novel text mining and visualization system, combining low level semantic text parsing to formulate and extract entities, events, and relationships with the latest in stream graph mining research to find hidden event patterns and predict relationships between patterns across large and heterogeneous textual information flows. These algorithms will be optimized to approach “one look” capabilities. The visualizations will be dynamic and near real-time, allowing exploration, discovery, prediction, and monitoring of event patterns.

The system will be based on the Data- to-Knowledge™ dataflow framework, a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools. It offers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development. All D2K™ components are written in Java for maximum flexibility and portability.





The visualization approach is to render these large fast moving streams manageable and adaptable to the changing needs of the analyst. The development of a variety of visualization techniques, such as interactive filtering, distortion, linking and brushing, would be deployed in this environment to discern alarming or interesting patterns across information flows. Users will be able to generate visualizations that display event patterns and correlations, display them and explore them in multiple views, and gain insight into directly- and indirectly-linked events and entities. The user interface will be web-driven and will permit maximum flexibility in machine and data management resources that could be utilized for experiments in scale and interoperability.










 
Project Leads
Loretta Auvil, NCSA
Duane Searsmith, NCSA
Michael Welge, NCSA

Return to Projects list


SELS 0.7 released
Secure Email List Services (SELS) is an open source software for creating and developing secure email list services among user communities.
 
Strong community engagement strengthens cybersecurity research and development
NCASSR-supported exploratory research at NCSA and elsewhere has sparked additional external funding and development opportunities as well as successful deployment and adoption by users ranging from the defense sector to state law enforcement to the utilities industry.
 
NCASSR Collaborator Goes To Washington
Carl Gunter, a professor in the University of Illinois Department of Computer Science and a project lead on NCASSR-supported work involving adaptive, secure messaging, recently spoke to an audience of congressional staffers and lobbyists on Capitol Hill regarding ways to address a variety of critical cybersecurity issues in areas such as healthcare and energy distribution.