Return to Colloquia & Seminar listing
Term-Document Matrix Analysis using Generalized Haar-Walsh Wavelet Packet Dictionaries
PDE and Applied Math SeminarSpeaker: | Naoki Saito, UC Davis |
Location: | 1447 MSB |
Start time: | Fri, May 27 2016, 4:10PM |
Many modern data analysis tasks often require one to efficiently handle and analyze large matrix-form datasets such as term-document matrices. Since such matrices are often shuffled and scrambled, they do not have spatial coherency and smoothness that usual images and photographs possess, and consequently, the conventional wavelets and their relatives cannot be used in practice. Instead we propose to use our multiscale basis dictionaries for graphs, i.e., the Generalized Haar-Walsh Transform (GHWT), which is a true generalization of the conventional Haar-Walsh wavelet packet transform to the graph setting. In particular, we build such dictionaries for columns and rows, extract the column and row best bases from these dictionaries, and construct their tensor product, which turns out to reveal hidden dependency and underlying geometric structure in the given matrix data. We demonstrate the effectiveness of our approach using the Science News database consisting of relative frequencies of occurrences of 1153 words over 1042 documents, which are categorized into eight different subjects: Anthropology; Astronomy; Behavioral Sciences; Earth Sciences; Life Sciences; Math/CS; Medicine; and Physics. Finally, we will also examine the capability of the GHWT dictionary as a whole for classifying the documents into those eight categories by the sparse multiclass logistic regression that can be solved efficiently by the popular lasso method. [Joint work with Jeff Irion and Eugene Shvarts]