The Internet brings us access to multimedia databases with billions of objects. The massive change in the amount of data available to researchers is changing the face of multimedia. In many domains, speech-recognition is most notable, people have observed that the best way to improve their algorithm’s performance is to add more data. Starting with hidden-Markov models (HMMs) and support-vector machines, people have applied ever greater amounts of data to their problems and been rewarded with new levels of performance.
What are the new algorithms and ideas that are necessary to work with such large databases of imagery, video, music, speech, and text? How do we define the scope of a problem, and how do we apply modern clusters of processors to these problems? What does it take to collect, manage and deliver solutions with millions of objects and petabytes of data?
In this tutorial we will present a range of algorithms and tools that make it easy/easier to scale our work to Internet-sized collections of multimedia. This tutorial will give attendees the tools they need to make use of large-scale datasets.The tutorial will start by providing attendees an overview and pointers to the tools that will allow them to scale their work to modern datasets. The tutorial discusses the theoretical and practical problem with large data, applications where large amounts of data are important to consider, types of algorithms that are practical with such large datasets, and examples of implementation techniques that make these algorithms practical. Many real-world examples and results illustrate the tutorial.
More specifically, our tutorial is divided into four types of web-scale processing options: simplified algorithms, randomized algorithms, parallel algorithms and implementations. Web-scale multimedia has brought a resurgence of interest in simple algorithms that can work with large data. This includes streaming algorithms that do not try to operate on all the data at once, and decision trees that do not seek a global optimum. Randomized algorithms are important because our data is large and redundant. We can often choose random directions (from the data) and come close to an optimum solution. A new set of algorithms are useful on the large parallel clusters of machines that are now available. Rearranging an algorithm so that it can be run on parallel computers makes a big difference. Finally, a new set of implementation techniques, such as parallel file systems, noSQL databases and map/reduce, and MPI frameworks have made all these ideas realistic.
This tutorial is based on two successful tutorials that were offered by the presenters at previous conferences. Dr. Slaney presented a tutorial titled “MIR at the Scale of the Web,” which was attended by more than 100 people at the ISMIR 2009 meeting (International Society of Music Information Retrieval.) Dr. Chang presented successful tutorials on “Large-scale Data Mining” at CIKM in 2009 and “Parallel Algorithms for Mining Large-scale Multimedia Datasets” at ACM MM 2009. This new tutorial represents a combination of the information, and is taught by researchers from two of the largest Internet (and multimedia) companies in the world.
Audience
This tutorial is aimed at technical students, researchers, and others interested in the large-scale processing algorithms and tools needed to handle today’s multimedia-processing needs. Our goal is to introduce techniques to the ACM Multimedia community that will encourage new research and development on the large-scale databases available to today’s consumer and research communities.
Outline
The problem
Size and curse of dimensionality
Machine-learning issues (why more data, etc)
Applications
Algorithm Simplifications
Boosted Decision Trees
Streaming Algorithms
Randomized Algorithms
Theory
MinHash
LSH
Parallel Algorithms
PSVM
PLDA*
Spectral Clustering
Combinational Collaborative Filtering
Implementations
Map/Reduce
MPI
Google FS & Yahoo Sherpa
New Directions
Tag ambiguity instead of tag suggestion
Inverse ESP
Deep vs. Shallow Learning Architecture
Model-based vs. Data-driven models
DMD: Deep Model-based and Data-driven Hybrid Model
Organizers/Presenters
Dr. Malcolm Slaney is a Principle Scientist at Yahoo! Research where he has been working on multimedia analysis and music- and image-retrieval algorithms in databases with billions of items. He is a Fellow of the IEEE and Associate Editors of IEEE Transactions on Audio, Speech and Signal Processing and IEEE Multimedia Magazine. He has given successful tutorials at ICASSP 1996 and 2009 on “Applications of Psychoacoustics to Signal Processing” and on “Multimedia Information Retrieval” at SIGIR and ICASSP. He is a coauthor, with A. C. Kak, of the IEEE book “Principles of Computerized Tomographic Imaging.” This book was recently republished by SIAM in their “Classics in Applied Mathematics” Series. He is coeditor, with Steven Greenberg, of the book “Computational Models of Auditory Function.” Before Yahoo!, Dr. Slaney has worked at Bell Laboratory, Schlumberger Palo Alto Research, Apple Computer, Interval Research and IBM’s Almaden Research Center. For the last several years he has lead the auditory group at the Telluride Neuromorphic Workshop. He is a (consulting) Professor at Stanford CCRMA where he has led the Hearing Seminar for the last 20 years.
Dr. Edward Chang heads Google Research in China since March 2006. He joined the department of Electrical & Computer Engineering at University of California, Santa Barbara, in 1999 after receiving his PhD from Stanford University. Ed received his tenure in 2003, and was promoted to full professor of Electrical Engineering in 2006. His recent research activities are in the areas of distributed data mining and their applications to rich-media data management and social-network collaborative filtering. His research group (which consists of members from Google, UC, MIT, Tsinghua, PKU, and Zheda) recently parallelized SVMs (NIPS 07), PLSA (KDD 08), Association Mining (ACM RS 08), Spectral Clustering (ECML 08), and LDA (WWW 09) (see MMDS/CIVR/EMMDS/AAIM/ADMA/CIKM keynote slides for details) to run on thousands of machines for mining large-scale datasets. His team at Google developed and launched Google Confucius (a Q&A system) at China, Russia, Thailand, and 17 Arabic countries. Ed has served on ACM (SIGMOD, KDD, MM, CIKM), VLDB, IEEE, WWW, and SIAM conference program committees, and co-chaired several conferences including MMM, ACM MM, ICDE, and WWW. Ed is a recipient of the IBM Faculty Partnership Award and the NSF Career Award.
T01 – Processing Web-Scale Multimedia Data
The Internet brings us access to multimedia databases with billions of objects. The massive change in the amount of data available to researchers is changing the face of multimedia. In many domains, speech-recognition is most notable, people have observed that the best way to improve their algorithm’s performance is to add more data. Starting with hidden-Markov models (HMMs) and support-vector machines, people have applied ever greater amounts of data to their problems and been rewarded with new levels of performance.
What are the new algorithms and ideas that are necessary to work with such large databases of imagery, video, music, speech, and text? How do we define the scope of a problem, and how do we apply modern clusters of processors to these problems? What does it take to collect, manage and deliver solutions with millions of objects and petabytes of data?
In this tutorial we will present a range of algorithms and tools that make it easy/easier to scale our work to Internet-sized collections of multimedia. This tutorial will give attendees the tools they need to make use of large-scale datasets.The tutorial will start by providing attendees an overview and pointers to the tools that will allow them to scale their work to modern datasets. The tutorial discusses the theoretical and practical problem with large data, applications where large amounts of data are important to consider, types of algorithms that are practical with such large datasets, and examples of implementation techniques that make these algorithms practical. Many real-world examples and results illustrate the tutorial.
More specifically, our tutorial is divided into four types of web-scale processing options: simplified algorithms, randomized algorithms, parallel algorithms and implementations. Web-scale multimedia has brought a resurgence of interest in simple algorithms that can work with large data. This includes streaming algorithms that do not try to operate on all the data at once, and decision trees that do not seek a global optimum. Randomized algorithms are important because our data is large and redundant. We can often choose random directions (from the data) and come close to an optimum solution. A new set of algorithms are useful on the large parallel clusters of machines that are now available. Rearranging an algorithm so that it can be run on parallel computers makes a big difference. Finally, a new set of implementation techniques, such as parallel file systems, noSQL databases and map/reduce, and MPI frameworks have made all these ideas realistic.
This tutorial is based on two successful tutorials that were offered by the presenters at previous conferences. Dr. Slaney presented a tutorial titled “MIR at the Scale of the Web,” which was attended by more than 100 people at the ISMIR 2009 meeting (International Society of Music Information Retrieval.) Dr. Chang presented successful tutorials on “Large-scale Data Mining” at CIKM in 2009 and “Parallel Algorithms for Mining Large-scale Multimedia Datasets” at ACM MM 2009. This new tutorial represents a combination of the information, and is taught by researchers from two of the largest Internet (and multimedia) companies in the world.
Audience
This tutorial is aimed at technical students, researchers, and others interested in the large-scale processing algorithms and tools needed to handle today’s multimedia-processing needs. Our goal is to introduce techniques to the ACM Multimedia community that will encourage new research and development on the large-scale databases available to today’s consumer and research communities.
Outline
Organizers/Presenters
Dr. Edward Chang heads Google Research in China since March 2006. He joined the department of Electrical & Computer Engineering at University of California, Santa Barbara, in 1999 after receiving his PhD from Stanford University. Ed received his tenure in 2003, and was promoted to full professor of Electrical Engineering in 2006. His recent research activities are in the areas of distributed data mining and their applications to rich-media data management and social-network collaborative filtering. His research group (which consists of members from Google, UC, MIT, Tsinghua, PKU, and Zheda) recently parallelized SVMs (NIPS 07), PLSA (KDD 08), Association Mining (ACM RS 08), Spectral Clustering (ECML 08), and LDA (WWW 09) (see MMDS/CIVR/EMMDS/AAIM/ADMA/CIKM keynote slides for details) to run on thousands of machines for mining large-scale datasets. His team at Google developed and launched Google Confucius (a Q&A system) at China, Russia, Thailand, and 17 Arabic countries. Ed has served on ACM (SIGMOD, KDD, MM, CIKM), VLDB, IEEE, WWW, and SIAM conference program committees, and co-chaired several conferences including MMM, ACM MM, ICDE, and WWW. Ed is a recipient of the IBM Faculty Partnership Award and the NSF Career Award.