Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 25-28, 2010. ACM 【DBLP Link】
【Paper Link】 【Pages】:7-16
【Authors】: David Chan ; Rong Ge ; Ori Gershony ; Tim Hesterberg ; Diane Lambert
【Abstract】: Display ads proliferate on the web, but are they effective? Or are they irrelevant in light of all the other advertising that people see? We describe a way to answer these questions, quickly and accurately, without randomized experiments, surveys, focus groups or expert data analysts. Doubly robust estimation protects against the selection bias that is inherent in observational data, and a nonparametric test that is based on irrelevant outcomes provides further defense. Simulations based on realistic scenarios show that the resulting estimates are more robust to selection bias than traditional alternatives, such as regression modeling or propensity scoring. Moreover, computations are fast enough that all processing, from data retrieval through estimation, testing, validation and report generation, proceeds in an automated pipeline, without anyone needing to see the raw data.
【Keywords】: ad effectiveness; display ads; doubly robust estimation; irrelevant outcomes; observational data; propensity score; selection bias
【Paper Link】 【Pages】:17-26
【Authors】: Diane Tang ; Ashish Agarwal ; Deirdre O'Brien ; Mike Meyer
【Abstract】: At Google, experimentation is practically a mantra; we evaluate almost every change that potentially affects what our users experience. Such changes include not only obvious user-visible changes such as modifications to a user interface, but also more subtle changes such as different machine learning algorithms that might affect ranking or content selection. Our insatiable appetite for experimentation has led us to tackle the problems of how to run more experiments, how to run experiments that produce better decisions, and how to run them faster. In this paper, we describe Google's overlapping experiment infrastructure that is a key component to solving these problems. In addition, because an experiment infrastructure alone is insufficient, we also discuss the associated tools and educational processes required to use it effectively. We conclude by describing trends that show the success of this overall experimental environment. While the paper specifically describes the experiment system and experimental processes we have in place at Google, we believe they can be generalized and applied by any entity interested in using experimentation to improve search engines and other web applications.
【Keywords】: a/b testing; controlled experiments; multivariable testing; website testing
【Paper Link】 【Pages】:27-36
【Authors】: Wei Li ; Xuerui Wang ; Ruofei Zhang ; Ying Cui ; Jianchang Mao ; Rong Jin
【Abstract】: The dynamic marketplace in online advertising calls for ranking systems that are optimized to consistently promote and capitalize better performing ads. The streaming nature of online data inevitably makes an advertising system choose between maximizing its expected revenue according to its current knowledge in short term (exploitation) and trying to learn more about the unknown to improve its knowledge (exploration), since the latter might increase its revenue in the future. The exploitation and exploration (EE) tradeoff has been extensively studied in the reinforcement learning community, however, not been paid much attention in online advertising until recently. In this paper, we develop two novel EE strategies for online advertising. Specifically, our methods can adaptively balance the two aspects of EE by automatically learning the optimal tradeoff and incorporating confidence metrics of historical performance. Within a deliberately designed offline simulation framework we apply our algorithms to an industry leading performance based contextual advertising system and conduct extensive evaluations with real online event log data. The experimental results and detailed analysis reveal several important findings of EE behaviors in online advertising and demonstrate that our algorithms perform superiorly in terms of ad reach and click-through-rate (CTR).
【Keywords】: click feedback; click-through-rate; exploitation and exploration; online advertising
【Paper Link】 【Pages】:37-46
【Authors】: Hillol Kargupta ; Kakali Sarkar ; Michael Gilligan
【Abstract】: This paper describes the MineFleet distributed vehicle performance data mining system designed for commercial fleets. MineFleet analyzes high throughput data streams onboard the vehicle, generates the analytics, sends those to the remote server over the wide-area wireless networks and offers them to the fleet managers using stand-alone and web-based user-interface. The paper describes the overall architecture of the system, business needs, and shares experience from successful large-scale commercial deployments. MineFleet is probably one of the first commercially successful distributed data stream mining systems. This patented technology has been adopted, productized, and commercially offered by many large companies in the mobile resource management and GPS fleet tracking industry. This paper offers an overview of the system and offers a detailed analysis of what made it work.
【Keywords】: distributed data mining; telematics; vehicle data stream mining
【Paper Link】 【Pages】:47-56
【Authors】: Santanu Das ; Bryan L. Matthews ; Ashok N. Srivastava ; Nikunj C. Oza
【Abstract】: The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequences of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
【Keywords】: aeronautics; anomaly detection; prediction; prognostics
【Paper Link】 【Pages】:57-64
【Authors】: Saurabh Goorha ; Lyle H. Ungar
【Abstract】: We describe a system that monitors social and mainstream media to determine shifts in what people are thinking about a product or company. We process over 100,000 news articles, blog posts, review sites, and tweets a day for mentions of items (e.g., products) of interest, extract phrases that are mentioned near them, and determine which of the phrases are of greatest possible interest to, for example, brand managers. Case studies show a good ability to rapidly pinpoint emerging subjects buried deep in large volumes of data and then highlight those that are rising or falling in significance as they relate to the firms interests. The tool and algorithm improves the signal-to-noise ratio and pinpoints precisely the opportunities and risks that matter most to communications professionals and their organizations.
【Keywords】: brand management; buzz; communication; competitive intelligence; entity extraction; reputation management; sentiment detection; social media; text mining; trend detection
【Paper Link】 【Pages】:65-74
【Authors】: Mohit Kumar ; Rayid Ghani ; Zhu-Song Mei
【Abstract】: Health insurance costs across the world have increased alarmingly in recent years. A major cause of this increase are payment errors made by the insurance companies while processing claims. These errors often result in extra administrative effort to re-process (or rework) the claim which accounts for up to 30% of the administrative staff in a typical health insurer. We describe a system that helps reduce these errors using machine learning techniques by predicting claims that will need to be reworked, generating explanations to help the auditors correct these claims, and experiment with feature selection, concept drift, and active learning to collect feedback from the auditors to improve over time. We describe our framework, problem formulation, evaluation metrics, and experimental results on claims data from a large US health insurer. We show that our system results in an order of magnitude better precision (hit rate) over existing approaches which is accurate enough to potentially result in over $15-25 million in savings for a typical insurer. We also describe interesting research problems in this domain as well as design choices made to make the system easily deployable across health insurance companies.
【Keywords】: claim rework identification; health insurance claims; machine learning
【Paper Link】 【Pages】:75-84
【Authors】: Naoki Abe ; Prem Melville ; Cezar Pendus ; Chandan K. Reddy ; David L. Jensen ; Vince P. Thomas ; James J. Bennett ; Gary F. Anderson ; Brent R. Cooley ; Melissa Kowalczyk ; Mark Domick ; Timothy Gardinier
【Abstract】: The problem of optimally managing the collections process by taxation authorities is one of prime importance, not only for the revenue it brings but also as a means to administer a fair taxing system. The analogous problem of debt collections management in the private sector, such as banks and credit card companies, is also increasingly gaining attention. With the recent successes in the applications of data analytics and optimization to various business areas, the question arises to what extent such collections processes can be improved by use of leading edge data modeling and optimization techniques. In this paper, we propose and develop a novel approach to this problem based on the framework of constrained Markov Decision Process (MDP), and report on our experience in an actual deployment of a tax collections optimization system at New York State Department of Taxation and Finance (NYS DTF).
【Keywords】: business analytics and optimization; constrained markov decision process; debt collection optimization; reinforcement learning
【Paper Link】 【Pages】:85-94
【Authors】: Longbing Cao ; Yuming Ou ; Philip S. Yu ; Gang Wei
【Abstract】: In capital market surveillance, an emerging trend is that a group of hidden manipulators collaborate with each other to manipulate three trading sequences: buy-orders, sell-orders and trades, through carefully arranging their prices, volumes and time, in order to mislead other investors, affect the instrument movement, and thus maximize personal benefits. If the focus is on only one of the above three sequences in attempting to analyze such hidden group based behavior, or if they are merged into one sequence as per an investor, the coupling relationships among them indicated through trading actions and their prices/volumes/times would be missing, and the resulting findings would have a high probability of mismatching the genuine fact in business. Therefore, typical sequence analysis approaches, which mainly identify patterns on a single sequence, cannot be used here. This paper addresses a novel topic, namely coupled behavior analysis in hidden groups. In particular, we propose a coupled Hidden Markov Models (HMM)-based approach to detect abnormal group-based trading behaviors. The resulting models cater for (1) multiple sequences from a group of people, (2) interactions among them, (3) sequence item properties, and (4) significant change among coupled sequences. We demonstrate our approach in detecting abnormal manipulative trading behaviors on orderbook-level stock data. The results are evaluated against alerts generated by the exchange's surveillance system from both technical and computational perspectives. It shows that the proposed coupled and adaptive HMMs outperform a standard HMM only modeling any single sequence, or the HMM combining multiple single sequences, without considering the coupling relationship. Further work on coupled behavior analysis, including coupled sequence/event analysis, hidden group analysis and behavior dynamics are very critical.
【Keywords】: abnormal behavior detection; coupled behavior analysis; coupled hidden markov model; coupled sequence analysis; hidden group discovery; market manipulation; sequence change; sequence item property
【Paper Link】 【Pages】:95-104
【Authors】: Yanfang Ye ; Tao Li ; Yong Chen ; Qingshan Jiang
【Abstract】: In this paper, resting on the analysis of instruction frequency and function-based instruction sequences, we develop an Automatic Malware Categorization System (AMCS) for automatically grouping malware samples into families that share some common characteristics using a cluster ensemble by aggregating the clustering solutions generated by different base clustering algorithms. We propose a principled cluster ensemble framework for combining individual clustering solutions based on the consensus partition. The domain knowledge in the form of sample-level constraints can be naturally incorporated in the ensemble framework. In addition, to account for the characteristics of feature representations, we propose a hybrid hierarchical clustering algorithm which combines the merits of hierarchical clustering and k-medoids algorithms and a weighted subspace K-medoids algorithm to generate base clusterings. The categorization results of our AMCS system can be used to generate signatures for malware families that are useful for malware detection. The case studies on large and real daily malware collection from Kingsoft Anti-Virus Lab demonstrate the effectiveness and efficiency of our AMCS system.
【Keywords】: cluster ensemble; malware categorization; signature
【Paper Link】 【Pages】:105-114
【Authors】: Mehran Bozorgi ; Lawrence K. Saul ; Stefan Savage ; Geoffrey M. Voelker
【Abstract】: The security demands on modern system administration are enormous and getting worse. Chief among these demands, administrators must monitor the continual ongoing disclosure of software vulnerabilities that have the potential to compromise their systems in some way. Such vulnerabilities include buffer overflow errors, improperly validated inputs, and other unanticipated attack modalities. In 2008, over 7,400 new vulnerabilities were disclosed--well over 100 per week. While no enterprise is affected by all of these disclosures, administrators commonly face many outstanding vulnerabilities across the software systems they manage. Vulnerabilities can be addressed by patches, reconfigurations, and other workarounds; however, these actions may incur down-time or unforeseen side-effects. Thus, a key question for systems administrators is which vulnerabilities to prioritize. From publicly available databases that document past vulnerabilities, we show how to train classifiers that predict whether and how soon a vulnerability is likely to be exploited. As input, our classifiers operate on high dimensional feature vectors that we extract from the text fields, time stamps, cross references, and other entries in existing vulnerability disclosure reports. Compared to current industry-standard heuristics based on expert knowledge and static formulas, our classifiers predict much more accurately whether and how soon individual vulnerabilities are likely to be exploited.
【Keywords】: SVM; exploits; supervised learning; vulnerabilities
【Paper Link】 【Pages】:115-124
【Authors】: Evan K. Maxwell ; Godmar Back ; Naren Ramakrishnan
【Abstract】: Memory leaks are caused by software programs that prevent the reclamation of memory that is no longer in use. They can cause significant slowdowns, exhaustion of available storage space and, eventually, application crashes. Detecting memory leaks is challenging because real-world applications are built on multiple layers of software frameworks, making it difficult for a developer to know whether observed references to objects are legitimate or the cause of a leak. We present a graph mining solution to this problem wherein we analyze heap dumps to automatically identify subgraphs which could represent potential memory leak sources. Although heap dumps are commonly analyzed in existing heap profiling tools, our work is the first to apply a graph grammar mining solution to this problem. Unlike classical graph mining work, we show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. Our approach identifies not just leaking candidates and their structure, but also provides aggregate information about the access path to the leaks. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT.
【Keywords】: dominator tree; graph grammars; graph mining; heap profiling; memory leaks
【Paper Link】 【Pages】:125-134
【Authors】: Li Zheng ; Chao Shen ; Liang Tang ; Tao Li ; Steven Luis ; Shu-Ching Chen ; Vagelis Hristidis
【Abstract】: Crisis Management and Disaster Recovery have gained immense importance in the wake of recent man and nature inflicted calamities. A critical problem in a crisis situation is how to efficiently discover, collect, organize, search and disseminate real-time disaster information. In this paper, we address several key problems which inhibit better information sharing and collaboration between both private and public sector participants for disaster management and recovery. We design and implement a web based prototype implementation of a Business Continuity Information Network (BCIN) system utilizing the latest advances in data mining technologies to create a user-friendly, Internet-based, information-rich service and acting as a vital part of a company's business continuity process. Specifically, information extraction is used to integrate the input data from different sources; the content recommendation engine and the report summarization module provide users personalized and brief views of the disaster information; the community generation module develops spatial clustering techniques to help users build dynamic community in disasters. Currently, BCIN has been exercised at Miami-Dade County Emergency Management.
【Keywords】: critical information; data mining; disaster management; dynamic dashboard; information extraction; multi-document summarization; spatial clustering
【Paper Link】 【Pages】:135-144
【Authors】: Shen-Shyang Ho ; Wenqing Tang ; W. Timothy Liu
【Abstract】: The Earth Observing System Data and Information System (EOSDIS) is a comprehensive data and information system which archives, manages, and distributes Earth science data from the EOS spacecrafts. One non-existent capability in the EOSDIS is the retrieval of satellite sensor data based on weather events (such as tropical cyclones) similarity query output. In this paper, we propose a framework to solve the similarity search problem given user-defined instance-level constraints for tropical cyclone events, represented by arbitrary length multidimensional spatio-temporal data sequences. A critical component for such a problem is the similarity/metric function to compare the data sequences. We describe a novel Longest Common Subsequence (LCSS) parameter learning approach driven by nonlinear dimensionality reduction and distance metric learning. Intuitively, arbitrary length multidimensional data sequences are projected into a fixed dimensional manifold for LCSS parameter learning. Similarity search is achieved through consensus among the (similar) instance-level constraints based on ranking orders computed using the LCSS-based similarity measure. Experimental results using a combination of synthetic and real tropical cyclone event data sequences are presented to demonstrate the feasibility of our parameter learning approach and its robustness to variability in the instance constraints. We, then, use a similarity query example on real tropical cyclone event data sequences from 2000 to 2008 to discuss (i) a problem of scientific interest, and (ii) challenges and issues related to the weather event similarity search problem.
【Keywords】: atmospheric events; dimensionality reduction; embedding method; ensemble method; metric learning; mining multi-dimensional data sequences; parameter learning; similarity search; spatio-temporal data mining; tropical cyclones
【Paper Link】 【Pages】:145-152
【Authors】: Collin Bennett ; Robert L. Grossman ; David Locke ; Jonathan Seidman ; Steve Vejcik
【Abstract】: Developing data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is developing cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper, we introduce a benchmark called MalStone that is specifically designed to measure the performance of cloud computing middleware that supports the type of data intensive computing common when building data mining models. We also introduce MalGen, which is a utility for generating data on clouds that can be used with MalStone.
【Keywords】: benchmarks; cloud computing benchmarks; data mining benchmarks
【Paper Link】 【Pages】:153-162
【Authors】: Furu Wei ; Shixia Liu ; Yangqiu Song ; Shimei Pan ; Michelle X. Zhou ; Weihong Qian ; Lei Shi ; Li Tan ; Qiang Zhang
【Abstract】: In this paper, we present a novel exploratory visual analytic system called TIARA (Text Insight via Automated Responsive Analytics), which combines text analytics and interactive visualization to help users explore and analyze large collections of text. Given a collection of documents, TIARA first uses topic analysis techniques to summarize the documents into a set of topics, each of which is represented by a set of keywords. In addition to extracting topics, TIARA derives time-sensitive keywords to depict the content evolution of each topic over time. To help users understand the topic-based summarization results, TIARA employs several interactive text visualization techniques to explain the summarization results and seamlessly link such results to the original text. We have applied TIARA to several real-world applications, including email summarization and patient record analysis. To measure the effectiveness of TIARA, we have conducted several experiments. Our experimental results and initial user feedback suggest that TIARA is effective in aiding users in their exploratory text analytic tasks.
【Keywords】: document summarization; exploratory analytics; interactive visualization; topic analysis; visual text analytics
【Paper Link】 【Pages】:163-172
【Authors】: Keith Henderson ; Tina Eliassi-Rad ; Christos Faloutsos ; Leman Akoglu ; Lei Li ; Koji Maruhashi ; B. Aditya Prakash ; Hanghang Tong
【Abstract】: Advances in data collection and storage capacity have made it increasingly possible to collect highly volatile graph data for analysis. Existing graph analysis techniques are not appropriate for such data, especially in cases where streaming or near-real-time results are required. An example that has drawn significant research interest is the cyber-security domain, where internet communication traces are collected and real-time discovery of events, behaviors, patterns, and anomalies is desired. We propose MetricForensics, a scalable framework for analysis of volatile graphs. MetricForensics combines a multi-level "drill down" approach, a collection of user-selected graph metrics, and a collection of analysis techniques. At each successive level, more sophisticated metrics are computed and the graph is viewed at finer temporal resolutions. In this way, MetricForensics scales to highly volatile graphs by only allocating resources for computationally expensive analysis when an interesting event is discovered at a coarser resolution first. We test MetricForensics on three real-world graphs: an enterprise IP trace, a trace of legitimate and malicious network traffic from a research institution, and the MIT Reality Mining proximity sensor data. Our largest graph has 3M vertices and 32M edges, spanning 4.5 days. The results demonstrate the scalability and capability of MetricForensics in analyzing volatile graphs; and highlight four novel phenomena in such graphs: elbows, broken correlations, prolonged spikes, and lightweight stars.
【Keywords】: graph mining; temporal analysis; volatile graphs
【Paper Link】 【Pages】:173-182
【Authors】: Byron C. Wallace ; Kevin Small ; Carla E. Brodley ; Thomas A. Trikalinos
【Abstract】: Active learning (AL) is an increasingly popular strategy for mitigating the amount of labeled data required to train classifiers, thereby reducing annotator effort. We describe a real-world, deployed application of AL to the problem of biomedical citation screening for systematic reviews at the Tufts Medical Center's Evidence-based Practice Center. We propose a novel active learning strategy that exploits a priori domain knowledge provided by the expert (specifically, labeled features)and extend this model via a Linear Programming algorithm for situations where the expert can provide ranked labeled features. Our methods outperform existing AL strategies on three real-world systematic review datasets. We argue that evaluation must be specific to the scenario under consideration. To this end, we propose a new evaluation framework for finite-pool scenarios, wherein the primary aim is to label a fixed set of examples rather than to simply induce a good predictive model. We use a method from medical decision theory for eliciting the relative costs of false positives and false negatives from the domain expert, constructing a utility measure of classification performance that integrates the expert preferences. Our findings suggest that the expert can, and should, provide more information than instance labels alone. In addition to achieving strong empirical results on the citation screening problem, this work outlines many important steps for moving away from simulated active learning and toward deploying AL for real-world applications.
【Keywords】: active learning; applications; medicine; text classification
【Paper Link】 【Pages】:183-192
【Authors】: Aditya Khosla ; Yu Cao ; Cliff Chiung-Yu Lin ; Hsu-Kuang Chiu ; Junling Hu ; Honglak Lee
【Abstract】: Stroke is the third leading cause of death and the principal cause of serious long-term disability in the United States. Accurate prediction of stroke is highly valuable for early intervention and treatment. In this study, we compare the Cox proportional hazards model with a machine learning approach for stroke prediction on the Cardiovascular Health Study (CHS) dataset. Specifically, we consider the common problems of data imputation, feature selection, and prediction in medical datasets. We propose a novel automatic feature selection algorithm that selects robust features based on our proposed heuristic: conservative mean. Combined with Support Vector Machines (SVMs), our proposed feature selection algorithm achieves a greater area under the ROC curve (AUC) as compared to the Cox proportional hazards model and L1 regularized Cox feature selection algorithm. Furthermore, we present a margin-based censored regression algorithm that combines the concept of margin-based classifiers with censored regression to achieve a better concordance index than the Cox model. Overall, our approach outperforms the current state-of-the-art in both metrics of AUC and concordance index. In addition, our work has also identified potential risk factors that have not been discovered by traditional approaches. Our method can be applied to clinical prediction of other diseases, where missing data are common and risk factors are not well understood.
【Keywords】: ROC; SVM; benchmark; classification; concordance index; data analysis; feature selection; healthcare; medical data analysis; prediction; stroke; stroke prediction
【Paper Link】 【Pages】:193-202
【Authors】: Yan Yan ; Glenn Fung ; Jennifer G. Dy ; Rómer Rosales
【Abstract】: Medical coding or classification is the process of transforming information contained in patient medical records into standard predefined medical codes. There are several worldwide accepted medical coding conventions associated with diagnoses and medical procedures; however, in the United States the Ninth Revision of ICD(ICD-9) provides the standard for coding clinical records. Accurate medical coding is important since it is used by hospitals for insurance billing purposes. Since after discharge a patient can be assigned or classified to several ICD-9 codes, the coding problem can be seen as a multi-label classification problem. In this paper, we introduce a multi-label large-margin classifier that automatically learns the underlying inter-code structure and allows the controlled incorporation of prior knowledge about medical code relationships. In addition to refining and learning the code relationships, our classifier can also utilize this shared information to improve its performance. Experiments on a publicly available dataset containing clinical free text and their associated medical codes showed that our proposed multi-label classifier outperforms related multi-label models in this problem.
【Keywords】: L1 regularization; classification; large margin; medical coding; medical data mining; multi-label classification
【Paper Link】 【Pages】:203-212
【Authors】: Chi Wang ; Jiawei Han ; Yuntao Jia ; Jie Tang ; Duo Zhang ; Yintao Yu ; Jingyi Guo
【Abstract】: Information network contains abundant knowledge about relationships among people or entities. Unfortunately, such kind of knowledge is often hidden in a network where different kinds of relationships are not explicitly categorized. For example, in a research publication network, the advisor-advisee relationships among researchers are hidden in the coauthor network. Discovery of those relationships can benefit many interesting applications such as expert finding and research community analysis. In this paper, we take a computer science bibliographic network as an example, to analyze the roles of authors and to discover the likely advisor-advisee relationships. In particular, we propose a time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function. We further design an efficient learning algorithm to optimize the objective function. Based on that our model suggests and ranks probable advisors for every author. Experimental results show that the proposed approach infer advisor-advisee relationships efficiently and achieves a state-of-the-art accuracy (80-90%). We also apply the discovered advisor-advisee relationships to bole search, a specific expert finding task and empirical study shows that the search performance can be effectively improved (+4.09% by NDCG@5).
【Keywords】: advisor-advisee prediction; coauthor network; relationship mining; time-constrained factor graph
【Paper Link】 【Pages】:213-222
【Authors】: Deepak Agarwal ; Rahul Agrawal ; Rajiv Khanna ; Nagaraj Kota
【Abstract】: We consider the problem of estimating rates of rare events for high dimensional, multivariate categorical data where several dimensions are hierarchical. Such problems are routine in several data mining applications including computational advertising, our main focus in this paper. We propose LMMH, a novel log-linear modeling method that scales to massive data applications with billions of training records and several million potential predictors in a map-reduce framework. Our method exploits correlations in aggregates observed at multiple resolutions when working with multiple hierarchies; stable estimates at coarser resolution provide informative prior information to improve estimates at finer resolutions. Other than prediction accuracy and scalability, our method has an inbuilt variable screening procedure based on a "spike and slab prior" that provides parsimony by removing non-informative predictors without hurting predictive accuracy. We perform large scale experiments on data from real computational advertising applications and illustrate our approach on datasets with several billion records and hundreds of millions of predictors. Extensive comparisons with other benchmark methods show significant improvements in prediction accuracy.
【Keywords】: computational advertising; count data; display advertising; gamma-poisson; spars contingency tables; spike and slab prior
【Paper Link】 【Pages】:223-232
【Authors】: Ramakrishnan Srikant ; Sugato Basu ; Ni Wang ; Daryl Pregibon
【Abstract】: There has been considerable work on user browsing models for search engine results, both organic and sponsored. The click-through rate (CTR) of a result is the product of the probability of examination (will the user look at the result) times the perceived relevance of the result (probability of a click given examination). Past papers have assumed that when the CTR of a result varies based on the pattern of clicks in prior positions, this variation is solely due to changes in the probability of examination. We show that, for sponsored search results, a substantial portion of the change in CTR when conditioned on prior clicks is in fact due to a change in the relevance of results for that query instance, not just due to a change in the probability of examination. We then propose three new user browsing models, which attribute CTR changes solely to changes in relevance, solely to changes in examination (with an enhanced model of user behavior), or to both changes in relevance and examination. The model that attributes all the CTR change to relevance yields substantially better predictors of CTR than models that attribute all the change to examination, and does only slightly worse than the model that attributes CTR change to both relevance and examination. For predicting relevance, the model that attributes all the CTR change to relevance again does better than the model that attributes the change to examination. Surprisingly, we also find that one model might do better than another in predicting CTR, but worse in predicting relevance. Thus it is essential to evaluate user browsing models with respect to accuracy in predicting relevance, not just CTR.
【Keywords】: examination; relevance; user browsing models
【Paper Link】 【Pages】:233-242
【Authors】: Maayan Roth ; Assaf Ben-David ; David Deutscher ; Guy Flysher ; Ilan Horn ; Ari Leichtberg ; Naty Leiser ; Yossi Matias ; Ron Merom
【Abstract】: Although users of online communication tools rarely categorize their contacts into groups such as "family", "co-workers", or "jogging buddies", they nonetheless implicitly cluster contacts, by virtue of their interactions with them, forming implicit groups. In this paper, we describe the implicit social graph which is formed by users' interactions with contacts and groups of contacts, and which is distinct from explicit social graphs in which users explicitly add other individuals as their "friends". We introduce an interaction-based metric for estimating a user's affinity to his contacts and groups. We then describe a novel friend suggestion algorithm that uses a user's implicit social graph to generate a friend group, given a small seed set of contacts which the user has already labeled as friends. We show experimental results that demonstrate the importance of both implicit group relationships and interaction-based affinity ranking in suggesting friends. Finally, we discuss two applications of the Friend Suggest algorithm that have been released as Gmail Labs features.
【Keywords】: contact group clustering; implicit social graph; tie strength
【Paper Link】 【Pages】:243-252
【Authors】: Ryan Lichtenwalter ; Jake T. Lussier ; Nitesh V. Chawla
【Abstract】: This paper examines important factors for link prediction in networks and provides a general, high-performance framework for the prediction task. Link prediction in sparse networks presents a significant challenge due to the inherent disproportion of links that can form to links that do form. Previous research has typically approached this as an unsupervised problem. While this is not the first work to explore supervised learning, many factors significant in influencing and guiding classification remain unexplored. In this paper, we consider these factors by first motivating the use of a supervised framework through a careful investigation of issues such as network observational period, generality of existing methods, variance reduction, topological causes and degrees of imbalance, and sampling approaches. We also present an effective flow-based predicting algorithm, offer formal bounds on imbalance in sparse network link prediction, and employ an evaluation method appropriate for the observed imbalance. Our careful consideration of the above issues ultimately leads to a completely general framework that outperforms unsupervised link prediction methods by more than 30% AUC.
【Keywords】: class imbalance; link prediction; machine learning; networks
【Paper Link】 【Pages】:253-262
【Authors】: Vincent S. Tseng ; Cheng-Wei Wu ; Bai-En Shie ; Philip S. Yu
【Abstract】: Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant approaches have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement. The situation may become worse when the database contains lots of long transactions or long high utility itemsets. In this paper, we propose an efficient algorithm, namely UP-Growth (Utility Pattern Growth), for mining high utility itemsets with a set of techniques for pruning candidate itemsets. The information of high utility itemsets is maintained in a special data structure named UP-Tree (Utility Pattern Tree) such that the candidate itemsets can be generated efficiently with only two scans of the database. The performance of UP-Growth was evaluated in comparison with the state-of-the-art algorithms on different types of datasets. The experimental results show that UP-Growth not only reduces the number of candidates effectively but also outperforms other algorithms substantially in terms of execution time, especially when the database contains lots of long transactions.
【Keywords】: candidate pruning; frequent itemset; high utility itemset; utility mining
【Paper Link】 【Pages】:263-272
【Authors】: Salvatore Ruggieri
【Abstract】: Concise representations of frequent itemsets sacrifice readability and direct interpretability by a data analyst of the concise patterns extracted. In this paper, we introduce an extension of itemsets, called regular, with an immediate semantics and interpretability, and a conciseness comparable to closed itemsets. Regular itemsets allow for specifying that an item may or may not be present; that any subset of an itemset may be present; and that any non-empty subset of an itemset may be present. We devise a procedure, called RegularMine, for mining a set of regular itemsets that is a concise representation of frequent itemsets. The procedure computes a covering, in terms of regular itemsets, of the frequent itemsets in the class of equivalence of a closed one. We report experimental results on several standard dense and sparse datasets that validate the proposed approach.
【Keywords】: closed and free itemsets; concise representations
【Paper Link】 【Pages】:273-282
【Authors】: Liwen Sun ; Reynold Cheng ; David W. Cheung ; Jiefeng Cheng
【Abstract】: Data uncertainty is inherent in applications such as sensor monitoring systems, location-based services, and biological databases. To manage this vast amount of imprecise information, probabilistic databases have been recently developed. In this paper, we study the discovery of frequent patterns and association rules from probabilistic data under the Possible World Semantics. This is technically challenging, since a probabilistic database can have an exponential number of possible worlds. We propose two effcient algorithms, which discover frequent patterns in bottom-up and top-down manners. Both algorithms can be easily extended to discover maximal frequent patterns. We also explain how to use these patterns to generate association rules. Extensive experiments, using real and synthetic datasets, were conducted to validate the performance of our methods.
【Keywords】: association rule; frequent pattern; uncertain data
【Paper Link】 【Pages】:283-292
【Authors】: Hoang Thanh Lam ; Toon Calders
【Abstract】: We study the problem of finding the k most frequent items in a stream of items for the recently proposed max-frequency measure. Based on the properties of an item, the max-frequency of an item is counted over a sliding window of which the length changes dynamically. Besides being parameterless, this way of measuring the support of items was shown to have the advantage of a faster detection of bursts in a stream, especially if the set of items is heterogeneous. The algorithm that was proposed for maintaining all frequent items, however, scales poorly when the number of items becomes large. Therefore, in this paper we propose, instead of reporting all frequent items, to only mine the top-k most frequent ones. First we prove that in order to solve this problem exactly, we still need a prohibitive amount of memory (at least linear in the number of items). Yet, under some reasonable conditions, we show both theoretically and empirically that a memory-efficient algorithm exists. A prototype of this algorithm is implemented and we present its performance w.r.t. memory-efficiency on real-life data and in controlled experiments with synthetic data.
【Keywords】: data stream mining; top-k frequent items
【Paper Link】 【Pages】:293-302
【Authors】: Nikolaj Tatti
【Abstract】: One of the main current challenges in itemset mining is to discover a small set of high-quality itemsets. In this paper we propose a new and general approach for measuring the quality of itemsets. The method is solidly founded in Bayesian statistics and decreases monotonically, allowing for efficient discovery of all interesting itemsets. The measure is defined by connecting statistical models and collections of itemsets. This allows us to score individual itemsets with the probability of them occuring in random models built on the data. As a concrete example of this framework we use exponential models. This class of models possesses many desirable properties. Most importantly, Occam's razor in Bayesian model selection provides a defence for the pattern explosion. As general exponential models are infeasible in practice, we use decomposable models; a large sub-class for which the measure is solvable. For the actual computation of the score we sample models from the posterior distribution using an MCMC approach. Experimentation on our method demonstrates the measure works in practice and results in interpretable and insightful itemsets for both synthetic and real-world data.
【Keywords】: MCMC; bayesian model selection; decomposable models; exponential models; itemset mining; junction tree
【Paper Link】 【Pages】:303-312
【Authors】: Jun Zhu ; Ni Lao ; Eric P. Xing
【Abstract】: Feature selection is an important task in order to achieve better generalizability in high dimensional learning, and structure learning of Markov random fields (MRFs) can automatically discover the inherent structures underlying complex data. Both problems can be cast as solving an l1-norm regularized parameter estimation problem. The existing Grafting method can avoid doing inference on dense graphs in structure learning by incrementally selecting new features. However, Grafting performs a greedy step to optimize over free parameters once new features are included. This greedy strategy results in low efficiency when parameter learning is itself non-trivial, such as in MRFs, in which parameter learning depends on an expensive subroutine to calculate gradients. The complexity of calculating gradients in MRFs is typically exponential to the size of maximal cliques. In this paper, we present a fast algorithm called Grafting-Light to solve the l1-norm regularized maximum likelihood estimation of MRFs for efficient feature selection and structure learning. Grafting-Light iteratively performs one-step of orthant-wise gradient descent over free parameters and selects new features. This lazy strategy is guaranteed to converge to the global optimum and can effectively select significant features. On both synthetic and real data sets, we show that Grafting-Light is much more efficient than Grafting for both feature selection and structure learning, and performs comparably with the optimal batch method that directly optimizes over all the features for feature selection but is much more efficient and accurate for structure learning of MRFs.
【Keywords】: Markov random fields; feature selection; structure learning
【Paper Link】 【Pages】:313-322
【Authors】: Liang Sun ; Betul Ceran ; Jieping Ye
【Abstract】: Dimensionality reduction plays an important role in many data mining applications involving high-dimensional data. Many existing dimensionality reduction techniques can be formulated as a generalized eigenvalue problem, which does not scale to large-size problems. Prior work transforms the generalized eigenvalue problem into an equivalent least squares formulation, which can then be solved efficiently. However, the equivalence relationship only holds under certain assumptions without regularization, which severely limits their applicability in practice. In this paper, an efficient two-stage approach is proposed to solve a class of dimensionality reduction techniques, including Canonical Correlation Analysis, Orthonormal Partial Least Squares, linear Discriminant Analysis, and Hypergraph Spectral Learning. The proposed two-stage approach scales linearly in terms of both the sample size and data dimensionality. The main contributions of this paper include (1) we rigorously establish the equivalence relationship between the proposed two-stage approach and the original formulation without any assumption; and (2) we show that the equivalence relationship still holds in the regularization setting. We have conducted extensive experiments using both synthetic and real-world data sets. Our experimental results confirm the equivalence relationship established in this paper. Results also demonstrate the scalability of the proposed two-stage approach.
【Keywords】: dimensionality reduction; generalized eigenvalue problem; least squares; regularization; scalability
【Paper Link】 【Pages】:323-332
【Authors】: Jun Liu ; Lei Yuan ; Jieping Ye
【Abstract】: The fused Lasso penalty enforces sparsity in both the coefficients and their successive differences, which is desirable for applications with features ordered in some meaningful way. The resulting problem is, however, challenging to solve, as the fused Lasso penalty is both non-smooth and non-separable. Existing algorithms have high computational complexity and do not scale to large-size problems. In this paper, we propose an Efficient Fused Lasso Algorithm (EFLA) for optimizing this class of problems. One key building block in the proposed EFLA is the Fused Lasso Signal Approximator (FLSA). To efficiently solve FLSA, we propose to reformulate it as the problem of finding an "appropriate" subgradient of the fused penalty at the minimizer, and develop a Subgradient Finding Algorithm (SFA). We further design a restart technique to accelerate the convergence of SFA, by exploiting the special "structures" of both the original and the reformulated FLSA problems. Our empirical evaluations show that, both SFA and EFLA significantly outperform existing solvers. We also demonstrate several applications of the fused Lasso.
【Keywords】: fused lasso; l1 regularization; restart; subgradient
【Paper Link】 【Pages】:333-342
【Authors】: Deng Cai ; Chiyuan Zhang ; Xiaofei He
【Abstract】: In many data analysis tasks, one is often confronted with very high dimensional data. Feature selection techniques are designed to find the relevant feature subset of the original features which can facilitate clustering, classification and retrieval. In this paper, we consider the feature selection problem in unsupervised learning scenario, which is particularly difficult due to the absence of class labels that would guide the search for relevant information. The feature selection problem is essentially a combinatorial optimization problem which is computationally expensive. Traditional unsupervised feature selection methods address this issue by selecting the top ranked features based on certain scores computed independently for each feature. These approaches neglect the possible correlation between different features and thus can not produce an optimal feature subset. Inspired from the recent developments on manifold learning and L1-regularized models for subset selection, we propose in this paper a new approach, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection. Specifically, we select those features such that the multi-cluster structure of the data can be best preserved. The corresponding optimization problem can be efficiently solved since it only involves a sparse eigen-problem and a L1-regularized least squares problem. Extensive experimental results over various real-life data sets have demonstrated the superiority of the proposed algorithm.
【Keywords】: clustering; feature selection; unsupervised
【Paper Link】 【Pages】:343-352
【Authors】: Jian-Bo Yang ; Chong Jin Ong
【Abstract】: This paper presents a novel wrapper-based feature selection method for Support Vector Regression (SVR) using its probabilistic predictions. The method computes the importance of a feature by aggregating the difference, over the feature space, of the conditional density functions of the SVR prediction with and without the feature. As the exact computation of this importance measure is expensive, two approximations are proposed. The effectiveness of the measure using these approximations, in comparison to several other existing feature selection methods for SVR, is evaluated on both artificial and real-world problems. The result of the experiment shows that the proposed method generally performs better, and at least as well as the existing methods, with notable advantage when the data set is sparse.
【Keywords】: feature ranking; feature selection; probabilistic predictions; random permutation; support vector regression
【Paper Link】 【Pages】:353-362
【Authors】: Xin Jin ; Mingyang Zhang ; Nan Zhang ; Gautam Das
【Abstract】: Motivated by the insufficiency of the existing quasi-identifier/sensitive-attribute (QI-SA) framework on modeling real-world privacy requirements for data publishing, we propose a novel versatile publishing scheme with which privacy requirements can be specified as an arbitrary set of privacy rules over attributes in the microdata table. To enable versatile publishing, we introduce the Guardian Normal Form (GNF), a novel method of publishing multiple sub-tables such that each sub-table is anonymized by an existing QI-SA publishing algorithm, while the combination of all published tables guarantees all privacy rules. We devise two algorithms, Guardian Decomposition (GD) and Utility-aware Decomposition (UAD), for decomposing a microdata table into GNF, and present extensive experiments over real-world datasets to demonstrate the effectiveness of both algorithms.
【Keywords】: decomposition; guardian normal form; privacy preservation; versatile publishing
【Paper Link】 【Pages】:363-372
【Authors】: Keng-Pei Lin ; Ming-Syan Chen
【Abstract】: Outsourcing the training of support vector machines (SVM) to external service providers benefits the data owner who is not familiar with the techniques of the SVM or has limited computing resources. In outsourcing, the data privacy is a critical issue for some legal or commercial reasons since there may be sensitive information contained in the data. Existing privacy-preserving SVM works are either not applicable to outsourcing or weak in security. In this paper, we propose a scheme for privacy-preserving outsourcing the training of the SVM without disclosing the actual content of the data to the service provider. In the proposed scheme, the data sent to the service provider is perturbed by a random transformation, and the service provider trains the SVM for the data owner from the perturbed data. The proposed scheme is stronger in security than existing techniques, and incurs very little redundant communication and computation cost.
【Keywords】: classification; outsourcing; privacy-preserving data mining; support vector machines
【Paper Link】 【Pages】:373-382
【Authors】: Zhen Wen ; Ching-Yung Lin
【Abstract】: This paper intends to provide some insights of a scientific problem: how likely one's interests can be inferred from his/her social connections -- friends, friends' friends, 3-degree friends, etc? Is "Birds of a Feather Flocks Together" a norm? We do not consider the friending activity on online social networking sites. Instead, we conduct this study by implementing a privacy-preserving large distribute social sensor system in a large global IT company to capture the multifaceted activities of 30,000+ people, including communications (e.g., emails, instant messaging, etc) and Web 2.0 activities (e.g., social bookmarking, file sharing, blogging, etc). These activities occupy the majority of employees' time in work, and thus, provide a high quality approximation to the real social connections of employees in the workplace context. In addition to such "informal networks", we investigated the "formal networks", such as their hierarchical structure, as well as the demographic profile data such as geography, job role, self-specified interests, etc. Because user ID matching across multiple sources on the Internet is very difficult, and most user activity logs have to be anonymized before they are processed, no prior studies could collect comparable multifaceted activity data of individuals. That makes this study unique. In this paper, we present a technique to predict the inference quality by utilizing (1) network analysis and network autocorrelation modeling of informal and formal networks, and (2) regression models to predict user interest inference quality from network characteristics. We verify our findings with experiments on both implicit user interests indicated by the content of communications or Web 2.0 activities, and explicit user interests specified in user profiles. We demonstrate that the inference quality prediction increases the inference quality of implicit interests by 42.8%, and inference quality of explicit interests by up to 101%.
【Keywords】: quality; social networks; user interest modeling
【Paper Link】 【Pages】:383-392
【Authors】: Smruti R. Sarangi ; Karin Murthy
【Abstract】: Large-scale sensor deployments and an increased use of privacy-preserving transformations have led to an increasing interest in mining uncertain time series data. Traditional distance measures such as Euclidean distance or dynamic time warping are not always effective for analyzing uncertain time series data. Recently, some measures have been proposed to account for uncertainty in time series data. However, we show in this paper that their applicability is limited. In specific, these approaches do not provide an intuitive way to compare two uncertain time series and do not easily accommodate multiple error functions. In this paper, we provide a theoretical framework that generalizes the notion of similarity between uncertain time series. Secondly, we propose DUST, a novel distance measure that accommodates uncertainty and degenerates to the Euclidean distance when the distance is large compared to the error. We provide an extensive experimental validation of our approach for the following applications: classification, top-k motif search, and top-k nearest-neighbor queries.
【Keywords】: data mining; distance measure; similarity; time series; uncertain data
【Paper Link】 【Pages】:393-402
【Authors】: Vincent Leroy ; Berkant Barla Cambazoglu ; Francesco Bonchi
【Abstract】: In the traditional link prediction problem, a snapshot of a social network is used as a starting point to predict, by means of graph-theoretic measures, the links that are likely to appear in the future. In this paper, we introduce cold start link prediction as the problem of predicting the structure of a social network when the network itself is totally missing while some other information regarding the nodes is available. We propose a two-phase method based on the bootstrap probabilistic graph. The first phase generates an implicit social network under the form of a probabilistic graph. The second phase applies probabilistic graph-based measures to produce the final prediction. We assess our method empirically over a large data collection obtained from Flickr, using interest groups as the initial information. The experiments confirm the effectiveness of our approach.
【Keywords】: link prediction; probabilistic graph; social networks
【Paper Link】 【Pages】:403-412
【Authors】: Xu-Ying Liu ; Zhi-Hua Zhou
【Abstract】: Existing cost-sensitive learning methods require that the unequal misclassification costs should be given as precise values. In many real-world applications, however, it is generally difficult to have a precise cost value since the user maybe only knows that one type of mistake is much more severe than another type, yet it is infeasible to give a precise description. In such situations, it is more meaningful to work with a cost interval instead of a precise cost value. In this paper we report the first study along this direction. We propose the CISVM method, a support vector machine, to work with cost interval information. Experiments show that when there are only cost intervals available, CISVM is significantly superior to standard cost-sensitive SVMs using any of the minimal cost, mean cost and maximal cost to learn. Moreover, considering that in some cases other information about costs can be obtained in addition to cost intervals, such as the distribution of costs, we propose a general approach CODIS for using the distribution information to help improve performance. Experiments show that this approach can reduce 60% more risks than the standard cost-sensitive SVM which assumes the expected cost is the true value.
【Keywords】: cost intervals; cost-sensitive learning; misclassification costs
【Paper Link】 【Pages】:413-422
【Authors】: Iris Adä ; Michael R. Berthold
【Abstract】: In this paper we introduce a modular, highly flexible, open-source environment for data generation. Using an existing graphical data flow tool, the user can combine various types of modules for numeric and categorical data generators. Additional functionality is added via the data processing framework in which the generator modules are embedded. The resulting data flows can be used to document, deploy, and reuse the resulting data generators. We describe the overall environment and individual modules and demonstrate how they can be used for the generation of a sample, complex customer/product database with corresponding shopping basket data, including various artifacts and outliers.
【Keywords】: artificial data; data generation; pipeline tool
【Paper Link】 【Pages】:423-432
【Authors】: Josh Attenberg ; Foster J. Provost
【Abstract】: This paper analyses alternative techniques for deploying low-cost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance - where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers - so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them "guide" the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.
【Keywords】: active learning; class imbalance; human resources; machine learning; micro-outsourcing; on-line advertising
【Paper Link】 【Pages】:433-442
【Authors】: Qiong Fang ; Wilfred Ng ; Jianlin Feng
【Abstract】: Mining order-preserving submatrix (OPSM) patterns has received much attention from researchers, since in many scientific applications, such as those involving gene expression data, it is natural to express the data in a matrix and also important to find the order-preserving submatrix patterns. However, most current work assumes the noise-free OPSM model and thus is not practical in many real situations when sample contamination exists. In this paper, we propose a relaxed OPSM model called ROPSM. The ROPSM model supports mining more reasonable noise-corrupted OPSM patterns than another well-known model called AOPC (approximate order-preserving cluster). While OPSM mining is known to be an NP-hard problem, mining ROPSM patterns is even a harder problem. We propose a novel method called ROPSM-Growth to mine ROPSM patterns. Specifically, two pattern growing strategies, such as column-centric strategy and row-centric strategy, are presented, which are effective to grow the seed OPSMs into significant ROPSMs. An effective median-rank based method is also developed to discover the underlying true order of conditions involved in an ROPSM pattern. Our experiments on a biological dataset show that the ROPSM model better captures the characteristics of noise in gene expression data matrix compared to the AOPC model. Importantly, we find that our approach is able to detect more quality biologically significant patterns with comparable efficiency with the counterparts of AOPC. Specifically, at least 26.6% (75 out of 282) of the patterns mined by our approach are strongly associated with more than 10 gene categories (high biological significance), which is 3 times better than that obtained from using the AOPC approach.
【Keywords】: backbone order; biclustering; order-preserving submatrices; relaxed order-preserving submatrices
【Paper Link】 【Pages】:443-452
【Authors】: Dan He ; Douglas Stott Parker Jr.
【Abstract】: For some time there has been increasing interest in the problem of monitoring the occurrence of topics in a stream of events, such as a stream of news articles. This has led to different models of bursts in these streams, i.e., periods of elevated occurrence of events. Today there are several burst definitions and detection algorithms, and their differences can produce very different results in topic streams. These definitions also share a fundamental problem: they define bursts in terms of an arrival rate. This approach is limiting; other stream dimensions can matter. We reconsider the idea of bursts from the standpoint of a simple kind of physics. Instead of focusing on arrival rates, we reconstruct bursts as a dynamic phenomenon, using kinetics concepts from physics -- mass and velocity -- and derive momentum, acceleration, and force from these. We refer to the result as topic dynamics, permitting a hierarchical, expressive model of bursts as intervals of increasing momentum. As a sample application, we present a topic dynamics model for the large PubMed/MEDLINE database of biomedical publications, using the MeSH (Medical Subject Heading) topic hierarchy. We show our model is able to detect bursts for MeSH terms accurately as well as efficiently.
【Keywords】: burst; hierarchy; momentum; pubmed; topic dynamic
【Paper Link】 【Pages】:453-462
【Authors】: Naren Sundaravaradan ; K. S. M. Tozammel Hossain ; Vandana Sreedharan ; Douglas J. Slotta ; John Paul C. Vergara ; Lenwood S. Heath ; Naren Ramakrishnan
【Abstract】: Systems biology has made massive strides in recent years, with capabilities to model complex systems including cell division, stress response, energy metabolism, and signaling pathways. Concomitant with their improved modeling capabilities, however, such biochemical network models have also become notoriously complex for humans to comprehend. We propose network comprehension as a key problem for the KDD community, where the goal is to create explainable representations of complex biological networks. We formulate this problem as one of extracting temporal signatures from multi-variate time series data, where the signatures are composed of ordinal comparisons between time series components. We show how such signatures can be inferred by formulating the data mining problem as one of feature selection in rank-order space. We propose five new feature selection strategies for rank-order space and assess their selective superiorities. Experimental results on budding yeast cell cycle models demonstrate compelling results comparable to human interpretations of the cell cycle.
【Keywords】: biological networks; feature selection; rank-order spaces; systems biology; temporal signatures
【Paper Link】 【Pages】:463-472
【Authors】: Jinyan Li ; Qian Liu ; Tao Zeng
【Abstract】: This paper studies efficient mining of negative correlations that pace in collaboration. A collaborating negative correlation is a negative correlation between two sets of variables rather than traditionally between a pair of variables. It signifies a synchronized value rise or fall of all variables within one set whenever all variables in the other set go jointly at the opposite trend. The time complexity is exponential in mining. The high efficiency of our algorithm is attributed to two factors: (i) the transformation of the original data into a bipartite graph database, and (ii) the mining of transpose closures from a wide transactional database. Applying to a Yeast gene expression data, we evaluate, by using Pearson's correlation coefficient and P-value, the biological relevance of collaborating negative correlations as an example among many real-life domains.
【Keywords】: collaborating negative correlations; transpose closures from a bipartite graph database
【Paper Link】 【Pages】:473-482
【Authors】: Chih-Hua Tai ; Philip S. Yu ; Ming-Syan Chen
【Abstract】: For any outsourcing service, privacy is a major concern. This paper focuses on outsourcing frequent itemset mining and examines the issue on how to protect privacy against the case where the attackers have precise knowledge on the supports of some items. We propose a new approach referred to as k-support anonymity to protect each sensitive item with k-1 other items of similar support. To achieve k-support anonymity, we introduce a pseudo taxonomy tree and have the third party mine the generalized frequent itemsets under the corresponding generalized association rules instead of association rules. The pseudo taxonomy is a construct to facilitate hiding of the original items, where each original item can map to either a leaf node or an internal node in the taxonomy tree. The rationale for this approach is that with a taxonomy tree, the k nodes to satisfy the k-support anonymity may be any k nodes in the taxonomy tree with the appropriate supports. So this approach can provide more candidates for k-support anonymity with limited fake items as only the leaf nodes, not the internal nodes, of the taxonomy tree need to appear in the transactions. Otherwise for the association rule mining, the k nodes to satisfy the k-support anonymity have to correspond to the leaf nodes in the taxonomy tree. This is far more restricted. The challenge is thus on how to generate the pseudo taxonomy tree to facilitate k-support anonymity and to ensure the conservation of original frequent itemsets. The experimental results showed that our methods of k-support anonymity can achieve very good privacy protection with moderate storage overhead.
【Keywords】: frequent itemsets; outsourcing; privacy
【Paper Link】 【Pages】:483-492
【Authors】: Bin Yang ; Hiroshi Nakagawa ; Issei Sato ; Jun Sakuma
【Abstract】: Recent research in privacy-preserving data mining (PPDM) has become increasingly popular due to the wide application of data mining and the increased concern regarding the protection of private and personal information. Lately, numerous methods of privacy-preserving data mining have been proposed. Most of these methods are based on an assumption that semi-honest is and collusion is not present. In other words, every party follows such protocol properly with the exception that it keeps a record of all its intermediate computations without sharing the record with others. In this paper, we focus our attention on the problem of collusions, in which some parties may collude and share their record to deduce the private information of other parties. In particular, we consider a general problem in PPDM - multiparty secure computation of some functions of secure summations of data spreading around multiple parties. To solve such a problem, we propose a new method that entails a high level of security - full-privacy. With this method, no sensitive information of a party will be revealed even when all other parties collude. In addition, this method is efficient with a running time of O(m). We will also show that by applying this general method, a large number of problems in PPDM can be solved with enhanced security.
【Keywords】: collusion; data mining; distributed; privacy; secure multiparty computation
【Paper Link】 【Pages】:493-502
【Authors】: Arik Friedman ; Assaf Schuster
【Abstract】: We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the differential privacy framework. Differential privacy requires that computations be insensitive to changes in any particular individual's record, thereby restricting data leaks through the results. The privacy preserving interface ensures unconditionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sample application. The privacy mechanism has a profound effect on the performance of the methods chosen by the data miner. We demonstrate that this choice could make the difference between an accurate classifier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.
【Keywords】: data mining; decision trees; differential privacy
【Paper Link】 【Pages】:503-512
【Authors】: Raghav Bhaskar ; Srivatsan Laxman ; Adam D. Smith ; Abhradeep Thakurta
【Abstract】: Discovering frequent patterns from data is a popular exploratory technique in datamining. However, if the data are sensitive (e.g., patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there. We present two efficient algorithms for discovering the k most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning 'noisy' lists of patterns that are close to the actual list of k most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-k pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately 'robust' measure of interest.
【Keywords】: differential privacy; exponential mechanism; frequent itemsets; frequent patterns; privacy
【Paper Link】 【Pages】:513-522
【Authors】: Purnamrita Sarkar ; Andrew W. Moore
【Abstract】: Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most "similar" k nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal (~ 90 M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of 3 but also boosts link prediction accuracy by a factor of 4 on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of 5) than actually simulating random walks.
【Keywords】: external memory; graphs; link prediction; random walks
【Paper Link】 【Pages】:523-532
【Authors】: Saeed Alaei ; Ravi Kumar ; Azarakhsh Malekian ; Erik Vee
【Abstract】: Motivated by applications in guaranteed delivery in computational advertising, we consider the general problem of balanced allocation in a bipartite supply-demand setting. Our formulation captures the notion of deviation from being balanced by a convex penalty function. While this formulation admits a convex programming solution, we strive for more robust and scalable algorithms. For the case of L1 penalty functions we obtain a simple combinatorial algorithm based on min-cost flow in graphs and show how to precompute a linear amount of information such that the allocation along any edge can be approximated in constant time. We then extend our combinatorial solution to any convex function by solving a convex cost flow. These scalable methods may have applications in other contexts stipulating balanced allocation. We study the performance of our algorithms on large real-world graphs and show that they are efficient, scalable, and robust in practice.
【Keywords】: balanced allocation; convex flow; maximum flow
【Paper Link】 【Pages】:533-542
【Authors】: Hossein Maserrat ; Jian Pei
【Abstract】: Compressing social networks can substantially facilitate mining and advanced analysis of large social networks. Preferably, social networks should be compressed in a way that they still can be queried efficiently without decompression. Arguably, neighbor queries, which search for all neighbors of a query vertex, are the most essential operations on social networks. Can we compress social networks effectively in a neighbor query friendly manner, that is, neighbor queries still can be answered in sublinear time using the compression? In this paper, we develop an effective social network compression approach achieved by a novel Eulerian data structure using multi-position linearizations of directed graphs. Our method comes with a nontrivial theoretical bound on the compression rate. To the best of our knowledge, our approach is the first that can answer both out-neighbor and in-neighbor queries in sublinear time. An extensive empirical study on more than a dozen benchmark real data sets verifies our design.
【Keywords】: MP k linearization; compression; social networks
【Paper Link】 【Pages】:543-552
【Authors】: Guoming He ; Haijun Feng ; Cuiping Li ; Hong Chen
【Abstract】: Recently there has been a lot of interest in graph-based analysis. One of the most important aspects of graph-based analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing methods on SimRank computation suffer from two limitations: 1) the computing cost can be very high in practice; and 2) they can only be applied on static graphs. In this paper, we exploit the inherent parallelism and high memory bandwidth of graphics processing units (GPU) to accelerate the computation of SimRank on large graphs. Furthermore, based on the observation that SimRank is essentially a first-order Markov Chain, we propose to utilize the iterative aggregation techniques for uncoupling Markov chains to compute SimRank scores in parallel for large graphs. The iterative aggregation method can be applied on dynamic graphs. Moreover, it can handle not only the link-updating problem but also the node-updating problem. Extensive experiments on synthetic and real data sets verify that the proposed methods are efficient and effective.
【Keywords】: gpu; graph; iterative aggregation; parallel; simrank
【Paper Link】 【Pages】:553-562
【Authors】: Ravi Kumar ; Mohammad Mahdian ; Mary McGlohon
【Abstract】: How do online conversations build? Is there a common model that human communication follows? In this work we explore these questions in detail. We analyze the structure of conversations in three different social datasets, namely, Usenet groups, Yahoo! Groups, and Twitter. We propose a simple mathematical model for the generation of basic conversation structures and then refine this model to take into account the identities of each member of the conversation.
【Keywords】: Twitter; conversations; graph models; groups; human response; threads; usenet
【Paper Link】 【Pages】:563-572
【Authors】: Xiang Wang ; Ian Davidson
【Abstract】: Constrained clustering has been well-studied for algorithms like K-means and hierarchical agglomerative clustering. However, how to encode constraints into spectral clustering remains a developing area. In this paper, we propose a flexible and generalized framework for constrained spectral clustering. In contrast to some previous efforts that implicitly encode Must-Link and Cannot-Link constraints by modifying the graph Laplacian or the resultant eigenspace, we present a more natural and principled formulation, which preserves the original graph Laplacian and explicitly encodes the constraints. Our method offers several practical advantages: it can encode the degree of belief (weight) in Must-Link and Cannot-Link constraints; it guarantees to lower-bound how well the given constraints are satisfied using a user-specified threshold; and it can be solved deterministically in polynomial time through generalized eigendecomposition. Furthermore, by inheriting the objective function from spectral clustering and explicitly encoding the constraints, much of the existing analysis of spectral clustering techniques is still valid. Consequently our work can be posed as a natural extension to unconstrained spectral clustering and be interpreted as finding the normalized min-cut of a labeled graph. We validate the effectiveness of our approach by empirical results on real-world data sets, with applications to constrained image segmentation and clustering benchmark data sets with both binary and degree-of-belief constraints.
【Keywords】: constrained clustering; spectral clustering
【Paper Link】 【Pages】:573-582
【Authors】: Xuan Hong Dang ; James Bailey
【Abstract】: Discovery of alternative clusterings is an important method for exploring complex datasets. It provides the capability for the user to view clustering behaviour from different perspectives and thus explore new hypotheses. However, current algorithms for alternative clustering have focused mainly on linear scenarios and may not perform as desired for datasets containing clusters with non linear shapes. Our goal in this paper is to address this challenge of non linearity. In particular, we propose a novel algorithm to uncover an alternative clustering that is distinctively different from an existing, reference clustering. Our technique is information theory based and aims to ensure alternative clustering quality by maximizing the mutual information between clustering labels and data observations, whilst at the same time ensuring alternative clustering distinctiveness by minimizing the information sharing between the two clusterings. We perform experiments to assess our method against a large range of alternative clustering algorithms in the literature. We show our technique's performance is generally better for non-linear scenarios and furthermore, is highly competitive even for simpler, linear scenarios.
【Keywords】: alternative clustering; information theoretic learning; parzenwindow technique
【Paper Link】 【Pages】:583-592
【Authors】: Christian Böhm ; Claudia Plant ; Junming Shao ; Qinli Yang
【Abstract】: Synchronization is a powerful basic concept in nature regulating a large variety of complex processes ranging from the metabolism in the cell to social behavior in groups of individuals. Therefore, synchronization phenomena have been extensively studied and models robustly capturing the dynamical synchronization process have been proposed, e.g. the Extensive Kuramoto Model. Inspired by the powerful concept of synchronization, we propose Sync, a novel approach to clustering. The basic idea is to view each data object as a phase oscillator and simulate the interaction behavior of the objects over time. As time evolves, similar objects naturally synchronize together and form distinct clusters. Inherited from synchronization, Sync has several desirable properties: The clusters revealed by dynamic synchronization truly reflect the intrinsic structure of the data set, Sync does not rely on any distribution assumption and allows detecting clusters of arbitrary number, shape and size. Moreover, the concept of synchronization allows natural outlier handling, since outliers do not synchronize with cluster objects. For fully automatic clustering, we propose to combine Sync with the Minimum Description Length principle. Extensive experiments on synthetic and real world data demonstrate the effectiveness and efficiency of our approach.
【Keywords】: clustering; kuramoto model; synchronization
【Paper Link】 【Pages】:593-602
【Authors】: M. Shahriar Hossain ; Satish Tadepalli ; Layne T. Watson ; Ian Davidson ; Richard F. Helm ; Naren Ramakrishnan
【Abstract】: Modern data mining settings involve a combination of attribute-valued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such non-homogeneous datasets by using the relationships to impose either dependent clustering or disparate clustering constraints. Unlike prior work that views constraints as boolean criteria, we present a formulation that allows constraints to be satisfied or violated in a smooth manner. This enables us to achieve dependent clustering and disparate clustering using the same optimization framework by merely maximizing versus minimizing the objective function. We present results on both synthetic data as well as several real-world datasets.
【Keywords】: clustering; contingency tables; multi-criteria optimization.; relational clustering
【Paper Link】 【Pages】:603-612
【Authors】: William B. March ; Parikshit Ram ; Alexander G. Gray
【Abstract】: The Euclidean Minimum Spanning Tree problem has applications in a wide range of fields, and many efficient algorithms have been developed to solve it. We present a new, fast, general EMST algorithm, motivated by the clustering and analysis of astronomical data. Large-scale astronomical surveys, including the Sloan Digital Sky Survey, and large simulations of the early universe, such as the Millennium Simulation, can contain millions of points and fill terabytes of storage. Traditional EMST methods scale quadratically, and more advanced methods lack rigorous runtime guarantees. We present a new dual-tree algorithm for efficiently computing the EMST, use adaptive algorithm analysis to prove the tightest (and possibly optimal) runtime bound for the EMST problem to-date, and demonstrate the scalability of our method on astronomical data sets.
【Keywords】: adaptive algorithm analysis; euclidean minimum spanning trees
【Paper Link】 【Pages】:613-622
【Authors】: Jian-Guang Lou ; Qiang Fu ; Shengqi Yang ; Jiang Li ; Bin Wu
【Abstract】: Successful software maintenance is becoming increasingly critical due to the increasing dependence of our society and economy on software systems. One key problem of software maintenance is the difficulty in understanding the evolving software systems. Program workflows can help system operators and administrators to understand system behaviors and verify system executions so as to greatly facilitate system maintenance. In this paper, we propose an algorithm to automatically discover program workflows from event traces that record system events during system execution. Different from existing workflow mining algorithms, our approach can construct concurrent workflows from traces of interleaved events. Our workflow mining approach is a three-step coarse-to-fine algorithm. At first, we mine temporal dependencies for each pair of events. Then, based on the mined pair-wise tem-poral dependencies, we construct a basic workflow model by a breadth-first path pruning algorithm. After that, we refine the workflow by verifying it with all training event traces. The re-finement algorithm tries to find out a workflow that can interpret all event traces with minimal state transitions and threads. The results of both simulation data and real program data show that our algorithm is highly effective.
【Keywords】: graphical behavior models; temporal properties; workflow mining
【Paper Link】 【Pages】:623-632
【Authors】: Dafna Shahaf ; Carlos Guestrin
【Abstract】: The process of extracting useful knowledge from large datasets has become one of the most pressing problems in today's society. The problem spans entire sectors, from scientists to intelligence analysts and web users, all of whom are constantly struggling to keep up with the larger and larger amounts of content published every day. With this much data, it is often easy to miss the big picture. In this paper, we investigate methods for automatically connecting the dots -- providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically finds a coherent chain linking them together. For example, it can recover the chain of events starting with the decline of home prices (January 2007), and ending with the ongoing health-care debate. We formalize the characteristics of a good chain and provide an efficient algorithm (with theoretical guarantees) to connect two fixed endpoints. We incorporate user feedback into our framework, allowing the stories to be refined and personalized. Finally, we evaluate our algorithm over real news data. Our user studies demonstrate the algorithm's effectiveness in helping users understanding the news.
【Keywords】: coherence; news
【Paper Link】 【Pages】:633-642
【Authors】: Zhaonian Zou ; Hong Gao ; Jianzhong Li
【Abstract】: Frequent subgraph mining has been extensively studied on certain graph data. However, uncertainties are inherently accompanied with graph data in practice, and there is very few work on mining uncertain graph data. This paper investigates frequent subgraph mining on uncertain graphs under probabilistic semantics. Specifically, a measure called φ-frequent probability is introduced to evaluate the degree of recurrence of subgraphs. Given a set of uncertain graphs and two numbers 0 < φ,τ < 1, the goal is to quickly find all subgraphs with φ-frequent probability at least τ. Due to the NP-hardness of the problem, an approximate mining algorithm is proposed for this problem. Let 0 < δ < 1 be a parameter. The algorithm guarantees to find any frequent subgraph S with probability at least (1 - δ/2)s, where s is the number of edges of S. In addition, it is thoroughly discussed how to set δ to guarantee the overall approximation quality of the algorithm. The extensive experiments on real uncertain graph data verify that the algorithm is efficient and that the mining results have very high quality.
【Keywords】: frequent subgraph; probabilistic semantics; uncertain graph
【Paper Link】 【Pages】:643-652
【Authors】: Hongliang Fei ; Jun Huan
【Abstract】: Boosting is a very successful classification algorithm that produces a linear combination of "weak" classifiers (a.k.a. base learners) to obtain high quality classification models. In this paper we propose a new boosting algorithm where base learners have structure relationships in the functional space. Though such relationships are generic, our work is particularly motivated by the emerging topic of pattern based classification for semi-structured data including graphs. Towards an efficient incorporation of the structure information, we have designed a general model where we use an undirected graph to capture the relationship of subgraph-based base learners. In our method, we combine both L1 norm and Laplacian based L2 norm penalty with Logit loss function of Logit Boost. In this approach, we enforce model sparsity and smoothness in the functional space spanned by the basis functions. We have derived efficient optimization algorithms based on coordinate decent for the new boosting formulation and theoretically prove that it exhibits a natural grouping effect for nearby spatial or overlapping features. Using comprehensive experimental study, we have demonstrated the effectiveness of the proposed learning methods.
【Keywords】: L1 regularization; boosting; feature selection; graph classification; semi-structured data
【Paper Link】 【Pages】:653-662
【Authors】: Seungil Huh ; Stephen E. Fienberg
【Abstract】: Topic modeling has been popularly used for data analysis in various domains including text documents. Previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These models, however, do not take into account the manifold structure of data, which is generally informative for the non-linear dimensionality reduction mapping. More recent models, namely Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown the resulting benefits. But these approaches fall short of the full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this paper, we propose Discriminative Topic Model (DTM) that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency. We also present a novel model fitting algorithm based on the generalized EM and the concept of Pareto improvement. As a result, DTM achieves higher classification performance in a semi-supervised setting by effectively exposing the manifold structure of data. We provide empirical evidence on text corpora to demonstrate the success of DTM in terms of classification accuracy and robustness to parameters compared to state-of-the-art techniques.
【Keywords】: dimensionality reduction; document classification; semi-supervised learning; topic modeling
【Paper Link】 【Pages】:663-672
【Authors】: Tomoharu Iwata ; Takeshi Yamada ; Yasushi Sakurai ; Naonori Ueda
【Abstract】: We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few days. Thus, in the proposed model, current topic-specific distributions over words are assumed to be generated based on the multiscale word distributions of the previous epoch. Considering both the long-timescale dependency as well as the short-timescale dependency yields a more robust model. We derive efficient online inference procedures based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference. We demonstrate the effectiveness of the proposed method in terms of predictive performance and computational efficiency by examining collections of real documents with timestamps.
【Keywords】: online learning; time-series analysis; topic model
【Paper Link】 【Pages】:673-682
【Authors】: Issei Sato ; Hiroshi Nakagawa
【Abstract】: One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a "bag of words," meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf's law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.
【Keywords】: Pitman-Yor process; latent dirichlet allocation; nonparametric bayes; power-law; topic model
【Paper Link】 【Pages】:683-692
【Authors】: Caimei Lu ; Xiaohua Hu ; Xin Chen ; Jung-ran Park ; Tingting He ; Zhoujun Li
【Abstract】: In this paper, we propose a new probabilistic generative model, called Topic-Perspective Model, for simulating the generation process of social annotations. Different from other generative models, in our model, the tag generation process is separated from the content term generation process. While content terms are only generated from resource topics, social tags are generated by resource topics and user perspectives together. The proposed probabilistic model can produce more useful information than any other models proposed before. The parameters learned from this model include: (1) the topical distribution of each document, (2) the perspective distribution of each user, (3) the word distribution of each topic, (4) the tag distribution of each topic, (5) the tag distribution of each user perspective, (6) and the probabilistic of each tag being generated from resource topics or user perspectives. Experimental results show that the proposed model has better generalization performance or tag prediction ability than other two models proposed in previous research.
【Keywords】: perplexity; social annotation; social tagging; user modeling
【Paper Link】 【Pages】:693-702
【Authors】: Michael Jahrer ; Andreas Töscher ; Robert A. Legenstein
【Abstract】: We analyze the application of ensemble learning to recommender systems on the Netflix Prize dataset. For our analysis we use a set of diverse state-of-the-art collaborative filtering (CF) algorithms, which include: SVD, Neighborhood Based Approaches, Restricted Boltzmann Machine, Asymmetric Factor Model and Global Effects. We show that linearly combining (blending) a set of CF algorithms increases the accuracy and outperforms any single CF algorithm. Furthermore, we show how to use ensemble methods for blending predictors in order to outperform a single blending algorithm. The dataset and the source code for the ensemble blending are available online.
【Keywords】: Netflix; ensemble learning; recommender systems; supervised learning
【Paper Link】 【Pages】:703-712
【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Pradheep Elango
【Abstract】: Recommender problems with large and dynamic item pools are ubiquitous in web applications like content optimization, online advertising and web search. Despite the availability of rich item meta-data, excess heterogeneity at the item level often requires inclusion of item-specific "factors" (or weights) in the model. However, since estimating item factors is computationally intensive, it poses a challenge for time-sensitive recommender problems where it is important to rapidly learn factors for new items (e.g., news articles, event updates, tweets) in an online fashion. In this paper, we propose a novel method called FOBFM (Fast Online Bilinear Factor Model) to learn item-specific factors quickly through online regression. The online regression for each item can be performed independently and hence the procedure is fast, scalable and easily parallelizable. However, the convergence of these independent regressions can be slow due to high dimensionality. The central idea of our approach is to use a large amount of historical data to initialize the online models based on offline features and learn linear projections that can effectively reduce the dimensionality. We estimate the rank of our linear projections by taking recourse to online model selection based on optimizing predictive likelihood. Through extensive experiments, we show that our method significantly and uniformly outperforms other competitive methods and obtains relative lifts that are in the range of 10-15% in terms of predictive log-likelihood, 200-300% for a rank correlation metric on a proprietary My Yahoo! dataset; it obtains 9% reduction in root mean squared error over the previously best method on a benchmark MovieLens dataset using a time-based train/test data split.
【Keywords】: dyadic data; factorization; latent factor; recommender systems; reduced rank regression
【Paper Link】 【Pages】:713-722
【Authors】: Harald Steck
【Abstract】: Users typically rate only a small fraction of all available items. We show that the absence of ratings carries useful information for improving the top-k hit rate concerning all items, a natural accuracy measure for recommendations. As to test recommender systems, we present two performance measures that can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random (MNAR). As to achieve optimal test results, we present appropriate surrogate objective functions for efficient training on MNAR data. Their main property is to account for all ratings - whether observed or missing in the data. Concerning the top-k hit rate on test data, our experiments indicate dramatic improvements over even sophisticated methods that are optimized on observed ratings only.
【Keywords】: recommender systems
【Paper Link】 【Pages】:723-732
【Authors】: Liang Xiang ; Quan Yuan ; Shiwan Zhao ; Li Chen ; Xiatian Zhang ; Qing Yang ; Jimeng Sun
【Abstract】: Accurately capturing user preferences over time is a great practical challenge in recommender systems. Simple correlation over time is typically not meaningful, since users change their preferences due to different external events. User behavior can often be determined by individual's long-term and short-term preferences. How to represent users' long-term and short-term preferences? How to leverage them for temporal recommendation? To address these challenges, we propose Session-based Temporal Graph (STG) which simultaneously models users' long-term and short-term preferences over time. Based on the STG model framework, we propose a novel recommendation algorithm Injected Preference Fusion (IPF) and extend the personalized Random Walk for temporal recommendation. Finally, we evaluate the effectiveness of our method using two real datasets on citations and social bookmarking, in which our proposed method IPF gives 15%-34% improvement over the previous state-of-the-art.
【Keywords】: graph; temporal recommendation; user preference
【Paper Link】 【Pages】:733-742
【Authors】: Gengxin Miao ; Louise E. Moser ; Xifeng Yan ; Shu Tao ; Yi Chen ; Nikos Anerousis
【Abstract】: Ticket resolution is a critical, yet challenging, aspect of the delivery of IT services. A large service provider needs to handle, on a daily basis, thousands of tickets that report various types of problems. Many of those tickets bounce among multiple expert groups before being transferred to the group with the right expertise to solve the problem. Finding a methodology that reduces such bouncing and hence shortens ticket resolution time is a long-standing challenge. In this paper, we present a unified generative model, the Optimized Network Model (ONM), that characterizes the lifecycle of a ticket, using both the content and the routing sequence of the ticket. ONM uses maximum likelihood estimation, to represent how the information contained in a ticket is used by human experts to make ticket routing decisions. Based on ONM, we develop a probabilistic algorithm to generate ticket routing recommendations for new tickets in a network of expert groups. Our algorithm calculates all possible routes to potential resolvers and makes globally optimal recommendations, in contrast to existing classification methods that make static and locally optimal recommendations. Experiments show that our method significantly outperforms existing solutions.
【Keywords】: expert network; generative model; ticket resolution
【Paper Link】 【Pages】:743-752
【Authors】: Chi-Hoon Lee
【Abstract】: Much of research in data mining and machine learning has led to numerous practical applications. Spam filtering, fraud detection, and user query-intent analysis has relied heavily on machine learned classifiers, and resulted in improvements in robust classification accuracy. Combining multiple classifiers (a.k.a. Ensemble Learning) is a well studied and has been known to improve effectiveness of a classifier. To address two key challenges in Ensemble Learning-- (1) learning weights of individual classifiers and (2) the combination rule of their weighted responses, this paper proposes a novel Ensemble classifier, EnLR, that computes weights of responses from discriminative classifiers and combines their weighted responses to produce a single response for a test instance. The combination rule is based on aggregating weighted responses, where a weight of an individual classifier is inversely based on their respective variances around their responses. Here, variance quantifies the uncertainty of the discriminative classifiers' parameters, which in turn depends on the training samples. As opposed to other ensemble methods where the weight of each individual classifier is learned as a part of parameter learning and thus the same weight is applied to all testing instances, our model is actively adjusted as individual classifiers become confident at its decision for a test instance. Our empirical experiments on various data sets demonstrate that our combined classifier produces "effective" results when compared with a single classifier. Our novel classifier shows statistically significant better accuracy when compared to well known Ensemble methods -- Bagging and AdaBoost. In addition to robust accuracy, our model is extremely efficient dealing with high volumes of training samples due to the independent learning paradigm among its multiple classifiers. It is simple to implement in a distributed computing environment such as Hadoop.
【Keywords】: classification; ensemble learning; logistic regression
【Paper Link】 【Pages】:753-762
【Authors】: Yuefeng Li ; Abdulmohsen Algarni ; Ning Zhong
【Abstract】: It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences, but many experiments do not support this hypothesis. The innovative technique presented in paper makes a breakthrough for this difficulty. This technique discovers both positive and negative patterns in text documents as higher level features in order to accurately weight low-level features (terms) based on their specificity and their distributions in the higher level features. Substantial experiments using this technique on Reuters Corpus Volume 1 and TREC topics show that the proposed approach significantly outperforms both the state-of-the-art term-based methods underpinned by Okapi BM25, Rocchio or Support Vector Machine and pattern based methods on precision, recall and F measures.
【Keywords】: pattern mining; relevance feature discovery; text mining
【Paper Link】 【Pages】:763-772
【Authors】: Guan Yu ; Rui-zhang Huang ; Zhaojun Wang
【Abstract】: One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.
【Keywords】: dirichlet process mixture model; document clustering; feature selection
【Paper Link】 【Pages】:773-782
【Authors】: Frank Reichartz ; Hannes Korte ; Gerhard Paass
【Abstract】: An important step for understanding the semantic content of text is the extraction of semantic relations between entities in natural language documents. Automatic extraction techniques have to be able to identify different versions of the same relation which usually may be expressed in a great variety of ways. Therefore these techniques benefit from taking into account many syntactic and semantic features, especially parse trees generated by automatic sentence parsers. Typed dependency parse trees are edge and node labeled parse trees whose labels and topology contains valuable semantic clues. This information can be exploited for relation extraction by the use of kernels over structured data for classification. In this paper we present new tree kernels for relation extraction over typed dependency parse trees. On a public benchmark data set we are able to demonstrate a significant improvement in terms of relation extraction quality of our new kernels over other state-of-the-art kernels.
【Keywords】: information extraction; relation extraction; text mining
【Paper Link】 【Pages】:783-792
【Authors】: Hongning Wang ; Yue Lu ; Chengxiang Zhai
【Abstract】: In this paper, we define and study a new opinionated text data analysis problem called Latent Aspect Rating Analysis (LARA), which aims at analyzing opinions expressed about an entity in an online review at the level of topical aspects to discover each individual reviewer's latent opinion on each aspect as well as the relative emphasis on different aspects when forming the overall judgment of the entity. We propose a novel probabilistic rating regression model to solve this new text mining problem in a general way. Empirical experiments on a hotel review data set show that the proposed latent rating regression model can effectively solve the problem of LARA, and that the detailed analysis of opinions at the level of topical aspects enabled by the proposed model can support a wide range of application tasks, such as aspect opinion summarization, entity ranking based on aspect ratings, and analysis of reviewers rating behavior.
【Keywords】: algorithms; experimentation
【Paper Link】 【Pages】:793-802
【Authors】: Xiangnan Kong ; Philip S. Yu
【Abstract】: The problem of graph classification has attracted great interest in the last decade. Current research on graph classification assumes the existence of large amounts of labeled training graphs. However, in many applications, the labels of graph data are very expensive or difficult to obtain, while there are often copious amounts of unlabeled graph data available. In this paper, we study the problem of semi-supervised feature selection for graph classification and propose a novel solution, called gSSC, to efficiently search for optimal subgraph features with labeled and unlabeled graphs. Different from existing feature selection methods in vector spaces which assume the feature set is given, we perform semi-supervised feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive a feature evaluation criterion, named gSemi, to estimate the usefulness of subgraph features based upon both labeled and unlabeled graphs. Then we propose a branch-and-bound algorithm to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space. Empirical studies on several real-world tasks demonstrate that our semi-supervised feature selection approach can effectively boost graph classification performances with semi-supervised feature selection and is very efficient by pruning the subgraph search space using both labeled and unlabeled graphs.
【Keywords】: data mining; feature selection; graph classification; semi-supervised learning
【Paper Link】 【Pages】:803-812
【Authors】: Christopher DuBois ; Padhraic Smyth
【Abstract】: Many social networks can be characterized by a sequence of dyadic interactions between individuals. Techniques for analyzing such events are of increasing interest. In this paper, we describe a generative model for dyadic events, where each event arises from one of C latent classes, and the properties of the event (sender, recipient, and type) are chosen from distributions over these entities conditioned on the chosen class. We present two algorithms for inference in this model: an expectation-maximization algorithm as well as a Markov chain Monte Carlo procedure based on collapsed Gibbs sampling. To analyze the model's predictive accuracy, the algorithms are applied to multiple real-world data sets involving email communication, international political events, and animal behavior data.
【Keywords】: collapsed gibbs sampling; relational data
【Paper Link】 【Pages】:813-822
【Authors】: Jing Gao ; Feng Liang ; Wei Fan ; Chi Wang ; Yizhou Sun ; Jiawei Han
【Abstract】: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.
【Keywords】: community discovery; information networks; outlier detection
【Paper Link】 【Pages】:823-832
【Authors】: Dan Preston ; Carla E. Brodley ; Roni Khardon ; Damien Sulla-Menashe ; Mark A. Friedl
【Abstract】: Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an L × L matrix of pairwise probabilistic constraints expressing their beliefs as to whether the L classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from O(N2) to O(NL) for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels.
【Keywords】: class discovery; constraint-based clustering; kdd-process; mining scientific data; remote sensing
【Paper Link】 【Pages】:833-842
【Authors】: Hsiang-Fu Yu ; Cho-Jui Hsieh ; Kai-Wei Chang ; Chih-Jen Lin
【Abstract】: Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.
【Keywords】: SVM; block minimization; large scale learning
【Paper Link】 【Pages】:843-852
【Authors】: Ryan J. Prenger ; Tracy D. Lemmond ; Kush R. Varshney ; Barry Y. Chen ; William G. Hanley
【Abstract】: The generalization error, or probability of misclassification, of ensemble classifiers has been shown to be bounded above by a function of the mean correlation between the constituent (i.e., base) classifiers and their average strength. This bound suggests that increasing the strength and/or decreasing the correlation of an ensemble's base classifiers may yield improved performance under the assumption of equal error costs. However, this and other existing bounds do not directly address application spaces in which error costs are inherently unequal. For applications involving binary classification, Receiver Operating Characteristic (ROC) curves, performance curves that explicitly trade off false alarms and missed detections, are often utilized to support decision making. To address performance optimization in this context, we have developed a lower bound for the entire ROC curve that can be expressed in terms of the class-specific strength and correlation of the base classifiers. We present empirical analyses demonstrating the efficacy of these bounds in predicting relative classifier performance. In addition, we specify performance regions of the ROC curve that are naturally delineated by the class-specific strengths of the base classifiers and show that each of these regions can be associated with a unique set of guidelines for performance optimization of binary classifiers within unequal error cost regimes.
【Keywords】: classifiers; cost-specific; ensemble
【Paper Link】 【Pages】:853-860
【Authors】: Vikas C. Raykar ; Balaji Krishnapuram ; Shipeng Yu
【Abstract】: We propose a method to train a cascade of classifiers by simultaneously optimizing all its stages. The approach relies on the idea of optimizing soft cascades. In particular, instead of optimizing a deterministic hard cascade, we optimize a stochastic soft cascade where each stage accepts or rejects samples according to a probability distribution induced by the previous stage-specific classifier. The overall system accuracy is maximized while explicitly controlling the expected cost for feature acquisition. Experimental results on three clinically relevant problems show the effectiveness of our proposed approach in achieving the desired tradeoff between accuracy and feature acquisition cost.
【Keywords】: accuracy vs cost; cascade design; cost sensitive learning
【Paper Link】 【Pages】:861-870
【Authors】: Chuancong Gao ; Jianyong Wang
【Abstract】: Classification is one of the most essential tasks in data mining. Unlike other methods, associative classification tries to find all the frequent patterns existing in the input categorical data satisfying a user-specified minimum support and/or other discrimination measures like minimum confidence or information-gain. Those patterns are used later either as rules for rule-based classifier or training features for support vector machine (SVM) classifier, after a feature selection procedure which usually tries to cover as many as the input instances with the most discriminative patterns in various manners. Several algorithms have also been proposed to mine the most discriminative patterns directly without costly feature selection. Previous empirical results show that associative classification could provide better classification accuracy over many datasets. Recently, many studies have been conducted on uncertain data, where fields of uncertain attributes no longer have certain values. Instead probability distribution functions are adopted to represent the possible values and their corresponding probabilities. The uncertainty is usually caused by noise, measurement limits, or other possible factors. Several algorithms have been proposed to solve the classification problem on uncertain data recently, for example by extending traditional rule-based classifier and decision tree to work on uncertain data. In this paper, we propose a novel algorithm uHARMONY which mines discriminative patterns directly and effectively from uncertain data as classification features/rules, to help train either SVM or rule-based classifier. Since patterns are discovered directly from the input database, feature selection usually taking a great amount of time could be avoided completely. Effective method for computation of expected confidence of the mined patterns used as the measurement of discrimination is also proposed. Empirical results show that using SVM classifier our algorithm uHARMONY outperforms the state-of-the-art uncertain data classification algorithms significantly with 4% to 10% improvements on average in accuracy on 30 categorical datasets under varying uncertain degree and uncertain attribute number.
【Keywords】: associative classification; expected confidence; frequent pattern mining; uncertain data
【Paper Link】 【Pages】:871-880
【Authors】: Zhenyu Lu ; Xindong Wu ; Xingquan Zhu ; Josh Bongard
【Abstract】: An ensemble is a set of learned models that make decisions collectively. Although an ensemble is usually more accurate than a single learner, existing ensemble methods often tend to construct unnecessarily large ensembles, which increases the memory consumption and computational cost. Ensemble pruning tackles this problem by selecting a subset of ensemble members to form subensembles that are subject to less resource consumption and response time with accuracy that is similar to or better than the original ensemble. In this paper, we analyze the accuracy/diversity trade-off and prove that classifiers that are more accurate and make more predictions in the minority group are more important for subensemble construction. Based on the gained insights, a heuristic metric that considers both accuracy and diversity is proposed to explicitly evaluate each individual classifier's contribution to the whole ensemble. By incorporating ensemble members in decreasing order of their contributions, subensembles are formed such that users can select the top $p$ percent of ensemble members, depending on their resource availability and tolerable waiting time, for predictions. Experimental results on 26 UCI data sets show that subensembles formed by the proposed EPIC (Ensemble Pruning via Individual Contribution ordering) algorithm outperform the original ensemble and a state-of-the-art ensemble pruning method, Orientation Ordering (OO).
【Keywords】: ensemble learning; ensemble pruning
【Paper Link】 【Pages】:881-888
【Authors】: Ni Lao ; William W. Cohen
【Abstract】: Many recommendation and retrieval tasks can be represented as proximity queries on a labeled directed graph, with typed nodes representing documents, terms, and metadata, and labeled edges representing the relationships between them. Recent work has shown that the accuracy of the widely-used random-walk-based proximity measures can be improved by supervised learning - in particular, one especially effective learning technique is based on Path-Constrained Random Walks (PCRW), in which similarity is defined by a learned combination of constrained random walkers, each constrained to follow only a particular sequence of edge labels away from the query nodes. The PCRW based method significantly outperformed unsupervised random walk based queries, and models with learned edge weights. Unfortunately, PCRW query systems are expensive to evaluate. In this study we evaluate the use of approximations to the computation of the PCRW distributions, including fingerprinting, particle filtering, and truncation strategies. In experiments on several recommendation and retrieval problems using two large scientific publications corpora we show speedups of factors of 2 to 100 with little loss in accuracy.
【Keywords】: filtering and recommending; learning to rank; path-constrained random walks; relational retrieval
【Paper Link】 【Pages】:889-898
【Authors】: Freddy Chong Tat Chua ; Ee-Peng Lim
【Abstract】: In an online rating system, raters assign ratings to objects contributed by other users. In addition, raters can develop trust and distrust on object contributors depending on a few rating and trust related factors. Previous study has shown that ratings and trust links can influence each other but there has been a lack of a formal model to relate these factors together. In this paper, we therefore propose Trust Antecedent Factor (TAF) Model, a novel probabilistic model that generate ratings based on a number of rater's and contributor's factors. We demonstrate that parameters of the model can be learnt by Collapsed Gibbs Sampling. We then apply the model to predict trust and distrust between raters and review contributors using a real data-set. Our experiments have shown that the proposed model is capable of predicting both trust and distrust in a unified way. The model can also determine user factors which otherwise cannot be observed from the rating and trust data.
【Keywords】: probability; social network; statistics; trust prediction
【Paper Link】 【Pages】:899-908
【Authors】: Yong Ge ; Hui Xiong ; Alexander Tuzhilin ; Keli Xiao ; Marco Gruteser ; Michael J. Pazzani
【Abstract】: The increasing availability of large-scale location traces creates unprecedent opportunities to change the paradigm for knowledge discovery in transportation systems. A particularly promising area is to extract energy-efficient transportation patterns (green knowledge), which can be used as guidance for reducing inefficiencies in energy consumption of transportation sectors. However, extracting green knowledge from location traces is not a trivial task. Conventional data analysis tools are usually not customized for handling the massive quantity, complex, dynamic, and distributed nature of location traces. To that end, in this paper, we provide a focused study of extracting energy-efficient transportation patterns from location traces. Specifically, we have the initial focus on a sequence of mobile recommendations. As a case study, we develop a mobile recommender system which has the ability in recommending a sequence of pick-up points for taxi drivers or a sequence of potential parking positions. The goal of this mobile recommendation system is to maximize the probability of business success. Along this line, we provide a Potential Travel Distance (PTD) function for evaluating each candidate sequence. This PTD function possesses a monotone property which can be used to effectively prune the search space. Based on this PTD function, we develop two algorithms, LCP and SkyRoute, for finding the recommended routes. Finally, experimental results show that the proposed system can provide effective mobile sequential recommendation and the knowledge extracted from location traces can be used for coaching drivers and leading to the efficient use of energy.
【Keywords】: mobile recommender system; trajectory data analysis
【Paper Link】 【Pages】:909-918
【Authors】: Manas Somaiya ; Christopher M. Jermaine ; Sanjay Ranka
【Abstract】: Archived data often describe entities that participate in multiple roles. Each of these roles may influence various aspects of the data. For example, a register transaction collected at a retail store may have been initiated by a person who is a woman, a mother, an avid reader, and an action movie fan. Each of these roles can influence various aspects of the customer's purchase: the fact that the customer is a mother may greatly influence the purchase of a toddler-sized pair of pants, but have no influence on the purchase of an action-adventure novel. The fact that the customer is an action move fan and an avid reader may influence the purchase of the novel, but will have no effect on the purchase of a shirt. In this paper, we present a generic, Bayesian framework for capturing exactly this situation. In our framework, it is assumed that multiple roles exist, and each data point corresponds to an entity (such as a retail customer, or an email, or a news article) that selects various roles which compete to influence the various attributes associated with the data point. We develop robust, MCMC algorithms for learning the models under the framework.
【Keywords】: high-dimensional data; mcmc; mixture models
【Paper Link】 【Pages】:919-928
【Authors】: Siyuan Liu ; Yunhuai Liu ; Lionel M. Ni ; Jianping Fan ; Minglu Li
【Abstract】: Identifying hot spots of moving vehicles in an urban area is essential to many smart city applications. The practical research on hot spots in smart city presents many unique features, such as highly mobile environments, supremely limited size of sample objects, and the non-uniform, biased samples. All these features have raised new challenges that make the traditional density-based clustering algorithms fail to capture the real clustering property of objects, making the results less meaningful. In this paper we propose a novel, non-density-based approach called mobility-based clustering. The key idea is that sample objects are employed as "sensors" to perceive the vehicle crowdedness in nearby areas using their instant mobility, rather than the "object representatives". As such the mobility of samples is naturally incorporated. Several key factors beyond the vehicle crowdedness have been identified and techniques to compensate these effects are proposed. We evaluate the performance of mobility-based clustering based on real traffic situations. Experimental results show that using 0.3% of vehicles as the samples, mobility-based clustering can accurately identify hot spots which can hardly be obtained by the latest representative algorithm UMicro.
【Keywords】: crowdedness; mobility-based clustering; traffic detection; vehicle
【Paper Link】 【Pages】:929-938
【Authors】: Cindy Xide Lin ; Bo Zhao ; Qiaozhu Mei ; Jiawei Han
【Abstract】: User generated information in online communities has been characterized with the mixture of a text stream and a network structure both changing over time. A good example is a web-blogging community with the daily blog posts and a social network of bloggers. An important task of analyzing an online community is to observe and track the popular events, or topics that evolve over time in the community. Existing approaches usually focus on either the burstiness of topics or the evolution of networks, but ignoring the interplay between textual topics and network structures. In this paper, we formally define the problem of popular event tracking in online communities (PET), focusing on the interplay between texts and networks. We propose a novel statistical method that models the the popularity of events over time, taking into consideration the burstiness of user interest, information diffusion on the network structure, and the evolution of textual topics. Specifically, a Gibbs Random Field is defined to model the influence of historic status and the dependency relationships in the graph; thereafter a topic model generates the words in text content of the event, regularized by the Gibbs Random Field. We prove that two classic models in information diffusion and text burstiness are special cases of our model under certain situations. Empirical experiments with two different communities and datasets (i.e., Twitter and DBLP) show that our approach is effective and outperforms existing approaches.
【Keywords】: PET; popular events tracking; social communities; topic modeling
【Paper Link】 【Pages】:939-948
【Authors】: Mauro Sozio ; Aristides Gionis
【Abstract】: A lot of research in graph mining has been devoted in the discovery of communities. Most of the work has focused in the scenario where communities need to be discovered with only reference to the input graph. However, for many interesting applications one is interested in finding the community formed by a given set of nodes. In this paper we study a query-dependent variant of the community-detection problem, which we call the community-search problem: given a graph G, and a set of query nodes in the graph, we seek to find a subgraph of G that contains the query nodes and it is densely connected. We motivate a measure of density based on minimum degree and distance constraints, and we develop an optimum greedy algorithm for this measure. We proceed by characterizing a class of monotone constraints and we generalize our algorithm to compute optimum solutions satisfying any set of monotone constraints. Finally we modify the greedy algorithm and we present two heuristic algorithms that find communities of size no greater than a specified upper bound. Our experimental evaluation on real datasets demonstrates the efficiency of the proposed algorithms and the quality of the solutions we obtain.
【Keywords】: community detection; graph algorithms; graph mining; social networks
【Paper Link】 【Pages】:949-958
【Authors】: Anon Plangprasopchok ; Kristina Lerman ; Lise Getoor
【Abstract】: Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes knowledge. For instance, we can aggregate many personal hierarchies into a common taxonomy, also known as a folksonomy, that will aid users in visualizing and browsing social content, and also to help them in organizing their own content. However, learning from social metadata presents several challenges, since it is sparse, shallow, ambiguous, noisy, and inconsistent. We describe an approach to folksonomy learning based on relational clustering, which exploits structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr, and demonstrate that the proposed approach addresses the challenges. Moreover, comparing to previous work, the approach produces larger, more accurate folksonomies, and in addition, scales better.
【Keywords】: collective knowledge; data mining; folksonomies; relational clustering; social information processing; social metadata; taxonomies
【Paper Link】 【Pages】:959-968
【Authors】: Dawei Yin ; Zhenzhen Xue ; Liangjie Hong ; Brian D. Davison
【Abstract】: Social tagging systems have become increasingly popular for sharing and organizing web resources. Tag prediction is a common feature of social tagging systems. Social tagging by nature is an incremental process, meaning that once a user has saved a web page with tags, the tagging system can provide more accurate predictions for the user, based on user's incremental behaviors. However, existing tag prediction methods do not consider this important factor, in which their training and test datasets are either split by a fixed time stamp or randomly sampled from a larger corpus. In our temporal experiments, we perform a time-sensitive sampling on an existing public dataset, resulting in a new scenario which is much closer to "real-world". In this paper, we address the problem of tag prediction by proposing a probabilistic model for personalized tag prediction. The model is a Bayesian approach, and integrates three factors - ego-centric effect, environmental effects and web page content. Two methods - both intuitive calculation and learning optimization - are provided for parameter estimation. Pure graphbased methods which may have significant constraints (such as every user, every item and every tag has to occur in at least p posts), cannot make a prediction in most of "real world" cases while our model improves the F-measure by over 30% compared to a leading algorithm, in our "real-world" use case.
【Keywords】: personalized tag prediction; social tagging; tag prediction
【Paper Link】 【Pages】:969-978
【Authors】: Xiaojiang Liu ; Zaiqing Nie ; Nenghai Yu ; Ji-Rong Wen
【Abstract】: Internet users regularly have the need to find biographies and facts of people of interest. Wikipedia has become the first stop for celebrity biographies and facts. However, Wikipedia can only provide information for celebrities because of its neutral point of view (NPOV) editorial policy. In this paper we propose an integrated bootstrapping framework named BioSnowball to automatically summarize the Web to generate Wikipedia-style pages for any person with a modest web presence. In BioSnowball, biography ranking and fact extraction are performed together in a single integrated training and inference process using Markov Logic Networks (MLNs) as its underlying statistical model. The bootstrapping framework starts with only a small number of seeds and iteratively finds new facts and biographies. As biography paragraphs on the Web are composed of the most important facts, our joint summarization model can improve the accuracy of both fact extraction and biography ranking compared to decoupled methods in the literature. Empirical results on both a small labeled data set and a real Web-scale data set show the effectiveness of BioSnowball. We also empirically show that BioSnowball outperforms the decoupled methods.
【Keywords】: bootstrapping; fact extraction; markov logic networks; summarization
【Paper Link】 【Pages】:979-988
【Authors】: D. Sculley
【Abstract】: Many real-world data mining tasks require the achievement of two distinct goals when applied to unseen data: first, to induce an accurate preference ranking, and second to give good regression performance. In this paper, we give an efficient and effective Combined Regression and Ranking method (CRR) that optimizes regression and ranking objectives simultaneously. We demonstrate the effectiveness of CRR for both families of metrics on a range of large-scale tasks, including click prediction for online advertisements. Results show that CRR often achieves performance equivalent to the best of both ranking-only and regression-only approaches. In the case of rare events or skewed distributions, we also find that this combination can actually improve regression performance due to the addition of informative ranking constraints.
【Keywords】: large-scale data; ranking; regression
【Paper Link】 【Pages】:989-998
【Authors】: Kai Ming Ting ; Guang-Tong Zhou ; Fei Tony Liu ; James Swee Chuan Tan
【Abstract】: This paper introduces mass estimation--a base modelling mechanism in data mining. It provides the theoretical basis of mass and an efficient method to estimate mass. We show that it solves problems very effectively in tasks such as information retrieval, regression and anomaly detection. The models, which use mass in these three tasks, perform at least as good as and often better than a total of eight state-of-the-art methods in terms of task-specific performance measures. In addition, mass estimation has constant time and space complexities.
【Keywords】: mass estimation
【Paper Link】 【Pages】:999-1008
【Authors】: Min-Ling Zhang ; Kun Zhang
【Abstract】: In multi-label learning, each training example is associated with a set of labels and the task is to predict the proper label set for the unseen example. Due to the tremendous (exponential) number of possible label sets, the task of learning from multi-label examples is rather challenging. Therefore, the key to successful multi-label learning is how to effectively exploit correlations between different labels to facilitate the learning process. In this paper, we propose to use a Bayesian network structure to efficiently encode the conditional dependencies of the labels as well as the feature set, with the feature set as the common parent of all labels. To make it practical, we give an approximate yet efficient procedure to find such a network structure. With the help of this network, multi-label learning is decomposed into a series of single-label classification problems, where a classifier is constructed for each label by incorporating its parental labels as additional features. Label sets of unseen examples are predicted recursively according to the label ordering given by the network. Extensive experiments on a broad range of data sets validate the effectiveness of our approach against other well-established methods.
【Keywords】: bayesian network; machine learning; multi-label learning
【Paper Link】 【Pages】:1009-1018
【Authors】: Qiaozhu Mei ; Jian Guo ; Dragomir R. Radev
【Abstract】: Information networks are widely used to characterize the relationships between data items such as text documents. Many important retrieval and mining tasks rely on ranking the data items based on their centrality or prestige in the network. Beyond prestige, diversity has been recognized as a crucial objective in ranking, aiming at providing a non-redundant and high coverage piece of information in the top ranked results. Nevertheless, existing network-based ranking approaches either disregard the concern of diversity, or handle it with non-optimized heuristics, usually based on greedy vertex selection. We propose a novel ranking algorithm, DivRank, based on a reinforced random walk in an information network. This model automatically balances the prestige and the diversity of the top ranked vertices in a principled way. DivRank not only has a clear optimization explanation, but also well connects to classical models in mathematics and network science. We evaluate DivRank using empirical experiments on three different networks as well as a text summarization task. DivRank outperforms existing network-based ranking methods in terms of enhancing diversity in prestige.
【Keywords】: diversity; information networks; ranking; reinforced random walk
【Paper Link】 【Pages】:1019-1028
【Authors】: Manuel Gomez-Rodriguez ; Jure Leskovec ; Andreas Krause
【Abstract】: Information diffusion and virus propagation are fundamental processes talking place in networks. While it is often possible to directly observe when nodes become infected, observing individual transmissions (i.e., who infects whom or who influences whom) is typically very difficult. Furthermore, in many applications, the underlying network over which the diffusions and propagations spread is actually unobserved. We tackle these challenges by developing a method for tracing paths of diffusion and influence through networks and inferring the networks over which contagions propagate. Given the times when nodes adopt pieces of information or become infected, we identify the optimal network that best explains the observed infection times. Since the optimization problem is NP-hard to solve exactly, we develop an efficient approximation algorithm that scales to large datasets and in practice gives provably near-optimal performance. We demonstrate the effectiveness of our approach by tracing information cascades in a set of 170 million blogs and news articles over a one year period to infer how information flows through the online media space. We find that the diffusion network of news tends to have a core-periphery structure with a small set of core media sites that diffuse information to the rest of the Web. These sites tend to have stable circles of influence with more general news media sites acting as connectors between them.
【Keywords】: blogs; information cascades; meme-tracking; networks of diffusion; news media; social networks
【Paper Link】 【Pages】:1029-1038
【Authors】: Wei Chen ; Chi Wang ; Yajun Wang
【Abstract】: Influence maximization, defined by Kempe, Kleinberg, and Tardos (2003), is the problem of finding a small set of seed nodes in a social network that maximizes the spread of influence under certain influence cascade models. The scalability of influence maximization is a key factor for enabling prevalent viral marketing in large-scale online social networks. Prior solutions, such as the greedy algorithm of Kempe et al. (2003) and its improvements are slow and not scalable, while other heuristic algorithms do not provide consistently good performance on influence spreads. In this paper, we design a new heuristic algorithm that is easily scalable to millions of nodes and edges in our experiments. Our algorithm has a simple tunable parameter for users to control the balance between the running time and the influence spread of the algorithm. Our results from extensive simulations on several real-world and synthetic networks demonstrate that our algorithm is currently the best scalable solution to the influence maximization problem: (a) our algorithm scales beyond million-sized graphs where the greedy algorithm becomes infeasible, and (b) in all size ranges, our algorithm performs consistently well in influence spread --- it is always among the best algorithms, and in most cases it significantly outperforms all other scalable heuristics to as much as 100%--260% increase in influence spread.
【Keywords】: influence maximization; social networks; viral marketing
【Paper Link】 【Pages】:1039-1048
【Authors】: Yu Wang ; Gao Cong ; Guojie Song ; Kunqing Xie
【Abstract】: With the proliferation of mobile devices and wireless technologies, mobile social network systems are increasingly available. A mobile social network plays an essential role as the spread of information and influence in the form of "word-of-mouth". It is a fundamental issue to find a subset of influential individuals in a mobile social network such that targeting them initially (e.g. to adopt a new product) will maximize the spread of the influence (further adoptions of the new product). The problem of finding the most influential nodes is unfortunately NP-hard. It has been shown that a Greedy algorithm with provable approximation guarantees can give good approximation; However, it is computationally expensive, if not prohibitive, to run the greedy algorithm on a large mobile network. In this paper we propose a new algorithm called Community-based Greedy algorithm for mining top-K influential nodes. The proposed algorithm encompasses two components: 1) an algorithm for detecting communities in a social network by taking into account information diffusion; and 2) a dynamic programming algorithm for selecting communities to find influential nodes. We also provide provable approximation guarantees for our algorithm. Empirical studies on a large real-world mobile social network show that our algorithm is more than an order of magnitudes faster than the state-of-the-art Greedy algorithm for finding top-K influential nodes and the error of our approximate algorithm is small.
【Keywords】: community detection; influence maximization; social networks
【Paper Link】 【Pages】:1049-1058
【Authors】: Chenhao Tan ; Jie Tang ; Jimeng Sun ; Quan Lin ; Fengjiao Wang
【Abstract】: It is well known that users' behaviors (actions) in a social network are influenced by various factors such as personal interests, social influence, and global trends. However, few publications systematically study how social actions evolve in a dynamic social network and to what extent different factors affect the user actions. In this paper, we propose a Noise Tolerant Time-varying Factor Graph Model (NTT-FGM) for modeling and predicting social actions. NTT-FGM simultaneously models social network structure, user attributes and user action history for better prediction of the users' future actions. More specifically, a user's action at time t is generated by her latent state at t, which is influenced by her attributes, her own latent state at time t-1 and her neighbors' states at time t and t-1. Based on this intuition, we formalize the social action tracking problem using the NTT-FGM model; then present an efficient algorithm to learn the model, by combining the ideas from both continuous linear system and Markov random field. Finally, we present a case study of our model on predicting future social actions. We validate the model on three different types of real-world data sets. Qualitatively, our model can uncover some interesting patterns of the social dynamics. Quantitatively, experimental results show that the proposed method outperforms several baseline methods for action prediction.
【Keywords】: social action tracking; social influence analysis; time-varying factor graphs
【Paper Link】 【Pages】:1059-1068
【Authors】: Theodoros Lappas ; Evimaria Terzi ; Dimitrios Gunopulos ; Heikki Mannila
【Abstract】: Assume a network (V,E) where a subset of the nodes in V are active. We consider the problem of selecting a set of k active nodes that best explain the observed activation state, under a given information-propagation model. We call these nodes effectors. We formally define the k-Effectors problem and study its complexity for different types of graphs. We show that for arbitrary graphs the problem is not only NP-hard to solve optimally, but also NP-hard to approximate. We also show that, for some special cases, the problem can be solved optimally in polynomial time using a dynamic-programming algorithm. To the best of our knowledge, this is the first work to consider the k-Effectors problem in networks. We experimentally evaluate our algorithms using the DBLP co-authorship graph, where we search for effectors of topics that appear in research papers.
【Keywords】: graph algorithms; information propagation; social networks
【Paper Link】 【Pages】:1069-1078
【Authors】: Feng Chen ; Chang-Tien Lu ; Arnold P. Boedihardjo
【Abstract】: Local based approach is a major category of methods for spatial outlier detection (SOD). Currently, there is a lack of systematic analysis on the statistical properties of this framework. For example, most methods assume identical and independent normal distributions (i.i.d. normal) for the calculated local differences, but no justifications for this critical assumption have been presented. The methods' detection performance on geostatistic data with linear or nonlinear trend is also not well studied. In addition, there is a lack of theoretical connections and empirical comparisons between local and global based SOD approaches. This paper discusses all these fundamental issues under the proposed Generalized Local Statistical (GLS) framework. Furthermore, robust estimation and outlier detection methods are designed for the new GLS model. Extensive simulations demonstrated that the SOD method based on the GLS model significantly outperformed all existing approaches when the spatial data exhibits a linear or nonlinear trend.
【Keywords】: spatial gaussian random field; spatial outlier detection
【Paper Link】 【Pages】:1079-1088
【Authors】: Jianwen Zhang ; Yangqiu Song ; Changshui Zhang ; Shixia Liu
【Abstract】: Mining cluster evolution from multiple correlated time-varying text corpora is important in exploratory text analytics. In this paper, we propose an approach called evolutionary hierarchical Dirichlet processes (EvoHDP) to discover interesting cluster evolution patterns from such text data. We formulate the EvoHDP as a series of hierarchical Dirichlet processes~(HDP) by adding time dependencies to the adjacent epochs, and propose a cascaded Gibbs sampling scheme to infer the model. This approach can discover different evolving patterns of clusters, including emergence, disappearance, evolution within a corpus and across different corpora. Experiments over synthetic and real-world multiple correlated time-varying data sets illustrate the effectiveness of EvoHDP on discovering cluster evolution patterns.
【Keywords】: bayesian nonparametric methods; clustering; dirichlet processes; mixture models; multiple correlated time-varying corpora
【Paper Link】 【Pages】:1089-1098
【Authors】: Abdullah Mueen ; Eamonn J. Keogh
【Abstract】: The detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higher-level data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In recent years there has been significant research effort spent on efficiently discovering these motifs in static offline databases. However, for many domains, the inherent streaming nature of time series demands online discovery and maintenance of time series motifs. In this paper, we develop the first online motif discovery algorithm which monitors and maintains motifs exactly in real time over the most recent history of a stream. Our algorithm has a worst-case update time which is linear to the window size and is extendible to maintain more complex pattern structures. In contrast, the current offline algorithms either need significant update time or require very costly pre-processing steps which online algorithms simply cannot afford. Our core ideas allow useful extensions of our algorithm to deal with arbitrary data rates and discovering multidimensional motifs. We demonstrate the utility of our algorithms with a variety of case studies in the domains of robotics, acoustic monitoring and online compression.
【Keywords】: motifs; online algorithms; time series
【Paper Link】 【Pages】:1099-1108
【Authors】: Zhenhui Li ; Bolin Ding ; Jiawei Han ; Roland Kays ; Peter Nye
【Abstract】: Periodicity is a frequently happening phenomenon for moving objects. Finding periodic behaviors is essential to understanding object movements. However, periodic behaviors could be complicated, involving multiple interleaving periods, partial time span, and spatiotemporal noises and outliers. In this paper, we address the problem of mining periodic behaviors for moving objects. It involves two sub-problems: how to detect the periods in complex movement, and how to mine periodic movement behaviors. Our main assumption is that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations. Based on this assumption, we propose a two-stage algorithm, Periodica, to solve the problem. At the first stage, the notion of observation spot is proposed to capture the reference locations. Through observation spots, multiple periods in the movement can be retrieved using a method that combines Fourier transform and autocorrelation. At the second stage, a probabilistic model is proposed to characterize the periodic behaviors. For a specific period, periodic behaviors are statistically generalized from partial movement sequences through hierarchical clustering. Empirical studies on both synthetic and real data sets demonstrate the effectiveness of our method.
【Keywords】: autocorrelation; fourier transform; hierarchical clustering; moving objects; observation spot; periodic behavior
【Paper Link】 【Pages】:1109-1118
【Authors】: Zhenxing Wang ; Laiwan Chan
【Abstract】: Bayesian network learning algorithms have been widely used for causal discovery since the pioneer work [13,18]. Among all existing algorithms, three-phase dependency analysis algorithm (TPDA) [5] is the most efficient one in the sense that it has polynomial-time complexity. However, there are still some limitations to be improved. First, TPDA depends on mutual information-based conditional independence (CI) tests, and so is not easy to be applied to continuous data. In addition, TPDA uses two phases to get approximate skeletons of Bayesian networks, which is not efficient in practice. In this paper, we propose a two-phase algorithm with partial correlation-based CI tests: the first phase of the algorithm constructs a Markov random field from data, which provides a close approximation to the structure of the true Bayesian network; at the second phase, the algorithm removes redundant edges according to CI tests to get the true Bayesian network. We show that two-phase algorithm with partial correlation-based CI tests can deal with continuous data following arbitrary distributions rather than only Gaussian distribution.
【Keywords】: bayesian networks; causal modeling; graphical models
【Paper Link】 【Pages】:1119-1128
【Authors】: Robert J. Durrant ; Ata Kaban
【Abstract】: We consider random projections in conjunction with classification, specifically the analysis of Fisher's Linear Discriminant (FLD) classifier in randomly projected data spaces. Unlike previous analyses of other classifiers in this setting, we avoid the unnatural effects that arise when one insists that all pairwise distances are approximately preserved under projection. We impose no sparsity or underlying low-dimensional structure constraints on the data; we instead take advantage of the class structure inherent in the problem. We obtain a reasonably tight upper bound on the estimated misclassification error on average over the random choice of the projection, which, in contrast to early distance preserving approaches, tightens in a natural way as the number of training examples increases. It follows that, for good generalisation of FLD, the required projection dimension grows logarithmically with the number of classes. We also show that the error contribution of a covariance misspecification is always no worse in the low-dimensional space than in the initial high-dimensional space. We contrast our findings to previous related work, and discuss our insights.
【Keywords】: classification; compressed learning; dimensionality reduction; linear discriminant analysis; random projection
【Paper Link】 【Pages】:1129-1138
【Authors】: Junfeng He ; Wei Liu ; Shih-Fu Chang
【Abstract】: Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.
【Keywords】: hashing; indexing; kernel method; nearest neighbor; scalable; search; structure data
【Paper Link】 【Pages】:1139-1148
【Authors】: Wei Liu ; Shiqian Ma ; Dacheng Tao ; Jianzhuang Liu ; Peng Liu
【Abstract】: In plenty of scenarios, data can be represented as vectors and then mathematically abstracted as points in a Euclidean space. Because a great number of machine learning and data mining applications need proximity measures over data, a simple and universal distance metric is desirable, and metric learning methods have been explored to produce sensible distance measures consistent with data relationship. However, most existing methods suffer from limited labeled data and expensive training. In this paper, we address these two issues through employing abundant unlabeled data and pursuing sparsity of metrics, resulting in a novel metric learning approach called semi-supervised sparse metric learning. Two important contributions of our approach are: 1) it propagates scarce prior affinities between data to the global scope and incorporates the full affinities into the metric learning; and 2) it uses an efficient alternating linearization method to directly optimize the sparse metric. Compared with conventional methods, ours can effectively take advantage of semi-supervision and automatically discover the sparse metric structure underlying input data patterns. We demonstrate the efficacy of the proposed approach with extensive experiments carried out on six datasets, obtaining clear performance gains over the state-of-the-arts.
【Keywords】: alternating linearization; metric learning; semi-supervised sparse metric learning; sparse inverse covariance estimation
【Paper Link】 【Pages】:1149-1158
【Authors】: Arvind Agarwal ; Jeff M. Phillips ; Suresh Venkatasubramanian
【Abstract】: In this paper, we propose a unified algorithmic framework for solving many known variants of MDS. Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods in comparable time. Moreover, they have a small memory footprint and scale effectively for large data sets. We expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a complement to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, by showing that projecting to a random O((1/µ2) log n)-dimensional sphere causes only an eps-distortion in the geodesic distances.
【Keywords】: dimensionality reduction; multi-dimensional scaling
【Paper Link】 【Pages】:1159-1168
【Authors】: Tianbao Yang ; Rong Jin ; Anil K. Jain ; Yang Zhou ; Wei Tong
【Abstract】: We study the problem of building the classification model for a target class in the absence of any labeled training example for that class. To address this difficult learning problem, we extend the idea of transfer learning by assuming that the following side information is available: (i) a collection of labeled examples belonging to other classes in the problem domain, called the auxiliary classes; (ii) the class information including the prior of the target class and the correlation between the target class and the auxiliary classes. Our goal is to construct the classification model for the target class by leveraging the above data and information. We refer to this learning problem as unsupervised transfer classification. Our framework is based on the generalized maximum entropy model that is effective in transferring the label information of the auxiliary classes to the target class. A theoretical analysis shows that under certain assumption, the classification model obtained by the proposed approach converges to the optimal model when it is learned from the labeled examples for the target class. Empirical study on text categorization over four different data sets verifies the effectiveness of the proposed approach.
【Keywords】: generalized maximum entropy model; text categorization; unsupervised transfer classification
【Paper Link】 【Pages】:1169-1178
【Authors】: Sunil Kumar Gupta ; Dinh Q. Phung ; Brett Adams ; Truyen Tran ; Svetha Venkatesh
【Abstract】: Although tagging has become increasingly popular in online image and video sharing systems, tags are known to be noisy, ambiguous, incomplete and subjective. These factors can seriously affect the precision of a social tag-based web retrieval system. Therefore improving the precision performance of these social tag-based web retrieval systems has become an increasingly important research topic. To this end, we propose a shared subspace learning framework to leverage a secondary source to improve retrieval performance from a primary dataset. This is achieved by learning a shared subspace between the two sources under a joint Nonnegative Matrix Factorization in which the level of subspace sharing can be explicitly controlled. We derive an efficient algorithm for learning the factorization, analyze its complexity, and provide proof of convergence. We validate the framework on image and video retrieval tasks in which tags from the LabelMe dataset are used to improve image retrieval performance from a Flickr dataset and video retrieval performance from a YouTube dataset. This has implications for how to exploit and transfer knowledge from readily available auxiliary tagging resources to improve another social web retrieval system. Our shared subspace learning framework is applicable to a range of problems where one needs to exploit the strengths existing among multiple and heterogeneous datasets.
【Keywords】: image and video retrieval; nonnegative shared subspace learning; social media; transfer learning
【Paper Link】 【Pages】:1179-1188
【Authors】: Jianhui Chen ; Ji Liu ; Jieping Ye
【Abstract】: We consider the problem of learning incoherent sparse and low-rank patterns from multiple tasks. Our approach is based on a linear multi-task learning formulation, in which the sparse and low-rank patterns are induced by a cardinality regularization term and a low-rank constraint, respectively. This formulation is non-convex; we convert it into its convex surrogate, which can be routinely solved via semidefinite programming for small-size problems. We propose to employ the general projected gradient scheme to efficiently solve such a convex surrogate; however, in the optimization formulation, the objective function is non-differentiable and the feasible domain is non-trivial. We present the procedures for computing the projected gradient and ensuring the global convergence of the projected gradient scheme. The computation of projected gradient involves a constrained optimization problem; we show that the optimal solution to such a problem can be obtained via solving an unconstrained optimization subproblem and an Euclidean projection subproblem. In addition, we present two projected gradient algorithms and discuss their rates of convergence. Experimental results on benchmark data sets demonstrate the effectiveness of the proposed multi-task learning formulation and the efficiency of the proposed projected gradient algorithms.
【Keywords】: multi-task learning; sparse and low-rank patterns; trace norm
【Paper Link】 【Pages】:1189-1198
【Authors】: Olivier Chapelle ; Pannagadatta K. Shivaswamy ; Srinivas Vadrevu ; Kilian Q. Weinberger ; Ya Zhang ; Belle L. Tseng
【Abstract】: In this paper we propose a novel algorithm for multi-task learning with boosted decision trees. We learn several different learning tasks with a joint model, explicitly addressing the specifics of each learning task with task-specific parameters and the commonalities between them through shared parameters. This enables implicit data sharing and regularization. We evaluate our learning method on web-search ranking data sets from several countries. Here, multitask learning is particularly helpful as data sets from different countries vary largely in size because of the cost of editorial judgments. Our experiments validate that learning various tasks jointly can lead to significant improvements in performance with surprising reliability.
【Keywords】: boosting; decision trees; multi-task learning; ranking; web search
【Paper Link】 【Pages】:1199-1208
【Authors】: Yu Zhang ; Dit-Yan Yeung
【Abstract】: Distance metric learning plays a very crucial role in many data mining algorithms because the performance of an algorithm relies heavily on choosing a good metric. However, the labeled data available in many applications is scarce and hence the metrics learned are often unsatisfactory. In this paper, we consider a transfer learning setting in which some related source tasks with labeled data are available to help the learning of the target task. We first propose a convex formulation for multi-task metric learning by modeling the task relationships in the form of a task covariance matrix. Then we regard transfer learning as a special case of multi-task learning and adapt the formulation of multi-task metric learning to the transfer learning setting for our method, called transfer metric learning (TML). In TML, we learn the metric and the task covariances between the source tasks and the target task under a unified convex formulation. To solve the convex optimization problem, we use an alternating method in which each subproblem has an efficient solution. Experimental results on some commonly used transfer learning applications demonstrate the effectiveness of our method.
【Keywords】: metric learning; multi-task learning; transfer learning
【Paper Link】 【Pages】:1209-1212
【Authors】: Hillol Kargupta ; João Gama ; Wei Fan
【Abstract】:
【Keywords】: data mining; transportation systems