23. KDD 2017:Halifax, NS, Canada

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM 【DBLP Link

Paper Num: 232 || Session Num: 7

KDD 2017 Keynote Talks 3

1. What's Fair?

Paper Link】 【Pages】:1

【Authors】: Cynthia Dwork

【Abstract】: Data, algorithms, and systems have biases embedded within them reflecting designers' explicit and implicit choices, historical biases, and societal priorities. They form, literally and inexorably, a codification of values. "Unfairness" of algorithms -- for tasks ranging from advertising to recidivism prediction -- has attracted considerable attention in the popular press. The talk will discuss the nascent mathematically rigorous study of fairness in classification and scoring.

【Keywords】: algorithmic bias; classification fairness; data bias; data mining.; fairness; scoring fairness; system bias

2. The Future of Data Integration.

Paper Link】 【Pages】:3

【Authors】: Renée J. Miller

【Abstract】: The value of data explodes when it is integrated. In this talk, I present some important innovations in data integration over the last two decades. These include data exchange [1], which provides a foundation for reasoning about the correctness of transformed data, and the use of declarative mappings in integration [2]. I discuss how data mining has been used to facilitate data integration, including constraint discovery [3], mapping discovery [4], and in schema discovery to combat database decay and facilitate integration [5,6]. I present some important new data integration challenges that arise in data science. These include the use of mining for query and visualization recommendation over massive data lakes [7] and data set search, finding datasets of interest at interactive speeds [8].

【Keywords】: data science

3. Three Principles of Data Science: Predictability, Stability and Computability.

Paper Link】 【Pages】:5

【Authors】: Bin Yu

【Abstract】: In this talk, I'd like to discuss the intertwining importance and connections of three principles of data science in the title in data-driven decisions. Making prediction as its central task and embracing computation as its core, machine learning has enabled wide-ranging data-driven successes. Prediction is a useful way to check with reality. Good prediction implicitly assumes stability between past and future. Stability (relative to data and model perturbations) is also a minimum requirement for interpretability and reproducibility of data driven results (cf. Yu, 2013). It is closely related to uncertainty assessment. Obviously, both prediction and stability principles can not be employed without feasible computational algorithms, hence the importance of computability. The three principles will be demonstrated in the context of two neuroscience projects and through analytical connections. In particular, the first project adds stability to predictive modeling used for reconstruction of movies from fMRI brain signlas for interpretable models. The second project use predictive transfer learning that combines AlexNet, GoogleNet and VGG with single V4 neuron data for state-of-the-art prediction performance. Our results lend support, to a certain extent, to the resemblance of these CNNs to brain and at the same time provide stable pattern interpretations of neurons in the difficult primate visual cortex V4.

【Keywords】:

KDD 2017 Applied Invited Talks 12

4. Foreword to the Applied Data Science: Invited Talks Track at KDD-2017.

Paper Link】 【Pages】:7-8

【Authors】: Usama M. Fayyad ; Evangelos Simoudis ; Ashok Srivastava

【Abstract】: The Applied Data Science (ADS) Invited Talks Track at KDD-2017 is a continuation of what has now become a "7-year tradition" at KDD conferences. This is the second year the track operates under the ADS name, an evolution from its origins at KDD-2011 as the "Industry Practice Expo". The KDD Conference on Knowledge Discovery and Data Mining (KDD) is the world's first, largest and best conference on Data Science, Data Mining, and Knowledge Discovery. It brings together a healthy mix of academic researchers, industry and government researchers, and practitioners from a wide range of institutions and fields. The primary focus on KDD is on peer-reviewed research contributions and the academic advancement of the field. This is an important goal and in fact the KDD conference is now recognized as the most competitive and prestigious forum for presenting high quality research results. KDD, being fundamentally an applied field, needs the strong representation of applied work of big impact. Over the years of running the conference we observed that our initial speaker-selection approach needed to be re-thought because of the important contributions made to the field outside traditional academic, industrial and government research laboratories. The result of this re-thinking was to create a forum that exposes important contributions to Data Science through Big Data Applications that address strategic problems. We wanted to effectively capture the rising importance of Data Science and Machine Learning especially in the Big Data environment where structured and unstructured data create special challenges, and of course present new opportunities. The goal of the Invited Talks Track is to curate contributions from leaders in our field who have made important contributions through the development of a system, the creation of a new and important business, or the development and market introduction of a product,. Some of these important contributions may never see an academic paper or detailed peer-reviewed paper written about them, yet they are of critical importance to our very applied field. To give you an idea of how rapidly growing this area is, and how this sector of our industry and promises to be highly disruptive across many industries, we cite a couple of articles out of a plethora of such coverage: According to IDC, the global revenues from Big Data and business will grow from $130.1 billion in 2016 to more than $203 billion in 2020, at a compound annual growth rate (CAGR) of 11.7% [1]. Furthermore, to quote from a Forbes article: "Data monetization" will become a major source of revenues, as the world will create 180 zettabytes of data (or 180 trillion gigabytes) in 2025, up from less than 10 zettabytes in 2015.? [2]

【Keywords】:

5. More than the Sum of its Parts: Building Domino Data Lab.

Paper Link】 【Pages】:9

【Authors】: Eduardo Ariño de la Rubia

【Abstract】: Industry has always leveraged cutting edge quantitative research techniques. From finance and insurance, to marketing and manufacturing, efficiencies and advantages have been seized through measurement, prediction, and the generation of insights' but never at this scale. Organizations which previously may have employed one or two data scientists are now scaling the work to dozens if not hundreds of practitioners. Where previously only a handful of organizations could boast that they were leveraging machine learning and statistical models, now it's a rarity to find an untouched industry or player. Organizations are now faced with the challenges of empowering, scaling, and measuring this workforce to sustain the transformation to the prediction economy. In this talk, I will discuss how and why we built the Domino Data Lab platform. I will talk about the challenges we faced technologically, organizationally and culturally when bringing a system of record to data science.

【Keywords】: data science; data science organization.; data science scalability; domino data lab; industrial machine learning

6. Mining Big Data in NeuroGenetics to Understand Muscular Dystrophy.

Paper Link】 【Pages】:11

【Authors】: Andy Berglund

【Abstract】: The recent advances in genome sequencing and analyses of the billions of base pairs in genomic data have been a boon for moving forward our understanding of human disease. In this talk I will describe how genome sequencing has dramatically improved our understanding of the most common adult form of muscular dystrophy, which is myotonic dystrophy. Two different genetic mutations cause thousands of changes in the cells and tissues of myotonic dystrophy patients. Genome sequencing has allowed us to precisely determine the degree of changes across patients, correlate these changes to disease symptoms and allow us to determine quickly in cell and animal models the effectiveness of therapeutic strategies for myotonic dystrophy.

【Keywords】: data mining.; genome sequencing; genomic data; muscular dystrophy; myotonic dystrophy

7. Industrial Machine Learning.

Paper Link】 【Pages】:13

【Authors】: Josh Bloom

【Abstract】: The ongoing digitization of the industrial-scale machines that power and enable human activity is itself a major global transformation. But the real revolution-in efficiencies, in improved and saved lives-will happen as machine learning automation and insights are properly coupled to the complex systems of industrial data. Leveraging a systems view of real-world use cases from aviation to transportation, I contrast the needs and approaches of consumer versus industrial machine learning. Particularly, I focus on three key areas: combining physics-based models to data-driven models, differential privacy and secure ML (including edge-to-cloud strategies), and interpretability of model predictions.

【Keywords】: data mining.; data-driven models; differential privacy; industrial machine learning; interpretability; physics-based models; secure machine learning

8. Behavior Informatics to Discover Behavior Insight for Active and Tailored Client Management.

Paper Link】 【Pages】:15-16

【Authors】: Longbing Cao

【Abstract】: Behavior is ubiquitous, and behavior intelligence and insight play an important role in data understanding and business problem-solving. Behavior Informatics [1,2] emerges as an important tool for discovering behavior intelligence and behavior insight. As a computational concept, behavior captures the aspects of the demographics of behavioral subjects and objects; social relationships or norms governing the interactions between behaviors of an individual or a group; behavior sequences or networks and their dynamics; and the impact or effect generated by the behaviors undertaken by subjects on objects. Accordingly, a behavior model [2] captures the subject and the object of a behavior or behavior sequence, the activities conducted by its subject on objects, and the relationships between activities; behavior subject, object, activities and relationships are characterized by their respective attributes. As a result, a behavior is represented as a behavior attributes-based vector; and a subject's behaviors at a time period form a vector-based sequence, namely, represented as a behavior attribute vector-based matrix [3]. With such behavior modeling and from the informatics perspective, behavior informatics takes a top-down approach to systematically and deeply represent, model, reason about, and aggregate behaviors [4]; and a bottom-up approach to analyze and learn behavior occurrences, non-occurrences, dynamics, impact, and utility [2]. Accordingly, for a real-life problem, first, its data is converted to behavioral data according to the above behavior model, characterized by the relevant activities that form behavior sequences, and the properties of subjects, objects, activities, and relationships. Second, analytical tasks, such as behavior pattern analysis, abnormal behavior detection, coupled analysis of group behaviors, modeling of behavior impact and utility, discovery of high impact and high utility behaviors, analysis of non-occurring behaviors, and analysis of behavior evolution and dynamics, can be undertaken on such behavioral data. In this way, complex behaviors are quantifiable, computable [5], and manageable. This talk introduces some of real-life applications of behavior informatics in core business, capital markets and government services. It involves complex individual and group behaviors in relevant business, the interactions between clients and service providers, and relevant behavior sequences and attributes. The examples demonstrate the personalized and early prediction, the prevention and intervention of abnormal behaviors, and the active and tailored management of suspicious clients. Examples include the detection of pool manipulation through analyzing coupled sequences [3] of trading behaviors from multiple associated accounts in stock markets, the intervention of high-impact [6] behaviors in social security for preventing overpayments, the quantification and identification of high-utility [7] behaviors, the identification of and tailored intervention on self-finalizing versus non-self-finalizing taxpaying behaviors [8,9], and even the impact of non-occurring behaviors [10] in debt recovery and prevention. The real-life case studies show the value and potential of behavior informatics for handling complex and challenging risk management, fraud and non-compliance, and for active and tailored client management in business problems. The examples are associated with highly significant economic benefits and social impact as a result of applying the resultant behavior insight and behavior intelligence.

【Keywords】: behavior computing; behavior informatics; behavior insight; behavior intelligence; behavior modeling; coupled behavior analysis; non-occurring behavior; user modeling

9. It Takes More than Math and Engineering to Hit the Bullseye with Data.

Paper Link】 【Pages】:17

【Authors】: Paritosh Desai

【Abstract】: Adopting algorithmic decision-making in a large and complex enterprise such as a Fortune 50 retailer like Target takes much more than clean, reliable data and great data mining capabilities. Yet data practitioners too often start with advanced math and fancy algorithms, rather than working hand-in-hand with business partners to identify and understand the biggest business problems. (Then teams should move onto how algorithms can be applied to those problems.) Another key step for data scientists at large organizations: ensuring that their business partners -- the merchants, marketers and supply chain experts -- have a base-line understanding of advanced models as well as the proper analytical support tools. Obtaining widespread buy-in and enthusiasm also requires providing a user-friendly interface for business partners with optionality and flexibility that allows the intelligence to be applied to the many varied issues facing a modern retailer, from personalization to supply chain transformation to decisions on assortment and pricing. This talk will explore effective practices and processes -- the do's and don'ts -- for data scientists to succeed in large, complex organizations like a retailer with 1,800+ stores, major marketing campaigns across multiple channels and a fast growing online business.

【Keywords】: data mining.; data science; decision-making; online business; practices and processes

10. Planning and Learning under Uncertainty: Theory and Practice.

Paper Link】 【Pages】:19

【Authors】: Jonathan P. How

【Abstract】: This talk will describe recent progress on modeling, planning, learning, and control of autonomous systems operating in dynamic environments, with an emphasis on addressing the challenges faced on various timescales. For example, autonomous robotic agents need to plan/execute safe paths and avoid imminent collisions given noisy sensory information (short timescale), learn how to interact with other agents (possibly humans) with intents that are not known (medium timescale), and perform complex cooperative tasks given imperfect models and knowledge of the environment and teammate actions (long timescale). These tasks are often constrained to be done using onboard computation and perception, which can add significant complexity to the system. The talk will highlight several recently developed solutions to these challenges that have been implemented to demonstrate high-speed agile flight of a quadrotor in unknown, cluttered environments, autonomous navigation of a ground vehicle in complex indoor environments alongside pedestrians, and real-time cooperative multiagent planning with an onboard deep learning-based perception system.

【Keywords】: invited talk

11. Big Data in Climate: Opportunities and Challenges for Machine Learning.

Paper Link】 【Pages】:21-22

【Authors】: Anuj Karpatne ; Vipin Kumar

【Abstract】: The climate and Earth sciences have recently undergone a rapid transformation from a data-poor to a data-rich environment. In particular, massive amount of data about Earth and its environment is now continuously being generated by a large number of Earth observing satellites as well as physics-based earth system models running on large-scale computational platforms. These massive and information-rich datasets offer huge potential for understanding how the Earth's climate and ecosystem have been changing and how they are being impacted by humans actions. We discuss the challenges involved in analyzing these massive data sets as well as opportunities they present for both advancing machine learning as well as the science of climate change.

【Keywords】: climate science; earth observation data; machine learning

12. Addressing Challenges with Big Data for Media Measurement.

Paper Link】 【Pages】:23

【Authors】: Mainak Mazumdar

【Abstract】: The digital media and TV - which is increasingly digitized, have amassed and generating enormous amount of data. While extremely useful, the big data generated by these platforms poses unique challenges for Data Scientists working on developing measurement framework and metrics. Most practitioners optimize speed and scale at the expense of accuracy, which is critical for any measurement. And, the trade-off between bias and variance is not in consideration. In this paper, we will demonstrate how Nielsen is combining proprietary ground truth data and methodologies with Big Data to address the accuracy and bias/variance challenges. We argue that high quality ground truth or training set is pre-requisite to deploying Big Data for high quality media measurement. To illustrate the point, we will share how Nielsen is combining its proprietary high quality panels with Set Top Box for TV measurement in the U.S.

【Keywords】: big data; data mining; digital media; media measurement; proprietary data; tv

13. Machine Learning Software in Practice: Quo Vadis?

Paper Link】 【Pages】:25

【Authors】: Szilárd Pafka

【Abstract】: Due to the hype in our industry in the last couple of years, there is a growing mismatch between software tools machine learning practitioners wish for, what they would truly need for their work, what's available (either commercially or open source) and what tool developers and researchers focus on. In this talk we will give a couple of examples of this mismatch. Several surveys and anecdotal evidence show that most practitioners work most of the time (at least in the modeling phase) with datasets that t in the RAM of a single server, therefore distributed computing tools are very of- ten overkill. Our benchmarks (available on github [1]) of the most widely used open source tools for binary classification (various implementations of algorithms such as linear methods, random forests, gradient boosted trees and neural networks) on such data show over 10x speed and over 10x RAM usage difference between various tools, with "big data" tools being the most inefficient. Significant performance gains have been obtained by those tools that incorporate various low-level (close to CPU and memory architecture) optimizations. Nevertheless, we will show that even the best tools show degrading performance on the multi-socket servers featuring a high number of cores, systems that have become widely accessible more recently. Finally, while most of this talk is about performance, we will also argue that machine learning tools that feature high-level easy-to-use APIs provide increasing productivity for practitioners and therefore are preferable.

【Keywords】: accuracy; binary classification; memory footprint; software implementations; training speed

14. Designing AI at Scale to Power Everyday Life.

Paper Link】 【Pages】:27

【Authors】: Rajesh Parekh

【Abstract】: The majority of the experiences and interactions people have on Facebook today are made possible with AI. Well over 1 billion people enjoy unique, personalized experiences on Facebook that are powered by a wealth of AI and machine learning algorithms. AI is an incredibly fast-moving field: engineers and researchers across the company are turning the latest research breakthroughs into tools, platforms, and infrastructure that make it possible for anyone at Facebook to use AI in the experiences and products they build. This talk will look at how Facebook is conducting and applying industry-leading research to help drive advancements in AI disciplines like computer vision, language understanding, speech and video. We will also talk about building an infrastructure that anyone at Facebook can use to easily reuse algorithms in different products, scale to run thousands of simultaneous custom experiments, and give concrete examples of how employees across the company are able to leverage these platforms to build new AI products and services.

【Keywords】: artificial intelligence; computer vision; facebook; language understanding; machine learning infrastructure; speech; video

15. Spaceborne Data Enters the Mainstream.

Paper Link】 【Pages】:29

【Authors】: David Potere

【Abstract】: Thanks to a diverse constellation of Earth observing satellites, humanity is effectively looking everywhere, at everything, all the time. And we've been quietly doing so for a long time. The challenge is that for many years these massive data archives have been stranded from operational, cloud-based, modern data science. That is changing fast. In this session we'll do a rapid primer on satellite imagery as a source of novel data about our Earth and discuss how machine learning is a key force for translating all of this Earth data into real insights. We'll use the global agriculture system as a case in point, highlighting some of potential that a spaceborne perspective brings to this vital sector of the economy. Along the way, we will explore some of the beautiful imagery of our home planet that fuels this new class of insights.

【Keywords】: agriculture system.; data mining; machine learning; satellite imagery; spaceborne data

KDD 2017 Panels 2

16. Benchmarks and Process Management in Data Science: Will We Ever Get Over the Mess?

Paper Link】 【Pages】:31-32

【Authors】: Usama M. Fayyad ; Arno Candel ; Eduardo Ariño de la Rubia ; Szilárd Pafka ; Anthony Chong ; Jeong-Yoon Lee

【Abstract】: This panel aims to address areas that are widely acknowledged to be of critical importance to the success of Data Science projects and to the healthy growth of KDD/Data Science as a field of scientific research. However, despite this acknowledgement of their criticality, these areas receive insufficient attention in the major conferences in the field. Furthermore, there is a lack of actual actions and tools to address these areas in actual practice. These areas are summarized as follows:

  1. Ask any data scientist or machine learning practitioner what they spend the majority of their time working on, and you will most likely get an answer that indicates that 80% to 90% of their time is spent on "Data Chasing", "Data Sourcing", "Data Wrangling", "Data Cleaning" and generally what researchers would refer to-often dismissively-as "Data Preparation". The process of producing statistical or data mining models from data is typically "messy" and certainly lacks management tools to help manage, replicate, reconstruct, and capture all the knowledge that goes in 90% of activities of a Data Scientists. The intensive Data Engineering work that goes into exploring and determining the representation of problem and the significant amount of "data cleaning" that ensues creates a plethora of extracts, files, and many artifacts that are only meaningful to the data scientist.
  2. The severe lack of Benchmarks in the field, especially ones at big data scale is an impediment to true, objective, measurable progress on performance. The results of each paper are highly dependent on the large degree of freedom an author has on configuring competitive models and on determining which data sets to use (often Data that is not available to others to replicate results on)
  3. Monitoring the health of models in production, and deploying models into production environments efficiently and effectively is a black art and often an ignored area. Many models are effectively "orphans" with no means of getting appropriate health monitoring. The task of deploying a built model to production is frequently beyond the capabilities of a Data Scientists and the understanding of the IT team.

【Keywords】: accuracy; data benchmarks; memory footprint; model deployment and monitoring; model management; performance benchmarks; software implementations; training speed

17. The Future of Artificially Intelligent Assistants.

Paper Link】 【Pages】:33-34

【Authors】: Muthu Muthukrishnan ; Andrew Tomkins ; Larry P. Heck ; Alborz Geramifard ; Deepak Agarwal

【Abstract】: Artificial Intelligence has been present in literature at least since the ancient Greeks. Depictions present a wide range of perspectives of AI ranging from malefic overlords to depressive androids. Perhaps the most common recurring theme is the AI Assistant: C3PO from Star Wars; the Jetson's Rosie the Robot; the benign hyper-efficient Minds of Iain M. Banks's Culture novels; the eerie HAL 9000 of Arthur C. Clarke's 2001: A Space Odyssey. Today, artificially intelligent assistants are actual products in the marketplace, based on startling recent progress in technologies like speaker-independent speech recognition. These products are in their infancy, but are improving rapidly. In this panel, we will address the product and technology landscape, and will ask a series of experts in the field plus the members of the audience to take a stance on what the future of artificially intelligent assistants will look like.

【Keywords】: artificial intelligence; conversational bots; intelligent assistants; speech recognition

KDD 2017 Research Papers (Oral Papers) 64

18. Learning Certifiably Optimal Rule Lists.

Paper Link】 【Pages】:35-44

【Authors】: Elaine Angelino ; Nicholas Larus-Stone ; Daniel Alabi ; Margo Seltzer ; Cynthia Rudin

【Abstract】: We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. We demonstrate that our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.

【Keywords】: decision trees; interpretable models; optimization; rule lists

19. Improved Degree Bounds and Full Spectrum Power Laws in Preferential Attachment Networks.

Paper Link】 【Pages】:45-53

【Authors】: Chen Avin ; Zvi Lotker ; Yinon Nahum ; David Peleg

【Abstract】: Consider a random preferential attachment model G(p) for network evolution that allows both node and edge arrivals. Starting with an arbitrary nonempty graph G0, at each time step, there are two possible events: with probability p > 0 a new node arrives and a new edge is added between the new node and an existing node, and with probability 1 - p a new edge is added between two existing nodes. In both cases, the involved existing nodes are chosen at random according to preferential attachment, i.e., with probability proportional to their degree. G(p) is known to generate power law networks, i.e., the fraction of nodes with degree k is proportional to k-β. Here β=(4-p)/(2-p) is in the range (2,3]. Denoting the number of nodes of degree k at time t by mk,t, we significantly improve some long-standing results. In particular, we show that mk,t is concentrated around its mean with a deviation of O(√t), which is independent of k. We also tightly bound the expectation Emk,t with an additive error of O(1/k), which is independent of t. These new bounds allow us to tightly estimate mk,t for a considerably larger k values than before. This, in turn, enables us to estimate other important quantities, e.g., the size of the k-rich club, namely, the set of all nodes with a degree at least k. Finally, we introduce a new generalized model, G(pt, rt, qt), which extends G(p) by allowing also time-varying probabilities for node and edge arrivals, as well as the formation of new components. We show that the extended model can produce power law networks with any exponent β in the range (1,∞). Furthermore, the concentration bounds established for mk,t in G(p) also apply in G(pt, rt, qt).

【Keywords】: networks; power law; preferential attachment

20. Unsupervised Network Discovery for Brain Imaging Data.

Paper Link】 【Pages】:55-64

【Authors】: Zilong Bai ; Peter B. Walker ; Anna E. Tschiffely ; Fei Wang ; Ian Davidson

【Abstract】: A common problem with spatiotemporal data is how to simplify the data to discover an underlying network that consists of cohesive spatial regions (nodes) and relationships between those regions (edges). This network discovery problem naturally exists in a multitude of domains including climate data (dipoles), astronomical data (gravitational lensing) and the focus of this paper, fMRI scans of human subjects. Whereas previous work requires strong supervision, we propose an unsupervised matrix tri-factorization formulation with complex constraints and spatial regularization. We show that this formulation works well in controlled experiments with synthetic networks and is able to recover the underlying ground-truth network. We then show that for real fMRI data our approach can reproduce well known results in neurology regarding the default mode network in resting-state healthy and Alzheimer affected individuals.

【Keywords】: brain; fmri; network discovery; spatial regularization; spatiotemporal data

21. Patient Subtyping via Time-Aware LSTM Networks.

Paper Link】 【Pages】:65-74

【Authors】: Inci M. Baytas ; Cao Xiao ; Xi Zhang ; Fei Wang ; Anil K. Jain ; Jiayu Zhou

【Abstract】: In the study of various diseases, heterogeneity among patients usually leads to different progression patterns and may require different types of therapeutic intervention. Therefore, it is important to study patient subtyping, which is grouping of patients into disease characterizing subtypes. Subtyping from complex patient data is challenging because of the information heterogeneity and temporal dynamics. Long-Short Term Memory (LSTM) has been successfully used in many domains for processing sequential data, and recently applied for analyzing longitudinal patient records. The LSTM units are designed to handle data with constant elapsed times between consecutive elements of a sequence. Given that time lapse between successive elements in patient records can vary from days to months, the design of traditional LSTM may lead to suboptimal performance. In this paper, we propose a novel LSTM unit called Time-Aware LSTM (T-LSTM) to handle irregular time intervals in longitudinal patient records. We learn a subspace decomposition of the cell memory which enables time decay to discount the memory content according to the elapsed time. We propose a patient subtyping model that leverages the proposed T-LSTM in an auto-encoder to learn a powerful single representation for sequential records of patients, which are then used to cluster patients into clinical subtypes. Experiments on synthetic and real world datasets show that the proposed T-LSTM architecture captures the underlying structures in the sequences with time irregularities.

【Keywords】: long-short term memory; patient subtyping; recurrent neural network

22. Robust Top-k Multiclass SVM for Visual Category Recognition.

Paper Link】 【Pages】:75-83

【Authors】: Xiaojun Chang ; Yaoliang Yu ; Yi Yang

【Abstract】: Classification problems with a large number of classes inevitably involve overlapping or similar classes. In such cases it seems reasonable to allow the learning algorithm to make mistakes on similar classes, as long as the true class is still among the top-k (say) predictions. Likewise, in applications such as search engine or ad display, we are allowed to present k predictions at a time and the customer would be satisfied as long as her interested prediction is included. Inspired by the recent work of [15], we propose a very generic, robust multiclass SVM formulation that directly aims at minimizing a weighted and truncated combination of the ordered prediction scores. Our method includes many previous works as special cases. Computationally, using the Jordan decomposition Lemma we show how to rewrite our objective as the difference of two convex functions, based on which we develop an efficient algorithm that allows incorporating many popular regularizers (such as the l2 and l1 norms). We conduct extensive experiments on four real large-scale visual category recognition datasets, and obtain very promising performances.

【Keywords】: top-k multiclass svm; visual category recognition

23. KATE: K-Competitive Autoencoder for Text.

Paper Link】 【Pages】:85-94

【Authors】: Yu Chen ; Mohammed J. Zaki

【Abstract】: Autoencoders have been successful in learning meaningful representations from image datasets. However, their performance on text datasets has not been widely studied. Traditional autoencoders tend to learn possibly trivial representations of text documents due to their confoundin properties such as high-dimensionality, sparsity and power-law word distributions. In this paper, we propose a novel k-competitive autoencoder, called KATE, for text documents. Due to the competition between the neurons in the hidden layer, each neuron becomes specialized in recognizing specific data patterns, and overall the model can learn meaningful representations of textual data. A comprehensive set of experiments show that KATE can learn better representations than traditional autoencoders including denoising, contractive, variational, and k-sparse autoencoders. Our model also outperforms deep generative models, probabilistic topic models, and even word representation models (e.g., Word2Vec) in terms of several downstream tasks such as document classification, regression, and retrieval.

【Keywords】: autoencoders; competitive learning; representation learning; text analytics

24. A Minimal Variance Estimator for the Cardinality of Big Data Set Intersection.

Paper Link】 【Pages】:95-103

【Authors】: Reuven Cohen ; Liran Katzir ; Aviv Yehezkel

【Abstract】: In recent years there has been a growing interest in developing "streaming algorithms" for efficient processing and querying of continuous data streams. These algorithms seek to provide accurate results while minimizing the required storage and the processing time, at the price of a small inaccuracy in their output. A fundamental query of interest is the intersection size of two big data streams. This problem arises in many different application areas, such as network monitoring, database systems, data integration and information retrieval. In this paper we develop a new algorithm for this problem, based on the Maximum Likelihood (ML) method. We show that this algorithm outperforms all known schemes in terms of the estimation's quality (lower variance) and that it asymptotically achieves the optimal variance.

【Keywords】: cardinality estimation; data mining; set intersection; streaming algorithms

25. HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics.

Paper Link】 【Pages】:105-114

【Authors】: Edith Cohen

【Abstract】: One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation. We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and cap statistics. We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the complement Laplace transform of the input frequencies.

【Keywords】: concave sublinear; distributed aggregation; frequency statistics; laplace transform; streaming

26. Fast Enumeration of Large k-Plexes.

Paper Link】 【Pages】:115-124

【Authors】: Alessio Conte ; Donatella Firmani ; Caterina Mordente ; Maurizio Patrignani ; Riccardo Torlone

【Abstract】: K-plexes are a formal yet flexible way of defining communities in networks. They generalize the notion of cliques and are more appropriate in most real cases: while a node of a clique C is connected to all other nodes of C, a node of a k-plex may miss up to k connections. Unfortunately, computing all maximal k-plexes is a gruesome task and state-of-the-art algorithms can only process small-size networks. In this paper we propose a new approach for enumerating large k-plexes in networks that speeds up the search by several orders of magnitude, leveraging on (i) methods for strongly reducing the search space and (ii) efficient techniques for the computation of maximal cliques. Several experiments show that our strategy is effective and is able to increase the size of the networks for which the computation of large k-plexes is feasible from a few hundred to several hundred thousand nodes.

【Keywords】: cliques; communities; k-plexes

27. Matrix Profile V: A Generic Technique to Incorporate Domain Knowledge into Motif Discovery.

Paper Link】 【Pages】:125-134

【Authors】: Hoang Anh Dau ; Eamonn J. Keogh

【Abstract】: Time series motif discovery has emerged as perhaps the most used primitive for time series data mining, and has seen applications to domains as diverse as robotics, medicine and climatology. There has been recent significant progress on the scalability of motif discovery. However, we believe that the current definitions of motif discovery are limited, and can create a mismatch between the user's intent/expectations, and the motif discovery search outcomes. In this work, we explain the reasons behind these issues, and introduce a novel and general framework to address them. Our ideas can be used with current state-of-the-art algorithms with virtually no time or space overhead, and are fast enough to allow real-time interaction and hypotheses testing on massive datasets. We demonstrate the utility of our ideas on domains as diverse as seismology and epileptic seizure monitoring.

【Keywords】: interactive data mining; matrix profile; motif discovery; time series

28. metapath2vec: Scalable Representation Learning for Heterogeneous Networks.

Paper Link】 【Pages】:135-144

【Authors】: Yuxiao Dong ; Nitesh V. Chawla ; Ananthram Swami

【Abstract】: We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.

【Keywords】: feature learning; heterogeneous information networks; heterogeneous representation learning; latent representations; network embedding

29. Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters.

Paper Link】 【Pages】:145-154

【Authors】: Alessandro Epasto ; Silvio Lattanzi ; Renato Paes Leme

【Abstract】: We propose ego-splitting, a new framework for detecting clusters in complex networks which leverage the local structures known as ego-nets (i.e. the subgraph induced by the neighborhood of each node) to de-couple overlapping clusters. Ego-splitting is a highly scalable and flexible framework, with provable theoretical guarantees, that reduces the complex overlapping clustering problem to a simpler and more amenable non-overlapping (partitioning) problem. We can scale community detection to graphs with tens of billions of edges and outperform previous solutions based on ego-nets analysis. More precisely, our framework works in two steps: a local ego-net analysis phase, and a global graph partitioning phase. In the local step, we first partition the nodes' ego-nets using a partitioning algorithm. We then use the computed clusters to split each node into its persona nodes that represent the instantiations of the node in its communities. Finally, in the global step, we partition the newly created graph to obtain an overlapping clustering of the original graph.

【Keywords】: ego-nets; large-scale graph algorithms; overlapping clustering

30. Contextual Motifs: Increasing the Utility of Motifs using Contextual Data.

Paper Link】 【Pages】:155-164

【Authors】: Ian Fox ; Lynn Ang ; Mamta Jaiswal ; Rodica Pop-Busui ; Jenna Wiens

【Abstract】: Motifs are a powerful tool for analyzing physiological waveform data. Standard motif methods, however, ignore important contextual information (e.g., what the patient was doing at the time the data were collected). We hypothesize that these additional contextual data could increase the utility of motifs. Thus, we propose an extension to motifs, contextual motifs, that incorporates context. Recognizing that, oftentimes, context may be unobserved or unavailable, we focus on methods to jointly infer motifs and context. Applied to both simulated and real physiological data, our proposed approach improves upon existing motif methods in terms of the discriminative utility of the discovered motifs. In particular, we discovered contextual motifs in continuous glucose monitor (CGM) data collected from patients with type 1 diabetes. Compared to their contextless counterparts, these contextual motifs led to better predictions of hypo- and hyperglycemic events. Our results suggest that even when inferred, context is useful in both a long- and short-term prediction horizon when processing and interpreting physiological waveform data.

【Keywords】: blood glucose; contextual motifs; motif discovery

31. Unsupervised P2P Rental Recommendations via Integer Programming.

Paper Link】 【Pages】:165-173

【Authors】: Yanjie Fu ; Guannan Liu ; Mingfei Teng ; Charu C. Aggarwal

【Abstract】: Due to the sparseness of quality rating data, unsupervised recommender systems are used in many applications in Peer to Peer (P2P) rental marketplaces such as Airbnb, FlipKey, and HomeAway. We present an integer programming based recommender systems, where both accommodation benefits and community risks of lodging places are measured and incorporated into an objective function as utility measurements. More specifically, we first present an unsupervised fused scoring method for quantifying the accommodation benefits and community risks of a lodging with crowd-sourced geo-tagged data. In order to the utility of recommendations, we formulate the unsupervised P2P rental recommendations as a constrained integer programming problem, where the accommodation benefits of recommendations are maximized and the community risks of recommendations are minimized, while maintaining constraints on personalization. Furthermore, we provide an efficient solution for the optimization problem by developing a learning-to-integer-programming method for combining aggregated listwise learning to rank into branching variable selection. We apply the proposed approach to the Airbnb data of New York City and provide lodging recommendations to travelers. In our empirical experiments, we demonstrate both the efficiency and effectiveness of our method in terms of striving a trade-off between the user satisfaction, time on market, and the number of reviews, and achieving a balance between positive and negative sides.

【Keywords】: integer programming; learning to optimize; unsupervised recommendations

32. The Co-Evolution Model for Social Network Evolving and Opinion Migration.

Paper Link】 【Pages】:175-184

【Authors】: Yupeng Gu ; Yizhou Sun ; Jianxi Gao

【Abstract】: Almost all real-world social networks are dynamic and evolving with time, where new links may form and old links may drop, largely determined by the homophily of social actors (i.e., nodes in the network). Meanwhile, (latent) properties of social actors, such as their opinions, are changing along the time, partially due to social influence received from the network, which will in turn affect the network structure. Social network evolution and node property migration are usually treated as two orthogonal problems, and have been studied separately. In this paper, we propose a co-evolution model that closes the loop by modeling the two phenomena together, which contains two major components: (1) a network generative model when the node property is known; and (2) a property migration model when the social network structure is known. Simulation shows that our model has several nice properties: (1) it can model a broad range of phenomena such as opinion convergence (i.e., herding) and community-based opinion divergence; and (2) it allows to control the evolution via a set of factors such as social influence scope, opinion leader, and noise level. Finally, the usefulness of our model is demonstrated by an application of co-sponsorship prediction for legislative bills in Congress, which outperforms several state-of-the-art baselines.

【Keywords】: co-evolution; dynamic networks; network generation models

33. Groups-Keeping Solution Path Algorithm for Sparse Regression with Automatic Feature Grouping.

Paper Link】 【Pages】:185-193

【Authors】: Bin Gu ; Guodong Liu ; Heng Huang

【Abstract】: Feature selection is one of the most important data mining research topics with many applications. In practical problems, features often have group structure to effect the outcomes. Thus, it is crucial to automatically identify homogenous groups of features for high-dimensional data analysis. Octagonal shrinkage and clustering algorithm for regression (OSCAR) is an important sparse regression approach with automatic feature grouping and selection by ℓ1 norm and pairwise ℓ∞ norm. However, due to over-complex representation of the penalty (especially the pairwise ℓ∞ norm), so far OSCAR has no solution path algorithm which is mostly useful for tuning the model. To address this challenge, in this paper, we propose a groups-keeping solution path algorithm to solve the OSCAR model (OscarGKPath). Given a set of homogenous groups of features and an accuracy bound ε, OscarGKPath can fit the solutions in an interval of regularization parameters while keeping the feature groups. The entire solution path can be obtained by combining multiple such intervals. We prove that all solutions in the solution path produced by OscarGKPath can strictly satisfy the given accuracy bound ε. The experimental results on benchmark datasets not only confirm the effectiveness of our OscarGKPath algorithm, but also show the superiority of our OscarGKPath in cross validation compared with the existing batch algorithm.

【Keywords】: automatic feature grouping; feature selection; oscar; solution path; sparse regression

34. Clustering Individual Transactional Data for Masses of Users.

Paper Link】 【Pages】:195-204

【Authors】: Riccardo Guidotti ; Anna Monreale ; Mirco Nanni ; Fosca Giannotti ; Dino Pedreschi

【Abstract】: Mining a large number of datasets recording human activities for making sense of individual data is the key enabler of a new wave of personalized knowledge-based services. In this paper we focus on the problem of clustering individual transactional data for a large mass of users. Transactional data is a very pervasive kind of information that is collected by several services, often involving huge pools of users. We propose txmeans, a parameter-free clustering algorithm able to efficiently partitioning transactional data in a completely automatic way. Txmeans is designed for the case where clustering must be applied on a massive number of different datasets, for instance when a large set of users need to be analyzed individually and each of them has generated a long history of transactions. A deep experimentation on both real and synthetic datasets shows the practical effectiveness of txmeans for the mass clustering of different personal datasets, and suggests that txmeans outperforms existing methods in terms of quality and efficiency. Finally, we present a personal cart assistant application based on txmeans

【Keywords】: parameter-free clustering algorithm; personal cart assistant; representative basket; transactional clustering

35. Network Inference via the Time-Varying Graphical Lasso.

Paper Link】 【Pages】:205-213

【Authors】: David Hallac ; Youngsuk Park ; Stephen P. Boyd ; Jure Leskovec

【Abstract】: Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the time-varying graphical lasso (TVGL), a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability.

【Keywords】: admm; convex optimization; graphical lasso; network inference; time series analysis

36. Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data.

Paper Link】 【Pages】:215-223

【Authors】: David Hallac ; Sagar Vare ; Stephen P. Boyd ; Jure Leskovec

【Abstract】: Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.

【Keywords】: alternating minimization; clustering; convex optimization; networks; time series analysis

37. Efficient Correlated Topic Modeling with Topic Embedding.

Paper Link】 【Pages】:225-233

【Authors】: Junxian He ; Zhiting Hu ; Taylor Berg-Kirkpatrick ; Ying Huang ; Eric P. Xing

【Abstract】: Correlated topic modeling has been limited to small model and problem sizes due to their high computational cost and poor scaling. In this paper, we propose a new model which learns compact topic embeddings and captures topic correlations through the closeness between the topic vectors. Our method enables efficient inference in the low-dimensional embedding space, reducing previous cubic or quadratic time complexity to linear w.r.t the topic size. We further speedup variational inference with a fast sampler to exploit sparsity of topic occurrence. Extensive experiments show that our approach is capable of handling model and data scales which are several orders of magnitude larger than existing correlation results, without sacrificing modeling quality by providing competitive or superior performance in document classification and retrieval.

【Keywords】: correlated topic models; scalability; topic embedding

38. Accelerating Innovation Through Analogy Mining.

Paper Link】 【Pages】:235-243

【Authors】: Tom Hope ; Joel Chan ; Aniket Kittur ; Dafna Shahaf

【Abstract】: The availability of large idea repositories (e.g., the U.S. patent database) could significantly accelerate innovation and discovery by providing people with inspiration from solutions to analogous problems. However, finding useful analogies in these large, messy, real-world repositories remains a persistent challenge for either human or automated methods. Previous approaches include costly hand-created databases that have high relational structure (e.g., predicate calculus representations) but are very sparse. Simpler machine-learning/information-retrieval similarity metrics can scale to large, natural-language datasets, but struggle to account for structural similarity, which is central to analogy. In this paper we explore the viability and value of learning simpler structural representations, specifically, "problem schemas", which specify the purpose of a product and the mechanisms by which it achieves that purpose. Our approach combines crowdsourcing and recurrent neural networks to extract purpose and mechanism vector representations from product descriptions. We demonstrate that these learned vectors allow us to find analogies with higher precision and recall than traditional information-retrieval methods. In an ideation experiment, analogies retrieved by our models significantly increased people's likelihood of generating creative ideas compared to analogies retrieved by traditional methods. Our results suggest a promising approach to enabling computational analogy at scale is to learn and leverage weaker structural representations.

【Keywords】: computational analogy; creativity; innovation; product dimensions; text embedding; text mining

39. Communication-Efficient Distributed Block Minimization for Nonlinear Kernel Machines.

Paper Link】 【Pages】:245-254

【Authors】: Cho-Jui Hsieh ; Si Si ; Inderjit S. Dhillon

【Abstract】: Nonlinear kernel machines often yield superior predictive performance on various tasks; however, they suffer from severe computational challenges. In this paper, we show how to overcome the important challenge of speeding up kernel machines using multiple computers. In particular, we develop a parallel block minimization framework, and demonstrate its good scalability in solving nonlinear kernel SVM and logistic regression. Our framework proceeds by dividing the problem into smaller subproblems by forming a block-diagonal approximation of the Hessian matrix. The subproblems are then solved approximately in parallel. After that, a communication efficient line search procedure is developed to ensure sufficient reduction of the objective function value by exploiting the problem structure of kernel machines. We prove global linear convergence rate of the proposed method with a wide class of subproblem solvers, and our analysis covers strongly convex and some non-strongly convex functions. We apply our algorithm to solve large-scale kernel SVM problems on distributed systems, and show a significant improvement over existing parallel solvers. As an example, on the covtype dataset with half-a-million samples, our algorithm can obtain an approximate solution with 96% accuracy in 20 seconds using 32 machines, while all the other parallel kernel SVM solvers require more than 2000 seconds to achieve a solution with 95% accuracy. Moreover, our algorithm is the first distributed kernel SVM solver that can scale to massive data sets. On the KDDB dataset (20 million samples and 30 million features), our parallel solver can compute the kernel SVM solution within half an hour using 32 machines with 640 cores in total, while existing solvers can not scale to this dataset.

【Keywords】: classification; distributed algorithm; kernel methods

40. A Hierarchical Algorithm for Extreme Clustering.

Paper Link】 【Pages】:255-264

【Authors】: Ari Kobren ; Nicholas Monath ; Akshay Krishnamurthy ; Andrew McCallum

【Abstract】: Many modern clustering methods scale well to a large number of data points, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy, incremental algorithm for hierarchical clustering that scales to both massive N and K---a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.

【Keywords】: clustering; large-scale learning

41. Estimating Treatment Effect in the Wild via Differentiated Confounder Balancing.

Paper Link】 【Pages】:265-274

【Authors】: Kun Kuang ; Peng Cui ; Bo Li ; Meng Jiang ; Shiqiang Yang

【Abstract】: Estimating treatment effect plays an important role on decision making in many fields, such as social marketing, healthcare, and public policy. The key challenge on estimating treatment effect in the wild observational studies is to handle confounding bias induced by imbalance of the confounder distributions between treated and control units. Traditional methods remove confounding bias by re-weighting units with supposedly accurate propensity score estimation under the unconfoundedness assumption. Controlling high-dimensional variables may make the unconfoundedness assumption more plausible, but poses new challenge on accurate propensity score estimation. One strand of recent literature seeks to directly optimize weights to balance confounder distributions, bypassing propensity score estimation. But existing balancing methods fail to do selection and differentiation among the pool of a large number of potential confounders, leading to possible underperformance in many high dimensional settings. In this paper, we propose a data-driven Differentiated Confounder Balancing (DCB) algorithm to jointly select confounders, differentiate weights of confounders and balance confounder distributions for treatment effect estimation in the wild high dimensional settings. The synergistic learning algorithm we proposed is more capable of reducing the confounding bias in many observational studies. To validate the effectiveness of our DCB algorithm, we conduct extensive experiments on both synthetic and real datasets. The experimental results clearly demonstrate that our DCB algorithm outperforms the state-of-the-art methods. We further show that the top features ranked by our algorithm generate accurate prediction of online advertising effect.

【Keywords】: causal inference; high dimensional inference; treatment effect estimation

42. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables.

Paper Link】 【Pages】:275-284

【Authors】: Himabindu Lakkaraju ; Jon M. Kleinberg ; Jure Leskovec ; Jens Ludwig ; Sendhil Mullainathan

【Abstract】: Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled, in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

【Keywords】: evaluating machine learning algorithms; selective labels; unmeasured confounders; unobservables

43. Constructivism Learning: A Learning Paradigm for Transparent Predictive Analytics.

Paper Link】 【Pages】:285-294

【Authors】: Xiaoli Li ; Jun Huan

【Abstract】: Developing transparent predictive analytics has attracted significant research attention recently. There have been multiple theories on how to model learning transparency but none of them aims to understand the internal and often complicated modeling processes. In this paper we adopt a contemporary philosophical concept called "constructivism", which is a theory regarding how human learns. We hypothesize that a critical aspect of transparent machine learning is to "reveal" model construction with two key process: (1) the assimilation process where we enhance our existing learning models and (2) the accommodation process where we create new learning models. With this intuition we propose a new learning paradigm, constructivism learning, using a Bayesian nonparametric model to dynamically handle the creation of new learning tasks. Our empirical study on both synthetic and real data sets demonstrate that the new learning algorithm is capable of delivering higher quality models (as compared to base lines and state-of-the-art) and at the same time increasing the transparency of the learning process.

【Keywords】: constructivisim learning; dynamic task construction; sequential dirichlet process; transparent machine learning

44. Is the Whole Greater Than the Sum of Its Parts?

Paper Link】 【Pages】:295-304

【Authors】: Liangyue Li ; Hanghang Tong ; Yong Wang ; Conglei Shi ; Nan Cao ; Norbou Buchler

【Abstract】: The PART-WHOLE relationship routinely finds itself in many disciplines, ranging from collaborative teams, crowdsourcing, autonomous systems to networked systems. From the algorithmic perspective, the existing work has primarily focused on predicting the outcomes of the whole and parts, by either separate models or linear joint models, which assume the outcome of the parts has a linear and independent effect on the outcome of the whole. In this paper, we propose a joint predictive method named PAROLE to simultaneously and mutually predict the part and whole outcomes. The proposed method offers two distinct advantages over the existing work. First (Model Generality), we formulate joint PART-WHOLE outcome prediction as a generic optimization problem, which is able to encode a variety of complex relationships between the outcome of the whole and parts, beyond the linear independence assumption. Second (Algorithm Efficacy), we propose an effective and efficient block coordinate descent algorithm, which is able to find the coordinate-wise optimum with a linear complexity in both time and space. Extensive empirical evaluations on real-world datasets demonstrate that the proposed PAROLE (1) leads to consistent prediction performance improvement by modeling the non-linear part-whole relationship as well as part-part interdependency, and (2) scales linearly in terms of the size of the training dataset.

【Keywords】: joint predictive model; part-whole relationship

45. Collaborative Variational Autoencoder for Recommender Systems.

Paper Link】 【Pages】:305-314

【Authors】: Xiaopeng Li ; James She

【Abstract】: Modern recommender systems usually employ collaborative filtering with rating information to recommend items to users due to its successful performance. However, because of the drawbacks of collaborative-based methods such as sparsity, cold start, etc., more attention has been drawn to hybrid methods that consider both the rating and content information. Most of the previous works in this area cannot learn a good representation from content for recommendation task or consider only text modality of the content, thus their methods are very limited in current multimedia scenario. This paper proposes a Bayesian generative model called collaborative variational autoencoder (CVAE) that considers both rating and content for recommendation in multimedia scenario. The model learns deep latent representations from content data in an unsupervised manner and also learns implicit relationships between items and users from both content and rating. Unlike previous works with denoising criteria, the proposed CVAE learns a latent distribution for content in latent space instead of observation space through an inference network and can be easily extended to other multimedia modalities other than text. Experiments show that CVAE is able to significantly outperform the state-of-the-art recommendation methods with more robust performance.

【Keywords】: autoencoder; bayesian; deep learning; generative models; recommender systems; variational inference

46. Linearized GMM Kernels and Normalized Random Fourier Features.

Paper Link】 【Pages】:315-324

【Authors】: Ping Li

【Abstract】: The method of "random Fourier features (RFF)" has become a popular tool for approximating the "radial basis function (RBF)" kernel. The variance of RFF is actually large. Interestingly, the variance can be substantially reduced by a simple normalization step as we theoretically demonstrate. We name the improved scheme as the "normalized RFF (NRFF)", and we provide a technical proof of the asymptotic variance of NRFF, as validated by simulations. We also propose the "generalized min-max (GMM)" kernel as a measure of data similarity, where data vectors can have both positive and negative entries. GMM is positive definite as there is an associate hashing method named "generalized consistent weighted sampling (GCWS)" which linearizes this (nonlinear) kernel. We provide an extensive empirical evaluation of the RBF and GMM kernels on more than 50 datasets. For a majority of the datasets, the (tuning-free) GMM kernel outperforms the best-tuned RBF kernel. We then conduct extensive classification experiments for comparing the linearized RBF kernel using NRFF with the linearized GMM kernel using GCWS. We observe that, in order to reach a similar accuracy, GCWS typically requires substantially fewer samples than NRFF, even on datasets where the original RBF kernel outperforms the original GMM kernel. As the training, storage, and processing costs are directly proportional to the sample size, our experiments can help demonstrate that GCWS would be a more practical scheme for large-scale machine learning applications. The empirical success of GCWS (compared to NRFF) can also be explained theoretically, from at least two aspects. Firstly, the relative variance (normalized by the squared expectation) of GCWS is substantially smaller than that of NRFF, except for the very high similarity region (where the variances of both methods approach zero). Secondly, if we are allowed to make a general model assumption on the data, then we can show analytically that GCWS exhibits much smaller variance than NRFF for estimating the same object (e.g., the RBF kernel), except for the very high similarity region.

【Keywords】: gmm kernel; hashing; large-scale machine learning; normalized random fourier features

47. Discrete Content-aware Matrix Factorization.

Paper Link】 【Pages】:325-334

【Authors】: Defu Lian ; Rui Liu ; Yong Ge ; Kai Zheng ; Xing Xie ; Longbing Cao

【Abstract】: Precisely recommending relevant items from massive candidates to a large number of users is an indispensable yet computationally expensive task in many online platforms (e.g., Amazon.com and Netflix.com). A promising way is to project users and items into a Hamming space and then recommend items via Hamming distance. However, previous studies didn't address the cold-start challenges and couldn't make the best use of preference data like implicit feedback. To fill this gap, we propose a Discrete Content-aware Matrix Factorization (DCMF) model, 1) to derive compact yet informative binary codes at the presence of user/item content information; 2) to support the classification task based on a local upper bound of logit loss; 3) to introduce an interaction regularization for dealing with the sparsity issue. We further develop an efficient discrete optimization algorithm for parameter learning. Based on extensive experiments on three real-world datasets, we show that DCFM outperforms the state-of-the-arts on both regression and classification tasks.

【Keywords】: collaborative filtering; content-based filtering; discrete hashing; recommendation

48. Effective and Real-time In-App Activity Analysis in Encrypted Internet Traffic Streams.

Paper Link】 【Pages】:335-344

【Authors】: Junming Liu ; Yanjie Fu ; Jingci Ming ; Yong Ren ; Leilei Sun ; Hui Xiong

【Abstract】: The mobile in-App service analysis, aiming at classifying mobile internet traffic into different types of service usages, has become a challenging and emergent task for mobile service providers due to the increasing adoption of secure protocols for in-App services. While some efforts have been made for the classification of mobile internet traffic, existing methods rely on complex feature construction and large storage cache, which lead to low processing speed, and thus not practical for online real-time scenarios. To this end, we develop an iterative analyzer for classifying encrypted mobile traffic in a real-time way. Specifically, we first select an optimal set of most discriminative features from raw features extracted from traffic packet sequences by a novel Maximizing Inner activity similarity and Minimizing Different activity similarity (MIMD) measurement. To develop the online analyzer, we first represent a traffic flow with a series of time windows, which are described by the optimal feature vector and are updated iteratively at the packet level. Instead of extracting feature elements from a series of raw traffic packets, our feature elements are updated when a new traffic packet is observed and the storage of raw traffic packets is not required. The time windows generated from the same service usage activity are grouped by our proposed method, namely, recursive time continuity constrained KMeans clustering (rCKC). The feature vectors of cluster centers are then fed into a random forest classifier to identify corresponding service usages. Finally, we provide extensive experiments on real-world Internet traffic data from Wechat, Whatsapp, and Facebook to demonstrate the effectiveness and efficiency of our approach. The results show that the proposed analyzer provides high accuracy in real-world scenarios, and has low storage cache requirement as well as fast processing speed.

【Keywords】: in-app analytics; internet traffic analysis; service usage classification; time series segmentation

49. Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning.

Paper Link】 【Pages】:345-354

【Authors】: Tingjin Luo ; Weizhong Zhang ; Shang Qiu ; Yang Yang ; Dongyun Yi ; Guangtao Wang ; Jieping Ye ; Jie Wang

【Abstract】: Functional annotation of human genes is fundamentally important for understanding the molecular basis of various genetic diseases. A major challenge in determining the functions of human genes lies in the functional diversity of proteins, that is, a gene can perform different functions as it may consist of multiple protein coding isoforms (PCIs). Therefore, differentiating functions of PCIs can significantly deepen our understanding of the functions of genes. However, due to the lack of isoform-level gold-standards (ground-truth annotation), many existing functional annotation approaches are developed at gene-level. In this paper, we propose a novel approach to differentiate the functions of PCIs by integrating sparse simplex projection---that is, a nonconvex sparsity-inducing regularizer---with the framework of multi-instance learning (MIL). Specifically, we label the genes that are annotated to the function under consideration as positive bags and the genes without the function as negative bags. Then, by sparse projections onto simplex, we learn a mapping that embeds the original bag space to a discriminative feature space. Our framework is flexible to incorporate various smooth and non-smooth loss functions such as logistic loss and hinge loss. To solve the resulting highly nontrivial non-convex and non-smooth optimization problem, we further develop an efficient block coordinate descent algorithm. Extensive experiments on human genome data demonstrate that the proposed approaches significantly outperform the state-of-the-art methods in terms of functional annotation accuracy of human PCIs and efficiency.

【Keywords】: alternative splicing; human pcis; key instance detection; multiple instance learning; non-convex problem

50. Discovering Reliable Approximate Functional Dependencies.

Paper Link】 【Pages】:355-363

【Authors】: Panagiotis Mandros ; Mario Boley ; Jilles Vreeken

【Abstract】: Given a database and a target attribute of interest, how can we tell whether there exists a functional, or approximately functional dependence of the target on any set of other attributes in the data? How can we reliably, without bias to sample size or dimensionality, measure the strength of such a dependence? And, how can we efficiently discover the optimal or α-approximate top-k dependencies? These are exactly the questions we answer in this paper. As we want to be agnostic on the form of the dependence, we adopt an information-theoretic approach, and construct a reliable, bias correcting score that can be efficiently computed. Moreover, we give an effective optimistic estimator of this score, by which for the first time we can mine the approximate functional dependencies from data with guarantees of optimality. Empirical evaluation shows that the derived score achieves a good bias for variance trade-off, can be used within an efficient discovery algorithm, and indeed discovers meaningful dependencies. Most important, it remains reliable in the face of data sparsity.

【Keywords】: information theory; pattern discovery

51. Towards an Optimal Subspace for K-Means.

Paper Link】 【Pages】:365-373

【Authors】: Dominik Mautz ; Wei Ye ; Claudia Plant ; Christian Böhm

【Abstract】: Is there an optimal dimensionality reduction for k-means, revealing the prominent cluster structure hidden in the data? We propose SUBKMEANS, which extends the classic k-means algorithm. The goal of this algorithm is twofold: find a sufficient k-means-style clustering partition and transform the clusters onto a common subspace, which is optimal for the cluster structure. Our solution is able to pursue these two goals simultaneously. The dimensionality of this subspace is found automatically and therefore the algorithm comes without the burden of additional parameters. At the same time this subspace helps to mitigate the curse of dimensionality. The SUBKMEANS optimization algorithm is intriguingly simple and efficient. It is easy to implement and can readily be adopted to the current situation. Furthermore, it is compatible to many existing extensions and improvements of k-means.

【Keywords】: clustering; dimensionality reduction; k-means; subspace

52. SPARTan: Scalable PARAFAC2 for Large & Sparse Data.

Paper Link】 【Pages】:375-384

【Authors】: Ioakeim Perros ; Evangelos E. Papalexakis ; Fei Wang ; Richard W. Vuduc ; Elizabeth Searles ; Michael Thompson ; Jimeng Sun

【Abstract】: In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To handle such data, the state-of-the-art tensor model is the so-called PARAFAC2, which yields interpretable and robust output and can naturally handle sparse data. However, its main limitation up to now has been the lack of efficient algorithms that can handle large-scale datasets. In this work, we fill this gap by developing a scalable method to compute the PARAFAC2 decomposition of large and sparse datasets, called SPARTan. Our method exploits special structure within PARAFAC2, leading to a novel algorithmic reformulation that is both faster (in absolute time) and more memory-efficient than prior work. We evaluate SPARTan on both synthetic and real datasets, showing 22X performance gains over the best previous implementation and also handling larger problem instances for which the baseline fails. Furthermore, we are able to apply SPARTan to the mining of temporally-evolving phenotypes on data taken from real and medically complex pediatric patients. The clinical meaningfulness of the phenotypes identified in this process, as well as their temporal evolution over time for several patients, have been endorsed by clinical experts.

【Keywords】: parafac2; phenotyping; sparse tensor factorization; unsupervised learning

53. struc2vec: Learning Node Representations from Structural Identity.

Paper Link】 【Pages】:385-394

【Authors】: Leonardo Filipe Rodrigues Ribeiro ; Pedro H. P. Saverese ; Daniel R. Figueiredo

【Abstract】: Structural identity is a concept of symmetry in which network nodes are identified according to the network structure and their relationship to other nodes. Structural identity has been studied in theory and practice over the past decades, but only recently has it been addressed with representational learning techniques. This work presents struc2vec, a novel and flexible framework for learning latent representations for the structural identity of nodes. struc2vec uses a hierarchy to measure node similarity at different scales, and constructs a multilayer graph to encode structural similarities and generate structural context for nodes. Numerical experiments indicate that state-of-the-art techniques for learning node representations fail in capturing stronger notions of structural identity, while struc2vec exhibits much superior performance in this task, as it overcomes limitations of prior approaches. As a consequence, numerical experiments indicate that struc2vec improves performance on classification tasks that depend more on structural identity.

【Keywords】: feature learning; node embeddings; structural identity

54. Similarity Forests.

Paper Link】 【Pages】:395-403

【Authors】: Saket Sathe ; Charu C. Aggarwal

【Abstract】: Random forests are among the most successful methods used in data mining because of their extraordinary accuracy and effectiveness. However, their use is primarily limited to multidimensional data because they sample features from the original data set. In this paper, we propose a method for extending random forests to work with any arbitrary set of data objects, as long as similarities can be computed among the data objects. Furthermore, since it is understood that similarity computation between all O(n2) pairs of n objects might be expensive, our method computes only a very small fraction of the O(n2) pairwise similarities between objects to construct the forests. Our results show that the proposed similarity forest approach is very efficient and accurate on a wide variety of data sets. Therefore, this paper significantly extends the applicability of random forest methods to arbitrary data domains. Furthermore, the approach even outperforms traditional random forests on multidimensional data. We show that similarity forests are robust to the noisy similarity values that are ubiquitous in real-world applications. In many practical settings, the similarity values between objects are incompletely specified because of the difficulty in collecting such values. Similarity forests can be used in such cases with straightforward modifications.

【Keywords】: classification; data mining; random forests

55. Online Ranking with Constraints: A Primal-Dual Algorithm and Applications to Web Traffic-Shaping.

Paper Link】 【Pages】:405-414

【Authors】: Parikshit Shah ; Akshay Soni ; Troy Chevalier

【Abstract】: We study the online constrained ranking problem motivated by an application to web-traffic shaping: an online stream of sessions arrive in which, within each session, we are asked to rank items. The challenge involves optimizing the ranking in each session so that local vs. global objectives are controlled: within each session one wishes to maximize a reward (local) while satisfying certain constraints over the entire set of sessions (global). A typical application of this setup is that of page optimization in a web portal. We wish to rank items so that not only is user engagement maximized in each session, but also other business constraints (such as the number of views/clicks delivered to various publishing partners) are satisfied. We describe an online algorithm for performing this optimization. A novel element of our approach is the use of linear programming duality and connections to the celebrated Hungarian algorithm. This framework enables us to determine a set of shadow prices for each traffic-shaping constraint that can then be used directly in the final ranking function to assign near-optimal rankings. The (dual) linear program can be solved off-line periodically to determine the prices. At serving time these prices are used as weights to compute weighted rank-scores for the items, and the simplicity of the approach facilitates scalability to web applications. We provide rigorous theoretical guarantees for the performance of our online algorithm and validate our approach using numerical experiments on real web-traffic data from a prominent internet portal.

【Keywords】: constrained ranking; online ranking; primal dual algorithm; traffic shaping

56. On Finding Socially Tenuous Groups for Online Social Networks.

Paper Link】 【Pages】:415-424

【Authors】: Chih-Ya Shen ; Liang-Hao Huang ; De-Nian Yang ; Hong-Han Shuai ; Wang-Chien Lee ; Ming-Syan Chen

【Abstract】: Existing research on finding social groups mostly focuses on dense subgraphs in social networks. However, finding socially tenuous groups also has many important applications. In this paper, we introduce the notion of k-triangles to measure the tenuity of a group. We then formulate a new research problem, Minimum k-Triangle Disconnected Group (MkTG), to find a socially tenuous group from online social networks. We prove that MkTG is NP-Hard and inapproximable within any ratio in arbitrary graphs but polynomial-time tractable in threshold graphs. Two algorithms, namely TERA and TERA-ADV, are designed to exploit graph-theoretical approaches for solving MkTG on general graphs effectively and efficiently. Experimental results on seven real datasets manifest that the proposed algorithms outperform existing approaches in both efficiency and solution quality.

【Keywords】: k-triangles; socially tenuous groups

57. PReP: Path-Based Relevance from a Probabilistic Perspective in Heterogeneous Information Networks.

Paper Link】 【Pages】:425-434

【Authors】: Yu Shi ; Po-Wei Chan ; Honglei Zhuang ; Huan Gui ; Jiawei Han

【Abstract】: As a powerful representation paradigm for networked and multi-typed data, the heterogeneous information network (HIN) is ubiquitous. Meanwhile, defining proper relevance measures has always been a fundamental problem and of great pragmatic importance for network mining tasks. Inspired by our probabilistic interpretation of existing path-based relevance measures, we propose to study HIN relevance from a probabilistic perspective. We also identify, from real-world data, and propose to model cross-meta-path synergy, which is a characteristic important for defining path-based HIN relevance and has not been modeled by existing methods. A generative model is established to derive a novel path-based relevance measure, which is data-driven and tailored for each HIN. We develop an inference algorithm to find the maximum a posteriori (MAP) estimate of the model parameters, which entails non-trivial tricks. Experiments on two real-world datasets demonstrate the effectiveness of the proposed model and relevance measure.

【Keywords】: graph mining; heterogeneous information networks; meta-paths; relevance measures

58. Multi-Aspect Streaming Tensor Completion.

Paper Link】 【Pages】:435-443

【Authors】: Qingquan Song ; Xiao Huang ; Hancheng Ge ; James Caverlee ; Xia Hu

【Abstract】: Tensor completion has become an effective computational tool in many real-world data-driven applications. Beyond traditional static setting, with the increasing popularity of high velocity streaming data, it requires efficient online processing without reconstructing the whole model from scratch. Existing work on streaming tensor completion is usually built upon the assumption that tensors only grow in one mode. Unfortunately, the assumption does not hold in many real-world situations in which tensors may grow in multiple modes, i.e., multi-aspect streaming tensors. Efficiently modeling and completing these incremental tensors without sacrificing its effectiveness remains a challenging task due to the uncertainty of tensor mode changes and complex data structure of multi-aspect streaming tensors. To bridge this gap, we propose a Multi-Aspect Streaming Tensor completion framework (MAST) based on CANDECOMP/PARAFAC (CP) decomposition to track the subspace of general incremental tensors for completion. In addition, we investigate a special situation where time is one mode of the tensors, and leverage its extra structure information to improve the general framework towards higher effectiveness. Experimental results on four datasets collected from various real-world applications demonstrate the effectiveness and efficiency of the proposed framework.

【Keywords】: cp decomposition; dynamic tensor analysis; tensor completion

59. Scalable and Sustainable Deep Learning via Randomized Hashing.

Paper Link】 【Pages】:445-454

【Authors】: Ryan Spring ; Anshumali Shrivastava

【Abstract】: Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend to bring deep learning to low-power, embedded devices. The matrix operations, associated with the training and testing of deep networks, are very expensive from a computational and energy standpoint. We present a novel hashing-based technique to drastically reduce the amount of computation needed to train and test neural networks. Our approach combines two recent ideas, Adaptive Dropout and Randomized Hashing for Maximum Inner Product Search (MIPS), to select the nodes with the highest activations efficiently. Our new algorithm for deep learning reduces the overall computational cost of the forward and backward propagation steps by operating on significantly fewer nodes. As a consequence, our algorithm uses only 5% of the total multiplications, while keeping within 1% of the accuracy of the original model on average. A unique property of the proposed hashing-based back-propagation is that the updates are always sparse. Due to the sparse gradient updates, our algorithm is ideally suited for asynchronous, parallel training, leading to near-linear speedup, as the number of cores increases. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations on several datasets.

【Keywords】: deep learning; locality-sensitive hashing; neural networks; parallel computing; randomized algorithms

Paper Link】 【Pages】:455-464

【Authors】: Yukihiro Tagami

【Abstract】: Extreme multi-label classification methods have been widely used in Web-scale classification tasks such as Web page tagging and product recommendation. In this paper, we present a novel graph embedding method called "AnnexML". At the training step, AnnexML constructs a k-nearest neighbor graph of label vectors and attempts to reproduce the graph structure in the embedding space. The prediction is efficiently performed by using an approximate nearest neighbor search method that efficiently explores the learned k-nearest neighbor graph in the embedding space. We conducted evaluations on several large-scale real-world data sets and compared our method with recent state-of-the-art methods. Experimental results show that our AnnexML can significantly improve prediction accuracy, especially on data sets that have larger a label space. In addition, AnnexML improves the trade-off between prediction time and accuracy. At the same level of accuracy, the prediction time of AnnexML was up to 58 times faster than that of SLEEC, which is a state-of-the-art embedding-based method.

【Keywords】: approximate nearest neighbor search; extreme multi-label classification; k-nearest neighbor graph; learning-to-rank

61. Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking.

Paper Link】 【Pages】:465-474

【Authors】: Gabriele Tolomei ; Fabrizio Silvestri ; Andrew Haines ; Mounia Lalmas

【Abstract】: Machine-learned models are often described as "black boxes". In many real-world applications however, models may have to sacrifice predictive power in favour of human-interpretability. When this is the case, feature engineering becomes a crucial task, which requires significant and time-consuming human effort. Whilst some features are inherently static, representing properties that cannot be influenced (e.g., the age of an individual), others capture characteristics that could be adjusted (e.g., the daily amount of carbohydrates taken). Nonetheless, once a model is learned from the data, each prediction it makes on new instances is irreversible - assuming every instance to be a static point located in the chosen feature space. There are many circumstances however where it is important to understand (i) why a model outputs a certain prediction on a given instance, (ii) which adjustable features of that instance should be modified, and finally (iii) how to alter such a prediction when the mutated instance is input back to the model. In this paper, we present a technique that exploits the internals of a tree-based ensemble classifier to offer recommendations for transforming true negative instances into positively predicted ones. We demonstrate the validity of our approach using an online advertising application. First, we design a Random Forest classifier that effectively separates between two types of ads: low (negative) and high (positive) quality ads (instances). Then, we introduce an algorithm that provides recommendations that aim to transform a low quality ad (negative instance) into a high quality one (positive instance). Finally, we evaluate our approach on a subset of the active inventory of a large ad network, Yahoo Gemini.

【Keywords】: actionable feature tweaking; altering model predictions; model interpretability; random forest; recommending feature changes

62. Structural Deep Brain Network Mining.

Paper Link】 【Pages】:475-484

【Authors】: Shen Wang ; Lifang He ; Bokai Cao ; Chun-Ta Lu ; Philip S. Yu ; Ann B. Ragin

【Abstract】: Mining from neuroimaging data is becoming increasingly popular in the field of healthcare and bioinformatics, due to its potential to discover clinically meaningful structure patterns that could facilitate the understanding and diagnosis of neurological and neuropsychiatric disorders. Most recent research concentrates on applying subgraph mining techniques to discover connected subgraph patterns in the brain network. However, the underlying brain network structure is complicated. As a shallow linear model, subgraph mining cannot capture the highly non-linear structures, resulting in sub-optimal patterns. Therefore, how to learn representations that can capture the highly non-linearity of brain networks and preserve the underlying structures is a critical problem. In this paper, we propose a Structural Deep Brain Network mining method, namely SDBN, to learn highly non-linear and structure-preserving representations of brain networks. Specifically, we first introduce a novel graph reordering approach based on module identification, which rearranges the order of the nodes to preserve the modular structure of the graph. Next, we perform structural augmentation to further enhance the spatial information of the reordered graph. Then we propose a deep feature learning framework for combining supervised learning and unsupervised learning in a small-scale setting, by augmenting Convolutional Neural Network (CNN) with decoding pathways for reconstruction. With the help of the multiple layers of non-linear mapping, the proposed SDBN approach can capture the highly non-linear structure of brain networks. Further, it has better generalization capability for high-dimensional brain networks and works well even for small sample learning. Benefit from CNN's task-oriented learning style, the learned hierarchical representation is meaningful for the clinical task. To evaluate the proposed SDBN method, we conduct extensive experiments on four real brain network datasets for disease diagnoses. The experiment results show that SDBN can capture discriminative and meaningful structural graph representations for brain disorder diagnosis.

【Keywords】: brain network; deep learning; graph reordering

63. Randomized Feature Engineering as a Fast and Accurate Alternative to Kernel Methods.

Paper Link】 【Pages】:485-494

【Authors】: Suhang Wang ; Charu C. Aggarwal ; Huan Liu

【Abstract】: Feature engineering has found increasing interest in recent years because of its ability to improve the effectiveness of various machine learning models. Although tailored feature engineering methods have been designed for various domains, there are few that simulate the consistent effectiveness of kernel methods. At the core, the success of kernel methods is achieved by using similarity functions that emphasize local variations in similarity. Unfortunately, this ability comes at the price of the high level of computational resources required and the inflexibility of the representation as it only provides the similarity of two data points instead of vector representations of each data point; while the vector representations can be readily used as input to facilitate various models for different tasks. Furthermore, kernel methods are also highly susceptible to overfitting and noise and it cannot capture the variety of data locality. In this paper, we first analyze the inner working and weaknesses of kernel method, which serves as guidance for designing feature engineering. With the guidance, we explore the use of randomized methods for feature engineering by capturing multi-granular locality of data. This approach has the merit of being time and space efficient for feature construction. Furthermore, the approach is resistant to overfitting and noise because the randomized approach naturally enables fast and robust ensemble methods. Extensive experiments on a number of real world datasets are conducted to show the effectiveness of the approach for various tasks such as clustering, classification and outlier detection.

【Keywords】: randomized feature engineering; unsupervised feature learning

64. Human Mobility Synchronization and Trip Purpose Detection with Mixture of Hawkes Processes.

Paper Link】 【Pages】:495-503

【Authors】: Pengfei Wang ; Yanjie Fu ; Guannan Liu ; Wenqing Hu ; Charu C. Aggarwal

【Abstract】: While exploring human mobility can benefit many applications such as smart transportation, city planning, and urban economics, there are two key questions that need to be answered: (i) What is the nature of the spatial diffusion of human mobility across regions with different urban functions? (ii) How to spot and trace the trip purposes of human mobility trajectories? To answer these questions, we study large-scale and city-wide taxi trajectories; and furtherly organize them as arrival sequences according to the chronological arrival time. We figure out an important property across different regions from the arrival sequences, namely human mobility synchronization effect, which can be exploited to explain the phenomenon that two regions have similar arrival patterns in particular time periods if they share similar urban functions. In addition, the arrival sequences are mixed by arrival events with distinct trip purposes, which can be revealed by the regional environment of both the origins and destinations. To that end, in this paper, we develop a joint model that integrates Mixture of Hawkes Process (MHP) with a hierarchical topic model to capture the arrival sequences with mixed trip purposes. Essentially, the human mobility synchronization effect is encoded as a synchronization rate in the MHP; while the regional environment is modeled by introducing latent Trip Purpose and POI Topic to generate the Point of Interests (POIs) in the regions. Moreover, we provide an effective inference algorithm for parameter learning. Finally, we conduct intensive experiments on synthetic data and real-world data, and the experimental results have demonstrated the effectiveness of the proposed model.

【Keywords】: hawkes process; human mobility; synchronization; trip purpose; variational inference

65. FORA: Simple and Effective Approximate Single-Source Personalized PageRank.

Paper Link】 【Pages】:505-514

【Authors】: Sibo Wang ; Renchi Yang ; Xiaokui Xiao ; Zhewei Wei ; Yin Yang

【Abstract】: Given a graph G, a source node s and a target node t, the personalized PageRank (PPR) of t with respect to s is the probability that a random walk starting from s terminates at t. A single-source PPR (SSPPR) query enumerates all nodes in G, and returns the top-k nodes with the highest PPR values with respect to a given source node s. SSPPR has important applications in web search and social networks, e.g., in Twitter's Who-To-Follow recommendation service. However, SSPPR computation is immensely expensive, and at the same time resistant to indexing and materialization. So far, existing solutions either use heuristics, which do not guarantee result quality, or rely on the strong computing power of modern data centers, which is costly. Motivated by this, we propose FORA, a simple and effective index-based solution for approximate SSPPR processing, with rigorous guarantees on result quality. The basic idea of FORA is to combine two existing methods Forward Push (which is fast but does not guarantee quality) and Monte Carlo Random Walk (accurate but slow) in a simple and yet non-trivial way, leading to an algorithm that is both fast and accurate. Further, FORA includes a simple and effective indexing scheme, as well as a module for top-k selection with high pruning power. Extensive experiments demonstrate that FORA is orders of magnitude more efficient than its main competitors. Notably, on a billion-edge Twitter dataset, FORA answers a top-500 approximate SSPPR query within 5 seconds, using a single commodity server.

【Keywords】: forward push; personalized pagerank; random walk

66. Large-scale Collaborative Ranking in Near-Linear Time.

Paper Link】 【Pages】:515-524

【Authors】: Liwei Wu ; Cho-Jui Hsieh ; James Sharpnack

【Abstract】: In this paper, we consider the Collaborative Ranking (CR) problem for recommendation systems. Given a set of pairwise preferences between items for each user, collaborative ranking can be used to rank un-rated items for each user, and this ranking can be naturally used for recommendation. It is observed that collaborative ranking algorithms usually achieve better performance since they directly minimize the ranking loss; however, they are rarely used in practice due to the poor scalability. All the existing CR algorithms have time complexity at least O(|Ω|r) per iteration, where r is the target rank and |Ω| is number of pairs which grows quadratically with number of ratings per user. For example, the Netflix data contains totally 20 billion rating pairs, and at this scale all the current algorithms have to work with significant subsampling, resulting in poor prediction on testing data. In this paper, we propose a new collaborative ranking algorithm called Primal-CR that reduces the time complexity to O(|Ω|+d1 |d2 r), where d1 is number of users and |d2 is the averaged number of items rated by a user. Note that d1 |d2 is strictly smaller and often much smaller than |Ω|. Furthermore, by exploiting the fact that most data is in the form of numerical ratings instead of pairwise comparisons, we propose Primal-CR++ with O(d1|d2 (r+ log |d2)) time complexity. Both algorithms have better theoretical time complexity than existing approaches and also outperform existing approaches in terms of NDCG and pairwise error on real data sets. To the best of our knowledge, this is the first collaborative ranking algorithm capable of working on the full Netflix dataset using all the 20 billion rating pairs, and this leads to a model with much better recommendation compared with previous models trained on subsamples. Finally, compared with classical matrix factorization algorithm which also requires O(d1d2r) time, our algorithm has almost the same efficiency while making much better recommendations since we consider the ranking loss.

【Keywords】: collaborative ranking; large-scale; recommendation systems

67. HoORaYs: High-order Optimization of Rating Distance for Recommender Systems.

Paper Link】 【Pages】:525-534

【Authors】: Jingwei Xu ; Yuan Yao ; Hanghang Tong ; Xianping Tao ; Jian Lu

【Abstract】: Latent factor models have become a prevalent method in recommender systems, to predict users' preference on items based on the historical user feedback. Most of the existing methods, explicitly or implicitly, are built upon the first-order rating distance principle, which aims to minimize the difference between the estimated and real ratings. In this paper, we generalize such first-order rating distance principle and propose a new latent factor model (HoORaYs) for recommender systems. The core idea of the proposed method is to explore high-order rating distance, which aims to minimize not only (i) the difference between the estimated and real ratings of the same (user, item) pair (i.e., the first-order rating distance), but also (ii) the difference between the estimated and real rating difference of the same user across different items (i.e., the second-order rating distance). We formulate it as a regularized optimization problem, and propose an effective and scalable algorithm to solve it. Our analysis from the geometry and Bayesian perspectives indicate that by exploring the high-order rating distance, it helps to reduce the variance of the estimator, which in turns leads to better generalization performance (e.g., smaller prediction error). We evaluate the proposed method on four real-world data sets, two with explicit user feedback and the other two with implicit user feedback. Experimental results show that the proposed method consistently outperforms the state-of-the-art methods in terms of the prediction accuracy.

【Keywords】: collaborative filtering; high-order distance; latent factor model; recommender systems

68. Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts.

Paper Link】 【Pages】:535-543

【Authors】: Guangxu Xun ; Yaliang Li ; Jing Gao ; Aidong Zhang

【Abstract】: A text corpus typically contains two types of context information -- global context and local context. Global context carries topical information which can be utilized by topic models to discover topic structures from the text corpus, while local context can train word embeddings to capture semantic regularities reflected in the text corpus. This encourages us to exploit the useful information in both the global and the local context information. In this paper, we propose a unified language model based on matrix factorization techniques which 1) takes the complementary global and local context information into consideration simultaneously, and 2) models topics and learns word embeddings collaboratively. We empirically show that by incorporating both global and local context, this collaborative model can not only significantly improve the performance of topic discovery over the baseline topic models, but also learn better word embeddings than the baseline word embedding models. We also provide qualitative analysis that explains how the cooperation of global and local context information can result in better topic structures and word embeddings.

【Keywords】: global context; local context; topic modeling; unified language model; word embeddings

69. PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification.

Paper Link】 【Pages】:545-553

【Authors】: Ian En-Hsu Yen ; Xiangru Huang ; Wei Dai ; Pradeep Ravikumar ; Inderjit S. Dhillon ; Eric P. Xing

【Abstract】: Extreme Classification comprises multi-class or multi-label prediction where there is a large number of classes, and is increasingly relevant to many real-world applications such as text and image tagging. In this setting, standard classification methods, with complexity linear in the number of classes, become intractable, while enforcing structural constraints among classes (such as low-rank or tree-structure) to reduce complexity often sacrifices accuracy for efficiency. The recent PD-Sparse method addresses this via an algorithm that is sub-linear in the number of variables, by exploiting primal-dual sparsity inherent in a specific loss function, namely the max-margin loss. In this work, we extend PD-Sparse to be efficiently parallelized in large-scale distributed settings. By introducing separable loss functions, we can scale out the training, with network communication and space efficiency comparable to those in one-versus-all approaches while maintaining an overall complexity sub-linear in the number of classes. On several large-scale benchmarks our proposed method achieves accuracy competitive to the state-of-the-art while reducing the training time from days to tens of minutes compared with existing parallel or sparse methods on a cluster of 100 cores.

【Keywords】: extreme classification; primal dual method; sparse optimization

70. Local Higher-Order Graph Clustering.

Paper Link】 【Pages】:555-564

【Authors】: Hao Yin ; Austin R. Benson ; Jure Leskovec ; David F. Gleich

【Abstract】: Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal \emph{motif conductance}, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology.

【Keywords】: clustering; community detection; higher-order structure; motifs

71. Long Short Memory Process: Modeling Growth Dynamics of Microscopic Social Connectivity.

Paper Link】 【Pages】:565-574

【Authors】: Chengxi Zang ; Peng Cui ; Christos Faloutsos ; Wenwu Zhu

【Abstract】: How do people make friends dynamically in social networks? What are the temporal patterns for an individual increasing its social connectivity? What are the basic mechanisms governing the formation of these temporal patterns? No matter cyber or physical social systems, their structure and dynamics are mainly driven by the connectivity dynamics of each individual. However, due to the lack of empirical data, little is known about the empirical dynamic patterns of social connectivity at microscopic level, let alone the regularities or models governing these microscopic dynamics. We examine the detailed growth process of "WeChat", the largest online social network in China, with 300 million users and 4.75 billion links spanning two years. We uncover a wide range of long-term power law growth and short-term bursty growth for the social connectivity of different users. We propose three key ingredients, namely average-effect, multiscale-effect and correlation-effect, which govern the observed growth patterns at microscopic level. As a result, we propose the long short memory process incorporating these ingredients, demonstrating that it successfully reproduces the complex growth patterns observed in the empirical data. By analyzing modeling parameters, we discover statistical regularities underlying the empirical growth dynamics. Our model and discoveries provide a foundation for the microscopic mechanisms of network growth dynamics, potentially leading to implications for prediction, clustering and outlier detection on human dynamics.

【Keywords】: burst; human dynamics; power law growth; social dynamics; social network

Paper Link】 【Pages】:575-583

【Authors】: Muhan Zhang ; Yixin Chen

【Abstract】: In this paper, we propose a next-generation link prediction method, Weisfeiler-Lehman Neural Machine (WLNM), which learns topological features in the form of graph patterns that promote the formation of links. WLNM has unmatched advantages including higher performance than state-of-the-art methods and universal applicability over various kinds of networks. WLNM extracts an enclosing subgraph of each target link and encodes the subgraph as an adjacency matrix. The key novelty of the encoding comes from a fast hashing-based Weisfeiler-Lehman (WL) algorithm that labels the vertices according to their structural roles in the subgraph while preserving the subgraph's intrinsic directionality. After that, a neural network is trained on these adjacency matrices to learn a predictive model. Compared with traditional link prediction methods, WLNM does not assume a particular link formation mechanism (such as common neighbors), but learns this mechanism from the graph itself. We conduct comprehensive experiments to show that WLNM not only outperforms a great number of state-of-the-art link prediction methods, but also consistently performs well across networks with different characteristics.

【Keywords】: color refinement; graph labeling; link prediction; neural network

73. EmbedJoin: Efficient Edit Similarity Joins via Embeddings.

Paper Link】 【Pages】:585-594

【Authors】: Haoyu Zhang ; Qin Zhang

【Abstract】: We study the problem of edit similarity joins, where given a set of strings and a threshold value K, we want to output all pairs of strings whose edit distances are at most K. Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative filtering and natural language processing, and has been identified as a primitive operator for database systems. This problem has been studied extensively in the literature. However, we have observed that all the existing algorithms fall short on long strings and large distance thresholds. In this paper we propose an algorithm named EmbedJoin which scales very well with string length and distance threshold. Our algorithm is built on the recent advance of metric embeddings for edit distance, and is very different from all of the previous approaches. We demonstrate via an extensive set of experiments that EmbedJoin significantly outperforms the previous best algorithms on long strings and large distance thresholds.

【Keywords】: edit distance; similarity joins

74. TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams.

Paper Link】 【Pages】:595-604

【Authors】: Chao Zhang ; Liyuan Liu ; Dongming Lei ; Quan Yuan ; Honglei Zhuang ; Tim Hanratty ; Jiawei Han

【Abstract】: Detecting local events (e.g., protest, disaster) at their onsets is an important task for a wide spectrum of applications, ranging from disaster control to crime monitoring and place recommendation. Recent years have witnessed growing interest in leveraging geo-tagged tweet streams for online local event detection. Nevertheless, the accuracies of existing methods still remain unsatisfactory for building reliable local event detection systems. We propose TrioVecEvent, a method that leverages multimodal embeddings to achieve accurate online local event detection. The effectiveness of TrioVecEvent is underpinned by its two-step detection scheme. First, it ensures a high coverage of the underlying local events by dividing the tweets in the query window into coherent geo-topic clusters. To generate quality geo-topic clusters, we capture short-text semantics by learning multimodal embeddings of the location, time, and text, and then perform online clustering with a novel Bayesian mixture model. Second, TrioVecEvent considers the geo-topic clusters as candidate events and extracts a set of features for classifying the candidates. Leveraging the multimodal embeddings as background knowledge, we introduce discriminative features that can well characterize local events, which enables pinpointing true local events from the candidate pool with a small amount of training data. We have used crowdsourcing to evaluate TrioVecEvent, and found that it improves the performance of the state-of-the-art method by a large margin.

【Keywords】: event detection; multimodal embedding; representation learning; social media; spatiotemporal data mining; topic modeling

75. Graph Edge Partitioning via Neighborhood Heuristic.

Paper Link】 【Pages】:605-614

【Authors】: Chenzi Zhang ; Fan Wei ; Qin Liu ; Zhihao Gavin Tang ; Zhenguo Li

【Abstract】: We consider the edge partitioning problem that partitions the edges of an input graph into multiple balanced components, while minimizing the total number of vertices replicated (one vertex might appear in more than one partition). This problem is critical in minimizing communication costs and running time for several large-scale distributed graph computation platforms (e.g., PowerGraph, Spark GraphX). We first prove that this problem is NP-hard, and then present a new partitioning heuristic with polynomial running time. We provide a worst-case upper bound of replication factor for our heuristic on general graphs. To our knowledge, we are the first to provide such bound for edge partitioning algorithms on general graphs. Applying this bound to random power-law graphs greatly improves the previous bounds of expected replication factor. Extensive experiments demonstrated that our partitioning algorithm consistently produces much smaller replication factors on various benchmark data sets than the state-of-the-art. When deployed in the production graph engine, PowerGraph, in average it reduces replication factor, communication, and running time by 54%, 66%, and 21%, respectively.

【Keywords】: distributed graph mining; graph edge partitioning

76. Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling.

Paper Link】 【Pages】:615-623

【Authors】: Kai Zhang ; Chuanren Liu ; Jie Zhang ; Hui Xiong ; Eric P. Xing ; Jieping Ye

【Abstract】: Matrix sketching is aimed at finding compact representations of a matrix while simultaneously preserving most of its properties, which is a fundamental building block in modern scientific computing. Randomized algorithms represent state-of-the-art and have attracted huge interest from the fields of machine learning, data mining, and theoretic computer science. However, it still requires the use of the entire input matrix in producing desired factorizations, which can be a major computational and memory bottleneck in truly large problems. In this paper, we uncover an interesting theoretic connection between matrix low-rank decomposition and lossy signal compression, based on which a cascaded compression sampling framework is devised to approximate an m-by-n matrix in only O(m+n) time and space. Indeed, the proposed method accesses only a small number of matrix rows and columns, which significantly improves the memory footprint. Meanwhile, by sequentially teaming two rounds of approximation procedures and upgrading the sampling strategy from a uniform probability to more sophisticated, encoding-orientated sampling, significant algorithmic boosting is achieved to uncover more granular structures in the data. Empirical results on a wide spectrum of real-world, large-scale matrices show that by taking only linear time and space, the accuracy of our method rivals those state-of-the-art randomized algorithms consuming a quadratic, O(mn), amount of resources.

【Keywords】: cascaded compression sampling; low-rank decomposition; matrix sketching; nystrom method; random projection; randomized algorithms

77. Tracking the Dynamics in Crowdfunding.

Paper Link】 【Pages】:625-634

【Authors】: Hongke Zhao ; Hefu Zhang ; Yong Ge ; Qi Liu ; Enhong Chen ; Huayu Li ; Le Wu

【Abstract】: Crowdfunding is an emerging Internet fundraising mechanism by raising monetary contributions from the crowd for projects or ventures. In these platforms, the dynamics, i.e., daily funding amount on campaigns and perks (backing options with rewards), are the most concerned issue for creators, backers and platforms. However, tracking the dynamics in crowdfunding is very challenging and still under-explored. To that end, in this paper, we present a focused study on this important problem. A special goal is to forecast the funding amount for a given campaign and its perks in the future days. Specifically, we formalize the dynamics in crowdfunding as a hierarchical time series, i.e., campaign level and perk level. Specific to each level, we develop a special regression by modeling the decision making process of the crowd (visitors and backing probability) and exploring various factors that impact the decision; on this basis, an enhanced switching regression is proposed at each level to address the heterogeneity of funding sequences. Further, we employ a revision matrix to combine the two-level base forecasts for the final forecasting. We conduct extensive experiments on a real-world crowdfunding data collected from Indiegogo.com. The experimental results clearly demonstrate the effectiveness of our approaches on tracking the dynamics in crowdfunding.

【Keywords】: crowdfunding; dynamics; hierarchical time series

78. Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks.

Paper Link】 【Pages】:635-644

【Authors】: Huan Zhao ; Quanming Yao ; Jianda Li ; Yangqiu Song ; Dik Lun Lee

【Abstract】: Heterogeneous Information Network (HIN) is a natural and general representation of data in modern large commercial recommender systems which involve heterogeneous types of data. HIN based recommenders face two problems: how to represent the high-level semantics of recommendations and how to fuse the heterogeneous information to make recommendations. In this paper, we solve the two problems by first introducing the concept of meta-graph to HIN-based recommendation, and then solving the information fusion problem with a "matrix factorization (MF) + factorization machine (FM)" approach. For the similarities generated by each meta-graph, we perform standard MF to generate latent features for both users and items. With different meta-graph based features, we propose to use FM with Group lasso (FMG) to automatically learn from the observed ratings to effectively select useful meta-graph based features. Experimental results on two real-world datasets, Amazon and Yelp, show the effectiveness of our approach compared to state-of-the-art FM and other HIN-based recommendation algorithms.

【Keywords】: collaborative filtering; factorization machine; heterogeneous information networks; recommendation system

79. Coresets for Kernel Regression.

Paper Link】 【Pages】:645-654

【Authors】: Yan Zheng ; Jeff M. Phillips

【Abstract】: Kernel regression is an essential and ubiquitous tool for non-parametric data analysis, particularly popular among time series and spatial data. However, the central operation which is performed many times, evaluating a kernel on the data set, takes linear time. This is impractical for modern large data sets. In this paper we describe coresets for kernel regression: compressed data sets which can be used as proxy for the original data and have provably bounded worst case error. The size of the coresets are independent of the raw number of data points; rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, and demonstrate that they incur negligible error, can be constructed extremely efficiently, and allow for great computational gains.

【Keywords】: coreset; kernel regression; sample complexity

80. A Local Algorithm for Structure-Preserving Graph Cut.

Paper Link】 【Pages】:655-664

【Authors】: Dawei Zhou ; Si Zhang ; Mehmet Yigit Yildirim ; Scott Alcorn ; Hanghang Tong ; Hasan Davulcu ; Jingrui He

【Abstract】: Nowadays, large-scale graph data is being generated in a variety of real-world applications, from social networks to co-authorship networks, from protein-protein interaction networks to road traffic networks. Many existing works on graph mining focus on the vertices and edges, with the first-order Markov chain as the underlying model. They fail to explore the high-order network structures, which are of key importance in many high impact domains. For example, in bank customer personally identifiable information (PII) networks, the star structures often correspond to a set of synthetic identities; in financial transaction networks, the loop structures may indicate the existence of money laundering. In this paper, we focus on mining user-specified high-order network structures and aim to find a structure-rich subgraph which does not break many such structures by separating the subgraph from the rest. A key challenge associated with finding a structure-rich subgraph is the prohibitive computational cost. To address this problem, inspired by the family of local graph clustering algorithms for efficiently identifying a low-conductance cut without exploring the entire graph, we propose to generalize the key idea to model high-order network structures. In particular, we start with a generic definition of high-order conductance, and define the high-order diffusion core, which is based on a high-order random walk induced by user-specified high-order network structure. Then we propose a novel High-Order Structure-Preserving LOcal Cut (HOSPLOC) algorithm, which runs in polylogarithmic time with respect to the number of edges in the graph. It starts with a seed vertex and iteratively explores its neighborhood until a subgraph with a small high-order conductance is found. Furthermore, we analyze its performance in terms of both effectiveness and efficiency. The experimental results on both synthetic graphs and real graphs demonstrate the effectiveness and efficiency of our proposed HOSPLOC algorithm.

【Keywords】: high-order network structure; local clustering algorithm

81. Anomaly Detection with Robust Deep Autoencoders.

Paper Link】 【Pages】:665-674

【Authors】: Chong Zhou ; Randy C. Paffenroth

【Abstract】: Deep autoencoders, and other deep neural networks, have demonstrated their effectiveness in discovering non-linear features across many problem domains. However, in many real-world problems, large outliers and pervasive noise are commonplace, and one may not have access to clean training data as required by standard deep denoising autoencoders. Herein, we demonstrate novel extensions to deep autoencoders which not only maintain a deep autoencoders' ability to discover high quality, non-linear features but can also eliminate outliers and noise without access to any clean training data. Our model is inspired by Robust Principal Component Analysis, and we split the input data X into two parts, $X = L{D} + S$, where $L{D}$ can be effectively reconstructed by a deep autoencoder and $S$ contains the outliers and noise in the original data X. Since such splitting increases the robustness of standard deep autoencoders, we name our model a "Robust Deep Autoencoder (RDA)". Further, we present generalizations of our results to grouped sparsity norms which allow one to distinguish random anomalies from other types of structured corruptions, such as a collection of features being corrupted across many instances or a collection of instances having more corruptions than their fellows. Such "Group Robust Deep Autoencoders (GRDA)" give rise to novel anomaly detection approaches whose superior performance we demonstrate on a selection of benchmark problems.

【Keywords】: anomaly detection; autoencoders; denoising; group robust deep autoencoder; robust deep autoencoders

KDD 2017 Research Papers (Poster Papers) 66

82. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers.

Paper Link】 【Pages】:687-696

【Authors】: Aman Agarwal ; Soumya Basu ; Tobias Schnabel ; Thorsten Joachims

【Abstract】: Accurately evaluating new policies (e.g. ad-placement models, ranking functions, recommendation functions) is one of the key prerequisites for improving interactive systems. While the conventional approach to evaluation relies on online A/B tests, recent work has shown that counterfactual estimators can provide an inexpensive and fast alternative, since they can be applied offline using log data that was collected from a different policy fielded in the past. In this paper, we address the question of how to estimate the performance of a new target policy when we have log data from multiple historic policies. This question is of great relevance in practice, since policies get updated frequently in most online systems. We show that naively combining data from multiple logging policies can be highly suboptimal. In particular, we find that the standard Inverse Propensity Score (IPS) estimator suffers especially when logging and target policies diverge -- to a point where throwing away data improves the variance of the estimator. We therefore propose two alternative estimators which we characterize theoretically and compare experimentally. We find that the new estimators can provide substantially improved estimation accuracy.

【Keywords】: counterfactual estimators; implicit feedback; log data; off-policy evaluation

83. Tripoles: A New Class of Relationships in Time Series Data.

Paper Link】 【Pages】:697-706

【Authors】: Saurabh Agrawal ; Gowtham Atluri ; Anuj Karpatne ; William Haltom ; Stefan Liess ; Snigdhansu Chatterjee ; Vipin Kumar

【Abstract】: Mining relationships in time series data is of immense interest to several disciplines such as neuroscience, climate science, and transportation. Traditional approaches for mining relationships focus on discovering pair-wise relationships in the data. In this work, we define a novel relationship pattern involving three interacting time series, which we refer to as a tripole. We show that tripoles capture interesting relationship patterns in the data that are not possible to be captured using traditionally studied pair-wise relationships. We demonstrate the utility of tripoles in multiple real-world datasets from various domains including climate science and neuroscience. In particular, our approach is able to discover tripoles that are statistically significant, reproducible across multiple independent data sets, and lead to novel domain insights.

【Keywords】: climate teleconnections; correlation mining; fmri; multivariate linear patterns; spatio-temporal

84. Post Processing Recommender Systems for Diversity.

Paper Link】 【Pages】:707-716

【Authors】: Arda Antikacioglu ; R. Ravi

【Abstract】: Collaborative filtering is a broad and powerful framework for building recommendation systems that has seen widespread adoption. Over the past decade, the propensity of such systems for favoring popular products and thus creating echo chambers have been observed. This has given rise to an active area of research that seeks to diversify recommendations generated by such algorithms. We address the problem of increasing diversity in recom- mendation systems that are based on collaborative filtering that use past ratings to predict a rating quality for potential recommendations. Following our earlier work, we formulate recommendation system design as a subgraph selection problem from a candidate super-graph of potential recommendations where both diversity and rating quality are explicitly optimized: (1) On the modeling side, we define a new flexible notion of diversity that allows a system designer to prescribe the number of recommendations each item should receive, and smoothly penalizes deviations from this distribution. (2) On the algorithmic side, we show that minimum-cost network flow methods yield fast algorithms in theory and practice for designing recommendation subgraphs that optimize this notion of diversity. (3) On the empirical side, we show the effectiveness of our new model and method to increase diversity while maintaining high rating quality in standard rating data sets from Netflix and MovieLens.

【Keywords】: discrepancy; diversity; network flow; recommender systems; subgraph selection

85. Aspect Based Recommendations: Recommending Items with the Most Valuable Aspects Based on User Reviews.

Paper Link】 【Pages】:717-725

【Authors】: Konstantin Bauman ; Bing Liu ; Alexander Tuzhilin

【Abstract】: In this paper, we propose a recommendation technique that not only can recommend items of interest to the user as traditional recommendation systems do but also specific aspects of consumption of the items to further enhance the user experience with those items. For example, it can recommend the user to go to a specific restaurant (item) and also order some specific foods there, e.g., seafood (an aspect of consumption). Our method is called Sentiment Utility Logistic Model (SULM). As its name suggests, SULM uses sentiment analysis of user reviews. It first predicts the sentiment that the user may have about the item based on what he/she might express about the aspects of the item and then identifies the most valuable aspects of the user's potential experience with that item. Furthermore, the method can recommend items together with those most important aspects over which the user has control and can potentially select them, such as the time to go to a restaurant, e.g. lunch vs. dinner, and what to order there, e.g., seafood. We tested the proposed method on three applications (restaurant, hotel, and beauty & spa) and experimentally showed that those users who followed our recommendations of the most valuable aspects while consuming the items, had better experiences, as defined by the overall rating.

【Keywords】: aspects of user experience; recommender systems; sentiment analysis; user experience; user reviews; user-controlled aspects

86. Bolt: Accelerated Data Mining with Fast Vector Compression.

Paper Link】 【Pages】:727-735

【Authors】: Davis W. Blalock ; John V. Guttag

【Abstract】: Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization algorithm that can compress vectors over 12x faster than existing techniques while also accelerating approximate vector operations such as distance and dot product computations by up to 10x. Because it can encode over 2GB of vectors per second, it makes vector quantization cheap enough to employ in many more circumstances. For example, using our technique to compute approximate dot products in a nested loop can multiply matrices faster than a state-of-the-art BLAS implementation, even when our algorithm must first compress the matrices. In addition to showing the above speedups, we demonstrate that our approach can accelerate nearest neighbor search and maximum inner product search by over 100x compared to floating point operations and 10x compared to other vector quantization methods. Our approximate Euclidean distance and dot product computations are not only faster than those of related algorithms with slower encodings, but also faster than Hamming distance computations, which have direct hardware support on the tested platforms. We also assess the errors of our algorithm's approximate distances and dot products, and find that it is competitive with existing, slower vector quantization algorithms.

【Keywords】: compression; nearest neighbor search; scalability; vector quantization

87. Robust Spectral Clustering for Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings.

Paper Link】 【Pages】:737-746

【Authors】: Aleksandar Bojchevski ; Yves Matkovic ; Stephan Günnemann

【Abstract】: Spectral clustering is one of the most prominent clustering approaches. However, it is highly sensitive to noisy input data. In this work, we propose a robust spectral clustering technique able to handle such scenarios. To achieve this goal, we propose a sparse and latent decomposition of the similarity graph used in spectral clustering. In our model, we jointly learn the spectral embedding as well as the corrupted data - thus, enhancing the clustering performance overall. We propose algorithmic solutions to all three established variants of spectral clustering, each showing linear complexity in the number of edges. Our experimental analysis confirms the significant potential of our approach for robust spectral clustering. Supplementary material is available at www.kdd.in.tum.de/RSC.

【Keywords】: embedding; graph embedding; latent decomposition; noise; noisy data; robust learning; robustness; sparsity; spectral clustering; spectral learning

88. DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection.

Paper Link】 【Pages】:747-755

【Authors】: Bokai Cao ; Lei Zheng ; Chenwei Zhang ; Philip S. Yu ; Andrea Piscitello ; John Zulueta ; Olu Ajilore ; Kelly Ryan ; Alex D. Leow

【Abstract】: The increasing use of electronic forms of communication presents new opportunities in the study of mental health, including the ability to investigate the manifestations of psychiatric diseases unobtrusively and in the setting of patients' daily lives. A pilot study to explore the possible connections between bipolar affective disorder and mobile phone usage was conducted. In this study, participants were provided a mobile phone to use as their primary phone. This phone was loaded with a custom keyboard that collected metadata consisting of keypress entry time and accelerometer movement. Individual character data with the exceptions of the backspace key and space bar were not collected due to privacy concerns. We propose an end-to-end deep architecture based on late fusion, named DeepMood, to model the multi-view metadata for the prediction of mood scores. Experimental results show that 90.31% prediction accuracy on the depression score can be achieved based on session-level mobile phone typing dynamics which is typically less than one minute. It demonstrates the feasibility of using mobile phone metadata to infer mood disturbance and severity.

【Keywords】: bipolar disorder; recurrent network; sequence prediction; typing dynamics

89. Fast Newton Hard Thresholding Pursuit for Sparsity Constrained Nonconvex Optimization.

Paper Link】 【Pages】:757-766

【Authors】: Jinghui Chen ; Quanquan Gu

【Abstract】: We propose a fast Newton hard thresholding pursuit algorithm for sparsity constrained nonconvex optimization. Our proposed algorithm reduces the per-iteration time complexity to linear in the data dimension d compared with cubic time complexity in Newton's method, while preserving faster computational and statistical convergence rates. In particular, we prove that the proposed algorithm converges to the unknown sparse model parameter at a composite rate, namely quadratic at first and linear when it gets close to the true parameter, up to the minimax optimal statistical precision of the underlying model. Thorough experiments on both synthetic and real datasets demonstrate that our algorithm outperforms the state-of-the-art optimization algorithms for sparsity constrained optimization.

【Keywords】: newton's method; nonconvex optimization; sparse learning; sparsity constrained optimization

90. On Sampling Strategies for Neural Network-based Collaborative Filtering.

Paper Link】 【Pages】:767-776

【Authors】: Ting Chen ; Yizhou Sun ; Yue Shi ; Liangjie Hong

【Abstract】: Recent advances in neural networks have inspired people to design hybrid recommendation algorithms that can incorporate both (1) user-item interaction information and (2) content information including image, audio, and text. Despite their promising results, neural network-based recommendation algorithms pose extensive computational costs, making it challenging to scale and improve upon. In this paper, we propose a general neural network-based recommendation framework, which subsumes several existing state-of-the-art recommendation algorithms, and address the efficiency issue by investigating sampling strategies in the stochastic gradient descent training for the framework. We tackle this issue by first establishing a connection between the loss functions and the user-item interaction bipartite graph, where the loss function terms are defined on links while major computation burdens are located at nodes. We call this type of loss functions "graph-based" loss functions, for which varied mini-batch sampling strategies can have different computational costs. Based on the insight, three novel sampling strategies are proposed, which can significantly improve the training efficiency of the proposed framework (up to $\times 30$ times speedup in our experiments), as well as improving the recommendation performance. Theoretical analysis is also provided for both the computational cost and the convergence. We believe the study of sampling strategies have further implications on general graph-based loss functions, and would also enable more research under the neural network-based recommendation framework.

【Keywords】: collaborative filtering; neural networks; sampling strategies

91. Unsupervised Feature Selection in Signed Social Networks.

Paper Link】 【Pages】:777-786

【Authors】: Kewei Cheng ; Jundong Li ; Huan Liu

【Abstract】: The rapid growth of social media services brings a large amount of high-dimensional social media data at an unprecedented rate. Feature selection is powerful to prepare high-dimensional data by finding a subset of relevant features. A vast majority of existing feature selection algorithms for social media data exclusively focus on positive interactions among linked instances such as friendships and user following relations. However, in many real-world social networks, instances may also be negatively interconnected. Recent work shows that negative links have an added value over positive links in advancing many learning tasks. In this paper, we study a novel problem of unsupervised feature selection in signed social networks and propose a novel framework SignedFS. In particular, we provide a principled way to model positive and negative links for user latent representation learning. Then we embed the user latent representations into feature selection when label information is not available. Also, we revisit the principle of homophily and balance theory in signed social networks and incorporate the signed graph regularization into the feature selection framework to capture the first-order and the second-order proximity among users in signed social networks. Experiments on two real-world signed social networks demonstrate the effectiveness of our proposed framework. Further experiments are conducted to understand the impacts of different components of SignedFS.

【Keywords】: feature selection; signed social networks; unsupervised learning

92. GRAM: Graph-based Attention Model for Healthcare Representation Learning.

Paper Link】 【Pages】:787-795

【Authors】: Edward Choi ; Mohammad Taha Bahadori ; Le Song ; Walter F. Stewart ; Jimeng Sun

【Abstract】: Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: - Data insufficiency: Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results. - Interpretation: The representations learned by deep learning methods should align with medical knowledge. To address these challenges, we propose GRaph-based Attention Model (GRAM) that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism. We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.

【Keywords】: attention model; electronic health records; graph; predictive healthcare

93. Algorithmic Decision Making and the Cost of Fairness.

Paper Link】 【Pages】:797-806

【Authors】: Sam Corbett-Davies ; Emma Pierson ; Avi Feller ; Sharad Goel ; Aziz Huq

【Abstract】: Algorithms are now regularly used to decide whether defendants awaiting trial are too dangerous to be released back into the community. In some cases, black defendants are substantially more likely than white defendants to be incorrectly classified as high risk. To mitigate such disparities, several techniques have recently been proposed to achieve algorithmic fairness. Here we reformulate algorithmic fairness as constrained optimization: the objective is to maximize public safety while satisfying formal fairness constraints designed to reduce racial disparities. We show that for several past definitions of fairness, the optimal algorithms that result require detaining defendants above race-specific risk thresholds. We further show that the optimal unconstrained algorithm requires applying a single, uniform threshold to all defendants. The unconstrained algorithm thus maximizes public safety while also satisfying one important understanding of equality: that all individuals are held to the same standard, irrespective of race. Because the optimal constrained and unconstrained algorithms generally differ, there is tension between improving public safety and satisfying prevailing notions of algorithmic fairness. By examining data from Broward County, Florida, we show that this trade-off can be large in practice. We focus on algorithms for pretrial release decisions, but the principles we discuss apply to other domains, and also to human decision makers carrying out structured decision rules.

【Keywords】: algorithmic fairness; discrimination; disparate impact; pretrial detention; risk assessment

94. Structural Diversity and Homophily: A Study Across More Than One Hundred Big Networks.

Paper Link】 【Pages】:807-816

【Authors】: Yuxiao Dong ; Reid A. Johnson ; Jian Xu ; Nitesh V. Chawla

【Abstract】: A widely recognized organizing principle of networks is structural homophily, which suggests that people with more common neighbors are more likely to connect with each other. However, what influence the diverse structures embedded in common neighbors have on link formation is much less well-understood. To explore this problem, we begin by characterizing the structural diversity of common neighborhoods. Using a collection of 120 large-scale networks, we demonstrate that the impact of the common neighborhood diversity on link existence can vary substantially across networks. We find that its positive effect on Facebook and negative effect on LinkedIn suggest different underlying networking needs in these networks. We also discover striking cases where diversity violates the principle of homophily---that is, where fewer mutual connections may lead to a higher tendency to link with each other. We then leverage structural diversity to develop a common neighborhood signature (CNS), which we apply to a large set of networks to uncover unique network superfamilies not discoverable by conventional methods. Our findings shed light on the pursuit to understand the ways in which network structures are organized and formed, pointing to potential advancement in designing graph generation models and recommender systems.

【Keywords】: big data; embeddedness; link prediction; network diversity; network signature; social networks; triadic closure

95. Revisiting Power-law Distributions in Spectra of Real World Networks.

Paper Link】 【Pages】:817-826

【Authors】: Nicole Eikmeier ; David F. Gleich

【Abstract】: By studying a large number of real world graphs, we find empirical evidence that most real world graphs have a statistically significant power-law distribution with a cutoff in the singular values of the adjacency matrix and eigenvalues of the Laplacian matrix in addition to the commonly conjectured power-law in the degrees. Among these results, power-laws in the singular values appear more consistently than in the degree distribution. The exponents of the power-law distributions are much larger than previously observed. We find a surprising direct relationship between the power-law in the degree distribution and the power-law in the eigenvalues of the Laplacian that was theorized in simple models but is extremely accurate in practice. We investigate these findings in large networks by studying the cutoff value itself, which shows a scaling law for the number of elements involved in these power-laws. Using the scaling law enables us to compute only a subset of eigenvalues of large networks, up to tens of millions of vertices and billions of edges, where we find that those too show evidence of statistically significant power-laws.

【Keywords】: degree distributions; graph spectrum; power-law distributions; real world networks

96. REMIX: Automated Exploration for Interactive Outlier Detection.

Paper Link】 【Pages】:827-835

【Authors】: Yanjie Fu ; Charu C. Aggarwal ; Srinivasan Parthasarathy ; Deepak S. Turaga ; Hui Xiong

【Abstract】: Outlier detection is the identification of points in a dataset that do not conform to the norm. Outlier detection is highly sensitive to the choice of the detection algorithm and the feature subspace used by the algorithm. Extracting domain-relevant insights from outliers needs systematic exploration of these choices since diverse outlier sets could lead to complementary insights. This challenge is especially acute in an interactive setting, where the choices must be explored in a time-constrained manner. In this work, we present REMIX, the first system to address the problem of outlier detection in an interactive setting. REMIX uses a novel mixed integer programming (MIP) formulation for automatically selecting and executing a diverse set of outlier detectors within a time limit. This formulation incorporates multiple aspects such as (i) an upper limit on the total execution time of detectors (ii) diversity in the space of algorithms and features, and (iii) meta-learning for evaluating the cost and utility of detectors. REMIX provides two distinct ways for the analyst to consume its results: (i) a partitioning of the detectors explored by REMIX into perspectives through low-rank non-negative matrix factorization; each perspective can be easily visualized as an intuitive heatmap of experiments versus outliers, and (ii) an ensembled set of outliers which combines outlier scores from all detectors. We demonstrate the benefits of REMIX through extensive empirical validation on real-world data.

【Keywords】: ensemble; exploration; interactive; matrix factorization; outlier detection; visualization

97. Anarchists, Unite: Practical Entropy Approximation for Distributed Streams.

Paper Link】 【Pages】:837-846

【Authors】: Moshe Gabel ; Daniel Keren ; Assaf Schuster

【Abstract】: Entropy is a fundamental property of data and a key metric in many scientific and engineering fields. Entropy estimation has been extensively studied, but almost always under the assumption that there is a single data stream, seen in its entirety by one node running the estimation algorithm. Multiple distributed data sources are becoming increasingly common, however, with applications in signal processing, computer science, medicine, physics, and more. Centralizing all data can be infeasible, for example in networks of battery or bandwidth limited sensors, so entropy estimation in distributed streams requires new, communication-efficient approaches. We propose a practical communication-efficient algorithm for continuously approximating the entropy of distributed streams, with deterministic, user-defined error bounds. Unlike previous streaming methods, it supports deletions and variable-sized time-based sliding windows, while still avoiding communication when possible. Moreover, it optionally incorporates a state-of-the-art entropy sketch, allowing for both bandwidth reduction and monitoring very high dimensional problems. Finally, it provides the approximation to all nodes, rather than to a centralized location, which is important in settings such as wireless sensor networks. Evaluation on several public datasets from real application domains shows that our adaptive algorithm can often reduce the number of messages by two orders of magnitude, compared to centralizing all data in one node.

【Keywords】: data mining; distributed streams; entropy estimation

98. Recurrent Poisson Factorization for Temporal Recommendation.

Paper Link】 【Pages】:847-855

【Authors】: Seyyed Abbas Hosseini ; Keivan Alizadeh ; Ali Khodadadi ; Ali Arabzadeh ; Mehrdad Farajtabar ; Hongyuan Zha ; Hamid R. Rabiee

【Abstract】: Poisson factorization is a probabilistic model of users and items for recommendation systems, where the so-called implicit consumer data is modeled by a factorized Poisson distribution. There are many variants of Poisson factorization methods who show state-of-the-art performance on real-world recommendation tasks. However, most of them do not explicitly take into account the temporal behavior and the recurrent activities of users which is essential to recommend the right item to the right user at the right time. In this paper, we introduce Recurrent Poisson Factorization (RPF) framework that generalizes the classical PF methods by utilizing a Poisson process for modeling the implicit feedback. RPF treats time as a natural constituent of the model and brings to the table a rich family of time-sensitive factorization models. To elaborate, we instantiate several variants of RPF who are capable of handling dynamic user preferences and item specification (DRPF), modeling the social-aspect of product adoption (SRPF), and capturing the consumption heterogeneity among users and items (HRPF). We also develop a variational algorithm for approximate posterior inference that scales up to massive data sets. Furthermore, we demonstrate RPF's superior performance over many state-of-the-art methods on synthetic dataset, and large scale real-world datasets on music streaming logs, and user-item interactions in M-Commerce platforms.

【Keywords】: poisson factorization; poisson process; temporal recommender system

99. SPOT: Sparse Optimal Transformations for High Dimensional Variable Selection and Exploratory Regression Analysis.

Paper Link】 【Pages】:857-865

【Authors】: Qiming Huang ; Michael Zhu

【Abstract】: We develop a novel method called SParse Optimal Transformations (SPOT) to simultaneously select important variables and explore relationships between the response and predictor variables in high dimensional nonparametric regression analysis. Not only are the optimal transformations identified by SPOT interpretable, they can also be used for response prediction. We further show that SPOT achieves consistency in both variable selection and parameter estimation. Numerical experiments and real data applications demonstrate that SPOT outperforms other existing methods and can serve as an effective tool in practice.

【Keywords】: monotone transformation; optimal transformation; regression analysis; spline; variable selection

100. Incremental Dual-memory LSTM in Land Cover Prediction.

Paper Link】 【Pages】:867-876

【Authors】: Xiaowei Jia ; Ankush Khandelwal ; Guruprasad Nayak ; James Gerber ; Kimberly Carlson ; Paul C. West ; Vipin Kumar

【Abstract】: Land cover prediction is essential for monitoring global environmental change. Unfortunately, traditional classification models are plagued by temporal variation and emergence of novel/unseen land cover classes in the prediction process. In this paper, we propose an LSTM-based spatio-temporal learning framework with a dual-memory structure. The dual-memory structure captures both long-term and short-term temporal variation patterns, and is updated incrementally to adapt the model to the ever-changing environment. Moreover, we integrate zero-shot learning to identify unseen classes even without labelled samples. Experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed framework over multiple baselines in land cover prediction.

【Keywords】: land cover; lstm; zero-short learning

101. MetaPAD: Meta Pattern Discovery from Massive Text Corpora.

Paper Link】 【Pages】:877-886

【Authors】: Meng Jiang ; Jingbo Shang ; Taylor Cassidy ; Xiang Ren ; Lance M. Kaplan ; Timothy P. Hanratty ; Jiawei Han

【Abstract】: Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results lose rich context around entities in the patterns, and the process is costly for a corpus of large scale. In this study, we propose a novel typed textual pattern structure, called meta pattern, which is extended to a frequent, informative, and precise subsequence pattern in certain context. We propose an efficient framework, called MetaPAD, which discovers meta patterns from massive corpora with three techniques: (1) it develops a context-aware segmentation method to carefully determine the boundaries of patterns with a learnt pattern quality assessment function, which avoids costly dependency parsing and generates high-quality patterns; (2) it identifies and groups synonymous meta patterns from multiple facets---their types, contexts, and extractions; and (3) it examines type distributions of entities in the instances extracted by each group of patterns, and looks for appropriate type levels to make discovered patterns precise. Experiments demonstrate that our proposed framework discovers high-quality typed textual patterns efficiently from different genres of massive corpora and facilitates information extraction.

【Keywords】: attribute discovery; context-aware segmentation; information extraction; meta pattern; natural language processing; pattern discovery; synonymous meta pattern; text corpora; text mining; textual pattern

102. Federated Tensor Factorization for Computational Phenotyping.

Paper Link】 【Pages】:887-895

【Authors】: Yejin Kim ; Jimeng Sun ; Hwanjo Yu ; Xiaoqian Jiang

【Abstract】: Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy.

【Keywords】: admm; federated approach; phenotype

103. Statistical Emerging Pattern Mining with Multiple Testing Correction.

Paper Link】 【Pages】:897-906

【Authors】: Junpei Komiyama ; Masakazu Ishihata ; Hiroki Arimura ; Takashi Nishibayashi ; Shin-ichi Minato

【Abstract】: Emerging patterns are patterns whose support significantly differs between two databases. We study the problem of listing emerging patterns with a multiple testing guarantee. Recently, Terada et al., proposed the Limitless Arity Multiple-testing Procedure (LAMP) that controls the family-wise error rate (FWER) in statistical association mining. LAMP reduces the number of "untestable" hypotheses without compromising its statistical power. Still, FWER is restrictive, and as a result, its statistical power is inherently unsatisfying when the number of patterns is large. On the other hand, the false discovery rate (FDR) is less restrictive than FWER, and thus controlling FDR yields a larger number of significant patterns. We propose two emerging pattern mining methods: the first one controls FWER, and the second one controls FDR. The effectiveness of the methods is verified in computer simulations with real-world datasets.

【Keywords】: emerging pattern mining; multiple testing; statistical pattern mining

104. Semi-Supervised Techniques for Mining Learning Outcomes and Prerequisites.

Paper Link】 【Pages】:907-915

【Authors】: Igor Labutov ; Yun Huang ; Peter Brusilovsky ; Daqing He

【Abstract】: Educational content of today no longer only resides in textbooks and classrooms; more and more learning material is found in a free, accessible form on the Internet. Our long-standing vision is to transform this web of educational content into an adaptive, web-scale "textbook", that can guide its readers to most relevant "pages" according to their learning goal and current knowledge. In this paper, we address one core, long-standing problem towards this goal: identifying outcome and prerequisite concepts within a piece of educational content (e.g., a tutorial). Specifically, we propose a novel approach that leverages textbooks as a source of distant supervision, but learns a model that can generalize to arbitrary documents (such as those on the web). As such, our model can take advantage of any existing textbook, without requiring expert annotation. At the task of predicting outcome and prerequisite concepts, we demonstrate improvements over a number of baselines on six textbooks, especially in the regime of little to no ground-truth labels available. Finally, we demonstrate the utility of a model learned using our approach at the task of identifying prerequisite documents for adaptive content recommendation --- an important step towards our vision of the "web as a textbook".

【Keywords】: adaptive hypermedia; educational data mining; graphical models; open corpus adaptive hypermedia; semi-supervised machine learning

105. Prospecting the Career Development of Talents: A Survival Analysis Perspective.

Paper Link】 【Pages】:917-925

【Authors】: Huayu Li ; Yong Ge ; Hengshu Zhu ; Hui Xiong ; Hongke Zhao

【Abstract】: The study of career development has become more important during a time of rising competition. Even with the help of newly available big data in the field of human resources, it is challenging to prospect the career development of talents in an effective manner, since the nature and structure of talent careers can change quickly. To this end, in this paper, we propose a novel survival analysis approach to model the talent career paths, with a focus on two critical issues in talent management, namely turnover and career progression. Specifically, for modeling the talent turnover behaviors, we formulate the prediction of survival status at a sequence of time intervals as a multi-task learning problem by considering the prediction at each time interval as a task. Also, we impose the ranking constraints to model both censored and uncensored data, and capture the intrinsic properties exhibited in general lifetime modeling with non-recurrent and recurrent events. Similarly, for modeling the talent career progression, each task concerns the prediction of a relative occupational level at each time interval. The ranking constraints imposed on different occupational levels can help to reduce the prediction error. Finally, we evaluate our approach with several state-of-the-art baseline methods on real-world talent data. The experimental results clearly demonstrate the effectiveness of the proposed models for predicting the turnover and career progression of talents.

【Keywords】: career development; career path modeling; multi-task learning; ranking; survival analysis

106. A Context-aware Attention Network for Interactive Question Answering.

Paper Link】 【Pages】:927-935

【Authors】: Huayu Li ; Martin Renqiang Min ; Yong Ge ; Asim Kadav

【Abstract】: Neural network based sequence-to-sequence models in an encoder-decoder framework have been successfully applied to solve Question Answering (QA) problems, predicting answers from statements and questions. However, almost all previous models have failed to consider detailed context information and unknown states under which systems do not have enough information to answer given questions. These scenarios with incomplete or ambiguous information are very common in the setting of Interactive Question Answering (IQA). To address this challenge, we develop a novel model, employing context-dependent word-level attention for more accurate statement representations and question-guided sentence-level attention for better context modeling. We also generate unique IQA datasets to test our model, which will be made publicly available. Employing these attention mechanisms, our model accurately understands when it can output an answer or when it requires generating a supplementary question for additional input depending on different contexts. When available, user's feedback is encoded and directly applied to update sentence-level attention to infer an answer. Extensive experiments on QA and IQA datasets quantitatively demonstrate the effectiveness of our model with significant improvement over state-of-the-art conventional QA models.

【Keywords】: attention; interactive question answering; question answering; recurrent neural network

107. Distributed Multi-Task Relationship Learning.

Paper Link】 【Pages】:937-946

【Authors】: Sulin Liu ; Sinno Jialin Pan ; Qirong Ho

【Abstract】: Multi-task learning aims to learn multiple tasks jointly by exploiting their relatedness to improve the generalization performance for each task. Traditionally, to perform multi-task learning, one needs to centralize data from all the tasks to a single machine. However, in many real-world applications, data of different tasks may be geo-distributed over different local machines. Due to heavy communication caused by transmitting the data and the issue of data privacy and security, it is impossible to send data of different task to a master machine to perform multi-task learning. Therefore, in this paper, we propose a distributed multi-task learning framework that simultaneously learns predictive models for each task as well as task relationships between tasks alternatingly in the parameter server paradigm. In our framework, we first offer a general dual form for a family of regularized multi-task relationship learning methods. Subsequently, we propose a communication-efficient primal-dual distributed optimization algorithm to solve the dual problem by carefully designing local subproblems to make the dual problem decomposable. Moreover, we provide a theoretical convergence analysis for the proposed algorithm, which is specific for distributed multi-task relationship learning. We conduct extensive experiments on both synthetic and real-world datasets to evaluate our proposed framework in terms of effectiveness and convergence.

【Keywords】: distributed multi-task learning; transfer learning

108. Point-of-Interest Demand Modeling with Human Mobility Patterns.

Paper Link】 【Pages】:947-955

【Authors】: Yanchi Liu ; Chuanren Liu ; Xinjiang Lu ; Mingfei Teng ; Hengshu Zhu ; Hui Xiong

【Abstract】: Point-of-Interest (POI) demand modeling in urban regions is critical for many applications such as business site selection and real estate investment. While some efforts have been made for the demand analysis of some specific POI categories, such as restaurants, it lacks systematic means to support POI demand modeling. To this end, in this paper, we develop a systematic POI demand modeling framework, named Region POI Demand Identification (RPDI), to model POI demands by exploiting the daily needs of people identified from their large-scale mobility data. Specifically, we first partition the urban space into spatially differentiated neighborhood regions formed by many small local communities. Then, the daily activity patterns of people traveling in the city will be extracted from human mobility data. Since the trip activities, even aggregated, are sparse and insufficient to directly identify the POI demands, especially for underdeveloped regions, we develop a latent factor model that integrates human mobility data, POI profiles, and demographic data to robustly model the POI demand of urban regions in a holistic way. In this model, POI preferences and supplies are used together with demographic features to estimate the POI demands simultaneously for all the urban regions interconnected in the city. Moreover, we also design efficient algorithms to optimize the latent model for large-scale data. Finally, experimental results on real-world data in New York City (NYC) show that our method is effective for identifying POI demands for different regions.

【Keywords】: human mobility; point-of-interest; region demand

109. Functional Zone Based Hierarchical Demand Prediction For Bike System Expansion.

Paper Link】 【Pages】:957-966

【Authors】: Junming Liu ; Leilei Sun ; Qiao Li ; Jingci Ming ; Yanchi Liu ; Hui Xiong

【Abstract】: Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. Many providers of bike sharing systems are ready to expand their bike stations from the existing service area to surrounding regions. A key to success for a bike sharing systems expansion is the bike demand prediction for expansion areas. There are two major challenges in this demand prediction problem: First. the bike transition records are not available for the expansion area and second. station level bike demand have big variances across the urban city. Previous research efforts mainly focus on discovering global features, assuming the station bike demands react equally to the global features, which brings large prediction error when the urban area is large and highly diversified. To address these challenges, in this paper, we develop a hierarchical station bike demand predictor which analyzes bike demands from functional zone level to station level. Specifically, we first divide the studied bike stations into functional zones by a novel Bi-clustering algorithm which is designed to cluster bike stations with similar POI characteristics and close geographical distances together. Then, the hourly bike check-ins and check-outs of functional zones are predicted by integrating three influential factors: distance preference, zone-to-zone preference, and zone characteristics. The station demand is estimated by studying the demand distributions among the stations within the same functional zone. Finally, the extensive experimental results on the NYC Citi Bike system with two expansion stages show the advantages of our approach on station demand and balance prediction for bike sharing system expansions.

【Keywords】: bike sharing system; clustering; demand prediction

110. Unsupervised Discovery of Drug Side-Effects from Heterogeneous Data Sources.

Paper Link】 【Pages】:967-976

【Authors】: Fenglong Ma ; Chuishi Meng ; Houping Xiao ; Qi Li ; Jing Gao ; Lu Su ; Aidong Zhang

【Abstract】: Drug side-effects become a worldwide public health concern, which are the fourth leading cause of death in the United States. Pharmaceutical industry has paid tremendous effort to identify drug side-effects during the drug development. However, it is impossible and impractical to identify all of them. Fortunately, drug side-effects can also be reported on heterogeneous platforms (i.e., data sources), such as FDA Adverse Event Reporting System and various online communities. However, existing supervised and semi-supervised approaches are not practical as annotating labels are expensive in the medical field. In this paper, we propose a novel and effective unsupervised model Sifter to automatically discover drug side-effects. Sifter enhances the estimation on drug side-effects by learning from various online platforms and measuring platform-level and user-level quality simultaneously. In this way, Sifter demonstrates better performance compared with existing approaches in terms of correctly identifying drug side-effects. Experimental results on five real-world datasets show that Sifter can significantly improve the performance of identifying side-effects compared with the state-of-the-art approaches.

【Keywords】: drug side-effects; healthcare informatics; probabilistic graphical model; truth discovery

111. Let's See Your Digits: Anomalous-State Detection using Benford's Law.

Paper Link】 【Pages】:977-986

【Authors】: Samuel Maurus ; Claudia Plant

【Abstract】: Benford's Law explains a curious phenomenon in which the leading digits of "naturally-occurring" numerical data are distributed in a precise fashion. In this paper we begin by showing that system metrics generated by many modern information systems like Twitter, Wikipedia, YouTube and GitHub obey this law. We then propose a novel unsupervised approach called BenFound that exploits this property to detect anomalous system events. BenFound tracks the "Benfordness" of key system metrics, like the follower counts of tweeting Twitter users or the change deltas in Wikipedia page edits. It then applies a novel Benford-conformity test in real-time to identify "non-Benford events". We investigate a variety of such events, showing that they correspond to unnatural and often undesirable system interactions like spamming, hashtag-hijacking and denial-of-service attacks. The result is a technically-uncomplicated and effective "red flagging" technique that can be used to complement existing anomaly-detection approaches. Although not without its limitations, it is highly efficient and requires neither obscure parameters, nor text streams, nor natural-language processing.

【Keywords】: anomaly detection; benford's law; data streams; nonparametric statistical tests; time series data

112. Mixture Factorized Ornstein-Uhlenbeck Processes for Time-Series Forecasting.

Paper Link】 【Pages】:987-995

【Authors】: Guo-Jun Qi ; Jiliang Tang ; Jingdong Wang ; Jiebo Luo

【Abstract】: Forecasting the future observations of time-series data can be performed by modeling the trend and fluctuations from the observed data. Many classical time-series analysis models like Autoregressive model (AR) and its variants have been developed to achieve such forecasting ability. While they are often based on the white noise assumption to model the data fluctuations, a more general Brownian motion has been adopted that results in Ornstein-Uhlenbeck (OU) process. The OU process has gained huge successes in predicting the future observations over many genres of time series, however, it is still limited in modeling simple diffusion dynamics driven by a single persistent factor that never evolves over time. However, in many real problems, a mixture of hidden factors are usually present, and when and how frequently they appear or disappear are unknown ahead of time. This imposes a challenge that inspires us to develop a Mixture Factorized OU process (MFOUP) to model evolving factors. The new model is able to capture the changing states of multiple mixed hidden factors, from which we can infer their roles in driving the movements of time series. We conduct experiments on three forecasting problems, covering sensor and market data streams. The results show its competitive performance on predicting future observations and capturing evolution patterns of hidden factors as compared with the other algorithms.

【Keywords】: mixture factorized ornstein-uhlenbeck process (mfoup)

113. Automatic Synonym Discovery with Knowledge Bases.

Paper Link】 【Pages】:997-1005

【Authors】: Meng Qu ; Xiang Ren ; Jiawei Han

【Abstract】: Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., "apple" could refer to both Apple Inc. and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised-learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as "distant" supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually-complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.

【Keywords】: distant supervision; distributional based approach; knowledge base; pattern based approach; synonym discovery

114. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance.

Paper Link】 【Pages】:1007-1015

【Authors】: Edward Raff ; Charles K. Nicholas

【Abstract】: The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.

【Keywords】: cyber security; jaccard similarity; lempel-ziv; malware classification; normalized compression distance

115. Inferring the Strength of Social Ties: A Community-Driven Approach.

Paper Link】 【Pages】:1017-1025

【Authors】: Polina Rozenshtein ; Nikolaj Tatti ; Aristides Gionis

【Abstract】: Online social networks are growing and becoming denser.The social connections of a given person may have very high variability: from close friends and relatives to acquaintances to people who hardly know. Inferring the strength of social ties is an important ingredient for modeling the interaction of users in a network and understanding their behavior. Furthermore, the problem has applications in computational social science, viral marketing, and people recommendation. In this paper we study the problem of inferring the strength of social ties in a given network. Our work is motivated by a recent approach by Sintos et. al [24], which leverages the Strong Triadic Closure} STC principle, a hypothesis rooted in social psychology. To guide our inference process, in addition to the network structure, we also consider as input a collection of tight communities. Those are sets of vertices that we expect to be connected via strong ties. Such communities appear in different situations, e.g., when being part of a community implies a strong connection to one of the existing members. We consider two related problem formalizations that reflect the assumptions of our setting: small number of STC violations and strong-tie connectivity in the input communities. We show that both problem formulations are NP-hard. We also show that one problem formulation is hard to approximate, while for the second we develop an algorithm with approximation guarantee. We validate the proposed method on real-world datasets by comparing with baselines that optimize STC violations and community connectivity separately.

【Keywords】: approximation algorithms; network inference; social network analysis; strong triadic closure

116. Detecting Network Effects: Randomizing Over Randomized Experiments.

Paper Link】 【Pages】:1027-1035

【Authors】: Martin Saveski ; Jean Pouget-Abadie ; Guillaume Saint-Jacques ; Weitao Duan ; Souvik Ghosh ; Ya Xu ; Edoardo M. Airoldi

【Abstract】: Randomized experiments, or A/B tests, are the standard approach for evaluating the causal effects of new product features, i.e., treatments. The validity of these tests rests on the "stable unit treatment value assumption" (SUTVA), which implies that the treatment only affects the behavior of treated users, and does not affect the behavior of their connections. Violations of SUTVA, common in features that exhibit network effects, result in inaccurate estimates of the causal effect of treatment. In this paper, we leverage a new experimental design for testing whether SUTVA holds, without making any assumptions on how treatment effects may spill over between the treatment and the control group. To achieve this, we simultaneously run both a completely randomized and a cluster-based randomized experiment, and then we compare the difference of the resulting estimates. We present a statistical test for measuring the significance of this difference and offer theoretical bounds on the Type I error rate. We provide practical guidelines for implementing our methodology on large-scale experimentation platforms. Importantly, the proposed methodology can be applied to settings in which a network is not necessarily observed but, if available, can be used in the analysis. Finally, we deploy this design to LinkedIn's experimentation platform and apply it to two online experiments, highlighting the presence of network effects and bias in standard A/B testing approaches in a real-world setting.

【Keywords】: a/b testing; experimental design; network effects; network interference; randomized experiment; spillovers; sutva

117. When is a Network a Network?: Multi-Order Graphical Model Selection in Pathways and Temporal Networks.

Paper Link】 【Pages】:1037-1046

【Authors】: Ingo Scholtes

【Abstract】: We introduce a framework for the modeling of sequential data capturing pathways of varying lengths observed in a network. Such data are important, e.g., when studying click streams in the Web, travel patterns in transportation systems, information cascades in social networks, biological pathways, or time-stamped social interactions. While it is common to apply graph analytics and network analysis to such data, recent works have shown that temporal correlations can invalidate the results of such methods. This raises a fundamental question: When is a network abstraction of sequential data justified?Addressing this open question, we propose a framework that combines Markov chains of multiple, higher orders into a multi-layer graphical model that captures temporal correlations in pathways at multiple length scales simultaneously. We develop a model selection technique to infer the optimal number of layers of such a model and show that it outperforms baseline Markov order detection techniques. An application to eight real-world data sets on pathways and temporal networks shows that it allows to infer graphical models that capture both topological and temporal characteristics of such data. Our work highlights fallacies of network abstractions and provides a principled answer to the open question when they are justified. Generalizing network representations to multi-order graphical models, it opens perspectives for new data mining and knowledge discovery algorithms.

【Keywords】: click paths; click stream analysis; data mining; dynamic graph; dynamic network analysis; graph analytics; graph mining; graph theory; graphical model; higher-order model; higher-order network; information systems; knowledge discovery; markov chain; network analysis; network science; pagerank; pathway data; relational data; sequence mining; sequential data; social network analysis; temporal data; temporal networks; time series analysis; time series data

118. ReasoNet: Learning to Stop Reading in Machine Comprehension.

Paper Link】 【Pages】:1047-1055

【Authors】: Yelong Shen ; Po-Sen Huang ; Jianfeng Gao ; Weizhu Chen

【Abstract】: Teaching a computer to read and answer general questions pertaining to a document is a challenging yet unsolved problem. In this paper, we describe a novel neural network architecture called the Reasoning Network (ReasoNet) for machine comprehension tasks. ReasoNets make use of multiple turns to effectively exploit and then reason over the relation among queries, documents, and answers. Different from previous approaches using a fixed number of turns during inference, ReasoNets introduce a termination state to relax this constraint on the reasoning depth. With the use of reinforcement learning, ReasoNets can dynamically determine whether to continue the comprehension process after digesting intermediate results, or to terminate reading when it concludes that existing information is adequate to produce an answer. ReasoNets achieve superior performance in machine comprehension datasets, including unstructured CNN and Daily Mail datasets, the Stanford SQuAD dataset, and a structured Graph Reachability dataset.

【Keywords】: deep reinforcement learning; machine reading comprehension; reasonet

119. DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams.

Paper Link】 【Pages】:1057-1066

【Authors】: Kijung Shin ; Bryan Hooi ; Jisu Kim ; Christos Faloutsos

【Abstract】: Consider a stream of retweet events - how can we spot fraudulent lock-step behavior in such multi-aspect data (i.e., tensors) evolving over time? Can we detect it in real time, with an accuracy guarantee? Past studies have shown that dense subtensors tend to indicate anomalous or even fraudulent behavior in many tensor data, including social media, Wikipedia, and TCP dumps. Thus, several algorithms have been proposed for detecting dense subtensors rapidly and accurately. However, existing algorithms assume that tensors are static, while many real-world tensors, including those mentioned above, evolve over time. We propose DENSESTREAM, an incremental algorithm that maintains and updates a dense subtensor in a tensor stream (i.e., a sequence of changes in a tensor), and DENSESALERT, an incremental algorithm spotting the sudden appearances of dense subtensors. Our algorithms are: (1) Fast and "any time": updates by our algorithms are up to a million times faster than the fastest batch algorithms, (2) Provably accurate: our algorithms guarantee a lower bound on the density of the subtensor they maintain, and (3) Effective: our DENSESALERT successfully spots anomalies in real-world tensors, especially those overlooked by existing algorithms.

【Keywords】: anomaly detection; dense-subtensor detection; fraud detection; incremental algorithm; tensor stream

120. Anomaly Detection in Streams with Extreme Value Theory.

Paper Link】 【Pages】:1067-1075

【Authors】: Alban Siffer ; Pierre-Alain Fouque ; Alexandre Termier ; Christine Largouët

【Abstract】: Anomaly detection in time series has attracted considerable attention due to its importance in many real-world applications including intrusion detection, energy management and finance. Most approaches for detecting outliers rely on either manually set thresholds or assumptions on the distribution of data according to Chandola, Banerjee and Kumar. Here, we propose a new approach to detect outliers in streaming univariate time series based on Extreme Value Theory that does not require to hand-set thresholds and makes no assumption on the distribution: the main parameter is only the risk, controlling the number of false positives. Our approach can be used for outlier detection, but more generally for automatically setting thresholds, making it useful in wide number of situations. We also experiment our algorithms on various real-world datasets which confirm its soundness and efficiency.

【Keywords】: extreme value theory; outliers in time series; streaming

121. Relay-Linking Models for Prominence and Obsolescence in Evolving Networks.

Paper Link】 【Pages】:1077-1086

【Authors】: Mayank Singh ; Rajdeep Sarkar ; Pawan Goyal ; Animesh Mukherjee ; Soumen Chakrabarti

【Abstract】: The rate at which nodes in evolving social networks acquire links (friends, citations) shows complex temporal dynamics. Preferential attachment and link copying models, while enabling elegant analysis, only capture rich-gets-richer effects, not aging and decline. Recent aging models are complex and heavily parameterized; most involve estimating 1-3 parameters per node. These parameters are intrinsic: they explain decline in terms of events in the past of the same node, and do not explain, using the network, where the linking attention might go instead. We argue that traditional characterization of linking dynamics are insufficient to judge the faithfulness of models. We propose a new temporal sketch of an evolving graph, and introduce several new characterizations of a network's temporal dynamics. Then we propose a new family of frugal aging models with no per-node parameters and only two global parameters. Our model is based on a surprising inversion or undoing of triangle completion, where an old node relays a citation to a younger follower in its immediate vicinity. Despite very few parameters, the new family of models shows remarkably better fit with real data. Before concluding, we analyze temporal signatures for various research communities yielding further insights into their comparative dynamics. To facilitate reproducible research, we shall soon make all the codes and the processed dataset available in the public domain.

【Keywords】: aging models; network growth models; relay-link

122. PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency.

Paper Link】 【Pages】:1087-1096

【Authors】: Hwanjun Song ; Jae-Gil Lee ; Wook-Shin Han

【Abstract】: The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors---"global search" and "entire data"---that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm. The source code and data are available at https://github.com/jaegil/k-Medoid.

【Keywords】: hadoop; k-medoids; pam; parallelization; sampling; spark

123. Sparse Compositional Local Metric Learning.

Paper Link】 【Pages】:1097-1104

【Authors】: Joseph St. Amand ; Jun Huan

【Abstract】: Mahalanobis distance metric learning becomes an especially challenging problem as the dimension of the feature space p is scaled upwards. The number of parameters to optimize grows with space complexity of order O (p 2), making storage infeasible, interpretability poor, and causing the model to have a high tendency to overfit. Additionally, optimization while maintaining feasibility of the solution becomes prohibitively expensive, requiring a projection onto the positive semi-definite cone after every iteration. In addition to the obvious space and computational challenges, vanilla distance metric learning is unable to model complex and multi-modal trends in the data. Inspired by the recent resurgence of Frank-Wolfe style optimization, we propose a new method for sparse compositional local Mahalanobis distance metric learning. Our proposed technique learns a set of distance metrics which are composed of local and global components. We capture local interactions in the feature space, while ensuring that all metrics share a global component, which may act as a regularizer. We optimize our model using an alternating pairwise Frank-Wolfe style algorithm. This serves a dual purpose, we can control the sparsity of our solution, and altogether avoid any expensive projection operations. Finally, we conduct an empirical evaluation of our method with the current state of the art and present the results on five datasets from varying domains.

【Keywords】: frank-wolfe optimization; mahalanobis distance; sparse metric learning

124. End-to-end Learning for Short Text Expansion.

Paper Link】 【Pages】:1105-1113

【Authors】: Jian Tang ; Yue Wang ; Kai Zheng ; Qiaozhu Mei

【Abstract】: Effectively making sense of short texts is a critical task for many real world applications such as search engines, social media services, and recommender systems. The task is particularly challenging as a short text contains very sparse information, often too sparse for a machine learning algorithm to pick up useful signals. A common practice for analyzing short text is to first expand it with external information, which is usually harvested from a large collection of longer texts. In literature, short text expansion has been done with all kinds of heuristics. We propose an end-to-end solution that automatically learns how to expand short text to optimize a given learning task. A novel deep memory network is proposed to automatically find relevant information from a collection of longer documents and reformulate the short text through a gating mechanism. Using short text classification as a demonstrating task, we show that the deep memory network significantly outperforms classical text expansion methods with comprehensive experiments on real world data sets.

【Keywords】: memory networks; query expansion; short text classification

125. Construction of Directed 2K Graphs.

Paper Link】 【Pages】:1115-1124

【Authors】: Bálint Tillman ; Athina Markopoulou ; Carter T. Butts ; Minas Gjoka

【Abstract】: We study the problem of generating synthetic graphs that resemble real-world directed graphs in terms of their degree correlations. In order to capture degree correlation specifically for directed graphs, we define directed 2K (D2K) as those graphs with a given directed degree sequence (DDS) and a given target joint degree and attribute matrix (JDAM). We provide necessary and sufficient conditions for a target D2K to be realizable and we design an efficient algorithm that generates graph realizations with exactly the target D2K. We apply our algorithm to generate synthetic graphs that target real-world directed graphs (such as Twitter), and we demonstrate its benefits compared to state-of-the-art construction algorithms.

【Keywords】: construction algorithms; directed graphs; graph realizations

126. Optimized Risk Scores.

Paper Link】 【Pages】:1125-1134

【Authors】: Berk Ustun ; Cynthia Rudin

【Abstract】: Risk scores are simple classification models that let users quickly assess risk by adding, subtracting, and multiplying a few small numbers. Such models are widely used in healthcare and criminal justice, but are often built ad hoc. In this paper, we present a principled approach to learn risk scores that are fully optimized for feature selection, integer coefficients, and operational constraints. We formulate the risk score problem as a mixed integer nonlinear program, and present a new cutting plane algorithm to efficiently recover its optimal solution. Our approach can fit optimized risk scores in a way that scales linearly with the sample size of a dataset, provides a proof of optimality, and obeys complex constraints without parameter tuning. We illustrate these benefits through an extensive set of numerical experiments, and an application where we build a customized risk score for ICU seizure prediction.

【Keywords】: classification; cutting plane methods; interpretable models; mixed integer non-linear programming; risk scores; seizure prediction

127. A Location-Sentiment-Aware Recommender System for Both Home-Town and Out-of-Town Users.

Paper Link】 【Pages】:1135-1143

【Authors】: Hao Wang ; Yanmei Fu ; Qinyong Wang ; Hongzhi Yin ; Changying Du ; Hui Xiong

【Abstract】: Spatial item recommendation has become an important means to help people discover interesting locations, especially when people pay a visit to unfamiliar regions. Some current researches are focusing on modelling individual and collective geographical preferences for spatial item recommendation based on users' check-in records, but they fail to explore the phenomenon of user interest drift across geographical regions, i.e., users would show different interests when they travel to different regions. Besides, they ignore the influence of public comments for subsequent users' check-in behaviors. Specifically, it is intuitive that users would refuse to check in to a spatial item whose historical reviews seem negative overall, even though it might fit their interests. Therefore, it is necessary to recommend the right item to the right user at the right location. In this paper, we propose a latent probabilistic generative model called LSARS to mimic the decision-making process of users' check-in activities both in home-town and out-of-town scenarios by adapting to user interest drift and crowd sentiments, which can learn location-aware and sentiment-aware individual interests from the contents of spatial items and user reviews. Due to the sparsity of user activities in out-of-town regions, LSARS is further designed to incorporate the public preferences learned from local users' check-in behaviors. Finally, we deploy LSARS into two practical application scenes: spatial item recommendation and target user discovery. Extensive experiments on two large-scale location-based social networks (LBSNs) datasets show that LSARS achieves better performance than existing state-of-the-art methods.

【Keywords】: check-in behavior; crowd sentiment; recommendation; user interest drift

128. Adversary Resistant Deep Neural Networks with an Application to Malware Detection.

Paper Link】 【Pages】:1145-1153

【Authors】: Qinglong Wang ; Wenbo Guo ; Kaixuan Zhang ; Alexander G. Ororbia II ; Xinyu Xing ; Xue Liu ; C. Lee Giles

【Abstract】: Outside the highly publicized victories in the game of Go, there have been numerous successful applications of deep learning in the fields of information retrieval, computer vision, and speech recognition. In cybersecurity, an increasing number of companies have begun exploring the use of deep learning (DL) in a variety of security tasks with malware detection among the more popular. These companies claim that deep neural networks (DNNs) could help turn the tide in the war against malware infection. However, DNNs are vulnerable to adversarial samples, a shortcoming that plagues most, if not all, statistical and machine learning models. Recent research has demonstrated that those with malicious intent can easily circumvent deep learning-powered malware detection by exploiting this weakness. To address this problem, previous work developed defense mechanisms that are based on augmenting training data or enhancing model complexity. However, after analyzing DNN susceptibility to adversarial samples, we discover that the current defense mechanisms are limited and, more importantly, cannot provide theoretical guarantees of robustness against adversarial sampled-based attacks. As such, we propose a new adversary resistant technique that obstructs attackers from constructing impactful adversarial samples by randomly nullifying features within data vectors. Our proposed technique is evaluated on a real world dataset with 14,679 malware variants and 17,399 benign programs. We theoretically validate the robustness of our technique, and empirically show that our technique significantly boosts DNN robustness to adversarial samples while maintaining high accuracy in classification. To demonstrate the general applicability of our proposed method, we also conduct experiments using the MNIST and CIFAR-10 datasets, widely used in image recognition research.

【Keywords】: adversarial sample; deep neural networks; image recognition; malware classification

129. Multi-Modality Disease Modeling via Collective Deep Matrix Factorization.

Paper Link】 【Pages】:1155-1164

【Authors】: Qi Wang ; Mengying Sun ; Liang Zhan ; Paul Thompson ; Shuiwang Ji ; Jiayu Zhou

【Abstract】: Alzheimer's disease (AD), one of the most common causes of dementia, is a severe irreversible neurodegenerative disease that results in loss of mental functions. The transitional stage between the expected cognitive decline of normal aging and AD, mild cognitive impairment (MCI), has been widely regarded as a suitable time for possible therapeutic intervention. The challenging task of MCI detection is therefore of great clinical importance, where the key is to effectively fuse predictive information from multiple heterogeneous data sources collected from the patients. In this paper, we propose a framework to fuse multiple data modalities for predictive modeling using deep matrix factorization, which explores the non-linear interactions among the modalities and exploits such interactions to transfer knowledge and enable high performance prediction. Specifically, the proposed collective deep matrix factorization decomposes all modalities simultaneously to capture non-linear structures of the modalities in a supervised manner, and learns a modality specific component for each modality and a modality invariant component across all modalities. The modality invariant component serves as a compact feature representation of patients that has high predictive power. The modality specific components provide an effective means to explore imaging genetics, yielding insights into how imaging and genotype interact with each other non-linearly in the AD pathology. Extensive empirical studies using various data modalities provided by Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the effectiveness of the proposed method for fusing heterogeneous modalities.

【Keywords】: deep neural network; genetics; matrix factorization; medical imaging; mild cognitive impairment; multi-modality

130. Decomposed Normalized Maximum Likelihood Codelength Criterion for Selecting Hierarchical Latent Variable Models.

Paper Link】 【Pages】:1165-1174

【Authors】: Tianyi Wu ; Shinya Sugawara ; Kenji Yamanishi

【Abstract】: We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as the Naive Bayes models, stochastic block models and latent Dirichlet allocations, for which many conventional information criteria cannot be straightforwardly applied due to irregularity of latent variable models. Our method also has an advantage that it can be exactly evaluated without asymptotic approximation with small time complexity. Our experiments using synthetic and real data demonstrated validity of our method in terms of computational efficiency and model selection accuracy, while our criterion especially dominated the other criteria when sample size is small and when data are noisy.

【Keywords】: clustering; model selection; topic and latent variable models

131. Structural Event Detection from Log Messages.

Paper Link】 【Pages】:1175-1184

【Authors】: Fei Wu ; Pranay Anchuri ; Zhenhui Li

【Abstract】: A wide range of modern web applications are only possible because of the composable nature of the web services they are built upon. It is, therefore, often critical to ensure proper functioning of these web services. As often, the server-side of web services is not directly accessible, several log message based analysis have been developed to monitor the status of web services. Existing techniques focus on using clusters of messages (log patterns) to detect important system events. We argue that meaningful system events are often representable by groups of cohesive log messages and the relationships among these groups. We propose a novel method to mine structural events as directed workflow graphs (where nodes represent log patterns, and edges represent relations among patterns). The structural events are inclusive and correspond to interpretable episodes in the system. The problem is non-trivial due to the nature of log data: (i) Individual log messages contain limited information, and (ii) Log messages in a large scale web system are often interleaved even though the log messages from individual components are ordered. As a result, the patterns and relationships mined directly from the messages and their ordering can be erroneous and unreliable in practice. Our solution is based on the observation that meaningful log patterns and relations often form workflow structures that are connected. Our method directly models the overall quality of structural events. Through both qualitative and quantitative experiments on real world datasets, we demonstrate the effectiveness and the expressiveness of our event detection method.

【Keywords】: log summarization; structural event detection

132. Retrospective Higher-Order Markov Processes for User Trails.

Paper Link】 【Pages】:1185-1194

【Authors】: Tao Wu ; David F. Gleich

【Abstract】: Users form information trails as they checkin with a geolocation, rate items, or consume media. A common problem is to predict what a user might do next for the purposes of guidance, recommendation, or prefetching. Markov chains models have been widely used methods to study such sequences of data. First-order Markov chains are easy to estimate, but lack accuracy when history matters. Higher-order Markov chains, in contrast, have too many parameters and suffer from overfitting the training data. Fitting these parameters with regularization and smoothing only offers mild improvements. In this paper we propose the retrospective higher-order Markov process (RHOMP) as a low-parameter model for such sequences. This model is a special case of a higher-order Markov chain where the transitions depend retrospectively on a single history state instead of an arbitrary combination of history states. There are two immediate computational advantages: the number of parameters is linear in the order of the Markov chain and the model can be fit to large state spaces. Furthermore, by providing a specific structure to the higher-order chain, RHOMPs improve the model accuracy by efficiently utilizing history states without risks of overfitting the data. We demonstrate how to estimate a RHOMP from data and we demonstrate the effectiveness of our method on various real application datasets spanning geolocation data, review sequences, and business locations. The RHOMP model uniformly outperforms higher-order Markov chains, Kneser-Ney regularization, and tensor factorizations in terms of prediction accuracy.

【Keywords】: higher-order markov chains; tensor factorization; user models

133. Privacy-Preserving Distributed Multi-Task Learning with Asynchronous Updates.

Paper Link】 【Pages】:1195-1204

【Authors】: Liyang Xie ; Inci M. Baytas ; Kaixiang Lin ; Jiayu Zhou

【Abstract】: Many data mining applications involve a set of related learning tasks. Multi-task learning (MTL) is a learning paradigm that improves generalization performance by transferring knowledge among those tasks. MTL has attracted so much attention in the community, and various algorithms have been successfully developed. Recently, distributed MTL has also been studied for related tasks whose data is distributed across different geographical regions. One prominent challenge of the distributed MTL frameworks is to maintain the privacy of the data. The distributed data may contain sensitive and private information such as patients' records and registers of a company. In such cases, distributed MTL frameworks are required to preserve the privacy of the data. In this paper, we propose a novel privacy-preserving distributed MTL framework to address this challenge. A privacy-preserving proximal gradient algorithm, which asynchronously updates models of the learning tasks, is introduced to solve a general class of MTL formulations. The proposed asynchronous approach is robust against network delays and provides a guaranteed differential privacy through carefully designed perturbation. Theoretical guarantees of the proposed algorithm are derived and supported by the extensive experimental results.

【Keywords】: asynchronous proximal optimization; differential privacy; multi-task learning

134. Evaluating U.S. Electoral Representation with a Joint Statistical Model of Congressional Roll-Calls, Legislative Text, and Voter Registration Data.

Paper Link】 【Pages】:1205-1214

【Authors】: Zhengming Xing ; Sunshine Hillygus ; Lawrence Carin

【Abstract】: Extensive information on 3 million randomly sampled United States citizens is used to construct a statistical model of constituent preferences for each U.S. congressional district. This model is linked to the legislative voting record of the legislator from each district, yielding an integrated model for constituency data, legislative roll-call votes, and the text of the legislation. The model is used to examine the extent to which legislators' voting records are aligned with constituent preferences, and the implications of that alignment (or lack thereof) on subsequent election outcomes. The analysis is based on a Bayesian formalism, with fast inference via a stochastic variational Bayesian analysis.

【Keywords】: hierarchical dirichlet process; ideal point; matrix factorization; multiplicative gamma process; stochastic variational inference; topic model

135. Convex Factorization Machine for Toxicogenomics Prediction.

Paper Link】 【Pages】:1215-1224

【Authors】: Makoto Yamada ; Wenzhao Lian ; Amit Goyal ; Jianhui Chen ; Kishan Wimalawarne ; Suleiman A. Khan ; Samuel Kaski ; Hiroshi Mamitsuka ; Yi Chang

【Abstract】: We introduce the convex factorization machine (CFM), which is a convex variant of the widely used Factorization Machines (FMs). Specifically, we employ a linear+quadratic model and regularize the linear term with the ℓ2-regularizer and the quadratic term with the trace norm regularizer. Then, we formulate the CFM optimization as a semidefinite programming problem and propose an efficient optimization procedure with Hazan's algorithm. A key advantage of CFM over existing FMs is that it can find a globally optimal solution, while FMs may get a poor locally optimal solution since the objective function of FMs is non-convex. In addition, the proposed algorithm is simple yet effective and can be implemented easily. Finally, CFM is a general factorization method and can also be used for other factorization problems, including multi-view matrix factorization and tensor completion problems, in various domains including toxicogenomics and bioinformatics. Through synthetic and traditionally used movielens datasets, we first show that the proposed CFM achieves results competitive to FMs. We then show in a toxicogenomics prediction task that CFM predicts the toxic outcomes of a collection of drugs better than a state-of-the-art tensor factorization method.

【Keywords】: convex; factorization machines; toxicogenomics prediction

136. Distributed Local Outlier Detection in Big Data.

Paper Link】 【Pages】:1225-1234

【Authors】: Yizhou Yan ; Lei Cao ; Caitlin Kuhlman ; Elke A. Rundensteiner

【Abstract】: In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method -- a popular outlier detection technique shown to be very effective for datasets with skewed distributions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required. This poses significant challenges due to the complexity of the LOF definition, and a lack of access to the entire dataset at any individual compute machine. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging our invariant observation for intermediate value management. Furthermore, we propose a data assignment strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline, while minimizing the number of data replicas. Based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies. These strategies not only minimize the computation costs within each stage, but also eliminate unnecessary communication costs by aggressively pushing the LOF computation into the early stages of the DLOF pipeline. Our comprehensive experimental study using both real and synthetic datasets confirms the efficiency and scalability of our approach to terabyte level data.

【Keywords】: big data; distributed processing; local outlier

137. Scalable Top-n Local Outlier Detection.

Paper Link】 【Pages】:1235-1244

【Authors】: Yizhou Yan ; Lei Cao ; Elke A. Rundensteiner

【Abstract】: Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.

【Keywords】: local outlier factor; pruning strategy; top-n

138. Bridging Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for POI Recommendation.

Paper Link】 【Pages】:1245-1254

【Authors】: Carl Yang ; Lanxiao Bai ; Chao Zhang ; Quan Yuan ; Jiawei Han

【Abstract】: Recommender system is one of the most popular data mining topics that keep drawing extensive attention from both academia and industry. Among them, POI (point of interest) recommendation is extremely practical but challenging: it greatly benefits both users and businesses in real-world life, but it is hard due to data scarcity and various context. While a number of algorithms attempt to tackle the problem w.r.t. specific data and problem settings, they often fail when the scenarios change. In this work, we propose to devise a general and principled SSL (semi-supervised learning) framework, to alleviate data scarcity via smoothing among neighboring users and POIs, and treat various context by regularizing user preference based on context graphs. To enable such a framework, we develop PACE (Preference And Context Embedding), a deep neural architecture that jointly learns the embeddings of users and POIs to predict both user preference over POIs and various context associated with users and POIs. We show that PACE successfully bridges CF (collaborative filtering) and SSL by generalizing the de facto methods matrix factorization of CF and graph Laplacian regularization of SSL. Extensive experiments on two real location-based social network datasets demonstrate the effectiveness of PACE.

【Keywords】: collaborative filtering; neural networks; recommender systems; semi-supervised learning

139. Multi-task Function-on-function Regression with Co-grouping Structured Sparsity.

Paper Link】 【Pages】:1255-1264

【Authors】: Pei Yang ; Qi Tan ; Jingrui He

【Abstract】: The growing importance of functional data has fueled the rapid development of functional data analysis, which treats the infinite-dimensional data as continuous functions rather than discrete, finite-dimensional vectors. On the other hand, heterogeneity is an intrinsic property of functional data due to the variety of sources to collect the data. In this paper, we propose a novel multi-task function-on-function regression approach to model both the functionality and heterogeneity of data. The basic idea is to simultaneously model the relatedness among tasks and correlations among basis functions by using the co-grouping structured sparsity to encourage similar tasks to behave similarly in shrinking the basis functions. The resulting optimization problem is challenging due to the non-smoothness and non-separability of the co-grouping structured sparsity. We present an efficient algorithm to solve the problem, and prove its separability, convexity, and global convergence. The proposed algorithm is applicable to a wide spectrum of structured sparsity regularized techniques, such as structured ℓ2,p norm and structured Schatten p-norm. The effectiveness of the proposed approach is verified on benchmark functional data sets collected from various domains.

【Keywords】: co-grouping structured sparsity; function-on-function regression; generalized schatten norm; multi-task learning

140. Learning from Labeled and Unlabeled Vertices in Networks.

Paper Link】 【Pages】:1265-1274

【Authors】: Wei Ye ; Linfei Zhou ; Dominik Mautz ; Claudia Plant ; Christian Böhm

【Abstract】: Networks such as social networks, citation networks, protein-protein interaction networks, etc., are prevalent in real world. However, only very few vertices have labels compared to large amounts of unlabeled vertices. For example, in social networks, not every user provides his/her profile information such as the personal interests which are relevant for targeted advertising. Can we leverage the limited user information and friendship network wisely to infer the labels of unlabeled users? In this paper, we propose a semi-supervised learning framework called weighted-vote Geometric Neighbor classifier (wvGN) to infer the likely labels of unlabeled vertices in sparsely labeled networks. wvGN exploits random walks to explore not only local but also global neighborhood information of a vertex. Then the label of the vertex is determined by the accumulated local and global neighborhood information. Specifically, wvGN optimizes a proposed objective function by a search strategy which is based on the gradient and coordinate descent methods. The search strategy iteratively conducts a coarse search and a fine search to escape from local optima. Extensive experiments on various synthetic and real-world data verify the effectiveness of wvGN compared to state-of-the-art approaches.

【Keywords】: geometric neighborhood; random walk; regularization; semi-supervised learning; support vector machines

141. Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach.

Paper Link】 【Pages】:1275-1284

【Authors】: Peifeng Yin ; Ping Luo ; Taiga Nakamura

【Abstract】: Efficiency of large-scale learning is a hot topic in both academic and industry. The stochastic gradient descent (SGD) algorithm, and its extension mini-batch SGD, allow the model to be updated without scanning the whole data set. However, the use of approximate gradient leads to the uncertainty issue, slowing down the decreasing of objective function. Furthermore, such uncertainty may result in a high frequency of meaningless update on the model, causing a communication issue in parallel learning environment. In this work, we develop a batch-adaptive stochastic gradient descent (BA-SGD) algorithm, which can dynamically choose a proper batch size as learning proceeds. Particularly on the basis of Taylor extension and central limit theorem, it models the decrease of objective value as a Gaussian random walk game with rebound. In this game, a heuristic strategy of determining batch size is adopted to maximize the utility of each incremental sampling. By evaluation on multiple real data sets, we demonstrate that by smartly choosing the batch size, the BA-SGD not only conserves the fast convergence of SGD algorithm but also avoids too frequent model updates.

【Keywords】: batch-adaptive sgd; gaussian random walk; rebound

142. Learning from Multiple Teacher Networks.

Paper Link】 【Pages】:1285-1294

【Authors】: Shan You ; Chang Xu ; Chao Xu ; Dacheng Tao

【Abstract】: Training thin deep networks following the student-teacher learning paradigm has received intensive attention because of its excellent performance. However, to the best of our knowledge, most existing work mainly considers one single teacher network. In practice, a student may access multiple teachers, and multiple teacher networks together provide comprehensive guidance that is beneficial for training the student network. In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples. We suggest that the relative dissimilarity between intermediate representations of different examples serves as a more flexible and appropriate guidance from teacher networks. Then triplets are utilized to encourage the consistence of these relative dissimilarity relationships between the student network and teacher networks. Moreover, we leverage a voting strategy to unify multiple relative dissimilarity information provided by multiple teacher networks, which realizes their incorporation in the intermediate layers. Extensive experimental results demonstrated that our method is capable of generating a well-performed student network, with the classification accuracy comparable or even superior to all teacher networks, yet having much fewer parameters and being much faster in running.

【Keywords】: deep learning; knowledge transfer; multiple teacher networks; triplet loss

143. A Temporally Heterogeneous Survival Framework with Application to Social Behavior Dynamics.

Paper Link】 【Pages】:1295-1304

【Authors】: Linyun Yu ; Peng Cui ; Chaoming Song ; Tianyang Zhang ; Shiqiang Yang

【Abstract】: Social behavior dynamics is one of the central building blocks in understanding and modeling complex social dynamic phenomena, such as information spreading, opinion formation, and social mobilization. While a wide range of models for social behavior dynamics have been proposed in recent years, the essential ingredients and the minimum model for social behavior dynamics is still largely unanswered. Here, we find that human interaction behavior dynamics exhibit rich complexities over the response time dimension and natural time dimension by exploring a large scale social communication dataset. To tackle this challenge, we develop a temporal Heterogeneous Survival framework where the regularities in response time dimension and natural time dimension can be organically integrated. We apply our model in two online social communication datasets. Our model can successfully regenerate the interaction patterns in the social communication datasets, and the results demonstrate that the proposed method can significantly outperform other state-of-the-art baselines. Meanwhile, the learnt parameters and discovered statistical regularities can lead to multiple potential applications.

【Keywords】: human behaviors analysis; information diffusion modeling; social dynamics

144. Inductive Semi-supervised Multi-Label Learning with Co-Training.

Paper Link】 【Pages】:1305-1314

【Authors】: Wang Zhan ; Min-Ling Zhang

【Abstract】: In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training examples, especially for multi-label learning task where a number of class labels need to be annotated for the instance. To circumvent this difficulty, semi-supervised multi-label learning aims to exploit the readily-available unlabeled data to help build multi-label predictive model. Nonetheless, most semi-supervised solutions to multi-label learning work under transductive setting, which only focus on making predictions on existing unlabeled data and cannot generalize to unseen instances. In this paper, a novel approach named COINS is proposed to learning from labeled and unlabeled data by adapting the well-known co-training strategy which naturally works under inductive setting. In each co-training round, a dichotomy over the feature space is learned by maximizing the diversity between the two classifiers induced on either dichotomized feature subset. After that, pairwise ranking predictions on unlabeled data are communicated between either classifier for model refinement. Extensive experiments on a number of benchmark data sets show that COINS performs favorably against state-of-the-art multi-label learning approaches.

【Keywords】: co-training; inductive setting; multi-label learning; semi-supervised learning

145. LEAP: Learning to Prescribe Effective and Safe Treatment Combinations for Multimorbidity.

Paper Link】 【Pages】:1315-1324

【Authors】: Yutao Zhang ; Robert Chen ; Jie Tang ; Walter F. Stewart ; Jimeng Sun

【Abstract】: Managing patients with complex multimorbidity has long been recognized as a difficult problem due to complex disease and medication dependencies and the potential risk of adverse drug interactions. Existing work either uses complicated rule-based protocols which are hard to implement and maintain, or simple statistical models that treat each disease independently, which may lead to sub-optimal or even harmful drug combinations. In this work, we propose the LEAP (LEArn to Prescribe) algorithm to decompose the treatment recommendation into a sequential decision-making process while automatically determining the appropriate number of medications. A recurrent decoder is used to model label dependencies and content-based attention is used to capture label instance mapping. We further leverage reinforcement learning to fine tune the model parameters to ensure accuracy and completeness. We incorporate external clinical knowledge into the design of the reinforcement reward to effectively prevent generating unfavorable drug combinations. Both quantitative experiments and qualitative case studies are conducted on two real world electronic health record datasets to verify the effectiveness of our solution. On both datasets, LEAP significantly outperforms baselines by up to 10-30% in terms of mean Jaccard coefficient and removes 99.8% adverse drug interactions in the recommended treatment sets.

【Keywords】: multi-instance multilabel learning; multimorbidity; treatment recommendation

146. Visualizing Attributed Graphs via Terrain Metaphor.

Paper Link】 【Pages】:1325-1334

【Authors】: Yang Zhang ; Yusu Wang ; Srinivasan Parthasarathy

【Abstract】: The value proposition of a dataset often resides in the implicit interconnections or explicit relationships (patterns) among individual entities, and is often modeled as a graph. Effective visualization of such graphs can lead to key insights uncovering such value. In this article we propose a visualization method to explore attributed graphs with numerical attributes associated with nodes (or edges). Such numerical attributes can represent raw content information, similarities, or derived information reflecting important network measures such as triangle density and centrality. The proposed visualization strategy seeks to simultaneously uncover the relationship between attribute values and graph topology, and relies on transforming the network to generate a terrain map. A key objective here is to ensure that the terrain map reveals the overall distribution of components-of-interest (e.g. dense subgraphs, k-cores) and the relationships among them while being sensitive to the attribute values over the graph. We also design extensions that can capture the relationship across multiple numerical attributes. We demonstrate the efficacy of our method on several real-world data science tasks while scaling to large graphs with millions of nodes.

【Keywords】: attributed graph; graph visualization

147. Achieving Non-Discrimination in Data Release.

Paper Link】 【Pages】:1335-1344

【Authors】: Lu Zhang ; Yongkai Wu ; Xintao Wu

【Abstract】: Discrimination discovery and prevention/removal are increasingly important tasks in data mining. Discrimination discovery aims to unveil discriminatory practices on the protected attribute (e.g., gender) by analyzing the dataset of historical decision records, and discrimination prevention aims to remove discrimination by modifying the biased data before conducting predictive analysis. In this paper, we show that the key to discrimination discovery and prevention is to find the meaningful partitions that can be used to provide quantitative evidences for the judgment of discrimination. With the support of the causal graph, we present a graphical condition for identifying a meaningful partition. Based on that, we develop a simple criterion for the claim of non-discrimination, and propose discrimination removal algorithms which accurately remove discrimination while retaining good data utility. Experiments using real datasets show the effectiveness of our approaches.

【Keywords】: causal graph; discrimination discovery and removal

KDD 2017 Applied Data Science Papers (Oral Papers) 36

148. Using Convolutional Networks and Satellite Imagery to Identify Patterns in Urban Environments at a Large Scale.

Paper Link】 【Pages】:1357-1366

【Authors】: Adrian Albert ; Jasleen Kaur ; Marta C. Gonzalez

【Abstract】: Urban planning applications (energy audits, investment, etc.) require an understanding of built infrastructure and its environment, i.e., both low-level, physical features (amount of vegetation, building area and geometry etc.), as well as higher-level concepts such as land use classes (which encode expert understanding of socio-economic end uses). This kind of data is expensive and labor-intensive to obtain, which limits its availability (particularly in developing countries). We analyze patterns in land use in urban neighborhoods using large-scale satellite imagery data (which is available worldwide from third-party providers) and state-of-the-art computer vision techniques based on deep convolutional neural networks. For supervision, given the limited availability of standard benchmarks for remote-sensing data, we obtain ground truth land use class labels carefully sampled from open-source surveys, in particular the Urban Atlas land classification dataset of $20$ land use classes across $~300$ European cities. We use this data to train and compare deep architectures which have recently shown good performance on standard computer vision tasks (image classification and segmentation), including on geospatial data. Furthermore, we show that the deep representations extracted from satellite imagery of urban environments can be used to compare neighborhoods across several cities. We make our dataset available for other machine learning researchers to use for remote-sensing applications.

【Keywords】: computer vision; convolutional neural networks; land use classification; satellite imagery

149. Luck is Hard to Beat: The Difficulty of Sports Prediction.

Paper Link】 【Pages】:1367-1376

【Authors】: Raquel Y. S. Aoki ; Renato Martins Assunção ; Pedro O. S. Vaz de Melo

【Abstract】: Predicting the outcome of sports events is a hard task. We quantify this difficulty with a coefficient that measures the distance between the observed final results of sports leagues and idealized perfectly balanced competitions in terms of skill. This indicates the relative presence of luck and skill. We collected and analyzed all games from 198 sports leagues comprising 1503 seasons from 84 countries of 4 different sports: basketball, soccer, volleyball and handball. We measured the competitiveness by countries and sports. We also identify in each season which teams, if removed from its league, result in a completely random tournament. Surprisingly, not many of them are needed. As another contribution of this paper, we propose a probabilistic graphical model to learn about the teams' skills and to decompose the relative weights of luck and skill in each game. We break down the skill component into factors associated with the teams' characteristics. The model also allows to estimate as 0.36 the probability that an underdog team wins in the NBA league, with a home advantage adding 0.09 to this probability. As shown in the first part of the paper, luck is substantially present even in the most competitive championships, which partially explains why sophisticated and complex feature-based models hardly beat simple models in the task of forecasting sports' outcomes.

【Keywords】: graphical model; sports analytics; uncertainty quantification

150. Planning Bike Lanes based on Sharing-Bikes' Trajectories.

Paper Link】 【Pages】:1377-1386

【Authors】: Jie Bao ; Tianfu He ; Sijie Ruan ; Yanhua Li ; Yu Zheng

【Abstract】: Cycling as a green transportation mode has been promoted by many governments all over the world. As a result, constructing effective bike lanes has become a crucial task for governments promoting the cycling life style, as well-planned bike paths can reduce traffic congestion and decrease safety risks for both cyclists and motor vehicle drivers. Unfortunately, existing trajectory mining approaches for bike lane planning do not consider key realistic government constraints: 1) budget limitations, 2) construction convenience, and 3) bike lane utilization. In this paper, we propose a data-driven approach to develop bike lane construction plans based on large-scale real world bike trajectory data. We enforce these constraints to formulate our problem and introduce a flexible objective function to tune the benefit between coverage of the number of users and the length of their trajectories. We prove the NP-hardness of the problem and propose greedy-based heuristics to address it. Finally, we deploy our system on Microsoft Azure, providing extensive experiments and case studies to demonstrate the effectiveness of our approach.

【Keywords】: trajectory data mining; urban computing; urban planning

151. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.

Paper Link】 【Pages】:1387-1395

【Authors】: Denis Baylor ; Eric Breck ; Heng-Tze Cheng ; Noah Fiedel ; Chuan Yu Foo ; Zakaria Haque ; Salem Haykal ; Mustafa Ispir ; Vihan Jain ; Levent Koc ; Chiu Yuen Koo ; Lukasz Lew ; Clemens Mewald ; Akshay Naresh Modi ; Neoklis Polyzotis ; Sukriti Ramesh ; Sudip Roy ; Steven Euijong Whang ; Martin Wicke ; Jarek Wilkiewicz ; Xin Zhang ; Martin Zinkevich

【Abstract】: Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components---a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.

【Keywords】: continuous training; end-to-end platform; large-scale machine learning

152. LiJAR: A System for Job Application Redistribution towards Efficient Career Marketplace.

Paper Link】 【Pages】:1397-1406

【Authors】: Fedor Borisyuk ; Liang Zhang ; Krishnaram Kenthapadi

【Abstract】: Online professional social networks such as LinkedIn serve as a marketplace, wherein job seekers can find right career opportunities and job providers can reach out to potential candidates. LinkedIn's job recommendations product is a key vehicle for efficient matching between potential candidates and job postings. However, we have observed in practice that a subset of job postings receive too many applications (due to several reasons such as the popularity of the company, nature of the job, etc.), while some other job postings receive too few applications. Both cases can result in job poster dissatisfaction and may lead to discontinuation of the associated job posting contracts. At the same time, if too many job seekers compete for the same job posting, each job seeker's chance of getting this job will be reduced. In the long term, this reduces the chance of users finding jobs that they really like on the site. Therefore, it becomes beneficial for the job recommendation system to consider values provided to both job seekers as well as job posters in the marketplace. In this paper, we propose the job application redistribution problem, with the goal of ensuring that job postings do not receive too many or too few applications, while still providing job recommendations to users with the same level of relevance. We present a dynamic forecasting model to estimate the expected number of applications at the job expiration date, and algorithms to either promote or penalize jobs based on the output of the forecasting model. We also describe the system design and architecture for LiJAR, LinkedIn's Job Applications Forecasting and Redistribution system, which we have implemented and deployed in production. We perform extensive evaluation of LiJAR through both offline and online A/B testing experiments. Our production deployment of this system as part of LinkedIn's job recommendation engine has resulted in significant increase in the engagement of users for underserved jobs (6.5%) without affecting the user engagement in terms of the total number of job applications, thereby addressing the needs of job seekers as well as job providers simultaneously.

【Keywords】: job application forecasting; job application redistribution; personalized search and recommendation systems

153. A Data Science Approach to Understanding Residential Water Contamination in Flint.

Paper Link】 【Pages】:1407-1416

【Authors】: Alex Chojnacki ; Chengyu Dai ; Arya Farahi ; Guangsha Shi ; Jared Webb ; Daniel T. Zhang ; Jacob D. Abernethy ; Eric Schwartz

【Abstract】: When the residents of Flint learned that lead had contaminated their water system, the local government made water-testing kits available to them free of charge. The city government published the results of these tests, creating a valuable dataset that is key to understanding the causes and extent of the lead contamination event in Flint. This is the nation's largest dataset on lead in a municipal water system. In this paper, we predict the lead contamination for each household's water supply, and we study several related aspects of Flint's water troubles, many of which generalize well beyond this one city. For example, we show that elevated lead risks can be (weakly) predicted from observable home attributes. Then we explore the factors associated with elevated lead. These risk assessments were developed in part via a crowd sourced prediction challenge at the University of Michigan. To inform Flint residents of these assessments, they have been incorporated into a web and mobile application funded by Google.org. We also explore questions of self-selection in the residential testing program, examining which factors are linked to when and how frequently residents voluntarily sample their water.

【Keywords】: flint water crisis; machine learning; public policy; risk assessment; sampling bias; water quality

154. Estimation of Recent Ancestral Origins of Individuals on a Large Scale.

Paper Link】 【Pages】:1417-1425

【Authors】: Ross E. Curtis ; Ahna R. Girshick

【Abstract】: The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.

【Keywords】: applied machine learning; community detection; computational genomics; random forest classification; scalability of large systems

155. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments.

Paper Link】 【Pages】:1427-1436

【Authors】: Pavel Dmitriev ; Somit Gupta ; Dong Woo Kim ; Garnet J. Vaz

【Abstract】: Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.

【Keywords】: a/b testing; controlled experiments; metrics; online experiments

156. A Century of Science: Globalization of Scientific Collaborations, Citations, and Innovations.

Paper Link】 【Pages】:1437-1446

【Authors】: Yuxiao Dong ; Hao Ma ; Zhihong Shen ; Kuansan Wang

【Abstract】: Progress in science has advanced the development of human society across history, with dramatic revolutions shaped by information theory, genetic cloning, and artificial intelligence, among the many scientific achievements produced in the 20th century. However, the way that science advances itself is much less well-understood. In this work, we study the evolution of scientific development over the past century by presenting an anatomy of 89 million digitalized papers published between 1900 and 2015. We find that science has benefited from the shift from individual work to collaborative effort, with over 90% of the world-leading innovations generated by collaborations in this century, nearly four times higher than they were in the 1900s. We discover that rather than the frequent myopic- and self-referencing that was common in the early 20th century, modern scientists instead tend to look for literature further back and farther around. Finally, we also observe the globalization of scientific development from 1900 to 2015, including 25-fold and 7-fold increases in international collaborations and citations, respectively, as well as a dramatic decline in the dominant accumulation of citations by the US, the UK, and Germany, from ~95% to ~50% over the same period. Our discoveries are meant to serve as a starter for exploring the visionary ways in which science has developed throughout the past century, generating insight into and an impact upon the current scientific innovations and funding policies.

【Keywords】: big data; diversity in science; funding policy; microsoft academic graph; science of science; scientific impact

157. FIRST: Fast Interactive Attributed Subgraph Matching.

Paper Link】 【Pages】:1447-1456

【Authors】: Boxin Du ; Si Zhang ; Nan Cao ; Hanghang Tong

【Abstract】: Attributed subgraph matching is a powerful tool for explorative mining of large attributed networks. In many applications (e.g., network science of teams, intelligence analysis, finance informatics), the user might not know what exactly s/he is looking for, and thus require the user to constantly revise the initial query graph based on what s/he finds from the current matching results. A major bottleneck in such an interactive matching scenario is the efficiency, as simply rerunning the matching algorithm on the revised query graph is computationally prohibitive. In this paper, we propose a family of effective and efficient algorithms (FIRST) to support interactive attributed subgraph matching. There are two key ideas behind the proposed methods. The first is to recast the attributed subgraph matching problem as a cross-network node similarity problem, whose major computation lies in solving a Sylvester equation for the query graph and the underlying data graph. The second key idea is to explore the smoothness between the initial and revised queries, which allows us to solve the new/updated Sylvester equation incrementally, without re-solving it from scratch. Experimental results show that our method can achieve (1) up to 16x speed-up when applying on networks with 6M$+$ nodes; (2) preserving more than 90% accuracy compared with existing methods; and (3) scales linearly with respect to the size of the data graph.

【Keywords】: graph mining; healthcare; large scale data mining; multimedia; social networks

158. Prognosis and Diagnosis of Parkinson's Disease Using Multi-Task Learning.

Paper Link】 【Pages】:1457-1466

【Authors】: Saba Emrani ; Anya McGuirk ; Wei Xiao

【Abstract】: Parkinson's disease (PD) is a debilitating neurodegenerative disease excessively affecting millions of patients. Early diagnosis of PD is critical as manifestation of symptoms occur many years after the onset of neurodegenration, when more than 60\% of dopaminergic neurons are lost. Since there is no definite diagnosis of PD, the early management of disease is a significant challenge in the field of PD therapeutics. Therefore, identifying valid biomarkers that can characterize the progression of PD has lately received growing attentions in PD research community. In this paper, we employ a multi-task learning regression framework for prediction of Parkinson's disease progression, where each task is the prediction of PD rating scales at one future time point. We then use the model to identify the important biomarkers predictive of disease progression. We adopt a graph regularization approach to capture the relationship between different tasks and penalize large variations of the model at consecutive future time points. We have carried out comprehensive experiments using different categories of measurements at baseline from Parkinson's Progression Markers Initiative (PPMI) database to predict the severity of PD, measured by unified PD rating scale. We use the learned model to identify the biomarkers with significant contribution in prediction of PD progression. Our results confirm some of the important biomarkers identified in existing medical studies, validate some of the biomarkers that have been observed as a potential marker of PD and discover new biomarkers that have not yet been investigated.

【Keywords】: biomarker identification; multi-task learning; parkinson's disease; regression

159. A Data Mining Framework for Valuing Large Portfolios of Variable Annuities.

Paper Link】 【Pages】:1467-1475

【Authors】: Guojun Gan ; Jimmy Xiangji Huang

【Abstract】: A variable annuity is a tax-deferred retirement vehicle created to address concerns that many people have about outliving their assets. In the past decade, the rapid growth of variable annuities has posed great challenges to insurance companies especially when it comes to valuing the complex guarantees embedded in these products. In this paper, we propose a novel data mining framework to address the computational issue associated with the valuation of large portfolios of variable annuity contracts. The data mining framework consists of two major components: a data clustering algorithm which is used to select representative variable annuity contracts, and a regression model which is used to predict quantities of interest for the whole portfolio based on the representative contracts. A series of numerical experiments are conducted on a portfolio of synthetic variable annuity contracts to demonstrate the performance of our proposed data mining framework in terms of accuracy and speed. The experimental results show that our proposed framework is able to produce accurate estimates of various quantities of interest and can reduce the runtime significantly.

【Keywords】: data clustering; data mining; kriging; portfolio valuation; variable annuity

160. GELL: Automatic Extraction of Epidemiological Line Lists from Open Sources.

Paper Link】 【Pages】:1477-1485

【Authors】: Saurav Ghosh ; Prithwish Chakraborty ; Bryan L. Lewis ; Maimuna S. Majumder ; Emily Cohn ; John S. Brownstein ; Madhav V. Marathe ; Naren Ramakrishnan

【Abstract】: Real-time monitoring and responses to emerging public health threats rely on the availability of timely surveillance data. During the early stages of an epidemic, the ready availability of line lists with detailed tabular information about laboratory-confirmed cases can assist epidemiologists in making reliable inferences and forecasts. Such inferences are crucial to understand the epidemiology of a specific disease early enough to stop or control the outbreak. However, construction of such line lists requires considerable human supervision and therefore, difficult to generate in real-time. In this paper, we motivate Guided Epidemiological Line List (GELL), the first tool for building automated line lists (in near real-time) from open source reports of emerging disease outbreaks. Specifically, we focus on deriving epidemiological characteristics of an emerging disease and the affected population from reports of illness. GELL uses distributed vector representations (ala word2vec) to discover a set of indicators for each line list feature. This discovery of indicators is followed by the use of dependency parsing based techniques for final extraction in tabular form. We evaluate the performance of GELL against a human annotated line list provided by HealthMap corresponding to MERS outbreaks in Saudi Arabia. We demonstrate that GELL extracts line list features with increased accuracy compared to a baseline method. We further show how these automatically extracted line list features can be used for making epidemiological inferences, such as inferring demographics and symptoms-to-hospitalization period of affected individuals.

【Keywords】: dependency parsing; epidemiological line lists; information extraction; negation detection; word embeddings

161. Google Vizier: A Service for Black-Box Optimization.

Paper Link】 【Pages】:1487-1495

【Authors】: Daniel Golovin ; Benjamin Solnik ; Subhodeep Moitra ; Greg Kochanski ; John Karro ; D. Sculley

【Abstract】: Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides.

【Keywords】: bayesian inference; black-box optimization; gaussian process; machine learning

162. Predicting Clinical Outcomes Across Changing Electronic Health Record Systems.

Paper Link】 【Pages】:1497-1505

【Authors】: Jen J. Gong ; Tristan Naumann ; Peter Szolovits ; John V. Guttag

【Abstract】: Existing machine learning methods typically assume consistency in how semantically equivalent information is encoded. However, the way information is recorded in databases differs across institutions and over time, often rendering potentially useful data obsolescent. To address this problem, we map database-specific representations of information to a shared set of semantic concepts, thus allowing models to be built from or transition across different databases. We demonstrate our method on machine learning models developed in a healthcare setting. In particular, we evaluate our method using two different intensive care unit (ICU) databases and on two clinically relevant tasks, in-hospital mortality and prolonged length of stay. For both outcomes, a feature representation mapping EHR-specific events to a shared set of clinical concepts yields better results than using EHR-specific events alone.

【Keywords】: clinical risk models; electronic health records; machine learning; model portability

163. HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network.

Paper Link】 【Pages】:1507-1515

【Authors】: Shifu Hou ; Yanfang Ye ; Yangqiu Song ; Melih Abdulhayoglu

【Abstract】: With explosive growth of Android malware and due to the severity of its damages to smart phone users, the detection of Android malware has become increasingly important in cybersecurity. The increasing sophistication of Android malware calls for new defensive techniques that are capable against novel threats and harder to evade. In this paper, to detect Android malware, instead of using Application Programming Interface (API) calls only, we further analyze the different relationships between them and create higher-level semantics which require more effort for attackers to evade the detection. We represent the Android applications (apps), related APIs, and their rich relationships as a structured heterogeneous information network (HIN). Then we use a meta-path based approach to characterize the semantic relatedness of apps and APIs. We use each meta-path to formulate a similarity measure over Android apps, and aggregate different similarities using multi-kernel learning. Then each meta-path is automatically weighted by the learning algorithm to make predictions. To the best of our knowledge, this is the first work to use structured HIN for Android malware detection. Comprehensive experiments on real sample collections from Comodo Cloud Security Center are conducted to compare various malware detection approaches. Promising experimental results demonstrate that our developed system HinDroid outperforms other alternative Android malware detection techniques.

【Keywords】: android malware detection; application programming interface calls; heterogeneous information network; relation analysis

164. Peeking at A/B Tests: Why it matters, and what to do about it.

Paper Link】 【Pages】:1517-1525

【Authors】: Ramesh Johari ; Pete Koomen ; Leonid Pekelis ; David Walsh

【Abstract】: This paper reports on novel statistical methodology, which has been deployed by the commercial A/B testing platform Optimizely to communicate experimental results to their customers. Our methodology addresses the issue that traditional p-values and confidence intervals give unreliable inference. This is because users of A/B testing software are known to continuously monitor these measures as the experiment is running. We provide always valid p-values and confidence intervals that are provably robust to this effect. Not only does this make it safe for a user to continuously monitor, but it empowers her to detect true effects more efficiently. This paper provides simulations and numerical studies on Optimizely's data, demonstrating an improvement in detection performance over traditional methods.

【Keywords】: a/b testing; confidence intervals; p-values; sequential hypothesis testing

165. PNP: Fast Path Ensemble Method for Movie Design.

Paper Link】 【Pages】:1527-1536

【Authors】: Danai Koutra ; Abhilash Dighe ; Smriti Bhagat ; Udi Weinsberg ; Stratis Ioannidis ; Christos Faloutsos ; Jean Bolot

【Abstract】: How can we design a product or movie that will attract, for example, the interest of Pennsylvania adolescents or liberal newspaper critics? What should be the genre of that movie and who should be in the cast? In this work, we seek to identify how we can design new movies with features tailored to a specific user population. We formulate the movie design as an optimization problem over the inference of user-feature scores and selection of the features that maximize the number of attracted users. Our approach, PNP, is based on a heterogeneous, tripartite graph of users, movies, and features (e.g. actors, directors, genres), where users rate movies and features contribute to movies. We learn the preferences by leveraging user similarities defined through different types of relations, and show that our method outperforms state-of-the-art approaches, including matrix factorization and other heterogeneous graph-based analysis. We evaluate PNP on publicly available real-world data and show that it is highly scalable and effectively provides movie designs oriented towards different groups of users, including men, women, and adolescents.

【Keywords】: heterogeneous networks; movies; product design; recommendations; user modeling; user preferences

166. Pharmacovigilance via Baseline Regularization with Large-Scale Longitudinal Observational Data.

Paper Link】 【Pages】:1537-1546

【Authors】: Zhaobin Kuang ; Peggy L. Peissig ; Vítor Santos Costa ; Richard Maclin ; David Page

【Abstract】: Several prominent public health incidents that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV), the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Multiple Self-Controlled Case Series (MSCCS) model, arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as MSCCS to the Marshfield Clinic EHR. Experimental results suggest that incorporating the heterogeneity among different patients and different times help to improve the performance in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth

【Keywords】: adverse drug event discovery; baseline regularization; electronic health records; longitudinal data; pharmacovigilance

167. FLAP: An End-to-End Event Log Analysis Platform for System Management.

Paper Link】 【Pages】:1547-1556

【Authors】: Tao Li ; Yexi Jiang ; Chunqiu Zeng ; Bin Xia ; Zheng Liu ; Wubai Zhou ; Xiaolong Zhu ; Wentao Wang ; Liang Zhang ; Jun Wu ; Li Xue ; Dewei Bao

【Abstract】: Many systems, such as distributed operating systems, complex networks, and high throughput web-based applications, are continuously generating large volume of event logs. These logs contain useful information to help system administrators to understand the system running status and to pinpoint the system failures. Generally, due to the scale and complexity of modern systems, the generated logs are beyond the analytic power of human beings. Therefore, it is imperative to develop a comprehensive log analysis system to support effective system management. Although a number of log mining techniques have been proposed to address specific log analysis use cases, few research and industrial efforts have been paid on providing integrated systems with an end-to-end solution to facilitate the log analysis routines. In this paper, we design and implement an integrated system, called FIU Log Analysis Platform (a.k.a. FLAP), that aims to facilitate the data analytics for system event logs. FLAP provides an end-to-end solution that utilizes advanced data mining techniques to assist log analysts to conveniently, timely, and accurately conduct event log knowledge discovery, system status investigation, and system failure diagnosis. Specifically, in FLAP, state-of-the-art template learning techniques are used to extract useful information from unstructured raw logs; advanced data transformation techniques are proposed and leveraged for event transformation and storage; effective event pattern mining, event summarization, event querying, and failure prediction techniques are designed and integrated for log analytics; and user-friendly interfaces are utilized to present the informative analysis results intuitively and vividly. Since 2016, FLAP has been used by Huawei Technologies Co. Ltd for internal event log analysis, and has provided effective support in its system operation and workflow optimization.

【Keywords】: data mining system; event mining; event summarization; log processing

168. Cascade Ranking for Operational E-commerce Search.

Paper Link】 【Pages】:1557-1565

【Authors】: Shichen Liu ; Fei Xiao ; Wenwu Ou ; Luo Si

【Abstract】: In the 'Big Data' era, many real-world applications like search involve the ranking problem for a large number of items. It is important to obtain effective ranking results and at the same time obtain the results efficiently in a timely manner for providing good user experience and saving computational costs. Valuable prior research has been conducted for learning to efficiently rank like the cascade ranking (learning) model, which uses a sequence of ranking functions to progressively filter some items and rank the remaining items. However, most existing research of learning to efficiently rank in search is studied in a relatively small computing environments with simulated user queries. This paper presents novel research and thorough study of designing and deploying a Cascade model in a Large-scale Operational E-commerce Search application (CLOES), which deals with hundreds of millions of user queries per day with hundreds of servers. The challenge of the real-world application provides new insights for research: 1). Real-world search applications often involve multiple factors of preferences or constraints with respect to user experience and computational costs such as search accuracy, search latency, size of search results and total CPU cost, while most existing search solutions only address one or two factors; 2). Effectiveness of e-commerce search involves multiple types of user behaviors such as click and purchase, while most existing cascade ranking in search only models the click behavior. Based on these observations, a novel cascade ranking model is designed and deployed in an operational e-commerce search application. An extensive set of experiments demonstrate the advantage of the proposed work to address multiple factors of effectiveness, efficiency and user experience in the real-world application.

【Keywords】: cascade ranking; effectiveness and efficiency; operational e-commerce search system; user experience

169. Developing a Comprehensive Framework for Multimodal Feature Extraction.

Paper Link】 【Pages】:1567-1574

【Authors】: Quinten McNamara ; Alejandro de la Vega ; Tal Yarkoni

【Abstract】: Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions--ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (videos, images, audio, and text), and is expressly implemented with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of feature extraction pipelines that output results in a single, standardized format. We describe the package's architecture, detail its advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct simple feature extraction workflows while increasing code clarity and maintainability.

【Keywords】: feature extraction; multimodal retrieval; python; standardization; wrappers

170. Deep Choice Model Using Pointer Networks for Airline Itinerary Prediction.

Paper Link】 【Pages】:1575-1583

【Authors】: Alejandro Mottini ; Rodrigo Acuna-Agost

【Abstract】: Travel providers such as airlines and on-line travel agents are becoming more and more interested in understanding how passengers choose among alternative itineraries when searching for flights. This knowledge helps them better display and adapt their offer, taking into account market conditions and customer needs. Some common applications are not only filtering and sorting alternatives, but also changing certain attributes in real-time (e.g., changing the price). In this paper, we concentrate with the problem of modeling air passenger choices of flight itineraries. This problem has historically been tackled using classical Discrete Choice Modelling techniques. Traditional statistical approaches, in particular the Multinomial Logit model (MNL), is widely used in industrial applications due to its simplicity and general good performance. However, MNL models present several shortcomings and assumptions that might not hold in real applications. To overcome these difficulties, we present a new choice model based on Pointer Networks. Given an input sequence, this type of deep neural architecture combines Recurrent Neural Networks with the Attention Mechanism to learn the conditional probability of an output whose values correspond to positions in an input sequence. Therefore, given a sequence of different alternatives presented to a customer, the model can learn to point to the one most likely to be chosen by the customer. The proposed method was evaluated on a real dataset that combines on-line user search logs and airline flight bookings. Experimental results show that the proposed model outperforms the traditional MNL model on several metrics.

【Keywords】: discrete choice modeling; pointer networks; recurrent neural networks

171. Compass: Spatio Temporal Sentiment Analysis of US Election What Twitter Says!

Paper Link】 【Pages】:1585-1594

【Authors】: Debjyoti Paul ; Feifei Li ; Murali Krishna Teja ; Xin Yu ; Richie Frost

【Abstract】: With the widespread growth of various social network tools and platforms, analyzing and understanding societal response and crowd reaction to important and emerging social issues and events through social media data is increasingly an important problem. However, there are numerous challenges towards realizing this goal effectively and efficiently, due to the unstructured and noisy nature of social media data. The large volume of the underlying data also presents a fundamental challenge. Furthermore, in many application scenarios, it is often interesting, and in some cases critical, to discover patterns and trends based on geographical and/or temporal partitions, and keep track of how they will change overtime. This brings up the interesting problem of spatio-temporal sentiment analysis from large-scale social media data. This paper investigates this problem through a data science project called "US Election 2016, What Twitter Says". The objective is to discover sentiment on Twitter towards either the democratic or the republican party at US county and state levels over any arbitrary temporal intervals, using a large collection of geotagged tweets from a period of 6 months leading up to the US Presidential Election in 2016. Our results demonstrate that by integrating and developing a combination of machine learning and data management techniques, it ispossible to do this at scale with effective outcomes. The results of our project have the potential to be adapted towards solving and influencing other interesting social issues such as building neighborhood happiness and health indicators.

【Keywords】: bursty events; crowd sentiment; election; scalable framework; sentiment analysis; spatio-temporal

172. Backpage and Bitcoin: Uncovering Human Traffickers.

Paper Link】 【Pages】:1595-1604

【Authors】: Rebecca S. Portnoff ; Danny Yuxing Huang ; Periwinkle Doerfler ; Sadia Afroz ; Damon McCoy

【Abstract】: Sites for online classified ads selling sex are widely used by human traffickers to support their pernicious business. The sheer quantity of ads makes manual exploration and analysis unscalable. In addition, discerning whether an ad is advertising a trafficked victim or an independent sex worker is a very difficult task. Very little concrete ground truth (i.e., ads definitively known to be posted by a trafficker) exists in this space. In this work, we develop tools and techniques that can be used separately and in conjunction to group sex ads by their true owner (and not the claimed author in the ad). Specifically, we develop a machine learning classifier that uses stylometry to distinguish between ads posted by the same vs. different authors with 90% TPR and 1% FPR. We also design a linking technique that takes advantage of leakages from the Bitcoin mempool, blockchain and sex ad site, to link a subset of sex ads to Bitcoin public wallets and transactions. Finally, we demonstrate via a 4-week proof of concept using Backpage as the sex ad site, how an analyst can use these automated approaches to potentially find human traffickers.

【Keywords】: bitcoin; human trafficking; machine learning; security

173. Not All Passes Are Created Equal: Objectively Measuring the Risk and Reward of Passes in Soccer from Tracking Data.

Paper Link】 【Pages】:1605-1613

【Authors】: Paul Power ; Héctor Ruiz ; Xinyu Wei ; Patrick Lucey

【Abstract】: In soccer, the most frequent event that occurs is a pass. For a trained eye, there are a myriad of adjectives which could describe this event (e.g., "majestic pass", "conservative" to "poor-ball"). However, as these events are needed to be coded live and in real-time (most often by human annotators), the current method of grading passes is restricted to the binary labels 0 (unsuccessful) or 1 (successful). Obviously, this is sub-optimal because the quality of a pass needs to be measured on a continuous spectrum (i.e., 0 to 100%) and not a binary value. Additionally, a pass can be measured across multiple dimensions, namely: i) risk -- the likelihood of executing a pass in a given situation, and ii) reward -- the likelihood of a pass creating a chance. In this paper, we show how we estimate both the risk and reward of a pass across two seasons of tracking data captured from a recent professional soccer league with state-of-the-art performance, then showcase various use cases of our deployed passing system.

【Keywords】: clustering; passing; soccer; tracking data; unsupervised

174. MARAS: Signaling Multi-Drug Adverse Reactions.

Paper Link】 【Pages】:1615-1623

【Authors】: Xiao Qin ; Tabassum Kakar ; Susmitha Wunnava ; Elke A. Rundensteiner ; Lei Cao

【Abstract】: There is a growing need for computing-supported methods that facilitate the automated signaling of Adverse Drug Reactions (ADRs) otherwise left undiscovered from the exploding amount of ADR reports filed by patients, medical professionals and drug manufacturers. In this research, we design a Multi-Drug Adverse Reaction Analytics Strategy, called MARAS, to signal severe unknown ADRs triggered by the usage of a combination of drugs, also known as Multi-Drug Adverse Reactions (MDAR). First, MARAS features an efficient signal generation algorithm based on association rule learning that extracts non-spurious MDAR associations. Second, MARAS incorporates contextual information to detect drug combinations that are strongly associated with a set of ADRs. It groups related associations into Contextual Association Clusters (CACs) that then avail contextual information to evaluate the significance of the discovered MDAR Associations. Lastly, we use this contextual significance to rank discoveries by their notion of interestingness to signal the most compelling MDARs. To demonstrate the utility of MARAS, it is compared with state-of-the-art techniques and evaluated via case studies on datasets collected by U.S. Food and Drug Administration Adverse Event Reporting System (FAERS).

【Keywords】: adverse drug reaction; association rule learning; interestingness of association; public health surveillance

Paper Link】 【Pages】:1625-1631

【Authors】: Parikshit Shah ; Ming Yang ; Sachidanand Alle ; Adwait Ratnaparkhi ; Ben Shahshahani ; Rohit Chandra

【Abstract】: In this paper, we describe an exploration system that was implemented by the search-advertising team of a prominent web-portal to address the cold ads problem. The cold ads problem refers to the situation where, when new ads are injected into the system by advertisers, the system is unable to assign an accurate quality to the ad (in our case, the click probability). As a consequence, the advertiser may suffer from low impression volumes for these cold ads, and the overall system may perform sub-optimally if the click probabilities for new ads are not learnt rapidly. We designed a new exploration system that was adapted to search advertising and the serving constraints of the system. In this paper, we define the problem, discuss the design details of the exploration system, new evaluation criteria, and present the performance metrics that were observed by us.

【Keywords】: cold ads; exploration; search advertising

176. MOLIERE: Automatic Biomedical Hypothesis Generation System.

Paper Link】 【Pages】:1633-1642

【Authors】: Justin Sybrandt ; Michael Shtutman ; Ilya Safro

【Abstract】: Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents and does not limit the document vocabulary. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network. We demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

【Keywords】: hypothesis generation; mining scientific publications; topic modeling; undiscovered public knowledge

177. Quick Access: Building a Smart Experience for Google Drive.

Paper Link】 【Pages】:1643-1651

【Authors】: Sandeep Tata ; Alexandrin Popescul ; Marc Najork ; Mike Colagrosso ; Julian Gibbons ; Alan Green ; Alexandre Mah ; Michael Smith ; Divanshu Garg ; Cayden Meyer ; Reuben Kan

【Abstract】: Google Drive is a cloud storage and collaboration service used by hundreds of millions of users around the world. Quick Access is a new feature in Google Drive that surfaces the most relevant documents when a user visits the home screen. Our metrics show that users locate their documents in half the time with this feature compared to previous approaches. The development of Quick Access illustrates many general challenges and constraints associated with practical machine learning such as protecting user privacy, working with data services that are not designed with machine learning in mind, and evolving product definitions. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine learning problems.

【Keywords】: applied machine learning; neural networks; private data

178. The Simpler The Better: A Unified Approach to Predicting Original Taxi Demands based on Large-Scale Online Platforms.

Paper Link】 【Pages】:1653-1662

【Authors】: Yongxin Tong ; Yuqiang Chen ; Zimu Zhou ; Lei Chen ; Jie Wang ; Qiang Yang ; Jieping Ye ; Weifeng Lv

【Abstract】: Taxi-calling apps are gaining increasing popularity for their efficiency in dispatching idle taxis to passengers in need. To precisely balance the supply and the demand of taxis, online taxicab platforms need to predict the Unit Original Taxi Demand (UOTD), which refers to the number of taxi-calling requirements submitted per unit time (e.g., every hour) and per unit region (e.g., each POI). Predicting UOTD is non-trivial for large-scale industrial online taxicab platforms because both accuracy and flexibility are essential. Complex non-linear models such as GBRT and deep learning are generally accurate, yet require labor-intensive model redesign after scenario changes (e.g., extra constraints due to new regulations). To accurately predict UOTD while remaining flexible to scenario changes, we propose LinUOTD, a unified linear regression model with more than 200 million dimensions of features. The simple model structure eliminates the need of repeated model redesign, while the high-dimensional features contribute to accurate UOTD prediction. We further design a series of optimization techniques for efficient model training and updating. Evaluations on two large-scale datasets from an industrial online taxicab platform verify that LinUOTD outperforms popular non-linear models in accuracy. We envision our experiences to adopt simple linear models with high-dimensional features in UOTD prediction as a pilot study and can shed insights upon other industrial large-scale spatio-temporal prediction problems.

【Keywords】: feature engineering; prediction; unit original taxi demands

179. DeepSD: Generating High Resolution Climate Change Projections through Single Image Super-Resolution.

Paper Link】 【Pages】:1663-1672

【Authors】: Thomas Vandal ; Evan Kodra ; Sangram Ganguly ; Andrew R. Michaelis ; Ramakrishna R. Nemani ; Auroop R. Ganguly

【Abstract】: The impacts of climate change are felt by most critical systems, such as infrastructure, ecological systems, and power-plants. However, contemporary Earth System Models (ESM) are run at spatial resolutions too coarse for assessing effects this localized. Local scale projections can be obtained using statistical downscaling, a technique which uses historical climate observations to learn a low-resolution to high-resolution mapping. Depending on statistical modeling choices, downscaled projections have been shown to vary significantly terms of accuracy and reliability. The spatio-temporal nature of the climate system motivates the adaptation of super-resolution image processing techniques to statistical downscaling. In our work, we present DeepSD, a generalized stacked super resolution convolutional neural network (SRCNN) framework for statistical downscaling of climate variables. DeepSD augments SRCNN with multi-scale input channels to maximize predictability in statistical downscaling. We provide a comparison with Bias Correction Spatial Disaggregation as well as three Automated-Statistical Downscaling approaches in downscaling daily precipitation from 1 degree (~100km) to 1/8 degrees (~12.5km) over the Continental United States. Furthermore, a framework using the NASA Earth Exchange (NEX) platform is discussed for downscaling more than 20 ESM models with multiple emission scenarios.

【Keywords】: climate statistical downscaling; daily precipitation; deep learning; super-resolution

180. No Longer Sleeping with a Bomb: A Duet System for Protecting Urban Safety from Dangerous Goods.

Paper Link】 【Pages】:1673-1681

【Authors】: Jingyuan Wang ; Chao Chen ; Junjie Wu ; Zhang Xiong

【Abstract】: Recent years have witnessed the continuous growth of megalopolises worldwide, which makes urban safety a top priority in modern city life. Among various threats, dangerous goods such as gas and hazardous chemicals transported through and around cities have increasingly become the deadly "bomb" we sleep with every day. In both academia and government, tremendous efforts have been dedicated to dealing with dangerous goods transportation (DGT) issues, but further study is still in great need to quantify the problem and explore its intrinsic dynamics in a big data perspective. In this paper, we present a novel system called DGeye, which features a "duet" between DGT trajectory data and human mobility data for risky zones identification. Moreover, DGeye innovatively takes risky patterns as the keystones in DGT management, and builds causality networks among them for pain point identification, attribution and prediction. Experiments on both Beijing and Tianjin cities demonstrate the effectiveness of DGeye. In particular, the report generated by DGeye driven the Beijing government to lay down gas pipelines for the famous Guijie food street.

【Keywords】: causal analysis; intelligent transportation; pattern mining; urban safety

181. A Quasi-experimental Estimate of the Impact of P2P Transportation Platforms on Urban Consumer Patterns.

Paper Link】 【Pages】:1683-1692

【Authors】: Zhe Zhang ; Beibei Li

【Abstract】: With the pervasiveness of mobile technology and location-based computing, new forms of smart urban transportation, such as Uber & Lyft, have become increasingly popular. These new forms of urban infrastructure can influence individuals' movement frictions and patterns, in turn influencing local consumption patterns and the economic performance of local businesses. To gain insights about future impact of urban transportation changes, in this paper, we utilize a novel dataset and econometric analysis methods to present a quasi-experimental examination of how the emerging growth of peer-to-peer car sharing services may have affected local consumer mobility and consumption patterns.

【Keywords】: econometrics; lyft; offline consumption; sharing economy; societal impact of technolgy; transportation; uber

182. KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial.

Paper Link】 【Pages】:1693-1702

【Authors】: Jun Zhou ; Xiaolong Li ; Peilin Zhao ; Chaochao Chen ; Longfei Li ; Xinxing Yang ; Qing Cui ; Jin Yu ; Xu Chen ; Yi Ding ; Yuan Alan Qi

【Abstract】: In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.

【Keywords】: distributed computing system; distributed optimization algorithm; parameter server

183. Deep Embedding Forest: Forest-based Serving with Deep Embedding Features.

Paper Link】 【Pages】:1703-1711

【Authors】: Jie Zhu ; Ying Shan ; J. C. Mao ; Dong Yu ; Holakou Rahmanian ; Yi Zhang

【Abstract】: Deep Neural Networks (DNN) have demonstrated superior ability to extract high level embedding vectors from low level features. Despite the success, the serving time is still the bottleneck due to expensive run-time computation of multiple layers of dense matrices. GPGPU, FPGA, or ASIC-based serving systems require additional hardware that are not in the mainstream design of most commercial applications. In contrast, tree or forest-based models are widely adopted because of low serving cost, but heavily depend on carefully engineered features. This work proposes a Deep Embedding Forest model that benefits from the best of both worlds. The model consists of a number of embedding layers and a forest/tree layer. The former maps high dimensional (hundreds of thousands to millions) and heterogeneous low-level features to the lower dimensional (thousands) vectors, and the latter ensures fast serving. Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware. After comparing with a joint optimization algorithm called partial fuzzification, also proposed in this paper, it is concluded that the two-step Deep Embedding Forest has achieved near optimal performance. Experiments based on large scale data sets (up to 1 billion samples) from a major sponsored search engine proves the efficacy of the proposed model.

【Keywords】: cntk; deep crossing; deep learning; deep neural network (dnn); gpu; gradient boosting machine; lightgbm; xgboost

KDD 2017 Applied Data Science Papers (Poster Papers) 49

184. A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications.

Paper Link】 【Pages】:1713-1721

【Authors】: Amr Ahmed ; James Long ; Daniel Silva ; Yuan Wang

【Abstract】: Topic models are often applied in industrial settings to discover user profiles from activity logs where documents correspond to users and words to complex objects such as web sites and installed apps. Standard topic models ignore the content-based similarity structure between these objects largely because of the inability of the Dirichlet prior to capture such side information of word-word correlation. Several approaches were proposed to replace the Dirichlet prior with more expressive alternatives. However, this added expressivity comes with a heavy premium: inference becomes intractable and sparsity is lost which renders these alternatives not suitable for industrial scale applications. In this paper we take a radically different approach to incorporating word-word correlation in topic models by applying this side information at the posterior level rather than at the prior level. We show that this choice preserves sparsity and results in a graph-based sampler for LDA whose computational complexity is asymptotically on bar with the state of the art Alias base sampler for LDA \cite{aliasLDA}. We illustrate the efficacy of our approach over real industrial datasets that span up to billion of users, tens of millions of words and thousands of topics. To the best of our knowledge, our approach provides the first practical and scalable solution to this important problem.

【Keywords】: big data; interpretable models; knowledge representation; latent variable models; topic models; user modeling

185. Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity.

Paper Link】 【Pages】:1723-1732

【Authors】: Blake Anderson ; David A. McGrew

【Abstract】: The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.

【Keywords】: machine learning; malware detection; network security; tls

186. Extremely Fast Decision Tree Mining for Evolving Data Streams.

Paper Link】 【Pages】:1733-1742

【Authors】: Albert Bifet ; Jiajin Zhang ; Wei Fan ; Cheng He ; Jianfeng Zhang ; Jianfeng Qian ; Geoff Holmes ; Bernhard Pfahringer

【Abstract】: Nowadays real-time industrial applications are generating a huge amount of data continuously every day. To process these large data streams, we need fast and efficient methodologies and systems. A useful feature desired for data scientists and analysts is to have easy to visualize and understand machine learning models. Decision trees are preferred in many real-time applications for this reason, and also, because combined in an ensemble, they are one of the most powerful methods in machine learning. In this paper, we present a new system called STREAMDM-C++, that implements decision trees for data streams in C++, and that has been used extensively at Huawei. Streaming decision trees adapt to changes on streams, a huge advantage since standard decision trees are built using a snapshot of data, and can not evolve over time. STREAMDM-C++ is easy to extend, and contains more powerful ensemble methods, and a more efficient and easy to use adaptive decision trees. We compare our new implementation with VFML, the current state of the art implementation in C, and show how our new system outperforms VFML in speed using less resources.

【Keywords】: classification; data streams; decision trees; online learning

187. Real-Time Optimization of Web Publisher RTB Revenues.

Paper Link】 【Pages】:1743-1751

【Authors】: Pedro Chahuara ; Nicolas Grislain ; Gregoire Jauvion ; Jean-Michel Renders

【Abstract】: This paper describes an engine to optimize web publisher revenues from second-price auctions. These auctions are widely used to sell online ad spaces in a mechanism called real-time bidding (RTB). Optimization within these auctions is crucial for web publishers, because setting appropriate reserve prices can significantly increase revenue. We consider a practical real-world setting where the only available information before an auction occurs consists of a user identifier and an ad placement identifier. The real-world challenges we had to tackle consist mainly of tracking the dependencies on both the user and placement in an highly non-stationary environment and of dealing with censored bid observations. These challenges led us to make the following design choices: (i) we adopted a relatively simple non-parametric regression model of auction revenue based on an incremental time-weighted matrix factorization which implicitly builds adaptive users' and placements' profiles; (ii) we jointly used a non-parametric model to estimate the first and second bids' distribution when they are censored, based on an on-line extension of the Aalen's Additive model. Our engine is a component of a deployed system handling hundreds of web publishers across the world, serving billions of ads a day to hundreds of millions of visitors. The engine is able to predict, for each auction, an optimal reserve price in approximately one millisecond and yields a significant revenue increase for the web publishers.

【Keywords】: ad-tech; auctions; big-data; online learning; real-time

188. Customer Lifetime Value Prediction Using Embeddings.

Paper Link】 【Pages】:1753-1762

【Authors】: Benjamin Paul Chamberlain ; Ângelo Cardoso ; C. H. Bryan Liu ; Roberto Pagliari ; Marc Peter Deisenroth

【Abstract】: We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of the future value of every customer and is one of the cornerstones of the personalised shopping experience. The state of the art in this domain uses large numbers of handcrafted features and ensemble regressors to forecast value, predict churn and evaluate customer loyalty. Recently, domains including language, vision and speech have shown dramatic advances by replacing handcrafted features with features that are learned automatically from data. We detail the system deployed at ASOS and show that learning feature representations is a promising extension to the state of the art in CLTV modelling. We propose a novel way to generate embeddings of customers, which addresses the issue of the ever changing product catalogue and obtain a significant improvement over an exhaustive set of handcrafted features.

【Keywords】: customer lifetime value; e-commerce; embeddings; neural networks; random forests

189. TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks.

Paper Link】 【Pages】:1763-1771

【Authors】: Heng-Tze Cheng ; Zakaria Haque ; Lichan Hong ; Mustafa Ispir ; Clemens Mewald ; Illia Polosukhin ; Georgios Roumpos ; D. Sculley ; Jamie Smith ; David Soergel ; Yuan Tang ; Philipp Tucker ; Martin Wicke ; Cassandra Xia ; Jianwei Xie

【Abstract】: We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is on simplifying cutting edge machine learning for practitioners in order to bring such technologies into production. Recognizing the fast evolution of the field of deep learning, we make no attempt to capture the design space of all possible model architectures in a domain-specific language (DSL) or similar configuration language. We allow users to write code to define their models, but provide abstractions that guide developers to write models in ways conducive to productionization. We also provide a unifying Estimator interface, making it possible to write downstream infrastructure (e.g. distributed training, hyperparameter tuning) independent of the model implementation. We balance the competing demands for flexibility and simplicity by offering APIs at different levels of abstraction, making common model architectures available out of the box, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models flexible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative specification describing how to interpret input data We discuss our experience in using this framework in research and production environments, and show the impact on code health, maintainability, and development speed.

【Keywords】: deep learning; high level api; machine learning framework

190. Learning Tree-Structured Detection Cascades for Heterogeneous Networks of Embedded Devices.

Paper Link】 【Pages】:1773-1781

【Authors】: Hamid Dadkhahi ; Benjamin M. Marlin

【Abstract】: In this paper, we present a new approach to learning cascaded classifiers for use in computing environments that involve networks of heterogeneous and resource-constrained, low-power embedded compute and sensing nodes. We present a generalization of the classical linear detection cascade to the case of tree-structured cascades where different branches of the tree execute on different physical compute nodes in the network. Different nodes have access to different features, as well as access to potentially different computation and energy resources. We concentrate on the problem of jointly learning the parameters for all of the classifiers in the cascade given a fixed cascade architecture and a known set of costs required to carry out the computation at each node. To accomplish the objective of joint learning of all detectors, we propose a novel approach to combining classifier outputs during training that better matches the hard cascade setting in which the learned system will be deployed. This work is motivated by research in the area of mobile health where energy efficient real time detectors integrating information from multiple wireless on-body sensors and a smart phone are needed for real-time monitoring and the delivery of just-in-time adaptive interventions. We evaluate our framework on mobile sensor-based human activity recognition and mobile health detector learning problems.

【Keywords】: cascaded classification; low-power embedded sensing networks; mobile health

191. AESOP: Automatic Policy Learning for Predicting and Mitigating Network Service Impairments.

Paper Link】 【Pages】:1783-1792

【Authors】: Supratim Deb ; Zihui Ge ; Sastry Isukapalli ; Sarat C. Puthenpura ; Shobha Venkataraman ; He Yan ; Jennifer Yates

【Abstract】: Efficient management and control of modern and next-gen networks is of paramount importance as networks have to maintain highly reliable service quality whilst supporting rapid growth in traffic demand and new application services. Rapid mitigation of network service degradations is a key factor in delivering high service quality. Automation is vital to achieving rapid mitigation of issues, particularly at the network edge where the scale and diversity is the greatest. This automation involves the rapid detection, localization and (where possible) repair of service-impacting faults and performance impairments. However, the most significant challenge here is knowing what events to detect, how to correlate events to localize an issue and what mitigation actions should be performed in response to the identified issues. These are defined as policies to systems such as ECOMP. In this paper, we present AESOP, a data-driven intelligent system to facilitate automatic learning of policies and rules for triggering remedial actions in networks. AESOP combines best operational practices (domain knowledge) with a variety of measurement data to learn and validate operational policies to mitigate service issues in networks. AESOP's design addresses the following key challenges: (i) learning from high-dimensional noisy data, (ii) capturing multiple fault models, (iii) modeling the high service-cost of false positives, and (iv) accounting for the evolving network infrastructure. We present the design of our system and show results from our ongoing experiments to show the effectiveness of our policy leaning framework.

【Keywords】: network management; policy learning; supervised learning

192. Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem.

Paper Link】 【Pages】:1793-1802

【Authors】: Shalini Ghosh ; Ariyam Das ; Phillip A. Porras ; Vinod Yegneswaran ; Ashish Gehani

【Abstract】: Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components -- (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.

【Keywords】: classification; clustering; darkweb; keyword discovery; onion sites; semi-supervised learning

193. Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster.

Paper Link】 【Pages】:1803-1812

【Authors】: Naeemul Hassan ; Fatma Arslan ; Chengkai Li ; Mark Tremayne

【Abstract】: This paper introduces how ClaimBuster, a fact-checking platform, uses natural language processing and supervised learning to detect important factual claims in political discourses. The claim spotting model is built using a human-labeled dataset of check-worthy factual claims from the U.S. general election debate transcripts. The paper explains the architecture and the components of the system and the evaluation of the model. It presents a case study of how ClaimBuster live covers the 2016 U.S. presidential election debates and monitors social media and Australian Hansard for factual claims. It also describes the current status and the long-term goals of ClaimBuster as we keep developing and expanding it.

【Keywords】: computational journalism; fact-checking; natural language processing; text classification; text mining

194. An Efficient Bandit Algorithm for Realtime Multivariate Optimization.

Paper Link】 【Pages】:1813-1821

【Authors】: Daniel N. Hill ; Houssam Nassif ; Yi Liu ; Anand Iyer ; S. V. N. Vishwanathan

【Abstract】: Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to display, etc. Such optimization is a combinatorial problem over an exponentially large decision space. Randomized experiments do not scale well to this setting, and therefore, in practice, one is typically limited to optimizing a single aspect of a web page at a time. This represents a missed opportunity in both the speed of experimentation and the exploitation of possible interactions between layout decisions Here we focus on multivariate optimization of interactive web pages. We formulate an approach where the possible interactions between different components of the page are modeled explicitly. We apply bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime. Our algorithm also extends to contextualization and personalization of layout selection. Simulation results show the suitability of our approach to large decision spaces with strong interactions between content. We further apply our algorithm to optimize a message that promotes adoption of an Amazon service. After only a single week of online optimization, we saw a 21% conversion increase compared to the median layout. Our technique is currently being deployed to optimize content across several locations at Amazon.com.

【Keywords】: a/b testing; hill-climbing; multi-armed bandit; multivariate optimization

195. Large Scale Sentiment Learning with Limited Labels.

Paper Link】 【Pages】:1823-1832

【Authors】: Vasileios Iosifidis ; Eirini Ntoutsi

【Abstract】: Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

【Keywords】: co-training; self-learning; semi-supervised learning; sentiment analysis

196. Optimization Beyond Prediction: Prescriptive Price Optimization.

Paper Link】 【Pages】:1833-1841

【Authors】: Shinji Ito ; Ryohei Fujimaki

【Abstract】: This paper addresses a novel data science problem, prescriptive price optimization, which derives the optimal price strategy to maximize future profit/revenue on the basis of massive predictive formulas produced by machine learning. The prescriptive price optimization first builds sales forecast formulas of multiple products, on the basis of historical data, which reveal complex relationships between sales and prices, such as price elasticity of demand and cannibalization. Then, it constructs a mathematical optimization problem on the basis of those predictive formulas. We present that the optimization problem can be formulated as an instance of binary quadratic programming (BQP). Although BQP problems are NP-hard in general and computationally intractable, we propose a fast approximation algorithm using a semi-definite programming (SDP) relaxation. Our experiments on simulation and real retail datasets show that our prescriptive price optimization simultaneously derives the optimal prices of tens/hundreds products with practical computational time, that potentially improve approximately 30% of gross profit of those products.

【Keywords】: decision making; mathematical optimization; scheduling

197. Finding Precursors to Anomalous Drop in Airspeed During a Flight's Takeoff.

Paper Link】 【Pages】:1843-1852

【Authors】: Vijay Manikandan Janakiraman ; Bryan L. Matthews ; Nikunj C. Oza

【Abstract】: Aerodynamic stall based loss of control in flight is a major cause of fatal flight accidents. In a typical takeoff, a flight's airspeed continues to increase as it gains altitude. However, in some cases, the airspeed may drop immediately after takeoff and when left uncorrected, the flight gets close to a stall condition which is extremely risky. The takeoff is a high workload period for the flight crew involving frequent monitoring, control and communication with the ground control tower. Although there exists secondary safety systems and specialized recovery maneuvers, current technology is reactive; often based on simple threshold detection and does not provide the crew with sufficient lead time. Further, with increasing complexity of automation, the crew may not be aware of the true states of the automation to take corrective actions in time. At NASA, we aim to develop decision support tools by mining historic flight data to proactively identify and manage high risk situations encountered in flight. In this paper, we present our work on finding precursors to the anomalous drop-in-airspeed (ADA) event using the ADOPT (Automatic Discovery of Precursors in Time series) algorithm. ADOPT works by converting the precursor discovery problem into a search for sub-optimal decision making in the time series data, which is modeled using reinforcement learning. We give insights about the flight data, feature selection, ADOPT modeling and results on precursor discovery. Some improvements to ADOPT algorithm are implemented that reduces its computational complexity and enables forecasting of the adverse event. Using ADOPT analysis, we have identified some interesting precursor patterns that were validated to be operationally significant by subject matter experts. The performance of ADOPT is evaluated by using the precursor scores as features to predict the drop in airspeed events.

【Keywords】: aircraft stall; anomaly detection; applications; aviation data mining; aviation safety; decision support; phm; precursor; prognostics; reinforcement learning; time series; vehicle health management

198. Ad Serving with Multiple KPIs.

Paper Link】 【Pages】:1853-1861

【Authors】: Brendan Kitts ; Michael Krishnan ; Ishadutta Yadav ; Yongbo Zeng ; Garrett Badeau ; Andrew Potter ; Sergey Tolkachov ; Ethan Thornburg ; Satyanarayana Reddy Janga

【Abstract】: Ad-servers have to satisfy many different targeting criteria, and the combination can often result in no feasible solution. We hypothesize that advertisers may be defining these metrics to create a kind of "proxy target". We therefore reformulate the standard ad-serving problem to one where we attempt to get as close as possible to the advertiser's multi-dimensional target inclusive of delivery. We use a simple simulation to illustrate the behavior of this algorithm compared to Constraint and Pacing strategies. The system is then deployed in one of the largest video ad-servers in the United States and we show experimental results from live test ads, as well as 6 months of production performance across hundreds of ads. We find that the live ad-server tests match the simulation, and we report significant gains in multi-KPI performance from using the error minimization strategy.

【Keywords】: advertising; iab; internet advertising bureau; optimization; prediction; targeting; viewability

199. Discovering Pollution Sources and Propagation Patterns in Urban Area.

Paper Link】 【Pages】:1863-1872

【Authors】: Xiucheng Li ; Yun Cheng ; Gao Cong ; Lisi Chen

【Abstract】: Air quality is one of the most important environmental concerns in the world, and it has deteriorated substantially over the past years in many countries. For example, Chinese Academy of Social Sciences reports that the problem of haze and fog in China is hitting a record level, and China is currently suffering from the worst air pollution. Among the various causal factors of air quality, particulate matter with a diameter of 2.5 micrometers or less (i.e., PM2.5) is a very important factor; governments and people are increasingly concerned with the concentration of PM2.5. In many cities, stations for monitoring PM2.5 concentration have been built by governments or companies to monitor urban air quality. Apart from monitoring, there is a rising demand for finding pollution sources of PM2.5 and discovering the transmission of PM2.5 based on the data from PM$_{2.5}$ monitoring stations. However, to the best of our knowledge, none of previous work proposes a solution to the problem of detecting pollution sources and mining pollution propagation patterns from such monitoring data. In this work, we propose the first solution for the problem, which comprises two steps. The first step is to extract the uptrend intervals and calculate the causal strengths among spatially distributed sensors; The second step is to construct causality graphs and perform frequent subgraphs mining on these causality graphs to find pollution sources and propagation patterns. We use real-life monitoring data collected by a company in our experiments. Our experimental results demonstrate significant findings regarding pollution sources and pollutant propagations in Beijing, which will be useful for governments to make policy and govern pollution sources.

【Keywords】: pollution sources; propagation patterns; real deployed system

200. Discovering Enterprise Concepts Using Spreadsheet Tables.

Paper Link】 【Pages】:1873-1882

【Authors】: Keqian Li ; Yeye He ; Kris Ganjam

【Abstract】: Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then "summarize" the large tree by selecting important tree nodes that are likely good concepts based on how well they "describe" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.

【Keywords】: clustering; enterprise concepts; hierarchical tree; knowledge discovery

201. Supporting Employer Name Normalization at both Entity and Cluster Level.

Paper Link】 【Pages】:1883-1892

【Authors】: Qiaoling Liu ; Faizan Javed ; Vachik S. Dave ; Ankita Joshi

【Abstract】: In the recruitment domain, the employer name normalization task, which links employer names in job postings or resumes to entities in an employer knowledge base (KB), is important to many business applications. In previous work, we proposed the CompanyDepot system, which used machine learning techniques to address the problem. After applying it to several applications at CareerBuilder, we faced several new challenges: 1) how to avoid duplicate normalization results when the KB is noisy and contains many duplicate entities; 2) how to address the vocabulary gap between query names and entity names in the KB; and 3) how to use the context available in jobs and resumes to improve normalization quality. To address these challenges, in this paper we extend the previous CompanyDepot system to normalize employer names not only at entity level, but also at cluster level by mapping a query to a cluster in the KB that best matches the query. We also propose a new metric based on success rate and diversity reduction ratio for evaluating the cluster-level normalization. Moreover, we perform query expansion based on five data sources to address the vocabulary gap challenge and leverage the url context for the employer names in many jobs and resumes to improve normalization quality. We show that the proposed CompanyDepot-V2 system outperforms the previous CompanyDepot system and several other baseline systems over multiple real-world datasets. We also demonstrate the large improvement on normalization quality from entity-level to cluster-level normalization.

【Keywords】: cluster-level normalization; diversity reduction ratio; employer name normalization; entity linking; success rate

202. BDT: Gradient Boosted Decision Tables for High Accuracy and Scoring Efficiency.

Paper Link】 【Pages】:1893-1901

【Authors】: Yin Lou ; Mikhail Obukhov

【Abstract】: In this paper we present gradient boosted decision tables (BDTs). A d-dimensional decision table is essentially a mapping from a sequence of d boolean tests to a real value in {R}. We propose novel algorithms to fit decision tables. Our thorough empirical study suggests that decision tables are better weak learners in the gradient boosting framework and can improve the accuracy of the boosted ensemble. In addition, we develop an efficient data structure to represent decision tables and propose a novel fast algorithm to improve the scoring efficiency for boosted ensemble of decision tables. Experiments on public classification and regression datasets demonstrate that our method is able to achieve 1.5x to 6x speedups over the boosted regression trees baseline. We complement our experimental evaluation with a bias-variance analysis that explains how different weak models influence the predictive power of the boosted ensemble. Our experiments suggest gradient boosting with randomly backfitted decision tables distinguishes itself as the most accurate method on a number of classification and regression problems. We have deployed a BDT model to LinkedIn news feed system and achieved significant lift on key metrics.

【Keywords】: classification; decision table; gradient boosting; regression

203. Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks.

Paper Link】 【Pages】:1903-1911

【Authors】: Fenglong Ma ; Radha Chitta ; Jing Zhou ; Quanzeng You ; Tong Sun ; Jing Gao

【Abstract】: Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple attention mechanism to interpret the results. However, RNN-based approaches suffer from the problem that the performance of RNNs drops when the length of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose Dipole, an end-to-end, simple and robust model for predicting patients' future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three attention mechanisms to measure the relationships of different visits for the prediction. With the attention mechanisms, Dipole can interpret the prediction results effectively. Dipole also allows us to interpret the learned medical code representations which are confirmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can significantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation.

【Keywords】: attention mechanism; bidirectional recurrent neural networks; healthcare informatics

204. Internet Device Graphs.

Paper Link】 【Pages】:1913-1921

【Authors】: Matthew Malloy ; Paul Barford ; Enis Ceyhun Alp ; Jonathan Koller ; Adria Jewell

【Abstract】: Internet device graphs identify relationships between user-centric internet connected devices such as desktops, laptops, smartphones, tablets, gaming consoles, TV's, etc. The ability to create such graphs is compelling for online advertising, content customization, recommendation systems, security, and operations. We begin by describing an algorithm for generating a device graph based on IP-colocation, and then apply the algorithm to a corpus of over 2.5 trillion internet events collected over the period of six weeks in the United States. The resulting graph exhibits immense scale with greater than 7.3 billion edges (pair-wise relationships) between more than 1.2 billion nodes (devices), accounting for the vast majority of internet connected devices in the US. Next, we apply community detection algorithms to the graph resulting in a partitioning of internet devices into 100 million small communities representing physical households. We validate this partition with a unique ground truth dataset. We report on the characteristics of the graph and the communities. Lastly, we discuss the important issues of ethics and privacy that must be considered when creating and studying device graphs, and suggest further opportunities for device graph enrichment and application.

【Keywords】: internet device graphs

205. RUSH!: Targeted Time-limited Coupons via Purchase Forecasts.

Paper Link】 【Pages】:1923-1931

【Authors】: Emaad A. Manzoor ; Leman Akoglu

【Abstract】: Time-limited promotions that exploit consumers' sense of urgency to boost sales account for billions of dollars in consumer spending each year. However, it is challenging to discover the right timing and duration of a promotion to increase its chances of being redeemed. In this work, we consider the problem of delivering time-limited discount coupons, where we partner with a large national bank functioning as a commission-based third-party coupon provider. Specifically, we use large-scale anonymized transaction records to model consumer spending and forecast future purchases, based on which we generate data-driven, personalized coupons. Our proposed model RUSH! (1) predicts {both the time and category} of the next event; (2) captures correlations between purchases in different categories (such as shopping triggering dining purchases); (3) incorporates temporal dynamics of purchase behavior (such as increased spending on weekends); (4) is composed of additive factors that are easily interpretable; and finally (5) scales linearly to millions of transactions. We design a cost-benefit framework that facilitates systematic evaluation in terms of our application, and show that RUSH! provides higher expected value than various baselines that do not jointly model time and category information.

【Keywords】: cost-benefit analysis; targeted promotions and discounts; temporal point processes; transaction data

206. Embedding-based News Recommendation for Millions of Users.

Paper Link】 【Pages】:1933-1942

【Authors】: Shumpei Okura ; Yukihiro Tagami ; Shingo Ono ; Akira Tajima

【Abstract】: It is necessary to understand the content of articles and user preferences to make effective news recommendations. While ID-based methods, such as collaborative filtering and low-rank factorization, are well known for making recommendations, they are not suitable for news recommendations because candidate articles expire quickly and are replaced with new ones within short spans of time. Word-based methods, which are often used in information retrieval settings, are good candidates in terms of system performance but have issues such as their ability to cope with synonyms and orthographical variants and define "queries" from users' historical activities. This paper proposes an embedding-based method to use distributed representations in a three step end-to-end manner: (i) start with distributed representations of articles based on a variant of a denoising autoencoder, (ii) generate user representations by using a recurrent neural network (RNN) with browsing histories as input sequences, and (iii) match and list articles for users based on inner-product operations by taking system performance into consideration. The proposed method performed well in an experimental offline evaluation using past access data on Yahoo! JAPAN's homepage. We implemented it on our actual news distribution system based on these experimental results and compared its online performance with a method that was conventionally incorporated into the system. As a result, the click-through rate (CTR) improved by 23% and the total duration improved by 10%, compared with the conventionally incorporated method. Services that incorporated the method we propose are already open to all users and provide recommendations to over ten million individual users per day who make billions of accesses per month.

【Keywords】: distributed representations; large-scale services; neural networks; news recommendations

207. Learning to Count Mosquitoes for the Sterile Insect Technique.

Paper Link】 【Pages】:1943-1949

【Authors】: Yaniv Ovadia ; Yoni Halpern ; Dilip Krishnan ; Josh Livni ; Daniel Newburger ; Ryan Poplin ; Tiantian Zha ; D. Sculley

【Abstract】: Mosquito-borne illnesses such as dengue, chikungunya, and Zika are major global health problems, which are not yet addressable with vaccines and must be countered by reducing mosquito populations. The Sterile Insect Technique (SIT) is a promising alternative to pesticides; however, effective SIT relies on minimal releases of female insects. This paper describes a multi-objective convolutional neural net to significantly streamline the process of counting male and female mosquitoes released from a SIT factory and provides a statistical basis for verifying strict contamination rate limits from these counts despite measurement noise. These results are a promising indication that such methods may dramatically reduce the cost of effective SIT methods in practice.

【Keywords】: counting from images; image modeling; quality assurance

208. An Intelligent Customer Care Assistant System for Large-Scale Cellular Network Diagnosis.

Paper Link】 【Pages】:1951-1959

【Authors】: Lujia Pan ; Jianfeng Zhang ; Patrick P. C. Lee ; Hong Cheng ; Cheng He ; Caifeng He ; Keli Zhang

【Abstract】: With the advent of cellular network technologies, mobile Internet access becomes the norm in everyday life. In the meantime, the complaints made by subscribers about unsatisfactory cellular network access also become increasingly frequent. From a network operator's perspective, achieving accurate and timely cellular network diagnosis about the causes of the complaints is critical for both improving subscriber-perceived experience and maintaining network robustness. We present the Intelligent Customer Care Assistant (ICCA), a distributed fault classification system that exploits a data-driven approach to perform large-scale cellular network diagnosis. ICCA takes massive network data as input, and realizes both offline model training and online feature computation to distinguish between user and network faults in real time. ICCA is currently deployed in a metropolitan LTE network in China that is serving around 50 million subscribers. We show via evaluation that ICCA achieves high classification accuracy (85.3%) and fast query response time (less than 2.3 seconds). We also report our experiences learned from the deployment.

【Keywords】: cellular network diagnosis; fault classification; sequential pattern mining

209. Deep Design: Product Aesthetics for Heterogeneous Markets.

Paper Link】 【Pages】:1961-1970

【Authors】: Yanxin Pan ; Alexander Burnap ; Jeffrey Hartley ; Richard Gonzalez ; Panos Y. Papalambros

【Abstract】: Aesthetic appeal is a primary driver of customer consideration for products such as automobiles. Product designers must accordingly convey design attributes (e.g., 'Sportiness'), a challenging proposition given the subjective nature of aesthetics and heterogeneous market segments with potentially different aesthetic preferences. We introduce a scalable deep learning approach that predicts how customers across different market segments perceive aesthetic designs and provides a visualization that can aid in product design. We tested this approach using a large-scale product design and crowdsourced customer data set with a Siamese neural network architecture containing a pair of conditional generative adversarial networks. The results show that the model predicts aesthetic design attributes of customers in heterogeneous market segments and provides a visualization of these aesthetic perceptions. This suggests that the proposed deep learning approach provides a scalable method for understanding customer aesthetic perceptions.

【Keywords】: automobile design; crowdsourcing; deep learning; heterogeneous markets; product aesthetics

210. Collecting and Analyzing Millions of mHealth Data Streams.

Paper Link】 【Pages】:1971-1980

【Authors】: Tom Quisel ; Luca Foschini ; Alessio Signorini ; David C. Kale

【Abstract】: Players across the health ecosystem are initiating studies of thousands, even millions, of participants to gather diverse types of data, including biomedical, behavioral, and lifestyle in order to advance medical research. These efforts to collect multi-modal data sets on large cohorts coincide with the rise of broad activity and behavior tracking across industries, particularly in healthcare and the growing field of mobile health (mHealth). Government and pharmaceutical sponsored, as well as patient-driven group studies in this arena leverage the ability of mobile technology to continuously track behaviors and environmental factors with minimal participant burden. However, the adoption of mHealth has been constrained by the lack of robust solutions for large-scale data collection in free-living conditions and concerns around data quality. In this work, we describe the infrastructure Evidation Health has developed to collect mHealth data from millions of users through hundreds of different mobile devices and apps. Additionally, we provide evidence of the utility of the data for inferring individual traits pertaining to health, wellness, and behavior. To this end, we introduce and evaluate deep neural network models that achieve high prediction performance without requiring any feature engineering when trained directly on the densely sampled multivariate mHealth time series data. We believe that the present work substantiates both the feasibility and the utility of creating a very large mHealth research cohort, as envisioned by the many large cohort studies currently underway across therapeutic areas and conditions.

【Keywords】: mhealth; neural networks; time series; wearables

211. Dispatch with Confidence: Integration of Machine Learning, Optimization and Simulation for Open Pit Mines.

Paper Link】 【Pages】:1981-1989

【Authors】: Kosta Ristovski ; Chetan Gupta ; Kunihiko Harada ; Hsiu-Khuern Tang

【Abstract】: Open pit mining operations require utilization of extremely expensive equipment such as large trucks, shovels and loaders. To remain competitive, mining companies are under pressure to increase equipment utilization and reduce operational costs. The key to this in mining operations is to have sophisticated truck assignment strategies which will ensure that equipment is utilized efficiently with minimum operating cost. To address this problem, we have implemented truck assignment approach which integrates machine learning, linear/integer programming and simulation. Our truck assignment approach takes into consideration the number of trucks and their sizes, shovels and dump locations as well as stochastic activity times during the operations. Machine learning is used to predict probability distributions of equipment activity duration. We have validated the approach using data collected from two open pit mines. Our experimental results show that our approach offers increase of 10% in efficiency. Presented results demonstrate that machine learning can bring significant value to mining industry.

【Keywords】: machine learning; mine simulator; optimization

212. "The Leicester City Fairytale?": Utilizing New Soccer Analytics Tools to Compare Performance in the 15/16 & 16/17 EPL Seasons.

Paper Link】 【Pages】:1991-2000

【Authors】: Héctor Ruiz ; Paul Power ; Xinyu Wei ; Patrick Lucey

【Abstract】: The last two years have been somewhat of a rollercoaster for English Premier League (EPL) team Leicester City. In the 2015/16 season, against all odds and logic, they won the league to much fan-fare. Fast-forward nine months later, and they are battling relegation. What could describe this fluctuating form? As soccer is a very complex and strategic game, common statistics (e.g., passes, shots, possession) do not really tell the full story on how a team succeeds and fails. However, using machine learning tools and a plethora of data, it is now possible to obtain some insights into how a team performs. To showcase the utility of these new tools (i.e., expected goal value, expected save value, strategy-plots and passing quality measures), we first analyze the EPL 2015/16 season which a specific emphasis on the champions Leicester City, and then compare it to the current one. Finally, we show how these features can be used to predict future performance.

【Keywords】: prediction; sports analytics; unstructured data

Paper Link】 【Pages】:2001-2009

【Authors】: Hesam Salehian ; Patrick D. Howell ; Chul Lee

【Abstract】: We study the problem of how to match a formally structured restaurant menu item to a large database of less structured food items that has been collected via crowd-sourcing. At first glance, this problem scenario looks like a typical text matching problem that might possibly be solved with existing text similarity learning approaches. However, due to the unique nature of our scenario and the need for scalability, our problem imposes certain restrictions on possible machine learning approaches that we can employ. We propose a novel, practical, and scalable machine learning solution architecture, consisting of two major steps. First we use a query generation approach, based on a Markov Decision Process algorithm, to reduce the time complexity of searching for matching candidates. That is then followed by a re-ranking step, using deep learning techniques, to meet our required matching quality goals. It is important to note that our proposed solution architecture has already been deployed in a real application system serving tens of millions of users, and shows great potential for practical cases of user-entered text to structured text matching, especially when scalability is crucial.

【Keywords】: convolutional neural networks; markov decision process; nutrition estimation; short text matching

214. The Fake vs Real Goods Problem: Microscopy and Machine Learning to the Rescue.

Paper Link】 【Pages】:2011-2019

【Authors】: Ashlesh Sharma ; Vidyuth Srinivasan ; Vishal Kanchan ; Lakshminarayanan Subramanian

【Abstract】: Counterfeiting of physical goods is a global problem amounting to nearly 7% of world trade. While there have been a variety of overt technologies like holograms and specialized barcodes and covert technologies like taggants and PUFs, these solutions have had a limited impact on the counterfeit market due to a variety of factors - clonability, cost or adoption barriers. In this paper, we introduce a new mechanism that uses machine learning algorithms on microscopic images of physical objects to distinguish between genuine and counterfeit versions of the same product. The underlying principle of our system stems from the idea that microscopic characteristics in a genuine product or a class of products (corresponding to the same larger product line), exhibit inherent similarities that can be used to distinguish these products from their corresponding counterfeit versions. A key building block for our system is a wide-angle microscopy device compatible with a mobile device that enables a user to easily capture the microscopic image of a large area of a physical object. Based on the captured microscopic images, we show that using machine learning algorithms (ConvNets and bag of words), one can generate a highly accurate classification engine for separating the genuine versions of a product from the counterfeit ones; this property also holds for "super-fake" counterfeits observed in the marketplace that are not easily discernible from the human eye. We describe the design of an end-to-end physical authentication system leveraging mobile devices, portable hardware and a cloud-based object verification ecosystem. We evaluate our system using a large dataset of 3 million images across various objects and materials such as fabrics, leather, pills, electronics, toys and shoes. The classification accuracy is more than 98% and we show how our system works with a cellphone to verify the authenticity of everyday objects.

【Keywords】: computer vision; conventional neural networks; microscopy; physical authentication

215. Automatic Application Identification from Billions of Files.

Paper Link】 【Pages】:2021-2030

【Authors】: Kyle Soska ; Christopher S. Gates ; Kevin A. Roundy ; Nicolas Christin

【Abstract】: Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.

【Keywords】: aknn; application; big data; clustering; files; hsnw; knn; malware; security; sketch

216. Learning to Generate Rock Descriptions from Multivariate Well Logs with Hierarchical Attention.

Paper Link】 【Pages】:2031-2040

【Authors】: Bin Tong ; Martin Klinkigt ; Makoto Iwayama ; Toshihiko Yanase ; Yoshiyuki Kobayashi ; Anshuman Sahu ; Ravigopal Vennelakanti

【Abstract】: In the shale oil & gas industry, operators are looking toward big data analytics to optimize operations and reduce cost. In this paper, we mainly focus on how to assist operators in understanding the subsurface formation, thereby helping them make optimal decisions. A large number of geology reports and well logs describing the sub-surface have been accumulated over years. Issuing geology reports is more time consuming and depends more on the expertise of engineers than acquiring the well logs. To assist in issuing geology reports, we propose an encoder-decoder-based model to automatically generate rock descriptions in human-readable format from multivariate well logs. Due to the different formats of data, this task differs dramatically from image and video captioning. The challenges are how to model structured rock descriptions and leverage the information in multivariate well logs. To achieve this, we design a hierarchical structure and two forms of attention for the decoder. Extensive validations are conducted on public well data of North Dakota in the United States. We show that our model is effective in generating rock descriptions. The two forms of attention enable the provision of a better insight into relations between well-log types and rock properties with our model from a data-driven perspective.

【Keywords】: attention; geology report; neural generation; well log

217. Multi-view Learning over Retinal Thickness and Visual Sensitivity on Glaucomatous Eyes.

Paper Link】 【Pages】:2041-2050

【Authors】: Toshimitsu Uesaka ; Kai Morino ; Hiroki Sugiura ; Taichi Kiwaki ; Hiroshi Murata ; Ryo Asaoka ; Kenji Yamanishi

【Abstract】: Dense measurements of visual-field, which is necessary to detect glaucoma, is known as very costly and labor intensive. Recently, measurement of retinal-thickness can be less costly than measurement of visual-field. Thus, it is sincerely desired that the retinal-thickness could be transformed into visual-sensitivity data somehow. In this paper, we propose two novel methods to estimate the sensitivity of the visual-field with SITA-Standard mode 10-2 resolution using retinal-thickness data measured with optical coherence tomography (OCT). The first method called Affine-Structured Non-negative Matrix Factorization (ASNMF) which is able to cope with both the estimation of visual-field and the discovery of deep glaucoma knowledge. While, the second is based on Convolutional Neural Networks (CNNs) which demonstrates very high estimation performance. These methods are kinds of multi-view learning methods because they utilize visual-field and retinal thickness data simultaneously. We experimentally tested the performance of our methods from several perspectives. We found that ASNMF worked better for relatively small data size while CNNs did for relatively large data size. In addition, some clinical knowledge are discovered via ASNMF. To the best of our knowledge, this is the first paper to address the dense estimation of the visual-field based on the retinal-thickness data.

【Keywords】: convolutional neural networks (cnns); glaucoma; non-negative matrix factorization (nmf); retinal thickness

218. Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors' Demonstration.

Paper Link】 【Pages】:2051-2059

【Authors】: Xuejian Wang ; Lantao Yu ; Kan Ren ; Guanyu Tao ; Weinan Zhang ; Yong Yu ; Jun Wang

【Abstract】: As aggregators, online news portals face great challenges in continuously selecting a pool of candidate articles to be shown to their users. Typically, those candidate articles are recommended manually by platform editors from a much larger pool of articles aggregated from multiple sources. Such a hand-pick process is labor intensive and time-consuming. In this paper, we study the editor article selection behavior and propose a learning by demonstration system to automatically select a subset of articles from the large pool. Our data analysis shows that (i) editors' selection criteria are non-explicit, which are less based only on the keywords or topics, but more depend on the quality and attractiveness of the writing from the candidate article, which is hard to capture based on traditional bag-of-words article representation. And (ii) editors' article selection behaviors are dynamic: articles with different data distribution come into the pool everyday and the editors' preference varies, which are driven by some underlying periodic or occasional patterns. To address such problems, we propose a meta-attention model across multiple deep neural nets to (i) automatically catch the editors' underlying selection criteria via the automatic representation learning of each article and its interaction with the meta data and (ii) adaptively capture the change of such criteria via a hybrid attention model. The attention model strategically incorporates multiple prediction models, which are trained in previous days. The system has been deployed in a commercial article feed platform. A 9-day A/B testing has demonstrated the consistent superiority of our proposed model over several strong baselines.

【Keywords】: attention models; convolutional neural network; learning by demonstration; recommendation

219. A Hybrid Framework for Text Modeling with Convolutional RNN.

Paper Link】 【Pages】:2061-2069

【Authors】: Chenglong Wang ; Feijun Jiang ; Hongxia Yang

【Abstract】: In this paper, we introduce a generic inference hybrid framework for Convolutional Recurrent Neural Network (conv-RNN) of semantic modeling of text, seamless integrating the merits on extracting different aspects of linguistic information from both convolutional and recurrent neural network structures and thus strengthening the semantic understanding power of the new framework. Besides, based on conv-RNN, we also propose a novel sentence classification model and an attention based answer selection model with strengthening power for the sentence matching and classification respectively. We validate the proposed models on a very wide variety of data sets, including two challenging tasks of answer selection (AS) and five benchmark datasets for sentence classification (SC). To the best of our knowledge, it is by far the most complete comparison results in both AS and SC. We empirically show superior performances of conv-RNN in these different challenging tasks and benchmark datasets and also summarize insights on the performances of other state-of-the-arts methodologies.

【Keywords】: answer selection; convolution neural network; hybrid framework; recurrent neural network; sentence classification; text modeling

220. Formative Essay Feedback Using Predictive Scoring Models.

Paper Link】 【Pages】:2071-2080

【Authors】: Bronwyn Woods ; David Adamson ; Shayne Miel ; Elijah Mayfield

【Abstract】: A major component of secondary education is learning to write effectively, a skill which is bolstered by repeated practice with formative guidance. However, providing focused feedback to every student on multiple drafts of each essay throughout the school year is a challenge for even the most dedicated of teachers. This paper first establishes a new ordinal essay scoring model and its state of the art performance compared to recent results in the Automated Essay Scoring field. Extending this model, we describe a method for using prediction on realistic essay variants to give rubric-specific formative feedback to writers. This method is used in Revision Assistant, a deployed data-driven educational product that provides immediate, rubric-specific, sentence-level feedback to students to supplement teacher guidance. We present initial evaluations of this feedback generation, both offline and in deployment.

【Keywords】: automated essay scoring; automated writing evaluation; automated writing feedback

221. Learning Temporal State of Diabetes Patients via Combining Behavioral and Demographic Data.

Paper Link】 【Pages】:2081-2089

【Authors】: Houping Xiao ; Jing Gao ; Long H. Vu ; Deepak S. Turaga

【Abstract】: Diabetes is a serious disease affecting a large number of people. Although there is no cure for diabetes, it can be managed. Especially, with advances in sensor technology, lots of data may lead to the improvement of diabetes management, if properly mined. However, there usually exists noise or errors in the observed behavioral data which poses challenges in extracting meaningful knowledge. To overcome this challenge, we learn the latent state which represents the patient's condition. Such states should be inferred from the behavioral data but unknown a priori. In this paper, we propose a novel framework to capture the trajectory of latent states for patients from behavioral data while exploiting their demographic differences and similarities to other patients. We conduct a hypothesis test to illustrate the importance of the demographic data in diabetes management, and validate that each behavioral feature follows an exponential or a Gaussian distribution. Integrating these aspects, we use a Demographic feature restricted hidden Markov model (DfrHMM) to estimate the trajectory of latent states by integrating the demographic and behavioral data. In DfrHMM, the latent state is mainly determined by the previous state and the demographic features in a nonlinear way. Markov Chain Monte Carlo techniques are used for model parameter estimation. Experiments on synthetic and real datasets show that DfrHMM is effective in diabetes management.

【Keywords】: diabetes management; hidden markov model; hypothesis testing

222. Local Algorithm for User Action Prediction Towards Display Ads.

Paper Link】 【Pages】:2091-2099

【Authors】: Hongxia Yang ; Yada Zhu ; Jingrui He

【Abstract】: User behavior modeling is essential in computational advertisement, which builds users' profiles by tracking their online behaviors and then delivers the relevant ads according to each user's interests and needs. Accurate models will lead to higher targeting accuracy and thus improved advertising performance. Intuitively, similar users tend to have similar behaviors towards the displayed ads (e.g., impression, click, conversion). However, to the best of our knowledge, there is not much previous work that explicitly investigates such similarities of various types of user behaviors, and incorporates them into ad response targeting and prediction, largely due to the prohibitive scale of the problem. To bridge this gap, in this paper, we use bipartite graphs to represent historical user behaviors, which consist of both user nodes and advertiser campaign nodes, as well as edges reflecting various types of user-campaign interactions in the past. Based on this representation, we study random-walk-based local algorithms for user behavior modeling and action prediction, whose computational complexity depends only on the size of the output cluster, rather than the entire graph. Our goal is to improve action prediction by leveraging historical user-user, campaign-campaign, and user-campaign interactions. In particular, we propose the bipartite graphs AdvUserGraph accompanied with the ADNI algorithm. ADNI extends the NIBBLE algorithm to AdvUserGraph, and it is able to find the local cluster consisting of interested users towards a specific advertiser campaign. We also propose two extensions of ADNI with improved efficiencies. The performance of the proposed algorithms is demonstrated on both synthetic data and a world leading Demand Side Platform (DSP), showing that they are able to discriminate extremely rare events in terms of their action propensity.

【Keywords】: computational advertisement; large scale; local graph algorithm; user action prediction

Paper Link】 【Pages】:2101-2110

【Authors】: Fan Yang ; Ajinkya Kale ; Yury Bubnov ; Leon Stein ; Qiaosong Wang ; M. Hadi Kiapour ; Robinson Piramuthu

【Abstract】: In this paper, we propose a novel end-to-end approach for scalable visual search infrastructure. We discuss the challenges we faced for a massive volatile inventory like at eBay and present our solution to overcome those. We harness the availability of large image collection of eBay listings and state-of-the-art deep learning techniques to perform visual search at scale. Supervised approach for optimized search limited to top predicted categories and also for compact binary signature are key to scale up without compromising accuracy and precision. Both use a common deep neural network requiring only a single forward inference. The system architecture is presented with in-depth discussions of its basic components and optimizations for a trade-off between search relevance and latency. This solution is currently deployed in a distributed cloud infrastructure and fuels visual search in eBay ShopBot and Close5. We show benchmark on ImageNet dataset on which our approach is faster and more accurate than several unsupervised baselines. We share our learnings with the hope that visual search becomes a first class citizen for all large scale search engines rather than an afterthought.

【Keywords】: deep learning; e-commerce; search engine; semantics; visual search

224. A Data-driven Process Recommender Framework.

Paper Link】 【Pages】:2111-2120

【Authors】: Sen Yang ; Xin Dong ; Leilei Sun ; Yichen Zhou ; Richard A. Farneth ; Hui Xiong ; Randall S. Burd ; Ivan Marsic

【Abstract】: We present an approach for improving the performance of complex knowledge-based processes by providing data-driven step-by-step recommendations. Our framework uses the associations between similar historic process performances and contextual information to determine the prototypical way of enacting the process. We introduce a novel similarity metric for grouping traces into clusters that incorporates temporal information about activity performance and handles concurrent activities. Our data-driven recommender system selects the appropriate prototype performance of the process based on user-provided context attributes. Our approach for determining the prototypes discovers the commonly performed activities and their temporal relationships. We tested our system on data from three real-world medical processes and achieved recommendation accuracy up to an F1 score of 0.77 (compared to an F1 score of 0.37 using ZeroR) with 63.2% of recommended enactments being within the first five neighbors of the actual historic enactments in a set of 87 cases. Our framework works as an interactive visual analytic tool for process mining. This work shows the feasibility of data-driven decision support system for complex knowledge-based processes.

【Keywords】: emergency medical process analysis.; process prototype extraction; process recommender system; process trace clustering

225. Predicting Optimal Facility Location without Customer Locations.

Paper Link】 【Pages】:2121-2130

【Authors】: Emre Yilmaz ; Sanem Elbasi ; Hakan Ferhatosmanoglu

【Abstract】: Deriving meaningful insights from location data helps businesses make better decisions. One critical decision made by a business is choosing a location for its new facility. Optimal location queries ask for a location to build a new facility that optimizes an objective function. Most of the existing works on optimal location queries propose solutions to return best location when the set of existing facilities and the set of customers are given. However, most businesses do not know the locations of their customers. In this paper, we introduce a new problem setting for optimal location queries by removing the assumption that the customer locations are known. We propose an optimal location predictor which accepts partial information about customer locations and returns a location for the new facility. The predictor generates synthetic customer locations by using given partial information and it runs optimal location queries with generated location data. Experiments with real data show that the predictor can find the optimal location when sufficient information is provided.

【Keywords】: data generation; location analytics; optimal location queries; prediction; uncertainty

226. DeepProbe: Information Directed Sequence Understanding and Chatbot Design via Recurrent Neural Networks.

Paper Link】 【Pages】:2131-2139

【Authors】: Zi Yin ; Keng-hao Chang ; Ruofei Zhang

【Abstract】: Information extraction and user intention identification is a central topic in modern query understanding and recommendation systems. In this paper, we propose DeepProbe, a generic information-directed interaction framework which is built around an attention-based sequence to sequence (seq2seq) recurrent neural network. DeepProbe can rephrase, evaluate, and even actively ask questions, leveraging the generative ability and likelihood estimation made possible by seq2seq models. DeepProbe makes decisions based on a derived uncertainty (entropy) measure conditioned on user inputs, possibly with multiple rounds of interactions. Three applications, namely a rewritter, a relevance scorer and a chatbot for ad recommendation, were built around DeepProbe, with the first two serving as precursory building blocks for the third. We first use the seq2seq model in DeepProbe to rewrite a user query into one of standard query form, which is submitted to an ordinary recommendation system. Secondly, we evaluate DeepProbe's seq2seq model-based relevance scoring. Finally, we build a chatbot prototype capable of making active user interactions, which can ask questions that maximize information gain, allowing for a more efficient user intention idenfication process. We evaluate first two applications by 1) comparing with baselines by BLEU and AUC, and 2) human judge evaluation. Both demonstrate significant improvements compared with current state-of-the-art systems, proving their values as useful tools on their own, and at the same time laying a good foundation for the ongoing chatbot application.

【Keywords】: attention mechanism; chatbot; deep learning; information gain; online advertising; probabilistic scoring; query rewriting; recommendation system; rnn; seq2seq; sponsored search

227. Stock Price Prediction via Discovering Multi-Frequency Trading Patterns.

Paper Link】 【Pages】:2141-2149

【Authors】: Liheng Zhang ; Charu C. Aggarwal ; Guo-Jun Qi

【Abstract】: Stock prices are formed based on short and/or long-term commercial and trading activities that reflect different frequencies of trading patterns. However, these patterns are often elusive as they are affected by many uncertain political-economic factors in the real world, such as corporate performances, government policies, and even breaking news circulated across markets. Moreover, time series of stock prices are non-stationary and non-linear, making the prediction of future price trends much challenging. To address them, we propose a novel State Frequency Memory (SFM) recurrent network to capture the multi-frequency trading patterns from past market data to make long and short term predictions over time. Inspired by Discrete Fourier Transform (DFT), the SFM decomposes the hidden states of memory cells into multiple frequency components, each of which models a particular frequency of latent trading pattern underlying the fluctuation of stock price. Then the future stock prices are predicted as a nonlinear mapping of the combination of these components in an Inverse Fourier Transform (IFT) fashion. Modeling multi-frequency trading patterns can enable more accurate predictions for various time ranges: while a short-term prediction usually depends on high frequency trading patterns, a long-term prediction should focus more on the low frequency trading patterns targeting at long-term return. Unfortunately, no existing model explicitly distinguishes between various frequencies of trading patterns to make dynamic predictions in literature. The experiments on the real market data also demonstrate more competitive performance by the SFM as compared with the state-of-the-art methods.

【Keywords】: multi-frequency trading patterns; state frequency memory; stock price prediction

228. A Taxi Order Dispatch Model based On Combinatorial Optimization.

Paper Link】 【Pages】:2151-2159

【Authors】: Lingyu Zhang ; Tao Hu ; Yue Min ; Guobin Wu ; Junying Zhang ; Pengcheng Feng ; Pinghua Gong ; Jieping Ye

【Abstract】: Taxi-booking apps have been very popular all over the world as they provide convenience such as fast response time to the users. The key component of a taxi-booking app is the dispatch system which aims to provide optimal matches between drivers and riders. Traditional dispatch systems sequentially dispatch taxis to riders and aim to maximize the driver acceptance rate for each individual order. However, the traditional systems may lead to a low global success rate, which degrades the rider experience when using the app. In this paper, we propose a novel system that attempts to optimally dispatch taxis to serve multiple bookings. The proposed system aims to maximize the global success rate, thus it optimizes the overall travel efficiency, leading to enhanced user experience. To further enhance users' experience, we also propose a method to predict destinations of a user once the taxi-booking APP is started. The proposed method employs the Bayesian framework to model the distribution of a user's destination based on his/her travel histories. We use rigorous A/B tests to compare our new taxi dispatch method with state-of-the-art models using data collected in Beijing. Experimental results show that the proposed method is significantly better than other state-of-the art models in terms of global success rate (increased from 80% to 84%). Moreover, we have also achieved significant improvement on other metrics such as user's waiting-time and pick-up distance. For our destination prediction algorithm, we show that our proposed model is superior to the baseline model by improving the top-3 accuracy from 89% to 93%. The proposed taxi dispatch and destination prediction algorithms are both deployed in our online systems and serve tens of millions of users everyday.

【Keywords】: circular distribution; combinatorial optimization; destination prediction; taxi dispatch

229. Contextual Spatial Outlier Detection with Metric Learning.

Paper Link】 【Pages】:2161-2170

【Authors】: Guanjie Zheng ; Susan L. Brantley ; Thomas Lauvaux ; Zhenhui Li

【Abstract】: Hydraulic fracturing (or "fracking") is a revolutionary well stimulation technique for shale gas extraction, but has spawned controversy in environmental contamination. If methane from gas wells leaks extensively, this greenhouse gas can impact drinking water wells and enhance global warming. Our work is motivated by this heated debate on environmental issue and focuses on general data analytical techniques to detect anomalous spatial data samples (e.g., water samples related to potential leakages). Specifically, we propose a spatial outlier detection method based on contextual neighbors. Different from existing work, our approach utilizes both spatial attributes and non-spatial contextual attributes to define neighbors. We further use robust metric learning to combine different contextual attributes in order to find meaningful neighbors. Our technique can be applied to any spatial dataset. Extensive experimental results on five real-world datasets demonstrate the effectiveness of our approach. We also show some interesting case studies, including one case linking to leakage of a gas well.

【Keywords】: metric learning; outlier detection

230. Resolving the Bias in Electronic Medical Records.

Paper Link】 【Pages】:2171-2180

【Authors】: Kaiping Zheng ; Jinyang Gao ; Kee Yuan Ngiam ; Beng Chin Ooi ; James Wei Luen Yip

【Abstract】: Electronic Medical Records (EMR) are the most fundamental resources used in healthcare data analytics. Since people visit hospital more frequently when they feel sick and doctors prescribe lab examinations when they feel necessary, we argue that there could be a strong bias in EMR observations compared with the hidden conditions of patients. Directly using such EMR for analytical tasks without considering the bias may lead to misinterpretation. To this end, we propose a general method to resolve the bias by transforming EMR to regular patient hidden condition series using a Hidden Markov Model (HMM) variant. Compared with the biased EMR series with irregular time stamps, the unbiased regular time series is much easier to be processed by most analytical models and yields better results. Extensive experimental results demonstrate that our bias resolving method imputes missing data more accurately than baselines and improves the performance of the state-of-the-art methods on typical medical data analytics.

【Keywords】: data analytics; healthcare; time series

231. STAR: A System for Ticket Analysis and Resolution.

Paper Link】 【Pages】:2181-2190

【Authors】: Wubai Zhou ; Wei Xue ; Ramesh Baral ; Qing Wang ; Chunqiu Zeng ; Tao Li ; Jian Xu ; Zheng Liu ; Larisa Shwartz ; Genady Ya. Grabarnik

【Abstract】: In large scale and complex IT service environments, a problematic incident is logged as a ticket and contains the ticket summary (system status and problem description). The system administrators log the step-wise resolution description when such tickets are resolved. The repeating service events are most likely resolved by inferring similar historical tickets. With the availability of reasonably large ticket datasets, we can have an automated system to recommend the best matching resolution for a given ticket summary. In this paper, we first identify the challenges in real-world ticket analysis and develop an integrated framework to efficiently handle those challenges. The framework first quantifies the quality of ticket resolutions using a regression model built on carefully designed features. The tickets, along with their quality scores obtained from the resolution quality quantification, are then used to train a deep neural network ranking model that outputs the matching scores of ticket summary and resolution pairs. This ranking model allows us to leverage the resolution quality in historical tickets when recommending resolutions for an incoming incident ticket. In addition, the feature vectors derived from the deep neural ranking model can be effectively used in other ticket analysis tasks, such as ticket classification and clustering. The proposed framework is extensively evaluated with a large real-world dataset.

【Keywords】: convolutional neural network; it service management; sentence model; ticket resolution

232. Optimized Cost per Click in Taobao Display Advertising.

Paper Link】 【Pages】:2191-2200

【Authors】: Han Zhu ; Junqi Jin ; Chang Tan ; Fei Pan ; Yifan Zeng ; Han Li ; Kun Gai

【Abstract】: Taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. For commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. The platform chooses the most suitable ads to display in tens of milliseconds. Common pricing methods include cost per mille (CPM) and cost per click (CPC). Traditional advertising systems target certain traits of users and ad placements with fixed bids, essentially regarded as coarse-grained matching of bid and traffic quality. However, the fixed bids set by the advertisers competing for different quality requests cannot fully optimize the advertisers' key requirements. Moreover, the platform has to be responsible for the business revenue and user experience. Thus, we proposed a bid optimizing strategy called optimized cost per click (OCPC) which automatically adjusts the bid to achieve finer matching of bid and traffic quality of page view (PV) request granularity. Our approach optimizes advertisers' demands, platform business revenue and user experience and as a whole improves traffic allocation efficiency. We have validated our approach in Taobao display advertising system in production. The online A/B test shows our algorithm yields substantially better results than previous fixed bid manner.

【Keywords】: bid optimization; display advertising; probability estimation