22. KDD 2016:San Francisco, CA, USA

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM 【DBLP Link

Paper Num: 236 || Session Num: 8

Keynote Talks 5

1. Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks.

Paper Link】 【Pages】:1

【Authors】: Jennifer T. Chayes

【Abstract】: There are numerous examples of sparse massive networks, in particular the Internet, WWW and online social networks. How do we model and learn these networks? In contrast to conventional learning problems, where we have many independent samples, it is often the case for these networks that we can get only one independent sample. How do we use a single snapshot today to learn a model for the network, and therefore be able to predict a similar, but larger network in the future? In the case of relatively small or moderately sized networks, it's appropriate to model the network parametrically, and attempt to learn these parameters. For massive networks, a non-parametric representation is more appropriate. In this talk, we first review the theory of graphons, developed over the last decade to describe limits of dense graphs, and the more the recent theory describing sparse graphs of unbounded average degree, including power-law graphs. We then show how to use these graphons as non-parametric models for sparse networks. Finally, we show how to get consistent estimators of these non-parametric models, and moreover how to do this in a way that protects the privacy of individuals on the network.

【Keywords】: data mining; graphons; machine learning; massive networks; sparse networks

2. Learning to Learn and Compositionality with Deep Recurrent Neural Networks: Learning to Learn and Compositionality.

Paper Link】 【Pages】:3

【Authors】: Nando de Freitas

【Abstract】: Deep neural network representations play an important role in computer vision, speech, computational linguistics, robotics, reinforcement learning and many other data-rich domains. In this talk I will show that learning-to-learn and compositionality are key ingredients for dealing with knowledge transfer so as to solve a wide range of tasks, for dealing with small-data regimes, and for continual learning. I will demonstrate this with several examples from my research team: learning to learn by gradient descent by gradient descent, neural programmers and interpreters, and learning communication.

【Keywords】: compositionality; continual learning; deep learning; learning to learn; neural programmers; optimization; program induction; recurrent neural networks; transfer learning

3. The Evolving Meaning of Information Security.

Paper Link】 【Pages】:5

【Authors】: Whitfield Diffie

【Abstract】: When you are developing security systems, new penetration techniques seem to appear as responses to new security measures but in general the flow is the other way around: security exists and evolves because of the evolution of threats. Beginning with the rise of radio in the 20th Century attacks on communication networks have shown two forms: those that go for the big kill --- such as the breaking of Enigma --- and those that assemble small seemingly innocuous leaks of information into a comprehensive understanding of the target's behavior. We will analyze the way in which these trends interact with others to create a situation in which what is possible in security and even the meaning of security in communication networks needs reexamination.

【Keywords】: communication security; information security; security systems

4. People, Computers, and The Hot Mess of Real Data.

Paper Link】 【Pages】:7

【Authors】: Joseph M. Hellerstein

【Abstract】: In practice, end-to-end data analysis is rarely a cleanly engineered process. Acquiring data can be tricky. Data assessment, wrangling and feature extraction are time-consuming and subjective. Models and algorithms used to derive data products are highly contextualized by time-varying properties of data sources, code and application needs. All of these issues would ideally benefit from an organizational view, but are often driven by individual users. Viewed holistically, both agile analytics and the establishment of analytic pipelines involve interactions between people, computation and infrastructure. In this talk I'll share some anecdotes from our research, user studies, and field experience with companies (Trifacta, Captricity), as well as an emerging open-source project (Ground).

【Keywords】: artificial intelligence; data mining; data science; real data; user studies

5. A VC View of Investing in ML.

Paper Link】 【Pages】:9

【Authors】: Greg Papadopoulos

【Abstract】: We are seeing a remarkable watershed in the application of data science across markets and industries. A trifecta of advances in algorithms, cheap cycles, and the capture of networked data from everywhere are no doubt the catalysts. The results for many are continuous improvements in efficiencies, and for some are a fundamental re-imagination and disruption of just about every industry. This talk will give examples we are seeing (and funding!) for the latter, and then focus on our views of the ecosystem of value-from-data infrastructure and end-application companies. A big question is whether the enormous collective advances in tools, techniques and education are in-fact converting would-be differentiated products into democratized features used everywhere. We'll follow the value and make our own predictions on future as ML as a business.

【Keywords】: data mining; data science; investing; machine learning; venture

Panel 1

6. Big Data Needs Big Dreamers: Lessons from Successful Big Data Investors.

Paper Link】 【Pages】:11-12

【Authors】: Evangelos Simoudis ; Mark Gorenberg ; Tim Guleri ; Matt Ocko ; Greg Sands

【Abstract】:

【Keywords】: panel

Applied Data Science Track Full Papers 40

7. Designing Policy Recommendations to Reduce Home Abandonment in Mexico.

Paper Link】 【Pages】:13-20

【Authors】: Klaus Ackermann ; Eduardo Blancas Reyes ; Sue He ; Thomas Anderson Keller ; Paul van der Boor ; Romana Khan ; Rayid Ghani ; José Carlos González

【Abstract】: Infonavit, the largest provider of mortgages in Mexico, assists working families to obtain low-interest rate housing solutions. An increasingly prevalent problem is home abandonment: when a homeowner decides to leave their property and forego their investment. A major causal factor of this outcome is a mismatch between the homeowner's needs, in terms of access to services and employment, and the location characteristics of the home. This paper describes our collaboration with Infonavit to reduce home abandonment at two levels: develop policy recommendations for targeted improvements in location characteristics, and develop a decision-support tool to assist the homeowner in the home location decision. Using 20 years of mortgage history data combined with surveys, census, and location information, we develop a model to predict the probability of home abandonment based on both individual and location characteristics. The model is used to develop a tool that provides Infonavit the ability to give advice to Mexican workers when they apply for a loan, evaluate and improve the locations of new housing developments, and provide data-driven recommendations to the federal government to influence local development initiatives and infrastructure investments. The result is improving economic outcomes for the citizens of Mexico by pre-emptively identifying at-risk home mortgages, thereby allowing them to be altered or remedied before they result in abandonment.

【Keywords】: Mexico; data science; home abandonment; machine learning; social good

8. Aircraft Trajectory Prediction Made Easy with Predictive Analytics.

Paper Link】 【Pages】:21-30

【Authors】: Samet Ayhan ; Hanan Samet

【Abstract】: At the heart of Air Traffic Management (ATM) lies the Decision Support Systems (DST) that rely upon accurate trajectory prediction to determine how the airspace will look like in the future to make better decisions and advisories. Dealing with airspace that is prone to congestion due to environmental factors still remains the challenge especially when a deterministic approach is used in the trajectory prediction process. In this paper, we describe a novel stochastic trajectory prediction approach for ATM that can be used for more efficient and realistic flight planning and to assist airspace flow management, potentially resulting in higher safety, capacity, and efficiency commensurate with fuel savings thereby reducing emissions for a better environment. Our approach considers airspace as a 3D grid network, where each grid point is a location of a weather observation. We hypothetically build cubes around these grid points, so the entire airspace can be considered as a set of cubes. Each cube is defined by its centroid, the original grid point, and associated weather parameters that remain homogeneous within the cube during a period of time. Then, we align raw trajectories to a set of cube centroids which are basically fixed 3D positions independent of trajectory data. This creates a new form of trajectories which are 4D joint cubes, where each cube is a segment that is associated with not only spatio-temporal attributes but also with weather parameters. Next, we exploit machine learning techniques to train inference models from historical data and apply a stochastic model, a Hidden Markov Model (HMM), to predict trajectories taking environmental uncertainties into account. During the process, we apply time series clustering to generate input observations from an excessive set of weather parameters to feed into the Viterbi algorithm. Our experiments use a real trajectory dataset with pertaining weather observations and demonstrate the effectiveness of our approach to the trajectory prediction process for ATM.

【Keywords】: air traffic management; aircraft trajectory prediction; hidden markov model; predictive analytics; time series

9. Matrix Computations and Optimization in Apache Spark.

Paper Link】 【Pages】:31-38

【Authors】: Reza Bosagh Zadeh ; Xiangrui Meng ; Alexander Ulanov ; Burak Yavuz ; Li Pu ; Shivaram Venkataraman ; Evan R. Sparks ; Aaron Staple ; Matei Zaharia

【Abstract】: We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services.

【Keywords】: distributed linear algebra; machine learning; matrix computations; mllib; optimization; spark

10. Predicting Disk Replacement towards Reliable Data Centers.

Paper Link】 【Pages】:39-48

【Authors】: Mirela Madalina Botezatu ; Ioana Giurgiu ; Jasmina Bogojeska ; Dorothea Wiesmann

【Abstract】: Disks are among the most frequently failing components in today's IT environments. Despite a set of defense mechanisms such as RAID, the availability and reliability of the system are still often impacted severely. In this paper, we present a highly accurate SMART-based analysis pipeline that can correctly predict the necessity of a disk replacement even 10-15 days in advance. Our method has been built and evaluated on more than 30000 disks from two major manufacturers, monitored over 17 months. Our approach employs statistical techniques to automatically detect which SMART parameters correlate with disk replacement and uses them to predict the replacement of a disk with even 98% accuracy.

【Keywords】: changepoint; classification; disk replacement; time series

11. Developing a Data-Driven Player Ranking in Soccer Using Predictive Model Weights.

Paper Link】 【Pages】:49-55

【Authors】: Joel Brooks ; Matthew Kerr ; John V. Guttag

【Abstract】: Quantitative evaluation of the ability of soccer players to contribute to team offensive performance is typically based on goals scored, assists made, and shots taken. In this paper, we describe a novel player ranking system based entirely on the value of passes completed. This value is derived based on the relationship of pass locations in a possession and shot opportunities generated. This relationship is learned by applying a supervised machine learning model to pass locations in event data from the 2012-2013 La Liga season. Interestingly, though this metric is based entirely on passes, the derived player rankings are largely consistent with general perceptions of offensive ability, e.g., Messi and Ronaldo are near the top. Additionally, when used to rank midfielders, it separates the more offensively-minded players from others.

【Keywords】: machine learning; soccer analytics; sports analytics

12. The Legislative Influence Detector: Finding Text Reuse in State Legislation.

Paper Link】 【Pages】:57-66

【Authors】: Matthew Burgess ; Eugenia Giraudy ; Julian Katz-Samuels ; Joe Walsh ; Derek Willis ; Lauren Haynes ; Rayid Ghani

【Abstract】: State legislatures introduce at least 45,000 bills each year. However, we lack a clear understanding of who is actually writing those bills. As legislators often lack the time and staff to draft each bill, they frequently copy text written by other states or interest groups. However, existing approaches to detect text reuse are slow, biased, and incomplete. Journalists or researchers who want to know where a particular bill originated must perform a largely manual search. Watchdog organizations even hire armies of volunteers to monitor legislation for matches. Given the time-consuming nature of the analysis, journalists and researchers tend to limit their analysis to a subset of topics (e.g. abortion or gun control) or a few interest groups. This paper presents the Legislative Influence Detector (LID). LID uses the Smith-Waterman local alignment algorithm to detect sequences of text that occur in model legislation and state bills. As it is computationally too expensive to run this algorithm on a large corpus of data, we use a search engine built using Elasticsearch to limit the number of comparisons. We show how system has found 45,405 instances of bill-to-bill text reuse and 14,137 instances of model-legislation-to-bill text reuse. System reduces the time it takes to manually find text reuse from days to seconds.

【Keywords】: government transparency; social good

13. Identifying Police Officers at Risk of Adverse Events.

Paper Link】 【Pages】:67-76

【Authors】: Samuel Carton ; Jennifer Helsby ; Kenneth Joseph ; Ayesha Mahmud ; Youngsoo Park ; Joe Walsh ; Crystal Cody ; C. P. T. Estella Patterson ; Lauren Haynes ; Rayid Ghani

【Abstract】: Adverse events between police and the public, such as deadly shootings or instances of racial profiling, can cause serious or deadly harm, damage police legitimacy, and result in costly litigation. Evidence suggests these events can be prevented by targeting interventions based on an Early Intervention System (EIS) that flags police officers who are at a high risk for involvement in such adverse events. Today's EIS are not data-driven and typically rely on simple thresholds based entirely on expert intuition. In this paper, we describe our work with the Charlotte-Mecklenburg Police Department (CMPD) to develop a machine learning model to predict which officers are at risk for an adverse event. Our approach significantly outperforms CMPD's existing EIS, increasing true positives by ~12% and decreasing false positives by ~32%. Our work also sheds light on features related to officer characteristics, situational factors, and neighborhood factors that are predictive of adverse events. This work provides a starting point for police departments to take a comprehensive, data-driven approach to improve policing and reduce harm to both officers and members of the public.

【Keywords】: law enforcement; police

14. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned.

Paper Link】 【Pages】:77-86

【Authors】: Alex Deng ; Xiaolin Shi

【Abstract】: Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it up, best practices and lessons learned to obtain trustworthy results, and experiment design techniques and various issues related to statistical inference and testing. However, despite playing a central role in online controlled experiments, there is little published work on treating metric development itself as a data-driven process. In this paper, we focus on the topic of how to develop meaningful and useful metrics for online services in their online experiments, and show how data-driven techniques and criteria can be applied in metric development process. In particular, we emphasize two fundamental qualities for the goal metrics (or Overall Evaluation Criteria) of any online service: directionality and sensitivity. We share lessons on why these two qualities are critical, how to measure these two qualities of metrics of interest, how to develop metrics with clear directionality and high sensitivity by using approaches based on user behavior models and data-driven calibration, and how to choose the right goal metrics for the entire online services.

【Keywords】: A/B testing; controlled experiment; online metrics; user experience evaluation; web measurement

15. Catch Me If You Can: Detecting Pickpocket Suspects from Large-Scale Transit Records.

Paper Link】 【Pages】:87-96

【Authors】: Bowen Du ; Chuanren Liu ; Wenjun Zhou ; Zhenshan Hou ; Hui Xiong

【Abstract】: Massive data collected by automated fare collection (AFC) systems provide opportunities for studying both personal traveling behaviors and collective mobility patterns in the urban area. Existing studies on the AFC data have primarily focused on identifying passengers' movement patterns. In this paper, however, we creatively leveraged such data for identifying thieves in the public transit systems. Indeed, stopping pickpockets in the public transit systems has been critical for improving passenger satisfaction and public safety. However, it is challenging to tell thieves from regular passengers in practice. To this end, we developed a suspect detection and surveillance system, which can identify pick-pocket suspects based on their daily transit records. Specifically, we first extracted a number of features from each passenger's daily activities in the transit systems. Then, we took a two-step approach that exploits the strengths of unsupervised outlier detection and supervised classification models to identify thieves, who exhibit abnormal traveling behaviors. Experimental results demonstrated the effective- ness of our method. We also developed a prototype system with a user-friendly interface for the security personnel.

【Keywords】: anomaly detection; automated fare collection; mobility patterns; public safety; travel behaviors

16. Email Volume Optimization at LinkedIn.

Paper Link】 【Pages】:97-106

【Authors】: Rupesh Gupta ; Guanfeng Liang ; Hsiao-Ping Tseng ; Ravi Kiran Holur Vijay ; Xiaoyu Chen ; Rómer Rosales

【Abstract】: Online social networking services distribute various types of messages to their members. Common types of messages include news, connection requests, membership notifications, promotions and event notifications. Such communication, if used judiciously, can provide an enormous value to members thereby keeping them engaged. However sending a message for every instance of news, connection request, or the like can result in an overwhelming number of messages in a member's mailbox. This may result in reduced effectiveness of communication if the messages are not sufficiently relevant to the member's interests. It may also result in a poor brand perception of the networking service. In this paper we discuss our strategy and experience with regard to the problem of email volume optimization at LinkedIn. In particular, we present a cost-benefit analysis of sending emails, the key factors to administer an effective volume optimization, our algorithm for volume optimization, the architecture of the supporting system and experimental results from online A/B tests.

【Keywords】: email; machine learning; optimization

17. Large-Scale Item Categorization in e-Commerce Using Multiple Recurrent Neural Networks.

Paper Link】 【Pages】:107-115

【Authors】: JungWoo Ha ; Hyuna Pyo ; Jeonghee Kim

【Abstract】: Precise item categorization is a key issue in e-commerce domains. However, it still remains a challenging problem due to data size, category skewness, and noisy metadata. Here, we demonstrate a successful report on a deep learning-based item categorization method, i.e., deep categorization network (DeepCN), in an e-commerce website. DeepCN is an end-to-end model using multiple recurrent neural networks (RNNs) dedicated to metadata attributes for generating features from text metadata and fully connected layers for classifying item categories from the generated features. The categorization errors are propagated back through the fully connected layers to the RNNs for weight update in the learning process. This deep learning-based approach allows diverse attributes to be integrated into a common representation, thus overcoming sparsity and scalability problems. We evaluate DeepCN on large-scale real-world data including more than 94 million items with approximately 4,100 leaf categories from a Korean e-commerce website. Experiment results show our method improves the categorization accuracy compared to the model using single RNN as well as a standard classification model using unigram-based bag-of-words. Furthermore, we investigate how much the model parameters and the used attributes influence categorization performances.

【Keywords】: deep learning; e-commerce; large-scale item categorization; recurrent neural networks

18. Online Dual Decomposition for Performance and Delivery-Based Distributed Ad Allocation.

Paper Link】 【Pages】:117-126

【Authors】: Jim C. Huang ; Rodolphe Jenatton ; Cédric Archambeau

【Abstract】: Online optimization is central to display advertising, where we must sequentially allocate ad impressions to maximize the total welfare among advertisers, while respecting various advertiser-specified long-term constraints (e.g., total amount of the ad's budget that is consumed at the end of the campaign). In this paper, we present the online dual decomposition (ODD) framework for large-scale, online, distributed ad allocation, which combines dual decomposition and online convex optimization. ODD allows us to account for the distributed and the online nature of the ad allocation problem and is extensible to a variety of ad allocation problems arising in real-world display advertising systems. Moreover, ODD does not require assumptions about auction dynamics, stochastic or adversarial feedback, or any other characteristics of the ad marketplace. We further provide guarantees for the online solution as measured by bounds on cumulative regret. The regret analysis accounts for the impact of having to estimate constraints in an online setting before they are observed and for the dependence on the smoothness with which constraints and constraint violations are generated. We provide an extensive set of results from a large-scale production advertising system at Amazon to validate the framework and compare its behavior to various ad allocation algorithms.

【Keywords】: ad allocation; budget pacing; campaign optimization; demand-side platform; display advertising; distributed optimization; long-term constraints

Paper Link】 【Pages】:127-136

【Authors】: Bo Jin ; Chao Che ; Kuifei Yu ; Yue Qu ; Li Guo ; Cuili Yao ; Ruiyun Yu ; Qiang Zhang

【Abstract】: Patent litigation not only covers legal and technical issues, it is also a key consideration for managers of high-technology (high-tech) companies when making strategic decisions. Patent litigation influences the market value of high-tech companies. However, this raises unique challenges. To this end, in this paper, we develop a novel recommendation framework to solve the problem of litigation risk prediction. We will introduce a specific type of patent-related litigation, that is, Section 337 investigations, which prohibit all acts of unfair competition, or any unfair trade practices, when exporting products to the United States. To build this recommendation framework, we collect and exploit a large amount of published information related to almost all Section 337 investigation cases. This study has two aims: (1) to predict the litigation risk in a specific industry category for high-tech companies and (2) to predict the litigation risk from competitors for high-tech companies. These aims can be achieved by mining historical investigation cases and related patents. Specifically, we propose two methods to meet the needs of both aims: a proximal slope one predictor and a time-aware predictor. Several factors are considered in the proposed methods, including the litigation risk if a company wants to enter a new market and the risk that a potential competitor would file a lawsuit against the new entrant. Comparative experiments using real-world data demonstrate that the proposed methods outperform several baselines with a significant margin.

【Keywords】: collaborative filtering; patent litigation; proximal slope one predictor; section 337 investigations; time-aware predictor

20. Ranking Universities Based on Career Outcomes of Graduates.

Paper Link】 【Pages】:137-144

【Authors】: Navneet Kapur ; Nikita I. Lytkin ; Bee-Chung Chen ; Deepak Agarwal ; Igor Perisic

【Abstract】: Every year, millions of new students enter higher educational programs. Publicly available rankings of academic programs play a key role in prospective students' decisions regarding which universities to apply to and enroll in. While surveys indicate that majority of freshmen enter college to get good jobs after graduation, established methodologies for ranking universities rely on indirect indicators of career outcomes such as reputational assessments of the universities among academic peers, acceptance and graduation rates, learning environment, and availability of research funding. In addition, many of these methodologies rely on arbitrary choices of weighting factors for the different ranking indicators, and suffer from lack of analyses of statistical stability. In this paper, we addresses these challenges holistically by developing a novel methodology for ranking and recommending universities for different professions on the basis of career outcomes of professionals who graduated from those schools. Our methodology incorporates a number of techniques for achieving statistical stability, and represents a step towards personalized educational recommendations based on interests and ambitions of individuals. We have applied this methodology on LinkedIn's Economic Graph data of over 400 million professional from around the world. The resulting university rankings have been made available to the public and demonstrate that there are valuable insights to be gleaned from professional career data on LinkedIn.

【Keywords】: company rankings; educational recommendations; statistics; university rankings

21. Predictors without Borders: Behavioral Modeling of Product Adoption in Three Developing Countries.

Paper Link】 【Pages】:145-154

【Authors】: Muhammad R. Khan ; Joshua E. Blumenstock

【Abstract】: Billions of people around the world live without access to banks or other formal financial institutions. In the past several years, many mobile operators have launched "Mobile Money" platforms that deliver basic financial services over the mobile phone network. While many believe that these services can improve the lives of the poor, in many countries adoption of Mobile Money still remains anemic. In this paper, we develop a predictive model of Mobile Money adoption that uses billions of mobile phone communications records to understand the behavioral determinants of adoption. We describe a novel approach to feature engineering that uses a Deterministic Finite Automaton to construct thousands of behavioral metrics of phone use from a concise set of recursive rules. These features provide the foundation for a predictive model that is tested on mobile phone operators logs from Ghana, Pakistan, and Zambia, three very different developing-country contexts. The results highlight the key correlates of Mobile Money use in each country, as well as the potential for such methods to predict and drive adoption. More generally, our analysis provides insight into the extent to which homogenized supervised learning methods can generalize across geographic contexts. We find that without careful tuning, a model that performs very well in one country frequently does not generalize to another.

【Keywords】: feature engineering; gradient boosting; mobilemoney; product adoption; supervised learning

22. Repeat Buyer Prediction for E-Commerce.

Paper Link】 【Pages】:155-164

【Authors】: Guimei Liu ; Tam T. Nguyen ; Gang Zhao ; Wei Zha ; Jianbo Yang ; Jianneng Cao ; Min Wu ; Peilin Zhao ; Wei Chen

【Abstract】: A large number of new buyers are often acquired by merchants during promotions. However, many of the attracted buyers are one-time deal hunters, and the promotions may have little long-lasting impact on sales. It is important for merchants to identify who can be converted to regular loyal buyers and then target them to reduce promotion cost and increase the return on investment (ROI). At International Joint Conferences on Artificial Intelligence (IJCAI) 2015, Alibaba hosted an international competition for repeat buyer prediction based on the sales data of the ``Double 11" shopping event in 2014 at Tmall.com. We won the first place at stage 1 of the competition out of 753 teams. In this paper, we present our winning solution, which consists of comprehensive feature engineering and model training. We created profiles for users, merchants, brands, categories, items and their interactions via extensive feature engineering. These profiles are not only useful for this particular prediction task, but can also be used for other important tasks in e-commerce, such as customer segmentation, product recommendation, and customer base augmentation for brands. Feature engineering is often the most important factor for the success of a prediction task, but not much work can be found in the literature on feature engineering for prediction tasks in e-commerce. Our work provides some useful hints and insights for data science practitioners in e-commerce.

【Keywords】: e-commerce; feature engineering; repeat buyer prediction

23. Audience Expansion for Online Social Network Advertising.

Paper Link】 【Pages】:165-174

【Authors】: Haishan Liu ; David Pardoe ; Kun Liu ; Manoj Thakur ; Frank Cao ; Chongzhe Li

【Abstract】: Online social network advertising platforms, such as that provided by LinkedIn, generally allow marketers to specify targeting options so that their ads appear to a desired demographic. Audience Expansion is a technique developed at LinkedIn to simplify targeting and identify new audiences with similar attributes to the original target audience. We developed two methods to achieve Audience Expansion: campaign-agnostic expansion and campaign-aware expansion. In this paper, we describe the details of these methods, present in-depth analysis of their trade-offs, and demonstrate a hybrid strategy that possesses the combined strength of both methods. Through large scale online experiments, we show the effectiveness of the proposed approach, and as a result, the benefits it brings to the whole marketplace including both LinkedIn and advertisers. The achieved benefits can be characterized as: 1) simplified targeting process and increased reach for advertisers, and 2) better utilization of LinkedIn's ads inventory and higher and more efficient market participation.

【Keywords】: audience expansion; lookalike modeling; online advertising

24. From Online Behaviors to Offline Retailing.

Paper Link】 【Pages】:175-184

【Authors】: Ping Luo ; Su Yan ; Zhiqiang Liu ; Zhiyong Shen ; Shengwen Yang ; Qing He

【Abstract】: To combat the ease of online shopping in pajamas, offline mall owners focus increasingly on driving satisfaction and improving retention by identifying customers' preferences. However, most of these studies are based on customers' offline consuming history only. Benefiting from the internet, we can also get customers' online behaviors, such as the search logs, web browsing logs, online shopping logs, and so on. Might these seemingly irrelevant information from two different modalities (i.e. online and offline) be somehow interrelated? How can we make use of the online behaviors and offline actions jointly to promote recommendation for offline retailing? In this study, we formulate this task as a cross-modality recommendation problem, and present its solution via a proposed probabilistic graphical model, called Online-to-Offline Topic Modeling (O2OTM). Specifically, this method explicitly models the relationships between online and offline topics so that the likelihood of both online and offline behaviors is maximized. Then, the recommendation is made only based on the pairs of online and offline topics, denoted by (t,l), with high values of lift, such that the existence of the online topic $t$ greatly increases the response on the corresponding offline topic $l$ compared with the average response for the population without the online topic t. Furthermore, we evaluate this solution in both live and retrospect experiments. The real-world deployment of this model for the anniversary promotion campaign of a famous shopping mall in Beijing shows that our approach increases the occurred customer purchases per promotion message by 29.75\% compared with the baseline. Also, our model finds some interesting interpretable relationships between the online search topics and offline brand topics.

【Keywords】: brands recommendation; recommendation explanation; topic modeling

25. Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta.

Paper Link】 【Pages】:185-194

【Authors】: Michael Madaio ; Shang-Tse Chen ; Oliver L. Haimson ; Wenwen Zhang ; Xiang Cheng ; Matthew Hinds-Aldrich ; Duen Horng Chau ; Bistra Dilkina

【Abstract】: The Atlanta Fire Rescue Department (AFRD), like many municipal fire departments, actively works to reduce fire risk by inspecting commercial properties for potential hazards and fire code violations. However, AFRD's fire inspection practices relied on tradition and intuition, with no existing data-driven process for prioritizing fire inspections or identifying new properties requiring inspection. In collaboration with AFRD, we developed the Firebird framework to help municipal fire departments identify and prioritize commercial property fire inspections, using machine learning, geocoding, and information visualization. Firebird computes fire risk scores for over 5,000 buildings in the city, with true positive rates of up to 71% in predicting fires. It has identified 6,096 new potential commercial properties to inspect, based on AFRD's criteria for inspection. Furthermore, through an interactive map, Firebird integrates and visualizes fire incidents, property information and risk scores to help AFRD make informed decisions about fire inspections. Firebird has already begun to make positive impact at both local and national levels. It is improving AFRD's inspection processes and Atlanta residents' safety, and was highlighted by National Fire Protection Association (NFPA) as a best practice for using data to inform fire inspections.

【Keywords】: data science; fire risk; government innovation; interactive visualization; predictive modeling

26. DopeLearning: A Computational Approach to Rap Lyrics Generation.

Paper Link】 【Pages】:195-204

【Authors】: Eric Malmi ; Pyry Takala ; Hannu Toivonen ; Tapani Raiko ; Aristides Gionis

【Abstract】: Writing rap lyrics requires both creativity to construct a meaningful, interesting story and lyrical skills to produce complex rhyme patterns, which form the cornerstone of good flow. We present a rap lyrics generation method that captures both of these aspects. First, we develop a prediction model to identify the next line of existing lyrics from a set of candidate next lines. This model is based on two machine-learning techniques: the RankSVM algorithm and a deep neural network model with a novel structure. Results show that the prediction model can identify the true next line among 299 randomly selected lines with an accuracy of 17%, i.e., over 50 times more likely than by random. Second, we employ the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning. An evaluation of the produced lyrics shows that in terms of quantitative rhyme density, the method outperforms the best human rappers by 21%. The rap lyrics generator has been deployed as an online tool called DeepBeat, and the performance of the tool has been assessed by analyzing its usage logs. This analysis shows that machine-learned rankings correlate with user preferences.

【Keywords】: deep learning; information retrieval; lyrics generation; natural language processing; rankSVM; rap

27. EMBERS at 4 years: Experiences operating an Open Source Indicators Forecasting System.

Paper Link】 【Pages】:205-214

【Authors】: Sathappan Muthiah ; Patrick Butler ; Rupinder Paul Khandpur ; Parang Saraf ; Nathan Self ; Alla Rozovskaya ; Liang Zhao ; Jose Cadena ; Chang-Tien Lu ; Anil Vullikanti ; Achla Marathe ; Kristen Maria Summers ; Graham Katz ; Andy Doyle ; Jaime Arredondo ; Dipak K. Gupta ; David Mares ; Naren Ramakrishnan

【Abstract】: EMBERS is an anticipatory intelligence system forecasting population-level events in multiple countries of Latin America. A deployed system from 2012, EMBERS has been generating alerts 24x7 by ingesting a broad range of data sources including news, blogs, tweets, machine coded events,currency rates, and food prices. In this paper, we describe our experiences operating EMBERS continuously for nearly 4 years, with specific attention to the discoveries it has enabled, correct as well as missed forecasts, lessons learnt from participating in a forecasting tournament, and our perspectives on the limits of forecasting including ethical considerations.

【Keywords】: civil unrest; event forecasting; open source indicators

28. Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs.

Paper Link】 【Pages】:215-224

【Authors】: Animesh Nandi ; Atri Mandal ; Shubham Atreja ; Gargi Banerjee Dasgupta ; Subhrajit Bhattacharya

【Abstract】: We focus on the problem of detecting anomalous run-time behavior of distributed applications from their execution logs. Specifically we mine templates and template sequences from logs to form a control flow graph (cfg) spanning distributed components. This cfg represents the baseline healthy system state and is used to flag deviations from the expected behavior of runtime logs. The novelty in our work stems from the new techniques employed to: (1) overcome the instrumentation requirements or application specific assumptions made in prior log mining approaches, (2) improve the accuracy of mined templates and the cfg in the presence of long parameters and high amount of interleaving respectively, and (3) improve by orders of magnitude the scalability of the cfg mining process in terms of volume of log data that can be processed per day. We evaluate our approach using (a) synthetic log traces and (b) multiple real-world log datasets collected at different layers of application stack. Results demonstrate that our template mining, cfg mining, and anomaly detection algorithms have high accuracy. The distributed implementation of our pipeline is highly scalable and has more than 500 GB/day of log data processing capability even on a 10 low-end VM based (Spark + Hadoop) cluster. We also demonstrate the efficacy of our end-to-end system using a case study with the Openstack VM provisioning system.

【Keywords】: application monitoring; control flow mining; distributed applications; execution logs; log analytics; spark; template

29. Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platforms.

Paper Link】 【Pages】:225-234

【Authors】: Alexander G. Nikolaev ; Shounak Gore ; Venu Govindaraju

【Abstract】: The challenges of assessing the "health" of online social media platforms and strategically growing them are recognized by many practitioners and researchers. For those platforms that primarily rely on user-generated content, the reach -- the degree of participation referring to the percentage and involvement of users -- is a key indicator of success. This paper lays a theoretical foundation for measuring engagement as a driver of reach that achieves growth via positive externality effects. The paper takes a game theoretic approach to quantifying engagement, viewing a platform's social capital as a cooperatively created value and finding a fair distribution of this value among the contributors. It introduces engagement capacity, a measure of the ability of users and user groups to engage peers, and formulates the Engaging Team Formation Problem (EngTFP) to identify the sets of users that "make a platform go". We show how engagement capacity can be useful in characterizing forum user behavior and in the reach maximization efforts. We also stress how engagement analysis differs from influence measurement. Computational investigations with Twitter and Health Forum data reveal the properties of engagement capacity and the utility of EngTFP.

【Keywords】: engagement; influence maximization; reach; social networks; team formation problem

30. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments.

Paper Link】 【Pages】:235-244

【Authors】: Alexey Poyarkov ; Alexey Drutsa ; Andrey Khalyavin ; Gleb Gusev ; Pavel Serdyukov

【Abstract】: Nowadays, the development of most leading web services is controlled by online experiments that qualify and quantify the steady stream of their updates achieving more than a thousand concurrent experiments per day. Despite the increasing need for running more experiments, these services are limited in their user traffic. This situation leads to the problem of finding a new or improving existing key performance metric with a higher sensitivity and lower variance. We focus on the problem of variance reduction for engagement metrics of user loyalty that are widely used in A/B testing of web services. We develop a general framework that is based on evaluation of the mean difference between the actual and the approximated values of the key performance metric (instead of the mean of this metric). On the one hand, it allows us to incorporate the state-of-the-art techniques widely used in randomized experiments of clinical and social research, but limitedly used in online evaluation. On the other hand, we propose a new class of methods based on advanced machine learning algorithms, including ensembles of decision trees, that, to the best of our knowledge, have not been applied earlier to the problem of variance reduction. We validate the variance reduction approaches on a very large set of real large-scale A/B experiments run at Yandex for different engagement metrics of user loyalty. Our best approach demonstrates $63\%$ average variance reduction (which is equivalent to 63% saved user traffic) and detects the treatment effect in $2$ times more A/B experiments.

【Keywords】: A/B test; prediction; variance reduction

31. Dynamic and Robust Wildfire Risk Prediction System: An Unsupervised Approach.

Paper Link】 【Pages】:245-254

【Authors】: Mahsa Salehi ; Laura Irina Rusu ; Timothy M. Lynar ; Anna Phan

【Abstract】: Ability to predict the risk of damaging events (e.g. wildfires) is crucial in helping emergency services in their decision making processes, to mitigate and reduce the impact of such events. Today, wildfire rating systems have been in operation extensively in many countries around the world to estimate the danger of wildfires. In this paper we propose a data-driven approach to predict wildfire risk using weather data. We show how we address the inherent challenge arising due to the temporal dynamicity of weather data. Weather observations naturally change in time, with finer-scale variation (e.g. stationary day or night) or large variations (nonstationary day or night), and this determines a temporal variation of the predicted wildfire danger. We show how our dynamic wildfire danger prediction model addresses the aforementioned challenge using context-based anomaly detection techniques. We call our predictive model a Context-Based Fire Risk (CBFR) model. The advantage of our model is that it maintains multiple historical models for different temporal variations (e.g. day versus night), and uses ensemble learning techniques to predict wildfire risk with high accuracy. In addition, it is completely unsupervised and does not rely on expert knowledge, which makes it flexible and easily applied to any region of interest. Our CBFR model is also scalable and can potentially be parallelised to speed up computation. We have considered multiple wildfire locations in the Blue Mountains, Australia as a case study, and compared the results of our system with the existing well-established Australian wildfire rating system. The experimental results show that our predictive model has a substantially higher accuracy in predicting wildfire risk, which makes it an effective model to supplement the operational Australian wildfire rating system.

【Keywords】: bushfires; context-based anomaly detection; data stream mining; risk prediction; unsupervised learning; wildfires

32. Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features.

Paper Link】 【Pages】:255-262

【Authors】: Ying Shan ; T. Ryan Hoens ; Jian Jiao ; Haijing Wang ; Dong Yu ; J. C. Mao

【Abstract】: Manually crafted combinatorial features have been the "secret sauce" behind many successful models. For web-scale applications, however, the variety and volume of features make these manually crafted features expensive to create, maintain, and deploy. This paper proposes the Deep Crossing model which is a deep neural network that automatically combines features to produce superior models. The input of Deep Crossing is a set of individual features that can be either dense or sparse. The important crossing features are discovered implicitly by the networks, which are comprised of an embedding and stacking layer, as well as a cascade of Residual Units. Deep Crossing is implemented with a modeling tool called the Computational Network Tool Kit (CNTK), powered by a multi-GPU platform. It was able to build, from scratch, two web-scale models for a major paid search engine, and achieve superior results with only a sub-set of the features used in the production models. This demonstrates the potential of using Deep Crossing as a general modeling paradigm to improve existing products, as well as to speed up the development of new models with a fraction of the investment in feature engineering and acquisition of deep domain knowledge.

【Keywords】: CNTK; DSSM; GPU; combinatorial features; convolutional neural network (CNN); deep crossing; deep learning; multi-GPU platform; neural networks; residual net

33. Question Independent Grading using Machine Learning: The Case of Computer Program Grading.

Paper Link】 【Pages】:263-272

【Authors】: Gursimran Singh ; Shashank Srikant ; Varun Aggarwal

【Abstract】: Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work [25] wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach.

【Keywords】: MOOC; automatic grading; feature engineering; one-class learning; question independent learning; recruitment; supervised learning

34. Contextual Intent Tracking for Personal Assistants.

Paper Link】 【Pages】:273-282

【Authors】: Yu Sun ; Nicholas Jing Yuan ; Yingzi Wang ; Xing Xie ; Kieran McDonald ; Rui Zhang

【Abstract】: A new paradigm of recommendation is emerging in intelligent personal assistants such as Apple's Siri, Google Now, and Microsoft Cortana, which recommends "the right information at the right time" and proactively helps you "get things done". This type of recommendation requires precisely tracking users' contemporaneous intent, i.e., what type of information (e.g., weather, stock prices) users currently intend to know, and what tasks (e.g., playing music, getting taxis) they intend to do. Users' intent is closely related to context, which includes both external environments such as time and location, and users' internal activities that can be sensed by personal assistants. The relationship between context and intent exhibits complicated co-occurring and sequential correlation, and contextual signals are also heterogeneous and sparse, which makes modeling the context intent relationship a challenging task. To solve the intent tracking problem, we propose the Kalman filter regularized PARAFAC2 (KP2) nowcasting model, which compactly represents the structure and co-movement of context and intent. The KP2 model utilizes collaborative capabilities among users, and learns for each user a personalized dynamic system that enables efficient nowcasting of users' intent. Extensive experiments using real-world data sets from a commercial personal assistant show that the KP2 model significantly outperforms various methods, and provides inspiring implications for deploying large-scale proactive recommendation systems in personal assistants.

【Keywords】: context-aware recommendation; intelligent personal assistant; intent tracking; multi-task learning; nowcasting; proactive triggers

35. An Empirical Study on Recommendation with Multiple Types of Feedback.

Paper Link】 【Pages】:283-292

【Authors】: Liang Tang ; Bo Long ; Bee-Chung Chen ; Deepak Agarwal

【Abstract】: User feedback like clicks and ratings on recommended items provides important information for recommender systems to predict users' interests in unseen items. Most systems rely on models trained using a single type of feedback, e.g., ratings for movie recommendation and clicks for online news recommendation. However, in addition to the primary feedback, many systems also allow users to provide other types of feedback, e.g., liking or sharing an article, or hiding all articles from a source. These additional feedback potentially provides extra information for the recommendation models. To optimize user experience and business objectives, it is important for a recommender system to use both the primary feedback and additional feedback. This paper presents an empirical study on various training methods for incorporating multiple user feedback types based on LinkedIn recommendation products. We study three important problems that we face at LinkedIn: (1) Whether to send an email based on clicks and complaints, (2) how to rank updates in LinkedIn feeds based on clicks and hides and (3) how jointly optimize for viral actions and clicks in LinkedIn feeds. Extensive offline experiments on historical data show the effectiveness of these methods in different situations. Online A/B testing results further demonstrate the impact of these methods on LinkedIn production systems.

【Keywords】: multi-objective optimization; personalized recommendation; recommender system

36. An Engagement-Based Customer Lifetime Value System for E-commerce.

Paper Link】 【Pages】:293-302

【Authors】: Ali Vanderveld ; Addhyan Pandey ; Angela Han ; Rajesh Parekh

【Abstract】: A comprehensive understanding of individual customer value is crucial to any successful customer relationship management strategy. It is also the key to building products for long-term value returns. Modeling customer lifetime value (CLTV) can be fraught with technical difficulties, however, due to both the noisy nature of user-level behavior and the potentially large customer base. Here we describe a new CLTV system that solves these problems. This was built at Groupon, a large global e-commerce company, where confronting the unique challenges of local commerce means quickly iterating on new products and the optimal inventory to appeal to a wide and diverse audience. Given current purchaser frequency we need a faster way to determine the health of individual customers, and given finite resources we need to know where to focus our energy. Our CLTV system predicts future value on an individual user basis with a random forest model which includes features that account for nearly all aspects of each customer's relationship with our platform. This feature set includes those quantifying engagement via email and our mobile app, which give us the ability to predict changes in value far more quickly than models based solely on purchase behavior. We further model different customer types, such as one-time buyers and power users, separately so as to allow for different feature weights and to enhance the interpretability of our results. Additionally, we developed an economical scoring framework wherein we re-score a user when any trigger events occur and apply a decay function otherwise, to enable frequent scoring of a large customer base with a complex model. This system is deployed, predicting the value of hundreds of millions of users on a daily cadence, and is actively being used across our products and business initiatives.

【Keywords】: customer lifetime value; e-commerce; random forests

37. Identifying Earmarks in Congressional Bills.

Paper Link】 【Pages】:303-311

【Authors】: Ellery Wulczyn ; Madian Khabsa ; Vrushank Vora ; Matthew Heston ; Joe Walsh ; Christopher Berry ; Rayid Ghani

【Abstract】: Earmarks are legislative provisions that direct federal funds to specific projects, circumventing the competitive grant-making process of federal agencies. Identifying and cataloging earmarks is a tedious, time-consuming process carried out by experts from public interest groups. In this paper, we present a machine learning system for automatically extracting earmarks from congressional bills and reports. We first describe a table-parsing algorithm for extracting budget allocations from appropriations tables in congressional bills. We then use machine learning classifiers to identify budget allocations as earmarked objects with an out of sample ROC AUC score of 0.89. Using this system, we construct the first publicly available database of earmarks dating back to 1995. Our machine learning approach adds transparency, accuracy, and speed to the congressional appropriations process.

【Keywords】: earmarks; information extraction; machine learning; natural language processing

38. Evaluating Mobile Apps with A/B and Quasi A/B Tests.

Paper Link】 【Pages】:313-322

【Authors】: Ya Xu ; Nanyu Chen

【Abstract】: We have seen an explosive growth of mobile usage, particularly on mobile apps. It is more important than ever to be able to properly evaluate mobile app release. A/B testing is a standard framework to evaluate new ideas. We have seen much of its applications in the online world across the industry [9,10,12]. Running A/B tests on mobile apps turns out to be quite different, and much of it is attributed to the fact that we cannot ship code easily to mobile apps other than going through a lengthy build, review and release process. Mobile infrastructure and user behavior differences also contribute to how A/B tests are conducted differently on mobile apps, which will be discussed in details in this paper. In addition to measuring features individually in the new app version through randomized A/B tests, we have a unique opportunity to evaluate the mobile app as a whole using the quasi-experimental framework [21]. Not all features can be A/B tested due to infrastructure changes and wholistic product redesign. We propose and establish quasi-experimental techniques for measuring impact from mobile app release, with results shared from a recent major app launch at LinkedIn.

【Keywords】: A/B testing; causal inference; mobile; quasi-experiments

39. Ranking Relevance in Yahoo Search.

Paper Link】 【Pages】:323-332

【Authors】: Dawei Yin ; Yuening Hu ; Jiliang Tang ; Tim Daly Jr. ; Mianwei Zhou ; Hua Ouyang ; Jianhui Chen ; Changsung Kang ; Hongbo Deng ; Chikashi Nobata ; Jean-Marc Langlois ; Yi Chang

【Abstract】: Search engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search engine has gone far beyond text matching, and now involves tremendous challenges. The semantic gap between queries and URLs is the main barrier for improving base relevance. Clicks help provide hints to improve relevance, but unfortunately for most tail queries, the click information is too sparse, noisy, or missing entirely. For comprehensive relevance, the recency and location sensitivity of results is also critical. In this paper, we give an overview of the solutions for relevance in the Yahoo search engine. We introduce three key techniques for base relevance -- ranking functions, semantic matching features and query rewriting. We also describe solutions for recency sensitive relevance and location sensitive relevance. This work builds upon 20 years of existing efforts on Yahoo search, summarizes the most recent advances and provides a series of practical relevance solutions. The performance reported is based on Yahoo's commercial search engine, where tens of billions of urls are indexed and served by the ranking system.

【Keywords】: deep learning; learning to rank; query rewriting; semantic matching

40. Identifying Decision Makers from Professional Social Networks.

Paper Link】 【Pages】:333-342

【Authors】: Shipeng Yu ; Evangelia Christakopoulou ; Abhishek Gupta

【Abstract】: Sales professionals help organizations win clients for products and services. Generating new clients starts with identifying the right decision makers at the target organization. For the past decade, online professional networks have collected tremendous amount of data on people's identity, their network and behavior data of buyers and sellers building relationships with each other for a variety of use-cases. Sales professionals are increasingly relying on these networks to research, identify and reach out to potential prospects, but it is often hard to find the right people effectively and efficiently. In this paper we present LDMS, the LinkedIn Decision Maker Score, to quantify the ability of making a sales decision for each of the 400M+ LinkedIn members. It is the key data-driven technology underlying Sales Navigator, a proprietary LinkedIn product that is designed for sales professionals. We will specifically discuss the modeling challenges of LDMS, and present two graph-based approaches to tackle this problem by leveraging the professional network data at LinkedIn. Both approaches are able to leverage both the graph information and the contextual information on the vertices, deal with small amount of labels on the graph, and handle heterogeneous graphs among different types of vertices. We will show some offline evaluations of LDMS on historical data, and also discuss its online usage in multiple applications in live production systems as well as future use cases within the LinkedIn ecosystem.

【Keywords】: decision makers; graph mining; social network mining

41. Batch Model for Batched Timestamps Data Analysis with Application to the SSA Disability Program.

Paper Link】 【Pages】:343-352

【Authors】: Qingqi Yue ; Ao Yuan ; Xuan Che ; Minh Huynh ; Chunxiao Zhou

【Abstract】: The Office of Disability Adjudication and Review (ODAR) is responsible for holding hearings, issuing decisions, and reviewing appeals as part of the Social Security Administration's disability determining process. In order to control and process cases, the ODAR has established a Case Processing and Management System (CPMS) to record management information since December 2003. The CPMS provides a detailed case status history for each case. Due to the large number of appeal requests and limited resources, the number of pending claims at ODAR was over one million cases by March 31, 2015. Our National Institutes of Health (NIH) team collaborated with SSA and developed a Case Status Change Model (CSCM) project to meet the ODAR's urgent need of reducing backlogs and improve hearings and appeals process. One of the key issues in our CSCM project is to estimate the expected service time and its variation for each case status code. The challenge is that the system's recorded job departure times may not be the true job finished times. As the CPMS timestamps data of case status codes showed apparent batch patterns, we proposed a batch model and applied the constrained least squares method to estimate the mean service times and the variances. We also proposed a batch search algorithm to determine the optimal batch partition, as no batch partition was given in the real data. Simulation studies were conducted to evaluate the performance of the proposed methods. Finally, we applied the method to analyze a real CPMS data from ODAR/SSA.

【Keywords】: batch information matrix; batch model; batched timestamps data; case processing and management system; constrained least squares estimation; disability determining process; service time

42. Collaborative Knowledge Base Embedding for Recommender Systems.

Paper Link】 【Pages】:353-362

【Authors】: Fuzheng Zhang ; Nicholas Jing Yuan ; Defu Lian ; Xing Xie ; Wei-Ying Ma

【Abstract】: Among different recommendation techniques, collaborative filtering usually suffer from limited performance due to the sparsity of user-item interactions. To address the issues, auxiliary information is usually used to boost the performance. Due to the rapid collection of information on the web, the knowledge base provides heterogeneous information including both structured and unstructured data with different semantics, which can be consumed by various applications. In this paper, we investigate how to leverage the heterogeneous information in a knowledge base to improve the quality of recommender systems. First, by exploiting the knowledge base, we design three components to extract items' semantic representations from structural content, textual content and visual content, respectively. To be specific, we adopt a heterogeneous network embedding method, termed as TransR, to extract items' structural representations by considering the heterogeneity of both nodes and relationships. We apply stacked denoising auto-encoders and stacked convolutional auto-encoders, which are two types of deep learning based embedding techniques, to extract items' textual representations and visual representations, respectively. Finally, we propose our final integrated framework, which is termed as Collaborative Knowledge Base Embedding (CKE), to jointly learn the latent representations in collaborative filtering as well as items' semantic representations from the knowledge base. To evaluate the performance of each embedding component as well as the whole system, we conduct extensive experiments with two real-world datasets from different scenarios. The results reveal that our approaches outperform several widely adopted state-of-the-art recommendation methods.

【Keywords】: collaborative joint learning; knowledge base embedding; recommender systems

43. GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction.

Paper Link】 【Pages】:363-372

【Authors】: XianXing Zhang ; Yitong Zhou ; Yiming Ma ; Bee-Chung Chen ; Liang Zhang ; Deepak Agarwal

【Abstract】: Generalized linear model (GLM) is a widely used class of models for statistical inference and response prediction problems. For instance, in order to recommend relevant content to a user or optimize for revenue, many web companies use logistic regression models to predict the probability of the user's clicking on an item (e.g., ad, news article, job). In scenarios where the data is abundant, having a more fine-grained model at the user or item level would potentially lead to more accurate prediction, as the user's personal preferences on items and the item's specific attraction for users can be better captured. One common approach is to introduce ID-level regression coefficients in addition to the global regression coefficients in a GLM setting, and such models are called generalized linear mixed models (GLMix) in the statistical literature. However, for big data sets with a large number of ID-level coefficients, fitting a GLMix model can be computationally challenging. In this paper, we report how we successfully overcame the scalability bottleneck by applying parallelized block coordinate descent under the Bulk Synchronous Parallel (BSP) paradigm. We deployed the model in the LinkedIn job recommender system, and generated 20% to 40% more job applications for job seekers on LinkedIn.

【Keywords】: big data; spark; statistical model

44. A Non-parametric Approach to Detect Epileptogenic Lesions using Restricted Boltzmann Machines.

Paper Link】 【Pages】:373-382

【Authors】: Yijun Zhao ; Bilal Ahmed ; Thomas Thesen ; Karen E. Blackmon ; Jennifer G. Dy ; Carla E. Brodley ; Ruben Kuzniecky ; Orrin Devinsky

【Abstract】: Visual detection of lesional areas on a cortical surface is critical in rendering a successful surgical operation for Treatment Resistant Epilepsy (TRE) patients. Unfortunately, 45% of Focal Cortical Dysplasia (FCD, the most common kind of TRE) patients have no visual abnormalities in their brains' 3D-MRI images. We collaborate with doctors from NYU Langone's Comprehensive Epilepsy Center and apply machine learning methodologies to identify the resective zones for these {MRI-negative} FCD patients. Our task is particularly challenging because MRI images can only provide a limited number of features. Furthermore, data from different patients often exhibit inter-patient variabilities due to age, gender, left/right handedness, etc. In this paper, we introduce a new approach which combines the restricted Boltzmann machines and a Bayesian non-parametric mixture model to address these issues. We demonstrate the efficacy of our model by applying it to a retrospective dataset of MRI-negative FCD patients who are seizure free after surgery.

【Keywords】: Bayesian non-parametric; mixture models; predictive medicine; restricted Boltzmann machine; semi-supervised learning and application

45. Recruitment Market Trend Analysis with Sequential Latent Variable Models.

Paper Link】 【Pages】:383-392

【Authors】: Chen Zhu ; Hengshu Zhu ; Hui Xiong ; Pengliang Ding ; Fang Xie

【Abstract】: Recruitment market analysis provides valuable understanding of industry-specific economic growth and plays an important role for both employers and job seekers. With the rapid development of online recruitment services, massive recruitment data have been accumulated and enable a new paradigm for recruitment market analysis. However, traditional methods for recruitment market analysis largely rely on the knowledge of domain experts and classic statistical models, which are usually too general to model large-scale dynamic recruitment data, and have difficulties to capture the fine-grained market trends. To this end, in this paper, we propose a new research paradigm for recruitment market analysis by leveraging unsupervised learning techniques for automatically discovering recruitment market trends based on large-scale recruitment data. Specifically, we develop a novel sequential latent variable model, named MTLVM, which is designed for capturing the sequential dependencies of corporate recruitment states and is able to automatically learn the latent recruitment topics within a Bayesian generative framework. In particular, to capture the variability of recruitment topics over time, we design hierarchical dirichlet processes for MTLVM. These processes allow to dynamically generate the evolving recruitment topics. Finally, we implement a prototype system to empirically evaluate our approach based on real-world recruitment data in China. Indeed, by visualizing the results from MTLVM, we can successfully reveal many interesting findings, such as the popularity of LBS related jobs reached the peak in the 2nd half of 2014, and decreased in 2015.

【Keywords】: latent variable model; recruitment market; trend analysis

46. Days on Market: Measuring Liquidity in Real Estate Markets.

Paper Link】 【Pages】:393-402

【Authors】: Hengshu Zhu ; Hui Xiong ; Fangshuang Tang ; Qi Liu ; Yong Ge ; Enhong Chen ; Yanjie Fu

【Abstract】: Days on Market (DOM) refers to the number of days a property is on the active market, which is an important measurement of market liquidity in real estate industry. Indeed, at the micro level, DOM is not only a special concern of house sellers, but also a useful indicator for potential buyers to evaluate the popularity of a house. At the macro level, DOM is an important indicator of real estate market status. However, it is very challenging to measure DOM, since there are a variety of factors which can impact on the DOM of a property. To this end, in this paper, we aim to measure real estate liquidity by examining multiple factors in a holistic manner. A special goal is to predict the DOM of a given property listing. Specifically, we first extract key features from multiple types of heterogeneous real estate-related data, such as house profiles and geo-social information of residential communities. Then, based on these features, we develop a multi-task learning based regression approach for predicting the DOM of real estates. This approach can effectively learn district-aware models for different property listings by considering multiple factors. Finally, we conduct extensive experiments on real-world real estate data collected in Beijing and develop a prototype system for practical use. The experimental results clearly validate the effectiveness of the proposed approach for measuring liquidity in real estate markets.

【Keywords】: days on market; multi-task learning; real estate

Applied Data Science Track Invited Talks 9

47. Can You Teach the Elephant to Dance? AKA: Culture Eats Data Science for Breakfast.

Paper Link】 【Pages】:403

【Authors】: Jonathan D. Becher

【Abstract】: In the past 20 years, the practical examples of KDD/data mining have become so ubiquitous that it's almost impossible to imagine a new venture that isn't based on data science. Uber, Facebook, 23andMe, Tesla -- they aren't just technology companies; they are data companies. And yet the reality is that these companies are still anomalies. Large, successful companies usually still treat KDD as either an afterthought or as an experiment. It's not core to how they run the business. As practitioners we compound this problem by concentrating our efforts on valuable business problems; but ones which are usually on the periphery of the business. We do this because changing the heart of how a company operates requires more than just process or technology changes. It requires cultural changes. And these cultural changes usually trigger corporate antibodies adverse to anything new. This talk will review some practical realities of instituting data-driven decisions in a very large multi-national company.

【Keywords】: analytics; enterprise software; mobile technology

48. How Machine Learning has Finally Solved Wanamaker's Dilemma.

Paper Link】 【Pages】:405

【Authors】: Oliver Downs

【Abstract】: It has become a cliché to talk about Wanamaker's dilemma, his famous quote that "Half the money I spend on advertising is wasted; the trouble is I don't know which half" is well known. So why talk about it today? Well, I'll show you that some of the best targeted consumer marketing still suffers exactly from this problem, and tell you why -- it's not humanly possible to solve it! At Amplero we've finally solved it and made that solution accessible to marketers, using machine learning in combination with the revolution in online experimentation that the advent of the multi-armed bandit has brought about.

【Keywords】: Wanamaker's dilemma; marketing optimization; multi-armed bandit experimentation

49. Learning Sparse Models at Scale.

Paper Link】 【Pages】:407

【Authors】: Ralf Herbrich

【Abstract】: Recently, learning deep models from dense data has received a lot of attention in tasks such as object recognition and signal processing. However, when dealing with non-sensory data about real-world entities, data is often sparse; for example people interaction with products in e-Commerce, people interacting with each other in social networks or word sequences in natural language. In this talk, I will share lessons learned over the past 10 years when learning predictive models based on sparse data: 1) how to scale the inference algorithms to distributed data setting, 2) how to automate the learning process by reducing the amount of hyper-parameters to zero, 3) how to deal with Zipf distributions when learning resource-constrained models, and 4) how to combine dense and sparse-learning algorithms. The talk will be drawing from many real-world experiences I gathered over the past decade in applications of the techniques in gaming, search, advertising and recommendations of systems developed at Microsoft, Facebook and Amazon.

【Keywords】: Zipf distribution; advertising; optimizing hyper-parameters; recommendations; scalable machine learning; sparse models

50. Profiling Users from Online Social Behaviors with Applications for Tencent Social Ads.

Paper Link】 【Pages】:409

【Authors】: Ching Law

【Abstract】: QQ and Wechat are the two largest instant messaging & social networks in China. Tencent Social Ads is the advertising platform for both Wechat and QQ, serving over 10B page views per day for several hundred million users. We strive to understand as much as possible on our users' characteristics, so as to serve the best personalized ads for them. The rich user behaviors on Tencent's diverse products lay a foundation in our user profiles on many dimensions, including demographics, interests, intents, transactions, physical locations, and devices, etc. In this talk, we will share our experience in large scale user data mining based on online social activities. We will discuss the challenges we face and the solutions we have devised so far. Some demographics data are obtained from user input, and thus would have gaps in both accuracy and coverage. We discuss the techniques in calibrating and verifying these data. We infer user interests from their social behaviors. For example, most QQ groups are not labelled properly, but by applying a large-scale topic model on the QQ memberships, we can effectively classify most QQ groups into an interest taxonomy. We also infer user interests from user's physical location check-ins and uploaded photos. User data can be collected from many diverse sources, including behaviors in various Tencent products, click and conversion in ad platform, and even seed customers collected by advertisers. We'll discuss the systems to merge these diverse data to provide a coherent view for our advertisers. High quality user labels are usually sparse. We implemented an algorithm for advertisers to reach more potential customers through user similarity computation based on user features as well as social graph inferences. We'll describe the system's contributions to ads quality. Top advertisers demand rich audience targeting solutions in combination of their own customer data, Tencent data, and possibly 3rd-party data. We'll discuss the data exchange platform that can facilitate collaborative applications with 3rd-party DSPs and DMPs.

【Keywords】: advertising technology; audience targeting; social networks

51. The Wisdom of Crowds: Best Practices for Data Prep & Machine Learning Derived from Millions of Data Science Workflows.

Paper Link】 【Pages】:411

【Authors】: Ingo Mierswa

【Abstract】: With hundreds of thousands of users, RapidMiner is the most frequently used visual workflow platform for machine learning. It covers the full spectrum of analytics from data preparation to machine learning and model validation. In this presentation, I will take you on a tour of machine learning which spans the last 15 years of research and industry applications and share key insights with you about how data scientists perform their daily analysis tasks. These patterns are extracted from mining millions of analytical workflows that have been created with RapidMiner over the past years. This talk will address important questions around the data mining process such as: What are the most frequently used solutions for typical data quality problems? How often are analysts using decision trees or neural networks? And does this behavior change over time or depend on the users experience level?

【Keywords】: analytics; data visualization; machine learning tools; visual workflow; wisdom of the crowds

52. Bayesian Optimization and Embedded Learning Systems.

Paper Link】 【Pages】:413

【Authors】: Jeff Schneider

【Abstract】: An important property of embedded learning systems is the ever-changing environment they create for all algorithms operating in the system. Optimizing the performance of those algorithms becomes a perpetual on-line activity rather than a one-off task. I will review some of these challenges in autonomous vehicles. I will discuss Bayesian optimization methods and their application in robotics and scientific applications, focusing on scaling up the dimensionality and managing multi-fidelity evaluations. I will finish with lessons learned and thoughts on future directions as these methods move into embedded systems.

【Keywords】: Bayesian optimization; autonomous vehicles; robotics

53. Accelerating the Race to Autonomous Cars.

Paper Link】 【Pages】:415

【Authors】: Danny Shapiro

【Abstract】: Every automaker is working on driver assistance systems and self-driving cars. Conventional computer vision used for ADAS is reaching its threshold because it is impossible to write code for every possible scenario as a vehicle navigates. In order to develop a truly autonomous car, deep learning and artificial intelligence are required. With deep learning, the vehicle can be trained to have super human levels of perception, driving safer than anyone on the road. An end-to-end artificial intelligence platform based on supercomputers in the cloud and in the vehicle enables cars to get smarter and smarter. Coupled with an extensive software development kit with vision and AI libraries and software modules, automakers, tier 1s, and startups can build scalable systems from ADAS to full autonomy.

【Keywords】: artificial intelligence; deep learning; scalability; self-driving cars

54. Large-Scale Machine Learning at Verizon: Theory and Applications.

Paper Link】 【Pages】:417

【Authors】: Ashok Srivastava

【Abstract】: This talk will cover recent innovations in large-scale machine learning and their applications on massive, real-world data sets at Verizon. These applications power new revenue generating products and services for the company and are hosted on a massive computing and storage platform known as Orion. We will discuss the architecture of Orion and the underlying algorithmic framework. We will also cover some of the real-world aspects of building a new organization dedicated to creating new product lines based on data science.

【Keywords】: Orion; large-scale machine learning; revenue generation

55. Computational Social Science: Exciting Progress and Future Challenges.

Paper Link】 【Pages】:419

【Authors】: Duncan Watts

【Abstract】: The past 15 years have witnessed a remarkable increase in both the scale and scope of social and behavioral data available to researchers, leading some to herald the emergence of a new field: "computational social science." Against these exciting developments stands a stubborn fact: that in spite of many thousands of published papers, there has been surprisingly little progress on the "big" questions that motivated the field in the first place?questions concerning systemic risk in financial systems, problem solving in complex organizations, and the dynamics of epidemics or social movements, among others. In this talk I highlight some examples of research that would not have been possible just a handful of years ago and that illustrate the promise of CSS. At the same time, they illustrate its limitations. I then conclude with some thoughts on how CSS can bridge the gap between its current state and its potential.

【Keywords】: computational social science; social data

Applied Data Science Track Posters 26

56. MAP: Frequency-Based Maximization of Airline Profits based on an Ensemble Forecasting Approach.

Paper Link】 【Pages】:421-430

【Authors】: Bo An ; Haipeng Chen ; Noseong Park ; V. S. Subrahmanian

【Abstract】: Though there are numerous traditional models to predict market share and demand along airline routes, the prediction of existing models is not precise enough and, to the best of our knowledge, there is no use of data-mining based forecasting techniques to improve airline profitability. We propose the MAP (Maximizing Airline Profits) architecture designed to help airlines and make two key contributions in airline market share and route demand prediction and prediction-based airline profit optimization. Compared with past methods to forecast market share and demand along airline routes, we introduce a novel Ensemble Forecasting (MAP-EF) approach considering two new classes of features: (i) features derived from clusters of similar routes, and (ii) features based on equilibrium pricing. We show that MAP-EF achieves much better Pearson Correlation Coefficients (over 0.95 vs. 0.82 for market share, 0.98 vs. 0.77 for demand) and R2-values compared with three state-of-the-art works for forecasting market share and demand, while showing much lower variance. Using the results of MAP-EF, we develop MAP-Bilevel Branch and Bound (MAP-BBB) and MAP-Greedy (MAP-G) algorithms to optimally allocate flight frequencies over multiple routes, to maximize an airline's profit. Experimental results show that airlines can increase profits by a significant margin. All experiments were conducted with data aggregated from four sources: US Bureau of Transportation Statistics (BTS), US Bureau of Economic Analysis (BEA), the National Transportation Safety Board (NTSB), and the US Census Bureau (CB).

【Keywords】: airline demand and market share prediction; airline profit maximization; ensemble prediction; regression

57. Gemello: Creating a Detailed Energy Breakdown from Just the Monthly Electricity Bill.

Paper Link】 【Pages】:431-440

【Authors】: Nipun Batra ; Amarjeet Singh ; Kamin Whitehouse

【Abstract】: The first step to saving energy in the home is often to create an energy breakdown: the amount of energy used by each individual appliance in the home. Unfortunately, current techniques that produce an energy breakdown are not scalable: they require hardware to be installed in each and every home. In this paper, we propose a more scalable solution called Gemello that estimates the energy breakdown for one home by matching it with similar homes for which the breakdown is already known. This matching requires only the monthly energy bill and household characteristics such as square footage of the home and the size of the household. We evaluate this approach using 57 homes and results indicate that the accuracy of Gemello is comparable to or better than existing techniques that use sensing infrastructure in each home. The information required by Gemello is often publicly available and, as such, it can be immediately applied to many homes around the world.

【Keywords】: energy disaggregation; energy metering; feedback; nilm

58. CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents.

Paper Link】 【Pages】:441-450

【Authors】: Fedor Borisyuk ; Krishnaram Kenthapadi ; David Stein ; Bo Zhao

【Abstract】: User experience at social media and web platforms such as LinkedIn is heavily dependent on the performance and scalability of its products. Applications such as personalized search and recommendations require real-time scoring of millions of structured candidate documents associated with each query, with strict latency constraints. In such applications, the query incorporates the context of the user (in addition to search keywords if present), and hence can become very large, comprising of thousands of Boolean clauses over hundreds of document attributes. Consequently, candidate selection techniques need to be applied since it is infeasible to retrieve and score all matching documents from the underlying inverted index. We propose CaSMoS, a machine learned candidate selection framework that makes use of Weighted AND (WAND) query. Our framework is designed to prune irrelevant documents and retrieve documents that are likely to be part of the top-k results for the query. We apply a constrained feature selection algorithm to learn positive weights for feature combinations that are used as part of the weighted candidate selection query. We have implemented and deployed this system to be executed in real time using LinkedIn's Galene search platform. We perform extensive evaluation with different training data approaches and parameter settings, and investigate the scalability of the proposed candidate selection model. Our deployment of this system as part of LinkedIn's job recommendation engine has resulted in significant reduction in latency (up to 25%) without sacrificing the quality of the retrieved results, thereby paving the way for more sophisticated scoring models.

【Keywords】: candidate selection; learning query models; personalized search and recommendation systems

59. Domain Adaptation in the Absence of Source Domain Data.

Paper Link】 【Pages】:451-460

【Authors】: Boris Chidlovskii ; Stéphane Clinchant ; Gabriela Csurka

【Abstract】: The overwhelming majority of existing domain adaptation methods makes an assumption of freely available source domain data. An equal access to both source and target data makes it possible to measure the discrepancy between their distributions and to build representations common to both target and source domains. In reality, such a simplifying assumption rarely holds, since source data are routinely a subject of legal and contractual constraints between data owners and data customers. When source domain data can not be accessed, decision making procedures are often available for adaptation nevertheless. These procedures are often presented in the form of classification, identification, ranking etc. rules trained on source data and made ready for a direct deployment and later reuse. In other cases, the owner of a source data is allowed to share a few representative examples such as class means. In this paper we address the domain adaptation problem in real world applications, where the reuse of source domain data is limited to classification rules or a few representative examples. We extend the recent techniques of feature corruption and their marginalization, both in supervised and unsupervised settings. We test and compare them on private and publicly available source datasets and show that significant performance gains can be achieved despite the absence of source data and shortage of labeled target data.

【Keywords】: classification; domain adaptation; emerging applications; machine learning; marginalization

Paper Link】 【Pages】:461-470

【Authors】: Steven H. H. Ding ; Benjamin C. M. Fung ; Philippe Charland

【Abstract】: Assembly code analysis is one of the critical processes for detecting and proving software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and databases. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function's subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. A deployed demo system is publicly available. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code.

【Keywords】: assembly clone search; information retrieval; mining software repositorie

61. Joint Optimization of Multiple Performance Metrics in Online Video Advertising.

Paper Link】 【Pages】:471-480

【Authors】: Sahin Cem Geyik ; Sergey Faleev ; Jianqiang Shen ; Sean O'Donnell ; Santanu Kolay

【Abstract】: The field of online advertising, in essence, deals with the problem of presenting ads to online users in the most appropriate contexts to achieve a multitude of advertiser goals. A vast amount of work in online advertising has been focused on optimizing banner display advertising campaigns where the main goal lies in direct response metrics, often as clicks or conversions. In this paper, we explore the newly popularized space of online video advertising, where brand recognition is the key focus. We propose a framework based on a feedback mechanism where we optimize multiple video specific performance indicators while making sure the delivery constraints (budget and user reach) of advertisers are satisfied. While our main focus is on improving metrics such as engagement (amount of view time), and viewability (whether a campaign is within eyesight of a user), we also discuss the possibilities of expanding to other metrics. We demonstrate the benefit of our framework via empirical results in multiple real-world advertising campaigns. To the best of our knowledge, this is the first paper that deals with the unique challenges arising from the nature of online video advertising.

【Keywords】: online advertising; performance optimization; video advertising

62. Convolutional Neural Networks for Steady Flow Approximation.

Paper Link】 【Pages】:481-490

【Authors】: Xiaoxiao Guo ; Wei Li ; Francesco Iorio

【Abstract】: In aerodynamics related design, analysis and optimization problems, flow fields are simulated using computational fluid dynamics (CFD) solvers. However, CFD simulation is usually a computationally expensive, memory demanding and time consuming iterative process. These drawbacks of CFD limit opportunities for design space exploration and forbid interactive design. We propose a general and flexible approximation model for real-time prediction of non-uniform steady laminar flow in a 2D or 3D domain based on convolutional neural networks (CNNs). We explored alternatives for the geometry representation and the network architecture of CNNs. We show that convolutional neural networks can estimate the velocity field two orders of magnitude faster than a GPU-accelerated CFD solver and four orders of magnitude faster than a CPU-based CFD solver at a cost of a low error rate. This approach can provide immediate feedback for real-time design iterations at the early stage of design. Compared with existing approximation models in the aerodynamics domain, CNNs enable an efficient estimation for the entire velocity field. Furthermore, designers and engineers can directly apply the CNN approximation model in their design space exploration algorithms without training extra lower-dimensional surrogate models.

【Keywords】: computational fluid dynamics; convolutional neural networks; machine learning; surrogate models

63. Computational Drug Repositioning Using Continuous Self-Controlled Case Series.

Paper Link】 【Pages】:491-500

【Authors】: Zhaobin Kuang ; James A. Thomson ; Michael Caldwell ; Peggy L. Peissig ; Ron M. Stewart ; David Page

【Abstract】: Computational Drug Repositioning (CDR) is the task of discovering potential new indications for existing drugs by mining large-scale heterogeneous drug-related data sources. Leveraging the patient-level temporal ordering information between numeric physiological measurements and various drug prescriptions provided in Electronic Health Records (EHRs), we propose a Continuous Self-controlled Case Series (CSCCS) model for CDR. As an initial evaluation, we look for drugs that can control Fasting Blood Glucose (FBG) level in our experiments. Applying CSCCS to the Marshfield Clinic EHR, well-known drugs that are indicated for controlling blood glucose level are rediscovered. Furthermore, some drugs with recent literature support for the potential effect of blood glucose level control are also identified.

【Keywords】: computational drug repositioning; longitudinal data; self-controlled case series

Paper Link】 【Pages】:501-510

【Authors】: Jia Li ; Dhruv Arya ; Viet Ha-Thuc ; Shakti Sinha

【Abstract】: This paper proposes an approach to applying standardized entity data to improve job search quality and to make search results more personalized. Specifically, we explore three types of entity-aware features and incorporate them into the job search ranking function. The first is query-job matching features which extract and standardize entities mentioned in queries and documents, then semantically match them based on these entities. The second type, searcher-job expertise homophily, aims to capture the fact that job searchers tend to be interested in the jobs requiring similar expertise as theirs. To measure the similarity, we use standardized skills in job descriptions and searchers' profiles as well as skills that we infer searchers might have but not explicitly list in their profiles. Third, we propose a concept of entity-faceted historical click-through-rates (CTRs) to capture job document quality. Faceting jobs by their standardized companies, titles, locations, etc., and computing historical CTRs at the facet level instead of individual job level alleviate sparseness issue in historical action data. This is particularly important in job search where job lifetime is typically short. Both offline and online experiments confirm the effectiveness of the features. In offline experiment, using the entity-aware features gives improvements of +20%, +12.1% and +8.3% on Precision@1, MRR and NDCG@25, respectively. Online A/B test shows that a new model with these features is +11.3% and +5.3% better than the baseline in terms of click-through-rate and apply rate.

【Keywords】: information retrieval; learning-to-ranking; personalization

65. Scalable Fast Rank-1 Dictionary Learning for fMRI Big Data Analysis.

Paper Link】 【Pages】:511-519

【Authors】: Xiang Li ; Milad Makkie ; Binbin Lin ; Mojtaba Sedigh Fazli ; Ian Davidson ; Jieping Ye ; Tianming Liu ; Shannon Quinn

【Abstract】: It has been shown from various functional neuroimaging studies that sparsity-regularized dictionary learning could achieve superior performance in decomposing comprehensive and neuroscientifically meaningful functional networks from massive fMRI signals. However, the computational cost for solving the dictionary learning problem has been known to be very demanding, especially when dealing with large-scale data sets. Thus in this work, we propose a novel distributed rank-1 dictionary learning (D-r1DL) model and apply it for fMRI big data analysis. The model estimates one rank-1 basis vector with sparsity constraint on its loading coefficient from the input data at each learning step through alternating least squares updates. By iteratively learning the rank-1 basis and deflating the input data at each step, the model is then capable of decomposing the whole set of functional networks. We implement and parallelize the rank-1 dictionary learning algorithm using Spark engine and deployed the resilient distributed dataset (RDDs) abstracts for the data distribution and operations. Experimental results from applying the model on the Human Connectome Project (HCP) data show that the proposed D-r1DL model is efficient and scalable towards fMRI big data analytics, thus enabling data-driven neuroscientific discovery from massive fMRI big data in the future.

【Keywords】: algorithm parallelization; distributed computation; fMRI; sparse coding

66. CompanyDepot: Employer Name Normalization in the Online Recruitment Industry.

Paper Link】 【Pages】:521-530

【Authors】: Qiaoling Liu ; Faizan Javed ; Matt McNair

【Abstract】: Entity linking links entity mentions in text to the corresponding entities in a knowledge base (KB) and has many applications in both open domain and specific domains. For example, in the recruitment domain, linking employer names in job postings or resumes to entities in an employer KB is very important to many business applications. In this paper, we focus on this employer name normalization task, which has several unique challenges: handling employer names from both job postings and resumes, leveraging the corresponding location context, and handling name variations, irrelevant input data, and noises in the KB. We present a system called CompanyDepot which contains a machine learning based approach CompanyDepot-ML and a heuristic approach CompanyDepot-H to address these challenges in three steps: (1) searching for candidate entities based on a customized search engine for the KB; (2) ranking the candidate entities using learning-to-rank methods or heuristics; and (3) validating the top-ranked entity via binary classification or heuristics. While CompanyDepot-ML shows better extendability and flexibility, CompanyDepot-H serves as a strong baseline and useful way to collect training data for CompanyDepot-ML. The proposed system achieves 2.5%-21.4% higher coverage at the same precision level compared to an existing system used at CareerBuilder over multiple real-world datasets. Applying the system to a similar task of academic institution name normalization further shows the generalization ability of the method.

【Keywords】: employer name normalization; entity linking; learning to rank; named entity disambiguation

67. Understanding Behaviors that Lead to Purchasing: A Case Study of Pinterest.

Paper Link】 【Pages】:531-540

【Authors】: Caroline Lo ; Dan Frankowski ; Jure Leskovec

【Abstract】: Online e-commerce applications are becoming a primary vehicle for people to find, compare, and ultimately purchase products. One of the fundamental questions that arises in e-commerce is to characterize, understand, and model user long-term purchasing intent, which is important as it allows for personalized and context relevant e-commerce services. In this paper we study user activity and purchasing behavior with the goal of building models of time-varying user purchasing intent. We analyze the purchasing behavior of nearly three million Pinterest users to determine short-term and long-term signals in user behavior that indicate higher purchase intent. We find that users with long-term purchasing intent tend to save and clickthrough on more content. However, as users approach the time of purchase their activity becomes more topically focused and actions shift from saves to searches. We further find that purchase signals in online behavior can exist weeks before a purchase is made and can also be traced across different purchase categories. Finally, we synthesize these insights in predictive models of user purchasing intent. Taken together, our work identifies a set of general principles and signals that can be used to model user purchasing intent across many content discovery applications.

【Keywords】: content discovery application; pinterest; purchase intent; purchasing; purchasing behavior; purchasing intent

68. Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank.

Paper Link】 【Pages】:541-548

【Authors】: Corey Lynch ; Kamelia Aryafar ; Josh Attenberg

【Abstract】: Search is at the heart of modern e-commerce. As a result, the task of ranking search results automatically (learning to rank) is a multibillion dollar machine learning problem. Traditional models optimize over a few hand-constructed features based on the item's text. In this paper, we introduce a multimodal learning to rank model that combines these traditional features with visual semantic features transferred from a deep convolutional neural network. In a large scale experiment using data from the online marketplace Etsy, we verify that moving to a multimodal representation significantly improves ranking quality. We show how image features can capture fine-grained style information not available in a text-only representation. In addition, we show concrete examples of how image information can successfully disentangle pairs of highly different items that are ranked similarly by a text-only model.

【Keywords】: computer vision; deep learning; learning to rank

69. Text Mining in Clinical Domain: Dealing with Noise.

Paper Link】 【Pages】:549-558

【Authors】: Hoang Nguyen ; Jon Patrick

【Abstract】: Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.

【Keywords】: active learning; clinical; named-entity recognition; natural languages processing; text classification

Paper Link】 【Pages】:559-568

【Authors】: John Paparrizos ; Ryen W. White ; Eric Horvitz

【Abstract】: Web search queries can offer a unique population-scale window onto streams of evidence that are useful for detecting the emergence of health conditions. We explore the promise of harnessing behavioral signals in search logs to provide advance warning about the presence of devastating diseases such as pancreatic cancer. Pancreatic cancer is often diagnosed too late to be treated effectively as the cancer has usually metastasized by the time of diagnosis. Symptoms of the early stages of the illness are often subtle and nonspecific. We identify searchers who issue credible, first-person diagnostic queries for pancreatic cancer and we learn models from prior search histories that predict which searchers will later input such queries. We show that we can infer the likelihood of seeing the rise of diagnostic queries months before they appear and characterize the tradeoff between predictivity and false positive rate. The findings highlight the potential of harnessing search logs for the early detection of pancreatic cancer and more generally for harnessing search systems to reduce health risks for individuals.

【Keywords】: behavioral data; digital disease detection; health screening; pancreatic cancer; search logs; temporal analysis

Paper Link】 【Pages】:569-578

【Authors】: Bryan Perozzi ; Michael Schueppert ; Jack Saalweachter ; Mayur Thakur

【Abstract】: We present a secondary ranking system to find and remove erroneous suggestions from a geospatial recommendation system. We discover such anomalous links by "double checking" the recommendation system's output to ensure that it is both structurally cohesive, and semantically consistent. Our approach is designed for the Google Related Places Graph, a geographic recommendation system which provides results for hundreds of millions of queries a day. We model the quality of a recommendation between two geographic entities as a function of their structure in the Related Places Graph, and their semantic relationship in the Google Knowledge Graph. To evaluate our approach, we perform a large scale human evaluation of such an anomalous link detection system. For the long tail of unpopular entities, our models can predict the recommendations users will consider poor with up to 42\% higher mean precision (29 raw points) than the live system. Results from our study reveal that structural and semantic features capture different facets of relatedness to human judges. We characterize our performance with a qualitative analysis detailing the categories of real-world anomalies our system is able to detect, and provide a discussion of additional applications of our method.

【Keywords】: anomaly detection; knowledge graph; link prediction; recommendation systems

72. Deploying Analytics with the Portable Format for Analytics (PFA).

Paper Link】 【Pages】:579-588

【Authors】: Jim Pivarski ; Collin Bennett ; Robert L. Grossman

【Abstract】: We introduce a new language for deploying analytic models into products, services and operational systems called the Portable Format for Analytics (PFA). PFA is an example of what is sometimes called a model interchange format, a language for describing analytic models that is independent of specific tools, applications or systems. Model interchange formats allow one application (the model producer) to export models and another application (the model consumer or scoring engine) to import models. The core idea behind PFA is to support the safe execution of statistical functions, mathematical functions, and machine learning algorithms and their compositions within a safe execution environment. With this approach, the common analytic models used in data science can be implemented, as well as the data transformations and data aggregations required for pre- and post-processing data. PFA compliant scoring engines can be extended by adding new user defined functions described in PFA. We describe the design of PFA. A Data Mining Group (DMG) Working Group is developing the PFA standard. The current version is 0.8.1 and contains many of the commonly used statistical and machine learning models, including regression, clustering, support vector machines, neural networks, etc. We also describe two implementations of Hadrian, one in Scala and one in Python. We discuss four case studies that use PFA and Hadrian to specify analytic models, including two that are deployed in operations at client sites.

【Keywords】: PFA; PMML; deploying analytics; model producers; portable format for analytics; scoring engines

73. Singapore in Motion: Insights on Public Transport Service Level Through Farecard and Mobile Data Analytics.

Paper Link】 【Pages】:589-598

【Authors】: Hasan Poonawala ; Vinay Kolar ; Sebastien Blandin ; Laura Wynter ; Sambit Sahu

【Abstract】: Given the changing dynamics of mobility patterns and rapid growth of cities, transport agencies seek to respond more rapidly to needs of the public with the goal of offering an effective and competitive public transport system. A more data-centric approach for transport planning is part of the evolution of this process. In particular, the vast penetration of mobile phones provides an opportunity to monitor and derive insights on transport usage. Real time and historical analyses of such data can give a detailed understanding of mobility patterns of people and also suggest improvements to current transit systems. On its own, however, mobile geolocation data has a number of limitations. We thus propose a joint telco-and-farecard-based learning approach to understanding urban mobility. The approach enhances telecommunications data by leveraging it jointly with other sources of real-time data. The approach is illustrated on the First- and last-mile problem as well as route choice estimation within a densely-connected train network.

【Keywords】: big data; mapmatching; mobility; public transport route choice

74. EMBERS AutoGSR: Automated Coding of Civil Unrest Events.

Paper Link】 【Pages】:599-608

【Authors】: Parang Saraf ; Naren Ramakrishnan

【Abstract】: We describe the EMBERS AutoGSR system that conducts automated coding of civil unrest events from news articles published in multiple languages. The nuts and bolts of the AutoGSR system constitute an ecosystem of filtering, ranking, and recommendation models to determine if an article reports a civil unrest event and, if so, proceed to identify and encode specific characteristics of the civil unrest event such as the when, where, who, and why of the protest. AutoGSR is a deployed system for the past 6 months continually processing data 24x7 in languages such as Spanish, Portuguese, English and encoding civil unrest events in 10 countries of Latin America: Argentina, Brazil, Chile, Colombia, Ecuador, El Salvador, Mexico, Paraguay, Uruguay, and Venezuela. We demonstrate the superiority of AutoGSR over both manual approaches and other state-of-the-art encoding systems for civil unrest.

【Keywords】: event encoding; event extraction; text mining

75. Compute Job Memory Recommender System Using Machine Learning.

Paper Link】 【Pages】:609-616

【Authors】: Taraneh Taghavi ; Maria Lupetini ; Yaron Kretchmer

【Abstract】: This paper presents a machine learning approach to predict the amount of compute memory needed by jobs which are submitted to Load Sharing Facility (LSF® ) with a high level of accuracy. LSF® is the compute resource manager and job scheduler for Qualcomm chip design process. It schedules the jobs based on available resources: CPU, memory, storage, and software licenses. Memory is one of the key resources and its proper utilization leads to a substantial improvement in saving machine resources which in turn results in a significant reduction in overall job pending time. In addition, efficient memory utilization helps to reduce the operations cost by decreasing the number of servers needed for the end-to-end design process. In this paper, we explored a suite of statistical and machine learning techniques to develop a Compute Memory Recommender System for the Qualcomm chip design process with over 90% accuracy in predicting the amount of memory a job needs. Moreover, it demonstrates the potential to significantly reduce job pending time.

【Keywords】: compute memory recommender system; grid computing.; memory utilization; statistical and machine learning techniques

76. Scalable Time-Decaying Adaptive Prediction Algorithm.

Paper Link】 【Pages】:617-626

【Authors】: Yinyan Tan ; Zhe Fan ; Guilin Li ; Fangshan Wang ; Zhengbing Li ; Shikai Liu ; Qiuling Pan ; Eric P. Xing ; Qirong Ho

【Abstract】: Online learning is used in a wide range of real applications, e.g., predicting ad click-through rates (CTR) and personalized recommendations. Based on the analysis of users' behaviors in Video-On-Demand (VoD) recommender systems,we discover that the most recent users' actions can better reflect users' current intentions and preferences. Under this observation, we thereby propose a novel time-decaying online learning algorithm derived from the state-of-the-art FTRL-proximal algorithm, called Time-Decaying Adaptive Prediction (TDAP) algorithm. To scale Big Data, we further parallelize our algorithm following the data parallel scheme under both BSP and SSP consistency model. We experimentally evaluate our TDAP algorithm on real IPTV VoD datasets using two state-of-the-art distributed computing platforms, TDAP achieves good accuracy: it improves at least 5.6% in terms of prediction accuracy, compared to FTRL-proximal algorithm; and TDAP scales well: it runs 4 times faster when the number of machines increases from 2 to 10.

【Keywords】: time decaying algorithm

77. Analyzing Volleyball Match Data from the 2014 World Championships Using Machine Learning Techniques.

Paper Link】 【Pages】:627-634

【Authors】: Jan Van Haaren ; Horesh Ben Shitrit ; Jesse Davis ; Pascal Fua

【Abstract】: This paper proposes a relational-learning based approach for discovering strategies in volleyball matches based on optical tracking data. In contrast to most existing methods, our approach permits discovering patterns that account for both spatial (that is, partial configurations of the players on the court) and temporal (that is, the order of events and positions) aspects of the game. We analyze both the men's and women's final match from the 2014 FIVB Volleyball World Championships, and are able to identify several interesting and relevant strategies from the matches.

【Keywords】: spatial data; sports analytics; strategy detection

78. Crime Rate Inference with Big Data.

Paper Link】 【Pages】:635-644

【Authors】: Hongjian Wang ; Daniel Kifer ; Corina Graif ; Zhenhui Li

【Abstract】: Crime is one of the most important social problems in the country, affecting public safety, children development, and adult socioeconomic status. Understanding what factors cause higher crime is critical for policy makers in their efforts to reduce crime and increase citizens' life quality. We tackle a fundamental problem in our paper: crime rate inference at the neighborhood level. Traditional approaches have used demographics and geographical influences to estimate crime rates in a region. With the fast development of positioning technology and prevalence of mobile devices, a large amount of modern urban data have been collected and such big data can provide new perspectives for understanding crime. In this paper, we used large-scale Point-Of-Interest data and taxi flow data in the city of Chicago, IL in the USA. We observed significantly improved performance in crime rate inference compared to using traditional features. Such an improvement is consistent over multiple years. We also show that these new features are significant in the feature importance analysis.

【Keywords】: big data; crime inference; heterogeneous data; spatial-temporal data

79. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix.

Paper Link】 【Pages】:645-654

【Authors】: Huizhi Xie ; Juliette Aurisset

【Abstract】: Controlled experiments are widely regarded as the most scientific way to establish a true causal relationship between product changes and their impact on business metrics. Many technology companies rely on such experiments as their main data-driven decision-making tool. The sensitivity of a controlled experiment refers to its ability to detect differences in business metrics due to product changes. At Netflix, with tens of millions of users, increasing the sensitivity of controlled experiments is critical as failure to detect a small effect, either positive or negative, can have a substantial revenue impact. This paper focuses on methods to increase sensitivity by reducing the sampling variance of business metrics. We define Netflix business metrics and share context around the critical need for improved sensitivity. We review popular variance reduction techniques that are broadly applicable to any type of controlled experiment and metric. We describe an innovative implementation of stratified sampling at Netflix where users are assigned to experiments in real time and discuss some surprising challenges with the implementation. We conduct case studies to compare these variance reduction techniques on a few Netflix datasets. Based on the empirical results, we recommend to use post-assignment variance reduction techniques such as post stratification and CUPED instead of at-assignment variance reduction techniques such as stratified sampling in large-scale controlled experiments.

【Keywords】: a/b testing; controlled experiment; randomized experiment; sensitivity; variance reduction

80. Talent Circle Detection in Job Transition Networks.

Paper Link】 【Pages】:655-664

【Authors】: Huang Xu ; Zhiwen Yu ; Jingyuan Yang ; Hui Xiong ; Hengshu Zhu

【Abstract】: With the high mobility of talent, it becomes critical for the recruitment team to find the right talent from the right source in an efficient manner. The prevalence of Online Professional Networks (OPNs), such as LinkedIn, enables the new paradigm for talent recruitment and job search. However, the dynamic and complex nature of such talent information imposes significant challenges to identify prospective talent sources from large-scale professional networks. Therefore, in this paper, we propose to create a job transition network where vertices stand for organizations and a directed edge represents the talent flow between two organizations for a time period. By analyzing this job transition network, it is able to extract talent circles in a way such that every circle includes the organizations with similar talent exchange patterns. Then, the characteristics of these talent circles can be used for talent recruitment and job search. To this end, we develop a talent circle detection model and design the corresponding learning method by maximizing the Normalized Discounted Cumulative Gain (NDCG) of inferred probability for the edge existence based on edge weights. Then, the identified circles will be labeled by the representative organizations as well as keywords in job descriptions. Moreover, based on these identified circles, we develop a talent exchange prediction method for talent recommendation. Finally, we have performed extensive experiments on real-world data. The results show that, our method can achieve much higher modularity when comparing to the benchmark approaches, as well as high precision and recall for talent exchange prediction.

【Keywords】: people analytics; talent circle detection

81. Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising.

Paper Link】 【Pages】:665-674

【Authors】: Weinan Zhang ; Tianxiong Zhou ; Jun Wang ; Jian Xu

【Abstract】: In real-time display advertising, ad slots are sold per impression via an auction mechanism. For an advertiser, the campaign information is incomplete --- the user responses (e.g, clicks or conversions) and the market price of each ad impression are observed only if the advertiser's bid had won the corresponding ad auction. The predictions, such as bid landscape forecasting, click-through rate (CTR) estimation, and bid optimisation, are all operated in the pre-bid stage with full-volume bid request data. However, the training data is gathered in the post-bid stage with a strong bias towards the winning impressions. A common solution for learning over such censored data is to reweight data instances to correct the discrepancy between training and prediction. However, little study has been done on how to obtain the weights independent of previous bidding strategies and consequently integrate them into the final CTR prediction and bid generation steps. In this paper, we formulate CTR estimation and bid optimisation under such censored auction data. Derived from a survival model, we show that historic bid information is naturally incorporated to produce Bid-aware Gradient Descents (BGD) which controls both the importance and the direction of the gradient to achieve unbiased learning. The empirical study based on two large-scale real-world datasets demonstrates remarkable performance gains from our solution. The learning framework has been deployed on Yahoo!'s real-time bidding platform and provided 2.97% AUC lift for CTR estimation and 9.30% eCPC drop for bid optimisation in an online A/B test.

【Keywords】: censored data; display advertising; real-time bidding; unbiased learning

Research Track Full Papers 70

82. Compact and Scalable Graph Neighborhood Sketching.

Paper Link】 【Pages】:685-694

【Authors】: Takuya Akiba ; Yosuke Yano

【Abstract】: The all-distances sketch (ADS) has recently emerged as a promising paradigm of graph neighborhood sketching. An ADS is a probabilistic data structure that is defined for each vertex of a graph. ADSs facilitate accurate estimation of many useful indicators for network analysis with the guarantee of accuracy, and the ADSs for all the vertices in a graph can be computed in near-linear time. Because of these useful properties, ADS has attracted considerable attention. However, a critical drawback of ADS is its space requirement, which tends to be much larger than that of the graph itself. In the present study, we address this issue by designing a new graph sketching scheme, namely, sketch retrieval shortcuts (SRS). Although SRSs are more space-efficient than ADSs by an order of magnitude, an ADS of any vertex can be quickly retrieved from the SRSs. The retrieved ADSs can be used to estimate the aforementioned indicators in exactly the same manner as with plain ADSs, inheriting the same accuracy guarantee. Our experiments on real-world networks demonstrate the usefulness of SRSs as a practical back-end of large-scale graph data mining.

【Keywords】: all-distances sketches; graphs; min-hash sketches

83. Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams.

Paper Link】 【Pages】:695-704

【Authors】: Hesam Amoualian ; Marianne Clausel ; Éric Gaussier ; Massih-Reza Amini

【Abstract】: We propose in this paper two new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The first model is a direct extension of Latent Dirichlet Allocation model (LDA) and makes use of a Dirichlet distribution to balance the influence of the LDA prior parameters wrt to topic and word-topic distribution of the previous document. The second extension makes use of copulas, which constitute a generic tools to model dependencies between random variables. We rely here on Archimedean copulas, and more precisely on Franck copulas, as they are symmetric and associative and are thus appropriate for exchangeable random variables. Our experiments, conducted on three standard collections that have been used in several studies on topic modeling, show that our proposals outperform previous ones (as dynamic topic models and temporal \LDA), both in terms of perplexity and for tracking similar topics in a document stream.

【Keywords】: copulas; document streams; latent dirichlet allocation; topic dependencies

84. Assessing Human Error Against a Benchmark of Perfection.

Paper Link】 【Pages】:705-714

【Authors】: Ashton Anderson ; Jon M. Kleinberg ; Sendhil Mullainathan

【Abstract】: An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors. To investigate what a general framework for human error prediction might look like, we focus on a model system with a rich history in the behavioral sciences: the decisions made by chess players as they select moves in a game. We carry out our analysis at a large scale, employing datasets with several million recorded games, and using chess tablebases to acquire a form of ground truth for a subset of chess positions that have been completely solved by computers but remain challenging even for the best players in the world. We organize our analysis around three categories of features that we argue are present in most settings where the analysis of human error is applicable: the skill of the decision-maker, the time available to make the decision, and the inherent difficulty of the decision. We identify rich structure in all three of these categories of features, and find strong evidence that in our domain, features describing the inherent difficulty of an instance are significantly more powerful than features based on skill or time.

【Keywords】: error prediction; game-playing; human decision-making; human error

85. Inferring Network Effects from Observational Data.

Paper Link】 【Pages】:715-724

【Authors】: David T. Arbour ; Dan Garant ; David D. Jensen

【Abstract】: We present Relational Covariate Adjustment (RCA), a general method for estimating causal effects in relational data. Relational Covariate Adjustment is implemented through two high-level operations: identification of an adjustment set and relational regression adjustment. The former is achieved through an extension of Pearl's back-door criterion to relational domains. We demonstrate how this extended definition can be used to estimate causal effects in the presence of network interference and confounding. RCA is agnostic to functional form, and it can easily model both discrete and continuous treatments as well as estimate the effects of a wider array of network interventions than existing experimental approaches. We show that RCA can yield robust estimates of causal effects using common regression models without extensive parameter tuning. Through a series of simulation experiments on a variety of synthetic and real-world network structures, we show that causal effects estimated on observational data with RCA are nearly as accurate as those estimated from well-designed network experiments

【Keywords】: causality; relational learning

86. Communication Efficient Distributed Kernel Principal Component Analysis.

Paper Link】 【Pages】:725-734

【Authors】: Maria-Florina Balcan ; Yingyu Liang ; Le Song ; David P. Woodruff ; Bo Xie

【Abstract】: Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(spk/ε+sk2/ε3) words, where p is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.

【Keywords】: distributed computing; kernel method; principal component analysis

87. Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns.

Paper Link】 【Pages】:735-744

【Authors】: Roel Bertens ; Jilles Vreeken ; Arno Siebes

【Abstract】: We study how to obtain concise descriptions of discrete multivariate sequential data. In particular, how to do so in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the domains (alphabets) of all sequences, allow patterns to overlap temporally, as well as allow for gaps in their occurrences. We formalise our goal by the Minimum Description Length principle, by which our objective is to discover the set of patterns that provides the most succinct description of the data. To discover high-quality pattern sets directly from data, we introduce Ditto, a highly efficient algorithm that approximates the ideal result very well. Experiments show that Ditto correctly discovers the patterns planted in synthetic data. Moreover, it scales favourably with the length of the data, the number of attributes, the alphabet sizes. On real data, ranging from sensor networks to annotated text, Ditto discovers easily interpretable summaries that provide clear insight in both the univariate and multivariate structure.

【Keywords】: MDL; multivariate patterns; multivariate sequences; summarisation

88. The Limits of Popularity-Based Recommendations, and the Role of Social Ties.

Paper Link】 【Pages】:745-754

【Authors】: Marco Bressan ; Stefano Leucci ; Alessandro Panconesi ; Prabhakar Raghavan ; Erisa Terolli

【Abstract】: In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is altered by a recommendation system, and determine the power of users to influence others. Our theoretical results are complemented by experiments with real world social networks showing that social graphs prevent large market distortions in spite of the presence of highly influential users.

【Keywords】: market equilibrium; pagerank; popularity effect; recommender systems; shapley value; social influence; social ties

89. Positive-Unlabeled Learning in Streaming Networks.

Paper Link】 【Pages】:755-764

【Authors】: Shiyu Chang ; Yang Zhang ; Jiliang Tang ; Dawei Yin ; Yi Chang ; Mark A. Hasegawa-Johnson ; Thomas S. Huang

【Abstract】: Data of many problems in real-world systems such as link prediction and one-class recommendation share common characteristics. First, data are in the form of positive unlabeled (PU) measurements (e.g. Twitter "following", Facebook "like", etc.) that do not provide negative information, which can be naturally represented as networks. Second, in the era of big data, such data are generated temporally-ordered, continuously and rapidly, which determines its streaming nature. These common characteristics allow us to unify many problems into a novel framework -- PU learning in streaming networks. In this paper, a principled probabilistic approach SPU is proposed to leverage the characteristics of the streaming PU inputs. In particular, SPU captures temporal dynamics and provides real-time adaptations and predictions by identifying the potential negative signals concealed in unlabeled data. Our empirical results on various real-world datasets demonstrate the effectiveness of the proposed framework over other state-of-the-art methods in both link prediction and recommendation.

【Keywords】: PU learning; continuous time; dynamic network; online learning; streaming link prediction; streaming recommendation

90. FASCINATE: Fast Cross-Layer Dependency Inference on Multi-layered Networks.

Paper Link】 【Pages】:765-774

【Authors】: Chen Chen ; Hanghang Tong ; Lei Xie ; Lei Ying ; Qing He

【Abstract】: Multi-layered networks have recently emerged as a new network model, which naturally finds itself in many high-impact application domains, ranging from critical inter-dependent infrastructure networks, biological systems, organization-level collaborations, to cross-platform e-commerce, etc. Cross-layer dependency, which describes the dependencies or the associations between nodes across different layers/networks, often plays a central role in many data mining tasks on such multi-layered networks. Yet, it remains a daunting task to accurately know the cross-layer dependency a prior. In this paper, we address the problem of inferring the missing cross-layer dependencies on multi-layered networks. The key idea behind our method is to view it as a collective collaborative filtering problem. By formulating the problem into a regularized optimization model, we propose an effective algorithm to find the local optima with linear complexity. Furthermore, we derive an online algorithm to accommodate newly arrived nodes, whose complexity is just linear wrt the size of the neighborhood of the new node. We perform extensive empirical evaluations to demonstrate the effectiveness and the efficiency of the proposed methods.

【Keywords】: dependency inference; multi-layered network

91. Predicting Matchups and Preferences in Context.

Paper Link】 【Pages】:775-784

【Authors】: Shuo Chen ; Thorsten Joachims

【Abstract】: We present a general probabilistic framework for predicting the outcome of pairwise matchups (e.g. two-player sport matches) and pairwise preferences (e.g. product preferences), both of which have widespread applications ranging from matchmaking in computer games to recommendation in e-commerce. Unlike existing models for these tasks, our model not only learns representations of the items in a more expressive latent vector space, but also models how context modifies matchup and preference outcomes. For example, the context "weather" may alter the winning probability in a tennis match, or the fact that the user is on a mobile device may alter his preferences among restaurants. More generally, the model is capable of handling any symmetric game/comparison problem that can be described by vectorized player/item and game/context features. We provide a comprehensive evaluation of its predictive performance with real datasets from both domains to show its ability to predict preference and game outcomes more accurately than existing models. Furthermore, we demonstrate on synthetic datasets the expressiveness of the model when compared against theoretical limits.

【Keywords】: games; matchup; pairwise comparison; preference; ranking; representation learning; sports

92. XGBoost: A Scalable Tree Boosting System.

Paper Link】 【Pages】:785-794

【Authors】: Tianqi Chen ; Carlos Guestrin

【Abstract】: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

【Keywords】: large-scale machine learning

93. Robust Influence Maximization.

Paper Link】 【Pages】:795-804

【Authors】: Wei Chen ; Tian Lin ; Zihan Tan ; Mingfei Zhao ; Xuren Zhou

【Abstract】: In this paper, we address the important issue of uncertainty in the edge influence probability estimates for the well studied influence maximization problem --- the task of finding k seed nodes in a social network to maximize the influence spread. We propose the problem of robust influence maximization, which maximizes the worst-case ratio between the influence spread of the chosen seed set and the optimal seed set, given the uncertainty of the parameter input. We design an algorithm that solves this problem with a solution-dependent bound. We further study uniform sampling and adaptive sampling methods to effectively reduce the uncertainty on parameters and improve the robustness of the influence maximization task. Our empirical results show that parameter uncertainty may greatly affect influence maximization performance and prior studies that learned influence probabilities could lead to poor performance in robust influence maximization due to relatively large uncertainty in parameter estimates, and information cascade based adaptive sampling method may be an effective way to improve the robustness of influence maximization.

【Keywords】: influence maximization; information diffusion; robust optimization; social networks

94. Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations.

Paper Link】 【Pages】:805-814

【Authors】: Wei Cheng ; Kai Zhang ; Haifeng Chen ; Guofei Jiang ; Zhengzhang Chen ; Wei Wang

【Abstract】: Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach.

【Keywords】: causal anomalies ranking; label propagation; nonnegative matrix factorization

95. Towards Conversational Recommender Systems.

Paper Link】 【Pages】:815-824

【Authors】: Konstantina Christakopoulou ; Filip Radlinski ; Katja Hofmann

【Abstract】: People often ask others for restaurant recommendations as a way to discover new dining experiences. This makes restaurant recommendation an exciting scenario for recommender systems and has led to substantial research in this area. However, most such systems behave very differently from a human when asked for a recommendation. The goal of this paper is to begin to reduce this gap. In particular, humans can quickly establish preferences when asked to make a recommendation for someone they do not know. We address this cold-start recommendation problem in an online learning setting. We develop a preference elicitation framework to identify which questions to ask a new user to quickly learn their preferences. Taking advantage of latent structure in the recommendation space using a probabilistic latent factor model, our experiments with both synthetic and real world data compare different types of feedback and question selection strategies. We find that our framework can make very effective use of online user feedback, improving personalized recommendations over a static model by 25% after asking only 2 questions. Our results demonstrate dramatic benefits of starting from offline embeddings, and highlight the benefit of bandit-based explore-exploit strategies in this setting.

【Keywords】: cold-start; online learning; recommender systems

96. TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size.

Paper Link】 【Pages】:825-834

【Authors】: Lorenzo De Stefani ; Alessandro Epasto ; Matteo Riondato ; Eli Upfal

【Abstract】: We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches, which require hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they use. We analyze the variance of the estimations and show novel concentration bounds for these quantities. Our experimental results on very large graphs demonstrate that TRIEST outperforms state-of-the-art approaches in accuracy and exhibits a small update time.

【Keywords】: cycle counting; data stream mining; dynamic graph algorithms; graph enumeration; probabilistic algorithms; reservoir sampling; social networks; subgraph counting; triangle counting

97. A Subsequence Interleaving Model for Sequential Pattern Mining.

Paper Link】 【Pages】:835-844

【Authors】: Jaroslav M. Fowkes ; Charles A. Sutton

【Abstract】: Recent sequential pattern mining methods have used the minimum description length (MDL) principle to define an encoding scheme which describes an algorithm for mining the most compressing patterns in a database. We present a novel subsequence interleaving model based on a probabilistic model of the sequence database, which allows us to search for the most compressing set of patterns without designing a specific encoding scheme. Our proposed algorithm is able to efficiently mine the most relevant sequential patterns and rank them using an associated measure of interestingness. The efficient inference in our model is a direct result of our use of a structural expectation-maximization framework, in which the expectation-step takes the form of a submodular optimization problem subject to a coverage constraint. We show on both synthetic and real world datasets that our model mines a set of sequential patterns with low spuriousness and redundancy, high interpretability and usefulness in real-world applications. Furthermore, we demonstrate that the quality of the patterns from our approach is comparable to, if not better than, existing state of the art sequential pattern mining algorithms.

【Keywords】: exploratory data analysis; generative model; sequential pattern mining

98. Efficient Frequent Directions Algorithm for Sparse Matrices.

Paper Link】 【Pages】:845-854

【Authors】: Mina Ghashami ; Edo Liberty ; Jeff M. Phillips

【Abstract】: This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix An x d one by one in the streaming setting and compute a small sketch B ∈ Rl x d. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in O(ndl) time regardless of the sparsity of the input matrix A, Sparse Frequent Directions runs in Õ(nnz(A)l + nl2) time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

【Keywords】: frequent directions; matrix sketching; sparse matrix

99. node2vec: Scalable Feature Learning for Networks.

Paper Link】 【Pages】:855-864

【Authors】: Aditya Grover ; Jure Leskovec

【Abstract】: Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.

【Keywords】: feature learning; graph representations; information networks; node embeddings

100. Generalized Hierarchical Sparse Model for Arbitrary-Order Interactive Antigenic Sites Identification in Flu Virus Data.

Paper Link】 【Pages】:865-874

【Authors】: Lei Han ; Yu Zhang ; Xiu-Feng Wan ; Tong Zhang

【Abstract】: Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates (features) can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, Hierarchical Sparse Models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a Generalized Hierarchical Sparse Model (GHSM) as a generalization of the HSM models to learn arbitrary-order interactions. The GHSM applies the l1 penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the GIST and ADMM methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the flu virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed method and the learned high-order interactions have meaningful synergistic covariate patterns in the virus antigenicity.

【Keywords】: antigenic sites identification; heredity structure; hierarchical sparsity; high-order interaction

101. Joint Community and Structural Hole Spanner Detection via Harmonic Modularity.

Paper Link】 【Pages】:875-884

【Authors】: Lifang He ; Chun-Ta Lu ; Jiaqi Ma ; Jianping Cao ; Linlin Shen ; Philip S. Yu

【Abstract】: Detecting communities (or modular structures) and structural hole spanners, the nodes bridging different communities in a network, are two essential tasks in the realm of network analytics. Due to the topological nature of communities and structural hole spanners, these two tasks are naturally tangled with each other, while there has been little synergy between them. In this paper, we propose a novel harmonic modularity method to tackle both tasks simultaneously. Specifically, we apply a harmonic function to measure the smoothness of community structure and to obtain the community indicator. We then investigate the sparsity level of the interactions between communities, with particular emphasis on the nodes connecting to multiple communities, to discriminate the indicator of SH spanners and assist the community guidance. Extensive experiments on real-world networks demonstrate that our proposed method outperforms several state-of-the-art methods in the community detection task and also in the SH spanner identification task (even the methods that require the supervised community information). Furthermore, by removing the SH spanners spotted by our method, we show that the quality of other community detection methods can be further improved.

【Keywords】: community detection; harmonic function; modularity; social network; structural hole

102. Robust Influence Maximization.

Paper Link】 【Pages】:885-894

【Authors】: Xinran He ; David Kempe

【Abstract】: Uncertainty about models and data is ubiquitous in the computational social sciences, and it creates a need for robust social network algorithms, which can simultaneously provide guarantees across a spectrum of models and parameter settings. We begin an investigation into this broad domain by studying robust algorithms for the Influence Maximization problem, in which the goal is to identify a set of k nodes in a social network whose joint influence on the network is maximized. We define a Robust Influence Maximization framework wherein an algorithm is presented with a set of influence functions, typically derived from different influence models or different parameter settings for the same model. The different parameter settings could be derived from observed cascades on different topics, under different conditions, or at different times. The algorithm's goal is to identify a set of k nodes who are simultaneously influential for all influence functions, compared to the (function-specific) optimum solutions. We show strong approximation hardness results for this problem unless the algorithm gets to select at least a logarithmic factor more seeds than the optimum solution. However, when enough extra seeds may be selected, we show that techniques of Krause et al. can be used to approximate the optimum robust influence to within a factor of 1-1/e. We evaluate this bicriteria approximation algorithm against natural heuristics on several real-world data sets. Our experiments indicate that the worst-case hardness does not necessarily translate into bad performance on real-world data sets; all algorithms perform fairly well.

【Keywords】: influence maximization; noise; robust optimization; submodular optimization; uncertainty

103. FRAUDAR: Bounding Graph Fraud in the Face of Camouflage.

Paper Link】 【Pages】:895-904

【Authors】: Bryan Hooi ; Hyun Ah Song ; Alex Beutel ; Neil Shah ; Kijung Shin ; Christos Faloutsos

【Abstract】: Given a bipartite graph of users and the products that they review, or followers and followees, how can we detect fake reviews or follows? Existing fraud detection methods (spectral, etc.) try to identify dense subgraphs of nodes that are sparsely connected to the remaining graph. Fraudsters can evade these methods using camouflage, by adding reviews or follows with honest targets so that they look "normal". Even worse, some fraudsters use hijacked accounts from honest users, and then the camouflage is indeed organic. Our focus is to spot fraudsters in the presence of camouflage or hijacked accounts. We propose FRAUDAR, an algorithm that (a) is camouflage-resistant, (b) provides upper bounds on the effectiveness of fraudsters, and (c) is effective in real-world data. Experimental results under various attacks show that FRAUDAR outperforms the top competitor in accuracy of detecting both camouflaged and non-camouflaged fraud. Additionally, in real-world experiments with a Twitter follower-followee graph of 1.47 billion edges, FRAUDAR successfully detected a subgraph of more than 4000 detected accounts, of which a majority had tweets showing that they used follower-buying services.

【Keywords】: content ranking; fraud detection

104. Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detection.

Paper Link】 【Pages】:905-914

【Authors】: Hao Hu ; Joey Velez-Ginorio ; Guo-Jun Qi

【Abstract】: Attention Deficit Hyperactive Disorder (ADHD) is one of the most common childhood disorders and can continue through adolescence and adulthood. Although the root cause of the problem still remains unknown, recent advancements in brain imaging technology reveal there exists differences between neural activities of Typically Developing Children (TDC) and ADHD subjects. Inspired by this, we propose a novel First-Take-All (FTA) hashing framework to investigate the problem of fast ADHD subjects detection through the fMRI time-series of neuron activities. By hashing time courses from regions of interests (ROIs) in the brain into fixed-size hash codes, FTA can compactly encode the temporal order differences between the neural activity patterns that are key to distinguish TDC and ADHD subjects. Such patterns can be directly learned via minimizing the training loss incurred by the generated FTA codes. By conducting similarity search on the resultant FTA codes, data-driven ADHD detection can be achieved in an efficient fashion. The experiments' results on real-world ADHD detection benchmarks demonstrate the FTA can outperform the state-of-the-art baselines using only neural activity time series without any phenotypic information.

【Keywords】: ADHD detection; first-take-all; time series hashing

105. When Social Influence Meets Item Inference.

Paper Link】 【Pages】:915-924

【Authors】: Hui-Ju Hung ; Hong-Han Shuai ; De-Nian Yang ; Liang-Hao Huang ; Wang-Chien Lee ; Jian Pei ; Ming-Syan Chen

【Abstract】: Research issues and data mining techniques for product recommendation and viral marketing have been widely studied. Existing works on seed selection in social networks do not take into account the effect of product recommendations in e-commerce stores. In this paper, we investigate the seed selection problem for viral marketing that considers both effects of social influence and item inference (for product recommendation). We develop a new model, Social Item Graph (SIG), that captures both effects in the form of hyperedges. Accordingly, we formulate a seed selection problem, called Social Item Maximization Problem (SIMP), and prove the hardness of SIMP. We design an efficient algorithm with performance guarantee, called Hyperedge-Aware Greedy (HAG), for SIMP and develop a new index structure, called SIG-index, to accelerate the computation of diffusion process in HAG. Moreover, to construct realistic SIG models for SIMP, we develop a statistical inference based framework to learn the weights of hyperedges from data. Finally, we perform a comprehensive evaluation on our proposals with various baselines. Experimental result validates our ideas and demonstrates the effectiveness and efficiency of the proposed model and algorithms over baselines.

【Keywords】: frequent pattern; product recommendation; viral marketing

106. Privacy-preserving Class Ratio Estimation.

Paper Link】 【Pages】:925-934

【Authors】: Arun Shankar Iyer ; J. Saketha Nath ; Sunita Sarawagi

【Abstract】: In this paper we present learning models for the class ratio estimation problem, which takes as input an unlabeled set of instances and predicts the proportions of instances in the set belonging to the different classes. This problem has applications in social and commercial data analysis. Existing models for class-ratio estimation however require instance-level supervision. Whereas in domains like politics, and demography, set-level supervision is more common. We present a new method for directly estimating class-ratios using set-level supervision. Another serious limitation in applying these techniques to sensitive domains like health is data privacy. We propose a novel label privacy-preserving mechanism that is well-suited for supervised class ratio estimation and has guarantees for achieving efficient differential privacy, provided the per-class counts are large enough. We derive learning bounds for the estimation with and without privacy constraints, which lead to important insights for the data-publisher. Extensive empirical evaluation shows that our model is more accurate than existing methods and that the proposed privacy mechanism and learning model are well-suited for each other.

【Keywords】: differential privacy; kernel method; learning theory; maximum mean discrepancy; statistical estimation

107. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications.

Paper Link】 【Pages】:935-944

【Authors】: Himanshu Jain ; Yashoteja Prabhu ; Manik Varma

【Abstract】: The choice of the loss function is critical in extreme multi-label learning where the objective is to annotate each data point with the most relevant subset of labels from an extremely large label set. Unfortunately, existing loss functions, such as the Hamming loss, are unsuitable for learning, model selection, hyperparameter tuning and performance evaluation. This paper addresses the issue by developing propensity scored losses which: (a) prioritize predicting the few relevant labels over the large number of irrelevant ones; (b) do not erroneously treat missing labels as irrelevant but instead provide unbiased estimates of the true loss function even when ground truth labels go missing under arbitrary probabilistic label noise models; and (c) promote the accurate prediction of infrequently occurring, hard to predict, but rewarding tail labels. Another contribution is the development of algorithms which efficiently scale to extremely large datasets with up to 9 million labels, 70 million points and 2 million dimensions and which give significant improvements over the state-of-the-art. This paper's results also apply to tagging, recommendation and ranking which are the motivating applications for extreme multi-label learning. They generalize previous attempts at deriving unbiased losses under the restrictive assumption that labels go missing uniformly at random from the ground truth. Furthermore, they provide a sound theoretical justification for popular label weighting heuristics used to recommend rare items. Finally, they demonstrate that the proposed contributions align with real world applications by achieving superior clickthrough rates on sponsored search advertising in Bing.

【Keywords】: extreme classification; multi-label learning; ranking; recommendation; tagging

108. CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors.

Paper Link】 【Pages】:945-954

【Authors】: Meng Jiang ; Christos Faloutsos ; Jiawei Han

【Abstract】: Representing and summarizing human behaviors with rich contexts facilitates behavioral sciences and user-oriented services. Traditional behavioral modeling represents a behavior as a tuple in which each element is one contextual factor of one type, and the tensor-based summaries look for high-order dense blocks by clustering the values (including timestamps) in each dimension. However, the human behaviors are multicontextual and dynamic: (1) each behavior takes place within multiple contexts in a few dimensions, which requires the representation to enable non-value and set-values for each dimension; (2) many behavior collections, such as tweets or papers, evolve over time. In this paper, we represent the behavioral data as a two-level matrix (temporal-behaviors by dimensional-values) and propose a novel representation for behavioral summary called Tartan that includes a set of dimensions, the values in each dimension, a list of consecutive time slices and the behaviors in each slice. We further develop a propagation method CatchTartan to catch the dynamic multicontextual patterns from the temporal multidimensional data in a principled and scalable way: it determines the meaningfulness of updating every element in the Tartan by minimizing the encoding cost in a compression manner. CatchTartan outperforms the baselines on both the accuracy and speed. We apply CatchTartan to four Twitter datasets up to 10 million tweets and the DBLP data, providing comprehensive summaries for the events, human life and scientific development.

【Keywords】: behavior representation; behavior summarization; minimum description length

109. Smart Reply: Automated Response Suggestion for Email.

Paper Link】 【Pages】:955-964

【Authors】: Anjuli Kannan ; Karol Kurach ; Sujith Ravi ; Tobias Kaufmann ; Andrew Tomkins ; Balint Miklos ; Greg Corrado ; László Lukács ; Marina Ganea ; Peter Young ; Vivek Ramavajjala

【Abstract】: In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning. We describe the architecture of the system as well as the challenges that we faced while building it, like response diversity and scalability. We also introduce a new method for semantic clustering of user-generated content that requires only a modest amount of explicitly labeled data.

【Keywords】: clustering; deep learning; email; lstm; semantics

110. Mining Subgroups with Exceptional Transition Behavior.

Paper Link】 【Pages】:965-974

【Authors】: Florian Lemmerich ; Martin Becker ; Philipp Singer ; Denis Helic ; Andreas Hotho ; Markus Strohmaier

【Abstract】: We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data.

【Keywords】: Markov chains; exceptional model mining; sequential data; subgroup discovery; transitions

111. Point-of-Interest Recommendations: Learning Potential Check-ins from Friends.

Paper Link】 【Pages】:975-984

【Authors】: Huayu Li ; Yong Ge ; Richang Hong ; Hengshu Zhu

【Abstract】: The emergence of Location-based Social Network (LBSN) services provides a wonderful opportunity to build personalized Point-of-Interest (POI) recommender systems. Although a personalized POI recommender system can significantly facilitate users' outdoor activities, it faces many challenging problems, such as the hardness to model user's POI decision making process and the difficulty to address data sparsity and user/location cold-start problem. To cope with these challenges, we define three types of friends (i.e., social friends, location friends, and neighboring friends) in LBSN, and develop a two-step framework to leverage the information of friends to improve POI recommendation accuracy and address cold-start problem. Specifically, we first propose to learn a set of potential locations that each individual's friends have checked-in before and this individual is most interested in. Then we incorporate three types of check-ins (i.e., observed check-ins, potential check-ins and other unobserved check-ins) into matrix factorization model using two different loss functions (i.e., the square error based loss and the ranking error based loss). To evaluate the proposed model, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on two real-world data sets. The experimental results demonstrate the effectiveness of our methods.

【Keywords】: matrix factorization; point-of-interest; recommendation

112. QUINT: On Query-Specific Optimal Networks.

Paper Link】 【Pages】:985-994

【Authors】: Liangyue Li ; Yuan Yao ; Jie Tang ; Wei Fan ; Hanghang Tong

【Abstract】: Measuring node proximity on large scale networks is a fundamental building block in many application domains, ranging from computer vision, e-commerce, social networks, software engineering, disaster management to biology and epidemiology. The state of the art (e.g., random walk based methods) typically assumes the input network is given a priori, with the known network topology and the associated edge weights. A few recent works aim to further infer the optimal edge weights based on the side information. This paper generalizes the challenge in multiple dimensions, aiming to learn optimal networks for node proximity measures. First (optimization scope), our proposed formulation explores a much larger parameter space, so that it is able to simultaneously infer the optimal network topology and the associated edge weights. This is important as a noisy or missing edge could greatly mislead the network node proximity measures. Second (optimization granularity), while all the existing works assume one common optimal network, be it given as the input or learned by the algorithms, exists for all queries, our method performs optimization at a much finer granularity, essentially being able to infer an optimal network that is specific to a given query. Third (optimization efficiency), we carefully design our algorithms with a linear complexity wrt the neighborhood size of the user preference set. We perform extensive empirical evaluations on a diverse set of 10+ real networks, which show that the proposed algorithms (1) consistently outperform the existing methods on all six commonly used metrics; (2) empirically scale sub-linearly to billion-scale networks and (3) respond in a fraction of a second.

【Keywords】: node proximity; optimal networks

113. Dynamic Clustering of Streaming Short Documents.

Paper Link】 【Pages】:995-1004

【Authors】: Shangsong Liang ; Emine Yilmaz ; Evangelos Kanoulas

【Abstract】: Clustering technology has found numerous applications in mining textual data. It was shown to enhance the performance of retrieval systems in various different ways, such as identifying different query aspects in search result diversification, improving smoothing in the context of language modeling, matching queries with documents in a latent topic space in ad-hoc retrieval, summarizing documents etc. The vast majority of clustering methods have been developed under the assumption of a static corpus of long (and hence textually rich) documents. Little attention has been given to streaming corpora of short text, which is the predominant type of data in Web 2.0 applications, such as social media, forums, and blogs. In this paper, we consider the problem of dynamically clustering a streaming corpus of short documents. The short length of documents makes the inference of the latent topic distribution challenging, while the temporal dynamics of streams allow topic distributions to change over time. To tackle these two challenges we propose a new dynamic clustering topic model - DCT - that enables tracking the time-varying distributions of topics over documents and words over topics. DCT models temporal dynamics by a short-term or long-term dependency model over sequential data, and overcomes the difficulty of handling short text by assigning a single topic to each short document and using the distributions inferred at a certain point in time as priors for the next inference, allowing the aggregation of information. At the same time, taking a Bayesian approach allows evidence obtained from new streaming documents to change the topic distribution. Our experimental results demonstrate that the proposed clustering algorithm outperforms state-of-the-art dynamic and non-dynamic clustering topic models in terms of perplexity and when integrated in a cluster-based query likelihood model it also outperforms state-of-the-art models in terms of retrieval quality.

【Keywords】: cluster-based retrieval; clustering; streaming text; topic models

114. Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization.

Paper Link】 【Pages】:1005-1014

【Authors】: Junming Liu ; Leilei Sun ; Weiwei Chen ; Hui Xiong

【Abstract】: Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. A key to success for a bike sharing systems is the effectiveness of rebalancing operations, that is, the efforts of restoring the number of bikes in each station to its target value by routing vehicles through pick-up and drop-off operations. There are two major issues for this bike rebalancing problem: the determination of station inventory target level and the large scale multiple capacitated vehicle routing optimization with outlier stations. The key challenges include demand prediction accuracy for inventory target level determination, and an effective optimizer for vehicle routing with hundreds of stations. To this end, in this paper, we develop a Meteorology Similarity Weighted K-Nearest-Neighbor (MSWK) regressor to predict the station pick-up demand based on large-scale historic trip records. Based on further analysis on the station network constructed by station-station connections and the trip duration, we propose an inter station bike transition (ISBT) model to predict the station drop-off demand. Then, we provide a mixed integer nonlinear programming (MINLP) formulation of multiple capacitated bike routing problem with the objective of minimizing total travel distance. To solve it, we propose an Adaptive Capacity Constrained K-centers Clustering (AdaCCKC) algorithm to separate outlier stations (the demands of these stations are very large and make the optimization infeasible) and group the rest stations into clusters within which one vehicle is scheduled to redistribute bikes between stations. In this way, the large scale multiple vehicle routing problem is reduced to inner cluster one vehicle routing problem with guaranteed feasible solutions. Finally, the extensive experimental results on the NYC Citi Bike system show the advantages of our approach for bike demand prediction and large-scale bike rebalancing optimization.

【Keywords】: bike sharing system; clustering; optimization

115. Unified Point-of-Interest Recommendation with Temporal Interval Assessment.

Paper Link】 【Pages】:1015-1024

【Authors】: Yanchi Liu ; Chuanren Liu ; Bin Liu ; Meng Qu ; Hui Xiong

【Abstract】: Point-of-interest (POI) recommendation, which helps mobile users explore new places, has become an important location-based service. Existing approaches for POI recommendation have been mainly focused on exploiting the information about user preferences, social influence, and geographical influence. However, these approaches cannot handle the scenario where users are expecting to have POI recommendation for a specific time period. To this end, in this paper, we propose a unified recommender system, named the 'Where and When to gO' (WWO) recommender system, to integrate the user interests and their evolving sequential preferences with temporal interval assessment. As a result, the WWO system can make recommendations dynamically for a specific time period and the traditional POI recommender system can be treated as the special case of the WWO system by setting this time period long enough. Specifically, to quantify users' sequential preferences, we consider the distributions of the temporal intervals between dependent POIs in the historical check-in sequences. Then, to estimate the distributions with only sparse observations, we develop the low-rank graph construction model, which identifies a set of bi-weighted graph bases so as to learn the static user preferences and the dynamic sequential preferences in a coherent way. Finally, we evaluate the proposed approach using real-world data sets from several location-based social networks (LBSNs). The experimental results show that our method outperforms the state-of-the-art approaches for POI recommendation in terms of various metrics, such as F-measure and NDCG, with a significant margin.

【Keywords】: poi recommendation; sequential preference

116. AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets.

Paper Link】 【Pages】:1025-1034

【Authors】: Son T. Mai ; Ira Assent ; Martin Storgaard

【Abstract】: The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, its O(n2) time complexity still remains a severe weakness. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters for reducing the label propagation time. Moreover, instead of passively performing the range query for all objects like existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still guaranteeing the exact final result of DBSCAN. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants on very large real and synthetic complex datasets.

【Keywords】: DBSCAN; active learning; anytime algorithm; data mining; density-based clustering

117. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs.

Paper Link】 【Pages】:1035-1044

【Authors】: Emaad A. Manzoor ; Sadegh M. Milajerdi ; Leman Akoglu

【Abstract】: Given a stream of heterogeneous graphs containing different types of nodes and edges, how can we spot anomalous ones in real-time while consuming bounded memory? This problem is motivated by and generalizes from its application in security to host-level advanced persistent threat (APT) detection. We propose StreamSpot, a clustering based anomaly detection approach that addresses challenges in two key fronts: (1) heterogeneity, and (2) streaming nature. We introduce a new similarity function for heterogeneous graphs that compares two graphs based on their relative frequency of local substructures, represented as short strings. This function lends itself to a vector representation of a graph, which is (a) fast to compute, and (b) amenable to a sketched version with bounded size that preserves similarity. StreamSpot exhibits desirable properties that a streaming application requires: it is (i) fully-streaming; processing the stream one edge at a time as it arrives, (ii) memory-efficient; requiring constant space for the sketches and the clustering, (iii) fast; taking constant time to update the graph sketches and the cluster summaries that can process over 100,000 edges per second, and (iv) online; scoring and flagging anomalies in real time. Experiments on datasets containing simulated system-call flow graphs from normal browser activity and various attack scenarios (ground truth) show that StreamSpot is high-performance; achieving above 95% detection accuracy with small delay, as well as competitive time and memory usage.

【Keywords】: anomaly detection; dynamic networks; evolving graphs; graph sketches; heterogenous graphs; streamspot; temporal networks; typed graphs

118. Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences.

Paper Link】 【Pages】:1045-1054

【Authors】: Yasuko Matsubara ; Yasushi Sakurai

【Abstract】: Given a large, online stream of multiple co-evolving event sequences, such as sensor data and Web-click logs, that contains various types of non-linear dynamic evolving patterns of different durations, how can we efficiently and effectively capture important patterns? How do we go about forecasting long-term future events? In this paper, we present REGIMECAST, an efficient and effective method for forecasting co-evolving data streams. REGIMECAST is designed as an adaptive non-linear dynamical system, which is inspired by the concept of "regime shifts" in natural dynamical systems. Our method has the following properties: (a) Effective: it operates on large data streams, captures important patterns and performs long-term forecasting; (b) Adaptive: it automatically and incrementally recognizes the latent trends and dynamic evolution patterns (i.e., regimes) that are unknown in advance; (c) Scalable: it is fast and the computation cost does not depend on the length of data streams; (d) Any-time: it provides a response at any time and generates long-range future events. Extensive experiments on real datasets demonstrate that REGIMECAST does indeed make long-range forecasts, and it outperforms state-of-the-art competitors as regards accuracy and speed.

【Keywords】: real-time forecasting; regime shifts; time series

119. Skinny-dip: Clustering in a Sea of Noise.

Paper Link】 【Pages】:1055-1064

【Authors】: Samuel Maurus ; Claudia Plant

【Abstract】: Can we find heterogeneous clusters hidden in data sets with 80% noise? Although such settings occur in the real-world, we struggle to find methods from the abundance of clustering techniques that perform well with noise at this level. Indeed, perhaps this is enough of a departure from classical clustering to warrant its study as a separate problem. In this paper we present SkinnyDip which, based on Hartigan's elegant dip test of unimodality, represents an intriguing approach to clustering with an attractive set of properties. Specifically, SkinnyDip is highly noise-robust, practically parameter-free and completely deterministic. SkinnyDip never performs multivariate distance calculations, but rather employs insightful recursion based on "dips" into univariate projections of the data. It is able to detect a range of cluster shapes and densities, assuming only that each cluster admits a unimodal shape. Practically, its run-time grows linearly with the data. Finally, for high-dimensional data, continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit in order to find an appropriate basis for clustering. Although not without its limitations, SkinnyDip compares favorably to a variety of clustering approaches on synthetic and real data, particularly in high-noise settings.

【Keywords】: clustering; high-noise; modality-testing; parameter-free

120. Semi-Markov Switching Vector Autoregressive Model-Based Anomaly Detection in Aviation Systems.

Paper Link】 【Pages】:1065-1074

【Authors】: Igor Melnyk ; Arindam Banerjee ; Bryan L. Matthews ; Nikunj C. Oza

【Abstract】: In this work we consider the problem of anomaly detection in heterogeneous, multivariate, variable-length time series datasets. Our focus is on the aviation safety domain, where data objects are flights and time series are sensor readings and pilot switches. In this context the goal is to detect anomalous flight segments, due to mechanical, environmental, or human factors in order to identifying operationally significant events and highlight potential safety risks. For this purpose, we propose a framework which represents each flight using a semi-Markov switching vector autoregressive (SMS-VAR) model. Detection of anomalies is then based on measuring dissimilarities between the model's prediction and data observation. The framework is scalable, due to the inherent parallel nature of most computations, and can be used to perform online anomaly detection. Extensive experimental results on simulated and real datasets illustrate that the framework can detect various types of anomalies along with the key parameters involved.

【Keywords】: anomaly detection; graphical model; time series analysis

121. Continuous Experience-aware Language Model.

Paper Link】 【Pages】:1075-1084

【Authors】: Subhabrata Mukherjee ; Stephan Günnemann ; Gerhard Weikum

【Abstract】: Online review communities are dynamic as users join and leave, adopt new vocabulary, and adapt to evolving trends. Recent work has shown that recommender systems benefit from explicit consideration of user experience. However, prior work assumes a fixed number of discrete experience levels, whereas in reality users gain experience and mature continuously over time. This paper presents a new model that captures the continuous evolution of user experience, and the resulting language model in reviews and other posts. Our model is unsupervised and combines principles of Geometric Brownian Motion, Brownian Motion, and Latent Dirichlet Allocation to trace a smooth temporal progression of user experience and language model respectively. We develop practical algorithms for estimating the model parameters from data and for inference with our model (e.g., to recommend items). Extensive experiments with five real-world datasets show that our model not only fits data better than discrete-model baselines, but also outperforms state-of-the-art methods for predicting item ratings.

【Keywords】: brownian motion; language evolution; recommendation; review community; topic modeling; user experience

122. Structural Neighborhood Based Classification of Nodes in a Network.

Paper Link】 【Pages】:1085-1094

【Authors】: Sharad Nandanwar ; M. Narasimha Murty

【Abstract】: Classification of entities based on the underlying network structure is an important problem. Networks encountered in practice are sparse and have many missing and noisy links. Statistical learning techniques have been used in intra-network classification; however, they typically exploit only the local neighborhood, so may not perform well. In this paper, we propose a novel structural neighborhood-based classifier learning using a random walk. For classifying a node, we take a random walk from the node and make a decision based on how nodes in the respective k^th-level neighborhood are labeled. We observe that random walks of short length are helpful in classification. Emphasizing role of longer random walks may cause the underlying Markov chain to converge to a stationary distribution. Considering this, we take a lazy random walk based approach with variable termination probability for each node, based on the node's structural properties including its degree. Our experimental study on real world datasets demonstrates the superiority of the proposed approach over the existing state-of-the-art approaches.

【Keywords】: collective classification; graph-based semi-supervised learning; relational learning

123. Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning.

Paper Link】 【Pages】:1095-1104

【Authors】: Yue Ning ; Sathappan Muthiah ; Huzefa Rangwala ; Naren Ramakrishnan

【Abstract】: Forecasting large-scale societal events like civil unrest movements, disease outbreaks, and elections is an important and challenging problem. From the perspective of human analysts and policy makers, forecasting algorithms must not only make accurate predictions but must also provide supporting evidence, e.g., the causal factors related to the event of interest. We develop a novel multiple instance learning based approach that jointly tackles the problem of identifying evidence-based precursors and forecasts events into the future. Specifically, given a collection of streaming news articles from multiple sources we develop a nested multiple instance learning approach to forecast significant societal events such as protests. Using data from three countries in Latin America, we demonstrate how our approach is able to consistently identify news articles considered as precursors for protests. Our empirical evaluation demonstrates the strengths of our proposed approach in filtering candidate precursors, in forecasting the occurrence of events with a lead time advantage and in accurately predicting the characteristics of civil unrest events.

【Keywords】: event detection; multi-instance learning; text mining

124. Asymmetric Transitivity Preserving Graph Embedding.

Paper Link】 【Pages】:1105-1114

【Authors】: Mingdong Ou ; Peng Cui ; Jian Pei ; Ziwei Zhang ; Wenwu Zhu

【Abstract】: Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of directed graphs. Asymmetric transitivity depicts the correlation among directed edges, that is, if there is a directed path from u to v, then there is likely a directed edge from u to v. Asymmetric transitivity can help in capturing structures of graphs and recovering from partially observed graphs. To tackle this challenge, we propose the idea of preserving asymmetric transitivity by approximating high-order proximity which are based on asymmetric transitivity. In particular, we develop a novel graph embedding algorithm, High-Order Proximity preserved Embedding (HOPE for short), which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. More specifically, we first derive a general formulation that cover multiple popular high-order proximity measurements, then propose a scalable embedding algorithm to approximate the high-order proximity measurements based on their general formulation. Moreover, we provide a theoretical upper bound on the RMSE (Root Mean Squared Error) of the approximation. Our empirical experiments on a synthetic dataset and three real-world datasets demonstrate that HOPE can approximate the high-order proximities significantly better than the state-of-art algorithms and outperform the state-of-art algorithms in tasks of reconstruction, link prediction and vertex recommendation.

【Keywords】: asymmetric transitivity; directed graph; graph embedding; high-order proximity

125. PTE: Enumerating Trillion Triangles On Distributed Systems.

Paper Link】 【Pages】:1115-1124

【Authors】: Ha-Myung Park ; Sung-Hyon Myaeng ; U. Kang

【Abstract】: How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time. In this paper, we propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read. Experimental results show that PTE provides up to 47 times faster performance than recent distributed algorithms on real world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges, which any previous triangle computation algorithm fail to process.

【Keywords】: big data; distributed algorithm; graph algorithm; network analysis; scalable algorithm; triangle enumeration

126. Robust Large-Scale Machine Learning in the Cloud.

Paper Link】 【Pages】:1125-1134

【Authors】: Steffen Rendle ; Dennis Fetterly ; Eugene J. Shekita ; Bor-Yiing Su

【Abstract】: The convergence behavior of many distributed machine learning (ML) algorithms can be sensitive to the number of machines being used or to changes in the computing environment. As a result, scaling to a large number of machines can be challenging. In this paper, we describe a new scalable coordinate descent (SCD) algorithm for generalized linear models whose convergence behavior is always the same, regardless of how much SCD is scaled out and regardless of the computing environment. This makes SCD highly robust and enables it to scale to massive datasets on low-cost commodity servers. Experimental results on a real advertising dataset in Google are used to demonstrate SCD's cost effectiveness and scalability. Using Google's internal cloud, we show that SCD can provide near linear scaling using thousands of cores for 1 trillion training examples on a petabyte of compressed data. This represents 10,000x more training examples than the 'large-scale' Netflix prize dataset. We also show that SCD can learn a model for 20 billion training examples in two hours for about $10.

【Keywords】: coordinate descent; distributed machine learning; linear regression

127. "Why Should I Trust You?": Explaining the Predictions of Any Classifier.

Paper Link】 【Pages】:1135-1144

【Authors】: Marco Túlio Ribeiro ; Sameer Singh ; Carlos Guestrin

【Abstract】: Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

【Keywords】: black box classifier; explaining machine learning; interpretability; interpretable machine learning

128. ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages.

Paper Link】 【Pages】:1145-1154

【Authors】: Matteo Riondato ; Eli Upfal

【Abstract】: We present ABRA, a suite of algorithms to compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademacher averages and pseudodimension, fundamental concepts from statistical learning theory. To our knowledge, this is the first application of these concepts to the field of graph analysis. Our experimental results show that ABRA is much faster than exact methods, and vastly outperforms, in both runtime and number of samples, state-of-the-art algorithms with the same quality guarantees.

【Keywords】: betweenness; centrality; centrality measures; progressive sampling; pseudodimension; rademacher averages; sample complexity; sampling

129. Sampling of Attributed Networks from Hierarchical Generative Models.

Paper Link】 【Pages】:1155-1164

【Authors】: Pablo Robles-Granda ; Sebastián Moreno ; Jennifer Neville

【Abstract】: Network sampling is a widely used procedure in social network analysis where a random network is sampled from a generative network model (GNM). Recently proposed GNMs, allow generation of networks with more realistic structural characteristics than earlier ones. This facilitates tasks such as hypothesis testing and sensitivity analysis. However, sampling of networks with correlated vertex attributes remains a challenging problem. While the recent work of \cite{Pfeiffer:14} has provided a promising approach for attributed-network sampling, the approach was developed for use with relatively simple GNMs and does not work well with more complex hierarchical GNMs (which can model the range of characteristics and variation observed in real world networks more accurately). In contrast to simple GNMs where the probability mass is spread throughout the space of edges more evenly, hierarchical GNMs concentrate the mass to smaller regions of the space to reflect dependencies among edges in the network---this produces more realistic network characteristics, but also makes it more difficult to identify candidate networks from the sampling space. In this paper, we propose a novel sampling method, CSAG, to sample from hierarchical GNMs and generate networks with correlated attributes. CSAG constrains every step of the sampling process to consider the structure of the GNM---in order to bias the search to regions of the space with higher likelihood. We implemented CSAG using mixed Kronecker Product Graph Models and evaluated our approach on three real-world datasets. The results show that CSAG jointly models the correlation and structure of the networks better than the state of the art. Specifically, CSAG maintains the variability of the underlying GNM while providing a ≥ 5X reduction in attribute correlation error.

【Keywords】: attributed networks; complex networks; network sampling; statistical network models

130. Goal-Directed Inductive Matrix Completion.

Paper Link】 【Pages】:1165-1174

【Authors】: Si Si ; Kai-Yang Chiang ; Cho-Jui Hsieh ; Nikhil Rao ; Inderjit S. Dhillon

【Abstract】: Matrix completion (MC) with additional information has found wide applicability in several machine learning applications. Among algorithms for solving such problems, Inductive Matrix Completion(IMC) has drawn a considerable amount of attention, not only for its well established theoretical guarantees but also for its superior performance in various real-world applications. However, IMC based methods usually place very strong constraints on the quality of the features(side information) to ensure accurate recovery, which might not be met in practice. In this paper, we propose Goal-directed Inductive Matrix Completion(GIMC) to learn a nonlinear mapping of the features so that they satisfy the required properties. A key distinction between GIMC and IMC is that the feature mapping is learnt in a supervised manner, deviating from the traditional approach of unsupervised feature learning followed by model training. We establish the superiority of our method on several popular machine learning applications including multi-label learning, multi-class classification, and semi-supervised clustering.

【Keywords】: matrix completion

131. Graph Wavelets via Sparse Cuts.

Paper Link】 【Pages】:1175-1184

【Authors】: Arlei Silva ; Xuan Hong Dang ; Prithwish Basu ; Ambuj K. Singh ; Ananthram Swami

【Abstract】: Modeling information that resides on vertices of large graphs is a key problem in several real-life applications, ranging from social networks to the Internet-of-things. Signal Processing on Graphs and, in particular, graph wavelets can exploit the intrinsic smoothness of these datasets in order to represent them in a compact and accurate manner. However, how to discover wavelet bases that capture the geometry of the data with respect to the signal as well as the graph structure remains an open problem. In this paper, we study the problem of computing graph wavelet bases via sparse cuts in order to produce low-dimensional encodings of data-driven bases. This problem is connected to known hard problems in graph theory (e.g. multiway cuts) and thus requires an efficient heuristic. We formulate the basis discovery task as a relaxation of a vector optimization problem, which leads to an elegant solution as a regularized eigenvalue computation. Moreover, we propose several strategies in order to scale our algorithm to large graphs. Experimental results show that the proposed algorithm can effectively encode both the graph structure and signal, producing compressed and accurate representations for vertex values in a wide range of datasets (e.g. sensor and gene networks) and significantly outperforming the best baseline.

【Keywords】: graph mining; spectral theory; wavelets

132. Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data.

Paper Link】 【Pages】:1185-1194

【Authors】: Payam Siyari ; Bistra Dilkina ; Constantine Dovrolis

【Abstract】: Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as Lexis, that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "Lexis-DAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the "core" of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents.

【Keywords】: DNA synthesis; centrality; compression; directed acyclic graph; feature extraction; hierarchical structure; optimization; structure discovery

133. Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches.

Paper Link】 【Pages】:1195-1204

【Authors】: Daniel Ting

【Abstract】: Estimating the cardinality of unions and intersections of sets is a problem of interest in OLAP. Large data applications often require the use of approximate methods based on small sketches of the data. We give new estimators for the cardinality of unions and intersection and show they approximate an optimal estimation procedure. These estimators enable the improved accuracy of the streaming MinCount sketch to be exploited in distributed settings. Both theoretical and empirical results demonstrate substantial improvements over existing methods.

【Keywords】: cardinality estimation; data sketching; randomized algorithms

134. Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure.

Paper Link】 【Pages】:1205-1214

【Authors】: Kai Ming Ting ; Ye Zhu ; Mark James Carman ; Yue Zhu ; Zhi-Hua Zhou

【Abstract】: This paper introduces the first generic version of data dependent dissimilarity and shows that it provides a better closest match than distance measures for three existing algorithms in clustering, anomaly detection and multi-label classification. For each algorithm, we show that by simply replacing the distance measure with the data dependent dissimilarity measure, it overcomes a key weakness of the otherwise unchanged algorithm.

【Keywords】: data dependent dissimilarity; distance measure; distance-based neighbourhood; k nearest neighbours.; probability-mass-based neighbourhood

135. Just One More: Modeling Binge Watching Behavior.

Paper Link】 【Pages】:1215-1224

【Authors】: William Trouleau ; Azin Ashkan ; Weicong Ding ; Brian Eriksson

【Abstract】: Easy accessibility can often lead to over-consumption, as seen in food and alcohol habits. On video on-demand (VOD) services, this has recently been referred to as binge watching, where potentially entire seasons of TV shows are consumed in a single viewing session. While a user viewership model may reveal this binging behavior, creating an accurate model has several challenges, including censored data, deviations in the population, and the need to consider external influences on consumption habits. In this paper, we introduce a novel statistical mixture model that incorporates these factors and presents a first of its kind characterization of viewer consumption behavior using a real-world dataset that includes playback data from a VOD service. From our modeling, we tackle various predictive tasks to infer the consumption decisions of a user in a viewing session, including estimating the number of episodes they watch and classifying if they continue watching another episode. Using these insights, we then identify binge watching sessions based on deviation from normal viewing behavior. We observe different types of binging behavior, that binge watchers often view certain content out-of-order, and that binge watching is not a consistent behavior among our users. These insights and our findings have application in VOD revenue generation, consumer health applications, and customer retention analysis.

【Keywords】: binge watching; censored poisson regression; mixture model

136. Structural Deep Network Embedding.

Paper Link】 【Pages】:1225-1234

【Authors】: Daixin Wang ; Peng Cui ; Wenwu Zhu

【Abstract】: Network embedding is an important method to learn low-dimensional representations of vertexes in networks, aiming to capture and preserve the network structure. Almost all the existing network embedding methods adopt shallow models. However, since the underlying network structure is complex, shallow models cannot capture the highly non-linear network structure, resulting in sub-optimal network representations. Therefore, how to find a method that is able to effectively capture the highly non-linear network structure and preserve the global and local structure is an open yet important problem. To solve this problem, in this paper we propose a Structural Deep Network Embedding method, namely SDNE. More specifically, we first propose a semi-supervised deep model, which has multiple layers of non-linear functions, thereby being able to capture the highly non-linear network structure. Then we propose to exploit the first-order and second-order proximity jointly to preserve the network structure. The second-order proximity is used by the unsupervised component to capture the global network structure. While the first-order proximity is used as the supervised information in the supervised component to preserve the local network structure. By jointly optimizing them in the semi-supervised deep model, our method can preserve both the local and global network structure and is robust to sparse networks. Empirically, we conduct the experiments on five real-world networks, including a language network, a citation network and three social networks. The results show that compared to the baselines, our method can reconstruct the original network significantly better and achieves substantial gains in three applications, i.e. multi-label classification, link prediction and visualization.

【Keywords】: deep learning; network analysis; network embedding

137. Targeted Topic Modeling for Focused Analysis.

Paper Link】 【Pages】:1235-1244

【Authors】: Shuai Wang ; Zhiyuan Chen ; Geli Fei ; Bing Liu ; Sherry Emery

【Abstract】: One of the overarching tasks of document analysis is to find what topics people talk about. One of the main techniques for this purpose is topic modeling. So far many models have been proposed. However, the existing models typically perform full analysis on the whole data to find all topics. This is certainly useful, but in practice we found that the user almost always also wants to perform more detailed analyses on some specific aspects, which we refer to as targets (or targeted aspects). Current full-analysis models are not suitable for such analyses as their generated topics are often too coarse and may not even be on target. For example, given a set of tweets about e-cigarette, one may want to find out what topics under discussion are specifically related to children. Likewise, given a collection of online reviews about a camera, a consumer or camera manufacturer may be interested in finding out all topics about the camera's screen, the targeted aspect. As we will see in our experiments, current full topic models are ineffective for such targeted analyses. This paper studies this problem and proposes a novel targeted topic model (TTM) to enable focused analyses on any specific aspect of interest. Our experimental results demonstrate the effectiveness of the TTM.

【Keywords】: focused analysis; targeted analysis; targeted aspect; targeted modeling; targeted topic modeling; topic modeling

138. Structured Doubly Stochastic Matrix for Graph Based Clustering: Structured Doubly Stochastic Matrix.

Paper Link】 【Pages】:1245-1254

【Authors】: Xiaoqian Wang ; Feiping Nie ; Heng Huang

【Abstract】: As one of the most significant machine learning topics, clustering has been extensively employed in various kinds of area. Its prevalent application in scientific research as well as industrial practice has drawn high attention in this day and age. A multitude of clustering methods have been developed, among which the graph based clustering method using the affinity matrix has been laid great emphasis on. Recent research work used the doubly stochastic matrix to normalize the input affinity matrix and enhance the graph based clustering models. Although the doubly stochastic matrix can improve the clustering performance, the clustering structure in the doubly stochastic matrix is not clear as expected. Thus, post processing step is required to extract the final clustering results, which may not be optimal. To address this problem, in this paper, we propose a novel convex model to learn the structured doubly stochastic matrix by imposing low-rank constraint on the graph Laplacian matrix. Our new structured doubly stochastic matrix can explicitly uncover the clustering structure and encode the probabilities of pair-wise data points to be connected, such that the clustering results are enhanced. An efficient optimization algorithm is derived to solve our new objective. Also, we provide theoretical discussions that when the input differs, our method possesses interesting connections with K-means and spectral graph cut models respectively. We conduct experiments on both synthetic and benchmark datasets to validate the performance of our proposed method. The empirical results demonstrate that our model provides an approach to better solving the K-mean clustering problem. By using the cluster indicator provided by our model as initialization, K-means converges to a smaller objective function value with better clustering performance. Moreover, we compare the clustering performance of our model with spectral clustering and related double stochastic model. On all datasets, our method performs equally or better than the related methods.

【Keywords】: doubly stochastic matrix; graph laplacian; k-means clustering; spectral clustering

139. A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests.

Paper Link】 【Pages】:1255-1264

【Authors】: Geoffrey I. Webb ; François Petitjean

【Abstract】: Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

【Keywords】: hypothesis testing; model selection; multiple testing

140. Revisiting Random Binning Features: Fast Convergence and Strong Parallelizability.

Paper Link】 【Pages】:1265-1274

【Authors】: Lingfei Wu ; Ian En-Hsu Yen ; Jie Chen ; Rui Yan

【Abstract】: Kernel method has been developed as one of the standard approaches for nonlinear learning, which however, does not scale to large data set due to its quadratic complexity in the number of samples. A number of kernel approximation methods have thus been proposed in the recent years, among which the random features method gains much popularity due to its simplicity and direct reduction of nonlinear problem to a linear one. Different random feature functions have since been proposed to approximate a variety of kernel functions. Among them the Random Binning (RB) feature, proposed in the first random-feature paper [21], has drawn much less attention than the Random Fourier (RF) feature proposed also in [21]. In this work, we observe that the RB features, with right choice of optimization solver, could be orders-of-magnitude more efficient than other random features and kernel approximation methods under the same requirement of accuracy. We thus propose the first analysis of RB from the perspective of optimization, which by interpreting RB as a Randomized Block Coordinate Descent in the infinite-dimensional space, gives a faster convergence rate compared to that of other random features. In particular, we show that by drawing R random grids with at least κ number of non-empty bins per grid in expectation, RB method achieves a convergence rate of O(1/κ R)), which not only sharpens its O(1/√R) rate from Monte Carlo analysis, but also shows a κ times speedup over other random features under the same analysis framework. In addition, we demonstrate another advantage of RB in the L1-regularized setting, where unlike other random features, a RB-based Coordinate Descent solver can be parallelized with guaranteed speedup proportional to κ. Our extensive experiments demonstrate the superior performance of the RB features over other random features and kernel approximation methods.

【Keywords】: faster convergence; kernel approximation; large-scale machine learning; random binning features; strong parallelizability

141. Robust Extreme Multi-label Learning.

Paper Link】 【Pages】:1275-1284

【Authors】: Chang Xu ; Dacheng Tao ; Chao Xu

【Abstract】: Tail labels in the multi-label learning problem undermine the low-rank assumption. Nevertheless, this problem has rarely been investigated. In addition to using the low-rank structure to depict label correlations, this paper explores and exploits an additional sparse component to handle tail labels behaving as outliers, in order to make the classical low-rank principle in multi-label learning valid. The divide-and-conquer optimization technique is employed to increase the scalability of the proposed algorithm while theoretically guaranteeing its performance. A theoretical analysis of the generalizability of the proposed algorithm suggests that it can be improved by the low-rank and sparse decomposition given tail labels. Experimental results on real-world data demonstrate the significance of investigating tail labels and the effectiveness of the proposed algorithm.

【Keywords】: low-rank algorithm; multi-label learning

142. Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective.

Paper Link】 【Pages】:1285-1294

【Authors】: Tong Xu ; Hengshu Zhu ; Xiangyu Zhao ; Qi Liu ; Hao Zhong ; Enhong Chen ; Hui Xiong

【Abstract】: With recent advances in mobile and sensor technologies, a large amount of efforts have been made on developing intelligent applications for taxi drivers, which provide beneficial guide and opportunity to improve the profit and work efficiency. However, limited scopes focus on the latent social interaction within cab drivers, and corresponding social propagation scheme to share driving behaviors has been largely ignored. To that end, in this paper, we propose a comprehensive study to reveal how the social propagation affects for better prediction of cab drivers' future behaviors. To be specific, we first investigate the correlation between drivers' skills and their mutual interactions in the latent vehicle-to-vehicle network, which intuitively indicates the effects of social influences. Along this line, by leveraging the classic social influence theory, we develop a two-stage framework for quantitatively revealing the latent driving pattern propagation within taxi drivers. Comprehensive experiments on a real-word data set collected from the New York City clearly validate the effectiveness of our proposed framework on predicting future taxi driving behaviors, which also support the hypothesis that social factors indeed improve the predictability of driving behaviors.

【Keywords】: mobile data mining; social influence; taxi trajectories

143. DeepIntent: Learning Attentions for Online Advertising with Recurrent Neural Networks.

Paper Link】 【Pages】:1295-1304

【Authors】: Shuangfei Zhai ; Keng-hao Chang ; Ruofei Zhang ; Zhongfei (Mark) Zhang

【Abstract】: In this paper, we investigate the use of recurrent neural networks (RNNs) in the context of search-based online advertising. We use RNNs to map both queries and ads to real valued vectors, with which the relevance of a given (query, ad) pair can be easily computed. On top of the RNN, we propose a novel attention network, which learns to assign attention scores to different word locations according to their intent importance (hence the name DeepIntent). The vector output of a sequence is thus computed by a weighted sum of the hidden states of the RNN at each word according their attention scores. We perform end-to-end training of both the RNN and attention network under the guidance of user click logs, which are sampled from a commercial search engine. We show that in most cases the attention network improves the quality of learned vector representations, evaluated by AUC on a manually labeled dataset. Moreover, we highlight the effectiveness of the learned attention scores from two aspects: query rewriting and a modified BM25 metric. We show that using the learned attention scores, one is able to produce sub-queries that are of better qualities than those of the state-of-the-art methods. Also, by modifying the term frequency with the attention scores in a standard BM25 formula, one is able to improve its performance evaluated by AUC.

【Keywords】: BM25; RNN; attention mechanism; deep learning; online advertising; query rewriting; sponsored search

144. GMove: Group-Level Mobility Modeling Using Geo-Tagged Social Media.

Paper Link】 【Pages】:1305-1314

【Authors】: Chao Zhang ; Keyang Zhang ; Quan Yuan ; Luming Zhang ; Tim Hanratty ; Jiawei Han

【Abstract】: Understanding human mobility is of great importance to various applications, such as urban planning, traffic scheduling, and location prediction. While there has been fruitful research on modeling human mobility using tracking data (e.g., GPS traces), the recent growth of geo-tagged social media (GeoSM) brings new opportunities to this task because of its sheer size and multi-dimensional nature. Nevertheless, how to obtain quality mobility models from the highly sparse and complex GeoSM data remains a challenge that cannot be readily addressed by existing techniques. We propose GMove, a group-level mobility modeling method using GeoSM data. Our insight is that the GeoSM data usually contains multiple user groups, where the users within the same group share significant movement regularity. Meanwhile, user grouping and mobility modeling are two intertwined tasks: (1) better user grouping offers better within-group data consistency and thus leads to more reliable mobility models; and (2) better mobility models serve as useful guidance that helps infer the group a user belongs to. GMove thus alternates between user grouping and mobility modeling, and generates an ensemble of Hidden Markov Models (HMMs) to characterize group-level movement regularity. Furthermore, to reduce text sparsity of GeoSM data, GMove also features a text augmenter. The augmenter computes keyword correlations by examining their spatiotemporal distributions. With such correlations as auxiliary knowledge, it performs sampling-based augmentation to alleviate text sparsity and produce high-quality HMMs. Our extensive experiments on two real-life data sets demonstrate that GMove can effectively generate meaningful group-level mobility models. Moreover, with context-aware location prediction as an example application, we find that GMove significantly outperforms baseline mobility models in terms of prediction accuracy.

【Keywords】: human mobility; movement pattern; social media; spatiotemporal data; twitter

145. Approximate Personalized PageRank on Dynamic Graphs.

Paper Link】 【Pages】:1315-1324

【Authors】: Hongyang Zhang ; Peter Lofgren ; Ashish Goel

【Abstract】: We propose and analyze two algorithms for maintaining approximate Personalized PageRank (PPR) vectors on a dynamic graph, where edges are added or deleted. Our algorithms are natural dynamic versions of two known local variations of power iteration. One, Forward Push, propagates probability mass forwards along edges from a source node, while the other, Reverse Push, propagates local changes backwards along edges from a target. In both variations, we maintain an invariant between two vectors, and when an edge is updated, our algorithm first modifies the vectors to restore the invariant, then performs any needed local push operations to restore accuracy. For Reverse Push, we prove that for an arbitrary directed graph in a random edge model, or for an arbitrary undirected graph, given a uniformly random target node t, the cost to maintain a PPR vector to t of additive error ε as k edges are updated is O(k + d/ε, where d is the average degree of the graph. This is O(1) work per update, plus the cost of computing a reverse vector once on a static graph. For Forward Push, we show that on an arbitrary undirected graph, given a uniformly random start node s, the cost to maintain a PPR vector from s of degree-normalized error ε as k edges are updated is O(k + 1/ε, which is again O(1) per update plus the cost of computing a PPR vector once on a static graph.

【Keywords】: dynamic graph; local push; personalized pagerank; random walks.

146. Annealed Sparsity via Adaptive and Dynamic Shrinking.

Paper Link】 【Pages】:1325-1334

【Authors】: Kai Zhang ; Shandian Zhe ; Chaoran Cheng ; Zhi Wei ; Zhengzhang Chen ; Haifeng Chen ; Guofei Jiang ; Yuan Qi ; Jieping Ye

【Abstract】: Sparse learning has received tremendous amount of interest in high-dimensional data analysis due to its model interpretability and the low-computational cost. Among the various techniques, adaptive l1-regularization is an effective framework to improve the convergence behaviour of the LASSO, by using varying strength of regularization across different features. In the meantime, the adaptive structure makes it very powerful in modelling grouped sparsity patterns as well, being particularly useful in high-dimensional multi-task problems. However, choosing an appropriate, global regularization weight is still an open problem. In this paper, inspired by the annealing technique in material science, we propose to achieve "annealed sparsity" by designing a dynamic shrinking scheme that simultaneously optimizes the regularization weights and model coefficients in sparse (multi-task) learning. The dynamic structures of our algorithm are twofold. Feature-wise (spatially), the regularization weights are updated interactively with model coefficients, allowing us to improve the global regularization structure. Iteration-wise (temporally), such interaction is coupled with gradually boosted l1-regularization by adjusting an equality norm-constraint, achieving an annealing effect to further improve model selection. This renders interesting shrinking behaviour in the whole solution path. Our method competes favorably with state-of-the-art methods in sparse (multi-task) learning. We also apply it in expression quantitative trait loci analysis (eQTL), which gives useful biological insights in human cancer (melanoma) study.

【Keywords】: adaptive lasso; annealing; compressive sensing; dynamic shrinking; feature selection; lasso; multitask learning; regularization path; sparse regression

147. Partial Label Learning via Feature-Aware Disambiguation.

Paper Link】 【Pages】:1335-1344

【Authors】: Min-Ling Zhang ; Bin-Bin Zhou ; Xu-Ying Liu

【Abstract】: Partial label learning deals with the problem where each training example is represented by a feature vector while associated with a set of candidate labels, among which only one label is valid. To learn from such ambiguous labeling information, the key is to try to disambiguate the candidate label sets of partial label training examples. Existing disambiguation strategies work by either identifying the ground-truth label iteratively or treating each candidate label equally. Nonetheless, the disambiguation process is generally conducted by focusing on manipulating the label space, and thus ignores making full use of potentially useful information from the feature space. In this paper, a novel two-stage approach is proposed to learning from partial label examples based on feature-aware disambiguation. In the first stage, the manifold structure of feature space is utilized to generate normalized labeling confidences over candidate label set. In the second stage, the predictive model is learned by performing regularized multi-output regression over the generated labeling confidences. Extensive experiments on artificial as well as real-world partial label data sets clearly validate the superiority of the proposed feature-aware disambiguation approach.

【Keywords】: disambiguation; manifold; partial label learning; weak supervision

148. FINAL: Fast Attributed Network Alignment.

Paper Link】 【Pages】:1345-1354

【Authors】: Si Zhang ; Hanghang Tong

【Abstract】: Multiple networks naturally appear in numerous high-impact applications. Network alignment (i.e., finding the node correspondence across different networks) is often the very first step for many data mining tasks. Most, if not all, of the existing alignment methods are solely based on the topology of the underlying networks. Nonetheless, many real networks often have rich attribute information on nodes and/or edges. In this paper, we propose a family of algorithms FINAL to align attributed networks. The key idea is to leverage the node/edge attribute information to guide (topology-based) alignment process. We formulate this problem from an optimization perspective based on the alignment consistency principle, and develop effective and scalable algorithms to solve it. Our experiments on real networks show that (1) by leveraging the attribute information, our algorithms can significantly improve the alignment accuracy (i.e., up to a 30% improvement over the existing methods); (2) compared with the exact solution, our proposed fast alignment algorithm leads to a more than 10 times speed-up, while preserving a 95% accuracy; and (3) our on-query alignment method scales linearly, with an around 90% ranking accuracy compared with our exact full alignment method and a near real-time response time.

【Keywords】: alignment consistency; attributed network alignment; on-query alignment

149. Come-and-Go Patterns of Group Evolution: A Dynamic Model.

Paper Link】 【Pages】:1355-1364

【Authors】: Tianyang Zhang ; Peng Cui ; Christos Faloutsos ; Yunfei Lu ; Hao Ye ; Wenwu Zhu ; Shiqiang Yang

【Abstract】: How do social groups, such as Facebook groups and Wechat groups, dynamically evolve over time? How do people join the social groups, uniformly or with burst? What is the pattern of people quitting from groups? Is there a simple universal model to depict the come-and-go patterns of various groups? In this paper, we examine temporal evolution patterns of more than 100 thousands social groups with more than 10 million users. We surprisingly find that the evolution patterns of real social groups goes far beyond the classic dynamic models like SI and SIR. For example, we observe both diffusion and non-diffusion mechanism in the group joining process, and power-law decay in group quitting process, rather than exponential decay as expected in SIR model. Therefore we propose a new model comeNgo, a concise yet flexible dynamic model for group evolution. Our model has the following advantages: (a) unification power: it generalizes earlier theoretical models and different joining and quitting mechanisms we find from observation. (b) succinctness and interpretability: it contains only six parameters with clear physical meanings. (c) accuracy: it can capture various kinds of group evolution patterns preciously and the goodness of fit increase by 58% over baseline. (d) usefulness: it can be used in multiple application scenarios such as forecasting and pattern discovery.

【Keywords】: dynamic model; group evolution; temporal patterns

150. NetCycle: Collective Evolution Inference in Heterogeneous Information Networks.

Paper Link】 【Pages】:1365-1374

【Authors】: Yizhou Zhang ; Yun Xiong ; Xiangnan Kong ; Yangyong Zhu

【Abstract】: Collective inference has attracted considerable attention in the last decade, where the response variables within a group of instances are correlated and should be inferred collectively, instead of independently. Previous works on collective inference mainly focus on exploiting the autocorrelation among instances in a static network during the inference process. There are also approaches on time series prediction, which mainly exploit the autocorrelation within an instance at different time points during the inference process. However, in many real-world applications, the response variables of related instances can co-evolve over time and their evolutions are not following a static correlation across time, but are following an internal life cycle. In this paper, we study the problem of collective evolution inference, where the goal is to predict the values of the response variables for a group of related instances at the end of their life cycles. This problem is extremely important for various applications, e.g., predicting fund-raising results in crowd-funding and predicting gene-expression levels in bioinformatics. This problem is also highly challenging because different instances in the network can co-evolve over time and they can be at different stages of their life cycles and thus have different evolving patterns. Moreover, the instances in collective evolution inference problems are usually connected through heterogeneous information networks, which involve complex relationships among the instances interconnected by multiple types of links. We propose an approach, called NetCycle, by incorporating information from both the correlation among related instances and their life cycles. We compared our approach with existing methods of collective inference and time series analysis on two real-world networks. The results demonstrate that our proposed approach can improve the inference performance by considering the autocorrelation through networks and the life cycles of the instances.

【Keywords】: collective inference; data mining; evolution inference; graph mining; heterogeneous information networks

151. Accelerating Online CP Decompositions for Higher Order Tensors.

Paper Link】 【Pages】:1375-1384

【Authors】: Shuo Zhou ; Nguyen Xuan Vinh ; James Bailey ; Yunzhe Jia ; Ian Davidson

【Abstract】: Tensors are a natural representation for multidimensional data. In recent years, CANDECOMP/PARAFAC (CP) decomposition, one of the most popular tools for analyzing multi-way data, has been extensively studied and widely applied. However, today's datasets are often dynamically changing over time. Tracking the CP decomposition for such dynamic tensors is a crucial but challenging task, due to the large scale of the tensor and the velocity of new data arriving. Traditional techniques, such as Alternating Least Squares (ALS), cannot be directly applied to this problem because of their poor scalability in terms of time and memory. Additionally, existing online approaches have only partially addressed this problem and can only be deployed on third-order tensors. To fill this gap, we propose an efficient online algorithm that can incrementally track the CP decompositions of dynamic tensors with an arbitrary number of dimensions. In terms of effectiveness, our algorithm demonstrates comparable results with the most accurate algorithm, ALS, whilst being computationally much more efficient. Specifically, on small and moderate datasets, our approach is tens to hundreds of times faster than ALS, while for large-scale datasets, the speedup can be more than 3,000 times. Compared to other state-of-the-art online approaches, our method shows not only significantly better decomposition quality, but also better performance in terms of stability, efficiency and scalability.

【Keywords】: CP decomposition; online learning; tensor decomposition

Research Track Posters 72

152. Optimal Reserve Prices in Upstream Auctions: Empirical Application on Online Video Advertising.

Paper Link】 【Pages】:1395-1404

【Authors】: Miguel Angel Alcobendas Lisbona ; Sheide Chammas ; Kuang-chih Lee

【Abstract】: We consider optimal reserve prices in BrightRoll Video Exchange when the inventory opportunity comes from other exchanges (downstream marketplaces). We show that the existence of downstream auctions impacts the optimal floor. Moreover, it renders the classical derivation of the floor set by a monopolist inadequate and suboptimal. We derive the new downstream-corrected reserve price and compare its performance with respect to existing floors and the classical optimal monopoly price. In our application, the downstream-corrected reserve price proves superior to both. The proposed model also deals with data challenges commonly faced by exchanges: limited number of logged bids in an auction, and uncertainty regarding the bidding behavior in other exchanges. The relevance of this study transcends its particular context and is applicable to a wide range of scenarios where sequential auctions exist, and where marketplaces interact with each other.

【Keywords】: auctions; online advertising; optimal auctions; reserve price; revenue optimization; video advertising

153. Burstiness Scale: A Parsimonious Model for Characterizing Random Series of Events.

Paper Link】 【Pages】:1405-1414

【Authors】: Rodrigo Augusto da Silva Alves ; Renato Martins Assunção ; Pedro Olmo Stancioli Vaz de Melo

【Abstract】: The problem to accurately and parsimoniously characterize random series of events (RSEs) seen in the Web, such as Yelp reviews or Twitter hashtags, is not trivial. Reports found in the literature reveal two apparent conflicting visions of how RSEs should be modeled. From one side, the Poissonian processes, of which consecutive events follow each other at a relatively regular time and should not be correlated. On the other side, the self-exciting processes, which are able to generate bursts of correlated events. The existence of many and sometimes conflicting approaches to model RSEs is a consequence of the unpredictability of the aggregated dynamics of our individual and routine activities, which sometimes show simple patterns, but sometimes results in irregular rising and falling trends. In this paper we propose a parsimonious way to characterize general RSEs, namely the Burstiness Scale (BuSca) model. BuSca views each RSE as a mix of two independent process: a Poissonian and a self-exciting one. Here we describe a fast method to extract the two parameters of BuSca that, together, gives the burstiness scale ψ, which represents how much of the RSE is due to bursty and viral effects. We validated our method in eight diverse and large datasets containing real random series of events seen in Twitter, Yelp, e-mail conversations, Digg, and online forums. Results showed that, even using only two parameters, BuSca is able to accurately describe RSEs seen in these diverse systems, what can leverage many applications.

【Keywords】: communication dynamics; self-exciting point process; social media

154. MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories.

Paper Link】 【Pages】:1415-1424

【Authors】: Prithu Banerjee ; Pranali Yawalkar ; Sayan Ranu

【Abstract】: In this paper, we study the problem of mining temporally anomalous sub-trajectory patterns from an input trajectory in a scalable manner. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we design a technique called MANTRA. We study the properties unique to anomalous sub-trajectories and utilize them in MANTRA to iteratively refine the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we demonstrate that the proposed model conforms to human intuition.

【Keywords】: anomalous trajectory; gps

155. From Prediction to Action: A Closed-Loop Approach for Data-Guided Network Resource Allocation.

Paper Link】 【Pages】:1425-1434

【Authors】: Yanan Bao ; Huasen Wu ; Xin Liu

【Abstract】: Machine learning methods have been widely used in modeling and predicting network user experience. In this paper, moving beyond user experience prediction, we propose a closed-loop approach that uses data-generated prediction models to explicitly guide resource allocation for user experience improvement. The closed-loop approach leverages and verifies the causal relation that often exists between certain feature values (e.g., bandwidth) and user experience in computer networks. The approach consists of three components: we train a neural network classifier to predict user experience, utilize the trained neural network classifier as the objective function to allocate network resource, and then evaluate user experience with allocated resource to (in)validate and adjust the original model. Specifically, we propose a dual decomposition algorithm to solve the neural network-based resource optimization problem, which is complex and non-convex. We further develop an iterative mechanism for classifier optimization. Numerical results show that the dual algorithm reduces the expected number of unsatisfied users by up to 2x compared with the baseline, and the optimized classifier further improves the performance by 50%.

【Keywords】: classification; computer networks; resource optimization

156. Towards Robust and Versatile Causal Discovery for Business Applications.

Paper Link】 【Pages】:1435-1444

【Authors】: Giorgos Borboudakis ; Ioannis Tsamardinos

【Abstract】: Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admitting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above. The code is available on the mensxmachina.org website.

【Keywords】: answer set programming; bayes-ball algorithm; causal discovery; interventional data; latent variables; multiple datasets; selection bias; semi-markov causal models

157. Deep Visual-Semantic Hashing for Cross-Modal Retrieval.

Paper Link】 【Pages】:1445-1454

【Authors】: Yue Cao ; Mingsheng Long ; Jianmin Wang ; Qiang Yang ; Philip S. Yu

【Abstract】: Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual-Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and high-quality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in cross-modal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO.

【Keywords】: cross-modal retrieval; deep hashing; multimodal embedding

158. Predicting Socio-Economic Indicators using News Events.

Paper Link】 【Pages】:1455-1464

【Authors】: Sunandan Chakraborty ; Ashwin Venkataraman ; Srikanth Jagabathula ; Lakshminarayanan Subramanian

【Abstract】: Many socio-economic indicators are sensitive to real-world events. Proper characterization of the events can help to identify the relevant events that drive fluctuations in these indicators. In this paper, we propose a novel generative model of real-world events and employ it to extract events from a large corpus of news articles. We introduce the notion of an event class, which is an abstract grouping of similarly themed events. These event classes are manifested in news articles in the form of event triggers which are specific words that describe the actions or incidents reported in any article. We use the extracted events to predict fluctuations in different socio-economic indicators. Specifically, we focus on food prices and predict the price of 12 different crops based on real-world events that potentially influence food price volatility, such as transport strikes, festivals etc. Our experiments demonstrate that incorporating event information in the prediction tasks reduces the root mean square error (RMSE) of prediction by 22% compared to the standard ARIMA model. We also predict sudden increases in the food prices (i.e. spikes) using events as features, and achieve an average 5-10% increase in accuracy compared to baseline models, including an LDA topic-model based predictive model.

【Keywords】: event model; food price; news analytics; predictive model: socio-economic indicators

159. City-Scale Map Creation and Updating using GPS Collections.

Paper Link】 【Pages】:1465-1474

【Authors】: Chen Chen ; Cewu Lu ; Qixing Huang ; Qiang Yang ; Dimitrios Gunopulos ; Leonidas J. Guibas

【Abstract】: Applications such as autonomous driving or real-time route recommendations require up-to-date and accurate digital maps. However, manually creating and updating such maps is too costly to meet the rising demands. As large collections of GPS trajectories become widely available, constructing and updating maps using such trajectory collections can greatly reduce the cost of such maps. Unfortunately, due to GPS noise and varying trajectory sampling rates, inferring maps from GPS trajectories can be very challenging. In this paper, we present a framework to create up-to-date maps with rich knowledge from GPS trajectory collections. Starting from an unstructured GPS point cloud, we discover road segments using novel graph-based clustering techniques with prior knowledge on road design. Based on road segments, we develop a scale- and orientation-invariant traj-SIFT feature to localize and recognize junctions using a supervised learning framework. Maps with rich knowledge are created based on discovered road segments and junctions. Compared to state-of-the-art methods, our approach can efficiently construct high-quality maps at city scales from large collections of GPS trajectories.

【Keywords】: gps trajectories; map construction; traj-sift feature

160. Compressing Convolutional Neural Networks in the Frequency Domain.

Paper Link】 【Pages】:1475-1484

【Authors】: Wenlin Chen ; James T. Wilson ; Stephen Tyree ; Kilian Q. Weinberger ; Yixin Chen

【Abstract】: Convolutional neural networks (CNN) are increasingly used in many areas of computer vision. They are particularly attractive because of their ability to "absorb" great quantities of labeled data through millions of parameters. However, as model sizes increase, so do the storage and memory requirements of the classifiers, hindering many applications such as image and speech recognition on mobile phones and other devices. In this paper, we present a novel net- work architecture, Frequency-Sensitive Hashed Nets (FreshNets), which exploits inherent redundancy in both convolutional layers and fully-connected layers of a deep learning model, leading to dramatic savings in memory and storage consumption. Based on the key observation that the weights of learned convolutional filters are typically smooth and low-frequency, we first convert filter weights to the frequency domain with a discrete cosine transform (DCT) and use a low-cost hash function to randomly group frequency parameters into hash buckets. All parameters assigned the same hash bucket share a single value learned with standard back-propagation. To further reduce model size, we allocate fewer hash buckets to high-frequency components, which are generally less important. We evaluate FreshNets on eight data sets, and show that it leads to better compressed performance than several relevant baselines.

【Keywords】: convolutional neural networks; hashing; model compression

161. Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments.

Paper Link】 【Pages】:1485-1494

【Authors】: Wei-Lin Chiang ; Mu-Chu Lee ; Chih-Jen Lin

【Abstract】: Dual coordinate descent method is one of the most effective approaches for large-scale linear classification. However, its sequential design makes the parallelization difficult. In this work, we target at the parallelization in a multi-core environment. After pointing out difficulties faced in some existing approaches, we propose a new framework to parallelize the dual coordinate descent method. The key idea is to make the majority of all operations (gradient calculation here) parallelizable. The proposed framework is shown to be theoretically sound. Further, we demonstrate through experiments that the new framework is robust and efficient in a multi-core environment.

【Keywords】: dual coordinate descent; linear classification; multi-core computing

162. Multi-layer Representation Learning for Medical Concepts.

Paper Link】 【Pages】:1495-1504

【Authors】: Edward Choi ; Mohammad Taha Bahadori ; Elizabeth Searles ; Catherine Coffey ; Michael Thompson ; James Bost ; Javier Tejedor-Sojo ; Jimeng Sun

【Abstract】: Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits from Electronic Health Records (EHR) has broad applications in healthcare analytics. Patient EHR data consists of a sequence of visits over time, where each visit includes multiple medical concepts, e.g., diagnosis, procedure, and medication codes. This hierarchical structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within a visit. In this work, we propose Med2Vec, which not only learns the representations for both medical codes and visits from large EHR datasets with over million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec shows significant improvement in prediction accuracy in clinical applications compared to baselines such as Skip-gram, GloVe, and stacked autoencoder, while providing clinically meaningful interpretation.

【Keywords】: healthcare analytics; medical concepts; neural networks; representation learning

163. Finding Gangs in War from Signed Networks.

Paper Link】 【Pages】:1505-1514

【Authors】: Lingyang Chu ; Zhefeng Wang ; Jian Pei ; Jiannan Wang ; Zijin Zhao ; Enhong Chen

【Abstract】: Given a signed network where edges are weighted in real number, and positive weights indicate cohesion between vertices and negative weights indicate opposition, we are interested in finding k-Oppositive Cohesive Groups (k-OCG). Each k-OCG is a group of k subgraphs such that (1) the edges within each subgraph are dense and cohesive; and (2) the edges crossing different subgraphs are dense and oppositive. Finding k-OCGs is challenging since the subgraphs are often small, there are multiple k-OCGs in a large signed network, and many existing dense subgraph extraction methods cannot handle edges of two signs. We model k-OCG finding task as a quadratic optimization problem. However, the classical Proximal Gradient method is very costly since it has to use the entire adjacency matrix, which is huge on large networks. Thus, we develop FOCG, an algorithm that is two orders of magnitudes faster than the Proximal Gradient method. The main idea is to only search in small subgraphs and thus avoids using a major portion of the adjacency matrix. Our experimental results on synthetic and real data sets as well as a case study clearly demonstrate the effectiveness and efficiency of our method.

【Keywords】: clustering; community detection; dense subgraph; signed network

164. Efficient Processing of Network Proximity Queries via Chebyshev Acceleration.

Paper Link】 【Pages】:1515-1524

【Authors】: Mustafa Coskun ; Ananth Grama ; Mehmet Koyutürk

【Abstract】: Network proximity is at the heart of a large class of network analytics and information retrieval techniques, including node/ edge rankings, network alignment, and randomwalk based proximity queries, among many others. Owing to its importance, significant effort has been devoted to accelerating iterative processes underlying network proximity computations. These techniques rely on numerical properties of power iterations, as well as structural properties of the networks to reduce the run time of iterative algorithms. In this paper, we present an alternate approach to acceleration of network proximity queries using Chebyshev polynomials. We show that our approach, called CHOPPER, yields asymptotically faster convergence in theory, and significantly reduced convergence times in practice. We also show that other existing acceleration techniques can be used in conjunction with Chopper to further reduce runtime. Using a number of large real-world networks, and top-k proximity queries as the benchmark problem, we show that CHOPPER outperforms existing methods for wide ranges of parameter values. CHOPPER is implemented in Matlab and is freely available at http://compbio.case.edu/chopper/.

【Keywords】: chebyshev polynomials; network proximity; random walk with restarts

165. Latent Space Model for Road Networks to Predict Time-Varying Traffic.

Paper Link】 【Pages】:1525-1534

【Authors】: Dingxiong Deng ; Cyrus Shahabi ; Ugur Demiryurek ; Linhong Zhu ; Rose Yu ; Yan Liu

【Abstract】: Real-time traffic prediction from high-fidelity spatiotemporal traffic sensor datasets is an important problem for intelligent transportation systems and sustainability. However, it is challenging due to the complex topological dependencies and high dynamism associated with changing road conditions. In this paper, we propose a Latent Space Model for Road Networks (LSM-RN) to address these challenges holistically. In particular, given a series of road network snapshots, we learn the attributes of vertices in latent spaces which capture both topological and temporal properties. As these latent attributes are time-dependent, they can estimate how traffic patterns form and evolve. In addition, we present an incremental online algorithm which sequentially and adaptively learns the latent attributes from the temporal graph changes. Our framework enables real-time traffic prediction by 1) exploiting real-time sensor readings to adjust/update the existing latent spaces, and 2) training as data arrives and making predictions on-the-fly. By conducting extensive experiments with a large volume of real-world traffic sensor data, we demonstrate the superiority of our framework for real-time traffic prediction on large road networks over competitors as well as baseline graph-based LSM's.

【Keywords】: latent space model; real-time traffic forecasting; road network

166. Compressing Graphs and Indexes with Recursive Graph Bisection.

Paper Link】 【Pages】:1535-1544

【Authors】: Laxman Dhulipala ; Igor Kabiljo ; Brian Karrer ; Giuseppe Ottaviano ; Sergey Pupyrev ; Alon Shalita

【Abstract】: Graph reordering is a powerful technique to increase the locality of the representations of graphs, which can be helpful in several applications. We study how the technique can be used to improve compression of graphs and inverted indexes. We extend the recent theoretical model of Chierichetti et al. (KDD 2009) for graph compression, and show how it can be employed for compression-friendly reordering of social networks and web graphs and for assigning document identifiers in inverted indexes. We design and implement a novel theoretically sound reordering algorithm that is based on recursive graph bisection. Our experiments show a significant improvement of the compression rate of graph and indexes over existing heuristics. The new method is relatively simple and allows efficient parallel and distributed implementations, which is demonstrated on graphs with billions of vertices and hundreds of billions of edges.

【Keywords】: algorithms; approximation algorithms; compression; data mining; distributed algorithms; experimentation; graph algorithms; linear arrangement; social networks

167. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test.

Paper Link】 【Pages】:1545-1554

【Authors】: Denis Moreira dos Reis ; Peter A. Flach ; Stan Matwin ; Gustavo Batista

【Abstract】: Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.

【Keywords】: cartesian tree; concept drift; data stream; kolmogorov-smirnov; lazy propagation; treap

168. Recurrent Marked Temporal Point Processes: Embedding Event History to Vector.

Paper Link】 【Pages】:1555-1564

【Authors】: Nan Du ; Hanjun Dai ; Rakshit Trivedi ; Utkarsh Upadhyay ; Manuel Gomez-Rodriguez ; Le Song

【Abstract】: Large volumes of event data are becoming increasingly available in a wide variety of applications, such as healthcare analytics, smart cities and social network analysis. The precise time interval or the exact distance between two events carries a great deal of information about the dynamics of the underlying systems. These characteristics make such data fundamentally different from independently and identically distributed data and time-series data where time and space are treated as indexes rather than random variables. Marked temporal point processes are the mathematical framework for modeling event data with covariates. However, typical point process models often make strong assumptions about the generative processes of the event data, which may or may not reflect the reality, and the specifically fixed parametric assumptions also have restricted the expressive power of the respective processes. Can we obtain a more expressive model of marked temporal point processes? How can we learn such a model from massive data? In this paper, we propose the Recurrent Marked Temporal Point Process (RMTPP) to simultaneously model the event timings and the markers. The key idea of our approach is to view the intensity function of a temporal point process as a nonlinear function of the history, and use a recurrent neural network to automatically learn a representation of influences from the event history. We develop an efficient stochastic gradient algorithm for learning the model parameters which can readily scale up to millions of events. Using both synthetic and real world datasets, we show that, in the case where the true models have parametric specifications, RMTPP can learn the dynamics of such models without the need to know the actual parametric forms; and in the case where the true models are unknown, RMTPP can also learn the dynamics and achieve better predictive performance than other parametric alternatives based on particular prior assumptions.

【Keywords】: marked temporal point process; recurrent neural network; stochastic process

169. Learning Cumulatively to Become More Knowledgeable.

Paper Link】 【Pages】:1565-1574

【Authors】: Geli Fei ; Shuai Wang ; Bing Liu

【Abstract】: In classic supervised learning, a learning algorithm takes a fixed training data of several classes to build a classifier. In this paper, we propose to study a new problem, i.e., building a learning system that learns cumulatively. As time goes by, the system sees and learns more and more classes of data and becomes more and more knowledgeable. We believe that this is similar to human learning. We humans learn continuously, retaining the learned knowledge, identifying and learning new things, and updating the existing knowledge with new experiences. Over time, we cumulate more and more knowledge. A learning system should be able to do the same. As algorithmic learning matures, it is time to tackle this cumulative machine learning (or simply cumulative learning) problem, which is a kind of lifelong machine learning problem. It presents two major challenges. First, the system must be able to detect data from unseen classes in the test set. Classic supervised learning, however, assumes all classes in testing are known or seen at the training time. Second, the system needs to be able to selectively update its models whenever a new class of data arrives without re-training the whole system using the entire past and present training data. This paper proposes a novel approach and system to tackle these challenges. Experimental results on two datasets with learning from 2 classes to up to 100 classes show that the proposed approach is highly promising in terms of both classification accuracy and computational efficiency.

【Keywords】: cumulative machine learning; lifelong machine learning; open world classification; unseen classes

170. Squish: Near-Optimal Compression for Archival of Relational Datasets.

Paper Link】 【Pages】:1575-1584

【Authors】: Yihan Gao ; Aditya G. Parameswaran

【Abstract】: Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets.

【Keywords】: arithmetic coding; bayesian network; data compression; lossy compression

171. Fast Component Pursuit for Large-Scale Inverse Covariance Estimation.

Paper Link】 【Pages】:1585-1594

【Authors】: Lei Han ; Yu Zhang ; Tong Zhang

【Abstract】: The maximum likelihood estimation (MLE) for the Gaussian graphical model, which is also known as the inverse covariance estimation problem, has gained increasing interest recently. Most existing works assume that inverse covariance estimators contain sparse structure and then construct models with the l 1 regularization. In this paper, different from existing works, we study the inverse covariance estimation problem from another perspective by efficiently modeling the low-rank structure in the inverse covariance, which is assumed to be a combination of a low-rank part and a diagonal matrix. One motivation for this assumption is that the low-rank structure is common in many applications including the climate and financial analysis, and another one is that such assumption can reduce the computational complexity when computing its inverse. Specifically, we propose an efficient COmponent Pursuit (COP) method to obtain the low-rank part, where each component can be sparse. For optimization, the COP method greedily learns a rank-one component in each iteration by maximizing the log-likelihood. Moreover, the COP algorithm enjoys several appealing properties including the existence of an efficient solution in each iteration and the theoretical guarantee on the convergence of this greedy approach. Experiments on large-scale synthetic and real-world datasets including thousands of millions variables show that the COP method is faster than the state-of-the-art techniques for the inverse covariance estimation problem when achieving comparable log-likelihood on test data.

【Keywords】: component pursuit; greedy algorithm; inverse covariance estimation; large-scale data

172. Meta Structure: Computing Relevance in Large Heterogeneous Information Networks.

Paper Link】 【Pages】:1595-1604

【Authors】: Zhipeng Huang ; Yudian Zheng ; Reynold Cheng ; Yizhou Sun ; Nikos Mamoulis ; Xiang Li

【Abstract】: A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures can be used in various applications, including entity resolution, recommendation, and information retrieval. Several studies have investigated the use of HIN information for relevance computation, however, most of them only utilize simple structure, such as path, to measure the similarity between objects. In this paper, we propose to use meta structure, which is a directed acyclic graph of object types with edge types connecting in between, to measure the proximity between objects. The strength of meta structure is that it can describe complex relationship between two HIN objects (e.g., two papers in DBLP share the same authors and topics). We develop three relevance measures based on meta structure. Due to the computational complexity of these measures, we further design an algorithm with data structures proposed to support their evaluation. Our extensive experiments on YAGO and DBLP show that meta structure-based relevance is more effective than state-of-the-art approaches, and can be efficiently computed.

【Keywords】: heterogeneous information network; meta path; meta structure; relevance

173. Robust and Effective Metric Learning Using Capped Trace Norm: Metric Learning via Capped Trace Norm.

Paper Link】 【Pages】:1605-1614

【Authors】: Zhouyuan Huo ; Feiping Nie ; Heng Huang

【Abstract】: Metric learning aims at automatically learning a metric from pair or triplet based constraints in data, and it can be potentially beneficial whenever the notion of metric between instances plays a nontrivial role. In Mahalanobis distance metric learning, distance matrix M is in symmetric positive semi-definite cone, and in order to avoid overfitting and to learn a better Mahalanobis distance from weakly supervised constraints, the low-rank regularization has been often imposed on matrix M to learn the correlations between features and samples. As the approximations of the rank minimization function, the trace norm and Fantope have been utilized to regularize the metric learning objectives and achieve good performance. However, these low-rank regularization models are either not tight enough to approximate rank minimization or time-consuming to tune an optimal rank. In this paper, we introduce a novel metric learning model using the capped trace norm based regularization, which uses a singular value threshold to constraint the metric matrix M as low-rank explicitly such that the rank of matrix M is stable when the large singular values vary. The capped trace norm regularization can also be viewed as the adaptive Fantope regularization. We minimize singular values which are less than threshold value and the rank of M is not necessary to be k, thus our method is more stable and applicable in practice when we do not know the optimal rank of matrix M. We derive an efficient optimization algorithm to solve the proposed new model and the algorithm convergence proof is also provided in this paper. We evaluate our method on a variety of challenging benchmarks, such as LFW and Pubfig datasets. Face verification experiments are performed and results show that our method consistently outperforms the state-of-the-art metric learning algorithms.

【Keywords】: capped trace norm; face verification; low-rank approximation; metric learning

174. Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations.

Paper Link】 【Pages】:1615-1624

【Authors】: Bo Kang ; Jefrey Lijffijt ; Raúl Santos-Rodriguez ; Tijl De Bie

【Abstract】: Methods that find insightful low-dimensional projections are essential to effectively explore high-dimensional data. Principal Component Analysis is used pervasively to find low-dimensional projections, not only because it is straightforward to use, but it is also often effective, because the variance in data is often dominated by relevant structure. However, even if the projections highlight real structure in the data, not all structure is interesting to every user. If a user is already aware of, or not interested in the dominant structure, Principal Component Analysis is less effective for finding interesting components. We introduce a new method called Subjectively Interesting Component Analysis (SICA), designed to find data projections that are subjectively interesting, i.e, projections that truly surprise the end-user. It is rooted in information theory and employs an explicit model of a user's prior expectations about the data. The corresponding optimization problem is a simple eigenvalue problem, and the result is a trade-off between explained variance and novelty. We present five case studies on synthetic data, images, time-series, and spatial data, to illustrate how SICA enables users to find (subjectively) interesting projections.

【Keywords】: dimensionality reduction; exploratory data mining; information theory; subjective interestingness

175. Online Optimization Methods for the Quantification Problem.

Paper Link】 【Pages】:1625-1634

【Authors】: Purushottam Kar ; Shuai Li ; Harikrishna Narasimhan ; Sanjay Chawla ; Fabrizio Sebastiani

【Abstract】: The estimation of class prevalence, i.e., of the fraction of a population that belongs to a certain class, is an important task in data analytics, and finds applications in many domains such as the social sciences, market research, epidemiology, and others. For example, in sentiment analysis the goal is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather to estimate the overall distribution of positive and negative sentiments, e.g., in a certain time frame. A popular way of performing the above task, often dubbed quantification, is to use supervised learning in order to train a prevalence estimator from labeled data. In the literature there are several performance metrics for measuring the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization; we show, via a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.

【Keywords】: class imbalance; classification; multivariate optimization; quantification; stochastic optimization

176. Smart Broadcasting: Do You Want to be Seen?

Paper Link】 【Pages】:1635-1644

【Authors】: Mohammad Reza Karimi ; Erfan Tavakoli ; Mehrdad Farajtabar ; Le Song ; Manuel Gomez-Rodriguez

【Abstract】: Many users in online social networks are constantly trying to gain attention from their followers by broadcasting posts to them. These broadcasters are likely to gain greater attention if their posts can remain visible for a longer period of time among their followers' most recent feeds. Then when to post? In this paper, we study the problem of smart broadcasting using the framework of temporal point processes, where we model users feeds and posts as discrete events occurring in continuous time. Based on such continuous-time model, then choosing a broadcasting strategy for a user becomes a problem of designing the conditional intensity of her posting events. We derive a novel formula which links this conditional intensity with the visibility of the user in her followers' feeds. Furthermore, by exploiting this formula, we develop an efficient convex optimization framework for the when-to-post problem. Our method can find broadcasting strategies that reach a desired visibility level with provable guarantees. We experimented with data gathered from Twitter, and show that our framework can consistently make broadcasters' post more visible than alternatives.

【Keywords】: broadcast; information networks; social networks; visibility

177. How to Compete Online for News Audience: Modeling Words that Attract Clicks.

Paper Link】 【Pages】:1645-1654

【Authors】: Joon Hee Kim ; Amin Mantrach ; Alejandro Jaimes ; Alice H. Oh

【Abstract】: Headlines are particularly important for online news outlets where there are many similar news stories competing for users' attention. Traditionally, journalists have followed rules-of-thumb and experience to master the art of crafting catchy headlines, but with the valuable resource of large-scale click-through data of online news articles, we can apply quantitative analysis and text mining techniques to acquire an in-depth understanding of headlines. In this paper, we conduct a large-scale analysis and modeling of 150K news articles published over a period of four months on the Yahoo home page. We define a simple method to measure click-value of individual words, and analyze how temporal trends and linguistic attributes affect click-through rate (CTR). We then propose a novel generative model, headline click-based topic model (HCTM), that extends latent Dirichlet allocation (LDA) to reveal the effect of topical context on the click-value of words in headlines. HCTM leverages clicks in aggregate on previously published headlines to identify words for headlines that will generate more clicks in the future. We show that by jointly taking topics and clicks into account we can detect changes in user interests within topics. We evaluate HCTM in two different experimental settings and compare its performance with ALDA (adapted LDA), LDA, and TextRank. The first task, full headline, is to retrieve full headline used for a news article given the body of news article. The second task, good headline, is to specifically identify words in the headline that have high click values for current news audience. For full headline task, our model performs on par with ALDA, a state-of-the art web-page summarization method that utilizes click-through information. For good headline task, which is of more practical importance to both individual journalists and online news outlets, our model significantly outperforms all other comparative methods.

【Keywords】: click-through rate; headline prediction; large-scale analysis; online news analysis

178. Causal Clustering for 1-Factor Measurement Models.

Paper Link】 【Pages】:1655-1664

【Authors】: Erich Kummerfeld ; Joseph Ramsey

【Abstract】: Many scientific research programs aim to learn the causal structure of real world phenomena. This learning problem is made more difficult when the target of study cannot be directly observed. One strategy commonly used by social scientists is to create measurable ``indicator'' variables that covary with the latent variables of interest. Before leveraging the indicator variables to learn about the latent variables, however, one needs a measurement model of the causal relations between the indicators and their corresponding latents. These measurement models are a special class of Bayesian networks. This paper addresses the problem of reliably inferring measurement models from measured indicators, without prior knowledge of the causal relations or the number of latent variables. We present a provably correct novel algorithm, FindOneFactorClusters (FOFC), for solving this inference problem. Compared to other state of the art algorithms, FOFC is faster, scales to larger sets of indicators, and is more reliable at small sample sizes. We also present the first correctness proofs for this problem that do not assume linearity or acyclicity among the latent variables.

【Keywords】: causal search; factor models

179. Optimally Discriminative Choice Sets in Discrete Choice Models: Application to Data-Driven Test Design.

Paper Link】 【Pages】:1665-1674

【Authors】: Igor Labutov ; Frans Schalekamp ; Kelvin Luu ; Hod Lipson ; Christoph Studer

【Abstract】: Difficult multiple-choice (MC) questions can be made easy by providing a set of answer options of which most are obviously wrong. In the education literature, a plethora of instructional guides exist for crafting a suitable set of wrong choices (distractors) that enable the assessment of the students' understanding. The art of MC question design thus hinges on the question-maker's experience and knowledge of the potential misconceptions. In contrast, we advocate a data-driven approach, where correct and incorrect options are assembled directly from the students' own past submissions. Large-scale online classroom settings, such as massively open online courses (MOOCs), provide an opportunity to design optimal and adaptive multiple-choice questions that are maximally informative about the students' level of understanding of the material. In this work, we (i) develop a multinomial-logit discrete choice model for the setting of MC testing, (ii) derive an optimization objective for selecting optimally discriminative option sets, (iii) propose an algorithm for finding a globally-optimal solution, and (iv) demonstrate the effectiveness of our approach via synthetic experiments and a user study. We finally showcase an application of our approach to crowd-sourcing tests from technical online forums.

【Keywords】: adaptive learning; assessment; crowdsourcing; optimal testing

180. Interpretable Decision Sets: A Joint Framework for Description and Prediction.

Paper Link】 【Pages】:1675-1684

【Authors】: Himabindu Lakkaraju ; Stephen H. Bach ; Jure Leskovec

【Abstract】: One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model's prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems. Here we propose interpretable decision sets, a framework for building predictive models that are highly accurate, yet also highly interpretable. Decision sets are sets of independent if-then rules. Because each rule can be applied independently, decision sets are simple, concise, and easily interpretable. We formalize decision set learning through an objective function that simultaneously optimizes accuracy and interpretability of the rules. In particular, our approach learns short, accurate, and non-overlapping rules that cover the whole feature space and pay attention to small but important classes. Moreover, we prove that our objective is a non-monotone submodular function, which we efficiently optimize to find a near-optimal set of rules. Experiments show that interpretable decision sets are as accurate at classification as state-of-the-art machine learning techniques. They are also three times smaller on average than rule-based models learned by other methods. Finally, results of a user study show that people are able to answer multiple-choice questions about the decision boundaries of interpretable decision sets and write descriptions of classes based on them faster and more accurately than with other rule-based models that were designed for interpretability. Overall, our framework provides a new approach to interpretable machine learning that balances accuracy, interpretability, and computational efficiency.

【Keywords】: classification; decision sets; interpretable machine learning; submodularity

181. Lightweight Monitoring of Distributed Streams.

Paper Link】 【Pages】:1685-1694

【Authors】: Arnon Lazerson ; Daniel Keren ; Assaf Schuster

【Abstract】: As data becomes dynamic, large, and distributed, there is increasing demand for what have become known as distributed stream algorithms. Since continuously collecting the data to a central server and processing it there incurs very high communication and computation complexities, it is advantageous to define local conditions at the nodes, such that -- as long as they are maintained -- some desirable global condition holds. A generic algorithm which proved very useful for reducing communication in distributed streaming environments is geometric monitoring (GM). Alas, applying GM to many important tasks is computationally very demanding, as it requires solving a notoriously difficult problem -- computing the distance between a point and a surface, which is often very time-consuming even in low dimensions. Thus, while useful for reducing communication, GM often suffers from exceedingly heavy computational burden at the nodes, which renders it very problematic to apply, especially for thin'', battery-operated sensors, which are prevalent in numerous applications, including theInternet of Things'' paradigm. Here we propose a very different approach, designated CB (for Convex/Concave Bounds). CB is based on directly bounding the monitored function by suitably chosen convex and concave functions, that naturally enable monitoring distributed streams. These functions can be checked on the fly, yielding far simpler local conditions than those applied by GM. CB's superiority over GM is demonstrated in reducing computational complexity, by several orders of magnitude in some cases. As an added bonus, CB also reduced communication overhead in all application scenarios we tested.

【Keywords】: disributed streams; distributed monitoring; resource limited devices

182. Bayesian Inference of Arrival Rate and Substitution Behavior from Sales Transaction Data with Stockouts.

Paper Link】 【Pages】:1695-1704

【Authors】: Benjamin Letham ; Lydia M. Letham ; Cynthia Rudin

【Abstract】: When an item goes out of stock, sales transaction data no longer reflect the original customer demand, since some customers leave with no purchase while others substitute alternative products for the one that was out of stock. Here we develop a Bayesian hierarchical model for inferring the underlying customer arrival rate and choice model from sales transaction data and the corresponding stock levels. The model uses a nonhomogeneous Poisson process to allow the arrival rate to vary throughout the day, and allows for a variety of choice models. Model parameters are inferred using a stochastic gradient MCMC algorithm that can scale to large transaction databases. We fit the model to data from a local bakery and show that it is able to make accurate out-of-sample predictions, and to provide actionable insight into lost cookie sales.

【Keywords】: bayesian analysis; langevin dynamics; markov chain monte carlo; sales transaction data

183. Parallel Lasso Screening for Big Data Optimization.

Paper Link】 【Pages】:1705-1714

【Authors】: Qingyang Li ; Shuang Qiu ; Shuiwang Ji ; Paul M. Thompson ; Jieping Ye ; Jie Wang

【Abstract】: Lasso regression is a widely used technique in data mining for model selection and feature extraction. In many applications, it remains challenging to apply the regression model to large-scale problems that have massive data samples with high-dimensional features. One popular and promising strategy is to solve the Lasso problem in parallel. Parallel solvers run multiple cores in parallel on a shared memory system to speedup the computation, while the practical usage is limited by the huge dimension in the feature space. Screening is a promising method to solve the problem of high dimensionality by discarding the inactive features and removing them from optimization. However, when integrating screening methods with parallel solvers, most of solvers cannot guarantee the convergence on the reduced feature matrix. In this paper, we propose a novel parallel framework by parallelizing screening methods and integrating it with our proposed parallel solver. We propose two parallel screening algorithms: Parallel Strong Rule (PSR) and Parallel Dual Polytope Projection (PDPP). For the parallel solver, we proposed an Asynchronous Grouped Coordinate Descent method (AGCD) to optimize the regression problem in parallel on the reduced feature matrix. AGCD is based on a grouped selection strategy to select the coordinate that has the maximum descent for the objective function in a group of candidates. Empirical studies on the real-world datasets demonstrate that the proposed parallel framework has a superior performance compared to the state-of-the-art parallel solvers.

【Keywords】: aynchronized coordinate descent; coordinate descent; lasso regression; parallel computing; screening rules

184. A Multi-Task Learning Formulation for Survival Analysis.

Paper Link】 【Pages】:1715-1724

【Authors】: Yan Li ; Jie Wang ; Jieping Ye ; Chandan K. Reddy

【Abstract】: Predicting the occurrence of a particular event of interest at future time points is the primary goal of survival analysis. The presence of incomplete observations due to time limitations or loss of data traces is known as censoring which brings unique challenges in this domain and differentiates survival analysis from other standard regression methods. The popularly used survival analysis methods such as Cox proportional hazard model and parametric survival regression suffer from some strict assumptions and hypotheses that are not realistic in most of the real-world applications. To overcome the weaknesses of these two types of methods, in this paper, we reformulate the survival analysis problem as a multi-task learning problem and propose a new multi-task learning based formulation to predict the survival time by estimating the survival status at each time interval during the study duration. We propose an indicator matrix to enable the multi-task learning algorithm to handle censored instances and incorporate some of the important characteristics of survival problems such as non-negative non-increasing list structure into our model through max-heap projection. We employ the L2,1-norm penalty which enables the model to learn a shared representation across related tasks and hence select important features and alleviate over-fitting in high-dimensional feature spaces; thus, reducing the prediction error of each task. To efficiently handle the two non-smooth constraints, in this paper, we propose an optimization method which employs Alternating Direction Method of Multipliers (ADMM) algorithm to solve the proposed multi-task learning problem. We demonstrate the performance of the proposed method using real-world microarray gene expression high-dimensional benchmark datasets and show that our method outperforms state-of-the-art methods.

【Keywords】: high-dimensional data; multi-task learning; regularization; survival analysis

185. A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm.

Paper Link】 【Pages】:1725-1734

【Authors】: Yanni Li ; Hui Li ; Tihua Duan ; Sheng Wang ; Zhi Wang ; Yang Cheng

【Abstract】: Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.

【Keywords】: multiple longest common subsequences (mlcs); non-redu-ndant common subsequence graph (ncsg); subsection calculation and serialization; topological sorting

186. Multi-Task Feature Interaction Learning.

Paper Link】 【Pages】:1735-1744

【Authors】: Kaixiang Lin ; Jianpeng Xu ; Inci M. Baytas ; Shuiwang Ji ; Jiayu Zhou

【Abstract】: One major limitation of linear models is the lack of capability to capture predictive information from interactions between features. While introducing high-order feature interaction terms can overcome this limitation, this approach tremendously increases the model complexity and imposes significant challenges in the learning against overfitting. In this paper, we proposed a novel Multi-Task feature Interaction Learning~(MTIL) framework to exploit the task relatedness from high-order feature interactions, which provides better generalization performance by inductive transfer among tasks via shared representations of feature interactions. We formulate two concrete approaches under this framework and provide efficient algorithms: the shared interaction approach and the embedded interaction approach. The former assumes tasks share the same set of interactions, and the latter assumes feature interactions from multiple tasks come from a shared subspace. We have provided efficient algorithms for solving the two approaches. Extensive empirical studies on both synthetic and real datasets have demonstrated the effectiveness of the proposed framework.

【Keywords】: feature interaction; muti-task learning; structured regularization; tensor norm

187. Infinite Ensemble for Image Clustering.

Paper Link】 【Pages】:1745-1754

【Authors】: Hongfu Liu ; Ming Shao ; Sheng Li ; Yun Fu

【Abstract】: Image clustering has been a critical preprocessing step for vision tasks, e.g., visual concept discovery, content-based image retrieval. Conventional image clustering methods use handcraft visual descriptors as basic features via K-means, or build the graph within spectral clustering. Recently, representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. However, few studies have discussed how to deploy deep representation learning to image clustering problems, especially the unified framework which integrates both representation learning and ensemble clustering for efficient image clustering still remains void. In addition, even though it is widely recognized that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates the power of deep representation and ensemble clustering in a one-step framework to fuse infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the image data, then by converting the basic partitions to the 1-of-K codings, we link the marginalized auto-encoder to the infinite ensemble clustering with i.i.d. basic partitions, which can be approached by the closed-form solutions, finally we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate both the time efficiency and superior performance of IEC compared to the state-of-the-art ensemble clustering and deep clustering methods.

【Keywords】: ensemble clustering; k-means; marginalized denoising auto-encoder

188. Scalable Pattern Matching over Compressed Graphs via Dedensification.

Paper Link】 【Pages】:1755-1764

【Authors】: Antonio Maccioni ; Daniel J. Abadi

【Abstract】: One of the most common operations on graph databases is graph pattern matching (e.g., graph isomorphism and more general types of "subgraph pattern matching"). In fact, in some graph query languages every single query is expressed as a graph matching operation. Consequently, there has been a significant amount of research effort in optimizing graph matching operations in graph database systems. As graph databases have scaled in recent years, so too has recent work on scaling graph matching operations. However, the performance of recent proposals for scaling graph pattern matching is limited by the presence of high-degree nodes. These high-degree nodes result in an explosion of intermediate result sizes during query execution, and therefore significant performance bottlenecks. In this paper we present a dedensification technique that losslessly compresses the neighborhood around high-degree nodes. Furthermore, we introduce a query processing technique that enables direct operation of graph query processing operations over the compressed data, without ever having to decompress the data. For pattern matching operations, we show how this technique can be implemented as a layer above existing graph database systems, so that the end-user can benefit from this technique without requiring modifications to the core graph database engine code. Our technique reduces the size of the intermediate result sets during query processing, and thereby improves query performance.

【Keywords】: big graphs; graph databases; graph pattern matching; power-law; rdf; scale-free graphs; sparql

189. Scalable Betweenness Centrality Maximization via Sampling.

Paper Link】 【Pages】:1765-1773

【Authors】: Ahmad Mahmoody ; Charalampos E. Tsourakakis ; Eli Upfal

【Abstract】: Betweenness centrality (BWC) is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of great importance to many key applications that rely on BWC, including community detection and understanding graph vulnerability. Despite the large amount of work on scalable approximation algorithm design for BWC, estimating BWC on large-scale networks remains a computational challenge. In this paper, we study the Centrality Maximization problem (CMP): given a graph G = (V,E) and a positive integer k, find a set S ⊆ V that maximizes BWC subject to the cardinality constraint |S| ≤ k. We present an efficient randomized algorithm that provides a (1 -- 1/e -- ε)-approximation with high probability, where ε > 0. Our results improve the current state-of-the-art result [40]. Furthermore, we provide the first theoretical evidence for the validity of a crucial assumption in betweenness centrality estimation, namely that in real-world networks O(|V|2) shortest paths pass through the top-k central nodes, where k is a constant. This also explains why our algorithm runs in near linear time on real-world networks. We also show that our algorithm and analysis can be applied to a wider range of centrality measures, by providing a general analytical framework. On the experimental side, we perform an extensive experimental analysis of our method on real-world networks, demonstrate its accuracy and scalability, and study different properties of central nodes. Then, we compare the sampling method used by the state-of-the-art algorithm with our method. Furthermore, we perform a study of BWC in time evolving networks, and see how the centrality of the central nodes in the graphs changes over time. Finally, we compare the performance of the stochastic Kronecker model [28] to real data, and observe that it generates a similar growth pattern.

【Keywords】: centrality; optimization; sampling; social network

190. User Identity Linkage by Latent User Space Modelling.

Paper Link】 【Pages】:1775-1784

【Authors】: Xin Mu ; Feida Zhu ; Ee-Peng Lim ; Jing Xiao ; Jianzong Wang ; Zhi-Hua Zhou

【Abstract】: User identity linkage across social platforms is an important problem of great research challenge and practical value. In real applications, the task often assumes an extra degree of difficulty by requiring linkage across multiple platforms. While pair-wise user linkage between two platforms, which has been the focus of most existing solutions, provides reasonably convincing linkage, the result depends by nature on the order of platform pairs in execution with no theoretical guarantee on its stability. In this paper, we explore a new concept of ``Latent User Space'' to more naturally model the relationship between the underlying real users and their observed projections onto the varied social platforms, such that the more similar the real users, the closer their profiles in the latent user space. We propose two effective algorithms, a batch model(ULink) and an online model(ULink-On), based on latent user space modelling. Two simple yet effective optimization methods are used for optimizing objective function: the first one based on the constrained concave-convex procedure(CCCP) and the second on accelerated proximal gradient. To our best knowledge, this is the first work to propose a unified framework to address the following two important aspects of the multi-platform user identity linkage problem --- (I) the platform multiplicity and (II) online data generation. We present experimental evaluations on real-world data sets for not only traditional pairwise-platform linkage but also multi-platform linkage. The results demonstrate the superiority of our proposed method over the state-of-the-art ones.

【Keywords】: latent user space; social network; user identity linkage

191. Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining.

Paper Link】 【Pages】:1785-1794

【Authors】: Kazuya Nakagawa ; Shinya Suzumura ; Masayuki Karasuyama ; Koji Tsuda ; Ichiro Takeuchi

【Abstract】: In this paper we study predictive pattern mining problems where the goal is to construct a predictive model based on a subset of predictive patterns in the database. Our main contribution is to introduce a novel method called safe pattern pruning (SPP) for a class of predictive pattern mining problems. The SPP method allows us to efficiently find a superset of all the predictive patterns in the database that are needed for the optimal predictive model. The advantage of the SPP method over existing boosting-type method is that the former can find the superset by a single search over the database, while the latter requires multiple searches. The SPP method is inspired by recent development of safe feature screening. In order to extend the idea of safe feature screening into predictive pattern mining, we derive a novel pruning rule called safe pattern pruning (SPP) rule that can be used for searching over the tree defined among patterns in the database. The SPP rule has a property that, if a node corresponding to a pattern in the database is pruned out by the SPP rule, then it is guaranteed that all the patterns corresponding to its descendant nodes are never needed for the optimal predictive model. We apply the SPP method to graph mining and item-set mining problems, and demonstrate its computational advantage.

【Keywords】: convex optimization; graph mining; item-set mining; predictive pattern mining; safe screening; sparse learning

192. Predict Risk of Relapse for Patients with Multiple Stages of Treatment of Depression.

Paper Link】 【Pages】:1795-1804

【Authors】: Zhi Nie ; Pinghua Gong ; Jieping Ye

【Abstract】: Depression is a serious mood disorder afflicting millions of people around the globe. Medications of different types and with different effects on neural activity have been developed for its treatments during the past few decades. Due to the heterogeneity of the disorder, many patients cannot achieve symptomatic remission from a single clinical trial. Instead they need multiple clinical trials to achieve remission, resulting in a multiple stage treatment pattern. Furthermore those who indeed achieve symptom remission are still faced with substantial risk of relapse. One promising approach to predicting the risk of relapse is censored regression. Traditional censored regression typically applies only to situations in which the exact time of event of interest is known. However, follow-up studies that track the patients' relapse status can only provide an interval of time during which relapse occurs. The exact time of relapse is usually unknown. In this paper, we present a censored regression approach with a truncated $l_1$ loss function that can handle the uncertainty of relapse time. Based on this general loss function, we develop a gradient boosting algorithm and a stochastic dual coordinate ascent algorithm when the hypothesis in the loss function is represented as (1) an ensemble of decision trees and (2) a linear combination of covariates, respectively. As an extension of our linear model, a multi-stage linear approach is further proposed to harness the data collected from multiple stages of treatment. We evaluate the proposed algorithms using a real-world clinical trial dataset. Results show that our methods outperform the well-known Cox proportional hazard model. In addition, the risk factors identified by our multi-stage linear model not only corroborate findings from recent research but also yield some new insights into how to develop effective measures for prevention of relapse among patients after their initial remission from the acute treatment stage.

【Keywords】: censored regression; major depressive disorder; relapse; survival analysis

193. Lossless Separation of Web Pages into Layout Code and Data.

Paper Link】 【Pages】:1805-1814

【Authors】: Adi Omari ; Benny Kimelfeld ; Eran Yahav ; Sharon Shoham

【Abstract】: A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting site updates and redesign). We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page. We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.

【Keywords】: data extraction; data mining; json; lossless; program synthesis; separation; tree alignment; wrapper induction

194. Unbounded Human Learning: Optimal Scheduling for Spaced Repetition.

Paper Link】 【Pages】:1815-1824

【Authors】: Siddharth Reddy ; Igor Labutov ; Siddhartha Banerjee ; Thorsten Joachims

【Abstract】: In the study of human learning, there is broad evidence that our ability to retain information improves with repeated exposure and decays with delay since last exposure. This plays a crucial role in the design of educational software, leading to a trade-off between teaching new material and reviewing what has already been taught. A common way to balance this trade-off is spaced repetition, which uses periodic review of content to improve long-term retention. Though spaced repetition is widely used in practice, e.g., in electronic flashcard software, there is little formal understanding of the design of these systems. Our paper addresses this gap in three ways. First, we mine log data from spaced repetition software to establish the functional dependence of retention on reinforcement and delay. Second, we use this memory model to develop a stochastic model for spaced repetition systems. We propose a queueing network model of the Leitner system for reviewing flashcards, along with a heuristic approximation that admits a tractable optimization problem for review scheduling. Finally, we empirically evaluate our queueing model through a Mechanical Turk experiment, verifying a key qualitative prediction of our model: the existence of a sharp phase transition in learning outcomes upon increasing the rate of new item introductions.

【Keywords】: human memory; queueing models; spaced repetition

195. Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding.

Paper Link】 【Pages】:1825-1834

【Authors】: Xiang Ren ; Wenqi He ; Meng Qu ; Clare R. Voss ; Heng Ji ; Jiawei Han

【Abstract】: Current systems of fine-grained entity typing use distant supervision in conjunction with existing knowledge bases to assign categories (type labels) to entity mentions. However, the type labels so obtained from knowledge bases are often noisy (i.e., incorrect for the entity mention's local context). We define a new task, Label Noise Reduction in Entity Typing (LNR), to be the automatic identification of correct type labels (type-paths) for training examples, given the set of candidate type labels obtained by distant supervision with a given type hierarchy. The unknown type labels for individual entity mentions and the semantic similarity between entity types pose unique challenges for solving the LNR task. We propose a general framework, called PLE, to jointly embed entity mentions, text features and entity types into the same low-dimensional space where, in that space, objects whose types are semantically close have similar representations. Then we estimate the type-path for each training example in a top-down manner using the learned embeddings. We formulate a global objective for learning the embeddings from text corpora and knowledge bases, which adopts a novel margin-based loss that is robust to noisy labels and faithfully models type correlation derived from knowledge bases. Our experiments on three public typing datasets demonstrate the effectiveness and robustness of PLE, with an average of 25% improvement in accuracy compared to next best method.

【Keywords】: distant supervision; entity typing; fine-grained entity typing; knowledge base; label noise reduction

196. Reconstructing an Epidemic Over Time.

Paper Link】 【Pages】:1835-1844

【Authors】: Polina Rozenshtein ; Aristides Gionis ; B. Aditya Prakash ; Jilles Vreeken

【Abstract】: We consider the problem of reconstructing an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. Our input consists of a temporal network, which contains information about when two nodes interacted, and a sample of nodes that have been reported as infected. The goal is to recover the flow of the spread, including discovering the starting nodes, and identifying other likely-infected nodes that are not reported. The problem we consider has multiple applications, from public health to social media and viral marketing purposes. Previous work explicitly factor-in many unrealistic assumptions: it is assumed that (a) the underlying network does not change;(b) we have access to perfect noise-free data; or (c) we know the exact propagation model. In contrast, we avoid these simplifications: we take into account the temporal network, we require only a small sample of reported infections, and we do not make any restrictive assumptions about the propagation model. We develop CulT, a scalable and effective algorithm to reconstruct epidemics that is also suited for online settings. CulT works by formulating the problem as that of a temporal Steiner-tree computation, for which we design a fast algorithm leveraging the specific problem structure. We demonstrate the efficacy of the proposed approach through extensive experiments on diverse datasets.

【Keywords】: approximation algorithms; propagation reconstruction; spreading process; temporal network; temporal path; temporal steiner-tree

197. Improving Survey Aggregation with Sparsely Represented Signals.

Paper Link】 【Pages】:1845-1854

【Authors】: Tianlin Shi ; Forest Agostinelli ; Matthew Staib ; David P. Wipf ; Thomas Moscibroda

【Abstract】: In this paper, we develop a new aggregation technique to reduce the cost of surveying. Our method aims to jointly estimate a vector of target quantities such as public opinion or voter intent across time and maintain good estimates when using only a fraction of the data. Inspired by the James-Stein estimator, we resolve this challenge by shrinking the estimates to a global mean which is assumed to have a sparse representation in some known basis. This assumption has lead to two different methods for estimating the global mean: orthogonal matching pursuit and deep learning. Both of which significantly reduce the number of samples needed to achieve good estimates of the true means of the data and, in the case of presidential elections, can estimate the outcome of the 2012 United States elections while saving hundreds of thousands of samples and maintaining accuracy.

【Keywords】: compressive sensing; deep learning; james stein estimator; multi-task learning; presidential elections; survey aggregation

198. Dynamics of Large Multi-View Social Networks: Synergy, Cannibalization and Cross-View Interplay.

Paper Link】 【Pages】:1855-1864

【Authors】: Yu Shi ; Myunghwan Kim ; Shaunak Chatterjee ; Mitul Tiwari ; Souvik Ghosh ; Rómer Rosales

【Abstract】: Most social networking services support multiple types of relationships between users, such as getting connected, sending messages, and consuming feed updates. These users and relationships can be naturally represented as a dynamic multi-view network, which is a set of weighted graphs with shared common nodes but having their own respective edges. Different network views, representing structural relationship and interaction types, could have very distinctive properties individually and these properties may change due to interplay across views. Therefore, it is of interest to study how multiple views interact and affect network dynamics and, in addition, explore possible applications to social networking. In this paper, we propose approaches to capture and analyze multi-view network dynamics from various aspects. Through our proposed descriptors, we observe the synergy and cannibalization between different user groups and network views from LinkedIn dataset. We then develop models that consider the synergy and cannibalization per new relationship, and show the outperforming predictive capability of our models compared to baseline models. Finally, the proposed models allow us to understand the interplay among different views where they dynamically change over time.

【Keywords】: multi-view networks; network dynamics; social networks

199. Data-driven Automatic Treatment Regimen Development and Recommendation.

Paper Link】 【Pages】:1865-1874

【Authors】: Leilei Sun ; Chuanren Liu ; Chonghui Guo ; Hui Xiong ; Yanming Xie

【Abstract】: The analysis of large-scale Electrical Medical Records (EMRs) has the potential to develop and optimize clinical treatment regimens. A treatment regimen usually includes a series of doctor orders containing rich temporal and heterogeneous information. However, in many existing studies, a doctor order is simplified as an event code and a treatment record is simplified as a code sequence. Thus, the information inherent in doctor orders is not fully used for in-depth analysis. In this paper, we aim at exploiting the rich information in doctor orders and developing data-driven approaches for improving clinical treatments. To this end, we first propose a novel method to measure the similarities between treatment records with consideration of sequential and multifaceted information in doctor orders. Then, we propose an efficient density-based clustering algorithm to summarize large-scale treatment records, and extract a semantic representation of each treatment cluster. Finally, we develop a unified framework to evaluate the discovered treatment regimens, and find the most effective treatment regimen for new patients. In the empirical study, we validate our methods with EMRs of 27,678 patients from 14 hospitals. The results show that: 1) Our method can successfully extract typical treatment regimens from large-scale treatment records. The extracted treatment regimens are intuitive and provide managerial implications for treatment regimen design and optimization. 2) By recommending the most effective treatment regimens, the total cure rate in our data improves from 19.89% to 21.28%, and the effective rate increases up to 98.29%.

【Keywords】: electronic medical records; temporal sets.; treatment recommendation; treatment regimen

200. Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices.

Paper Link】 【Pages】:1875-1884

【Authors】: Yasuo Tabei ; Hiroto Saigo ; Yoshihiro Yamanishi ; Simon J. Puglisi

【Abstract】: With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability.

【Keywords】: big data; grammar compression; large-scale machine learning; partial least squares regression

201. From Truth Discovery to Trustworthy Opinion Discovery: An Uncertainty-Aware Quantitative Modeling Approach.

Paper Link】 【Pages】:1885-1894

【Authors】: Mengting Wan ; Xiangyu Chen ; Lance M. Kaplan ; Jiawei Han ; Jing Gao ; Bo Zhao

【Abstract】: In this era of information explosion, conflicts are often encountered when information is provided by multiple sources. Traditional truth discovery task aims to identify the truth the most trustworthy information, from conflicting sources in different scenarios. In this kind of tasks, truth is regarded as a fixed value or a set of fixed values. However, in a number of real-world cases, objective truth existence cannot be ensured and we can only identify single or multiple reliable facts from opinions. Different from traditional truth discovery task, we address this uncertainty and introduce the concept of trustworthy opinion of an entity, treat it as a random variable, and use its distribution to describe consistency or controversy, which is particularly difficult for data which can be numerically measured, i.e. quantitative information. In this study, we focus on the quantitative opinion, propose an uncertainty-aware approach called Kernel Density Estimation from Multiple Sources (KDEm) to estimate its probability distribution, and summarize trustworthy information based on this distribution. Experiments indicate that KDEm not only has outstanding performance on the classical numeric truth discovery task, but also shows good performance on multi-modality detection and anomaly detection in the uncertain-opinion setting.

【Keywords】: kernel density estimation; source reliability; truth discovery

202. The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain Recommendation.

Paper Link】 【Pages】:1895-1904

【Authors】: Beidou Wang ; Martin Ester ; Yikang Liao ; Jiajun Bu ; Yu Zhu ; Ziyu Guan ; Deng Cai

【Abstract】: With email overload becoming a billion-level drag on the economy, personalized email prioritization is of urgent need to help predict the importance level of an email. Despite lots of previous effort on the topic, broadcast email, an important type of emails with its unique challenges and intriguing opportunities, has been overlooked. The most salient opportunity lies in that effective collaborative filtering can be exploited due to thousands of receivers of a typical broadcast email. However, every broadcast email is completely cold and it is very costly to obtain users' preference feedback. Fortunately, there exist up to million-level broadcast mailing lists in a real life email system. Similar mailing lists can provide useful extra information for broadcast email prioritization in a target mailing list. How to mine such useful extra information is a challenging problem that has never been touched. In this work, we propose the first broadcast email prioritization framework considering large numbers of mailing lists by formulating this problem as a cross domain recommendation problem. An optimization framework is proposed to select the optimal set of source domains considering multiple criteria including overlap of users, feedback pattern similarity and coverage of users. Our method is thoroughly evaluated on a real world industrial dataset from Samsung Electronics and is proved highly effective and outperforms all the baselines.

【Keywords】: collaborative filtering; cross domain recommendation; email prioritization; source domain selection

203. Transfer Knowledge between Cities.

Paper Link】 【Pages】:1905-1914

【Authors】: Ying Wei ; Yu Zheng ; Qiang Yang

【Abstract】: The rapid urbanization has motivated extensive research on urban computing. It is critical for urban computing tasks to unlock the power of the diversity of data modalities generated by different sources in urban spaces, such as vehicles and humans. However, we are more likely to encounter the label scarcity problem and the data insufficiency problem when solving an urban computing task in a city where services and infrastructures are not ready or just built. In this paper, we propose a FLexible multimOdal tRAnsfer Learning (FLORAL) method to transfer knowledge from a city where there exist sufficient multimodal data and labels, to this kind of cities to fully alleviate the two problems. FLORAL learns semantically related dictionaries for multiple modalities from a source domain, and simultaneously transfers the dictionaries and labelled instances from the source into a target domain. We evaluate the proposed method with a case study of air quality prediction.

【Keywords】: multi-modality; transfer learning; urban computing

204. Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics.

Paper Link】 【Pages】:1915-1924

【Authors】: Hao Wu ; Jiangyun Mao ; Weiwei Sun ; Baihua Zheng ; Hanyuan Zhang ; Ziyang Chen ; Wei Wang

【Abstract】: Vehicle trajectories are one of the most important data in location-based services. The quality of trajectories directly affects the services. However, in the real applications, trajectory data are not always sampled densely. In this paper, we study the problem of recovering the entire route between two distant consecutive locations in a trajectory. Most existing works solve the problem without using those informative historical data or solve it in an empirical way. We claim that a data-driven and probabilistic approach is actually more suitable as long as data sparsity can be well handled. We propose a novel route recovery system in a fully probabilistic way which incorporates both temporal and spatial dynamics and addresses all the data sparsity problem introduced by the probabilistic method. It outperforms the existing works with a high accuracy (over 80%) and shows a strong robustness even when the length of routes to be recovered is very long (about 30 road segments) or the data is very sparse.

【Keywords】: location-basedservices; routerecovery; spatio-temporal; trajectory

205. A Truth Discovery Approach with Theoretical Guarantee.

Paper Link】 【Pages】:1925-1934

【Authors】: Houping Xiao ; Jing Gao ; Zhaoran Wang ; Shiyu Wang ; Lu Su ; Han Liu

【Abstract】: In the information age, people can easily collect information about the same set of entities from multiple sources, among which conflicts are inevitable. This leads to an important task, truth discovery, i.e., to identify true facts (truths) via iteratively updating truths and source reliability. However, the convergence to the truths is never discussed in existing work, and thus there is no theoretical guarantee in the results of these truth discovery approaches. In contrast, in this paper we propose a truth discovery approach with theoretical guarantee. We propose a randomized gaussian mixture model (RGMM) to represent multi-source data, where truths are model parameters. We incorporate source bias which captures its reliability degree into RGMM formulation. The truth discovery task is then modeled as seeking the maximum likelihood estimate (MLE) of the truths. Based on expectation-maximization (EM) techniques, we propose population-based (i.e., on the limit of infinite data) and sample-based (i.e., on a finite set of samples) solutions for the MLE. Theoretically, we prove that both solutions are contractive to an ε-ball around the MLE, under certain conditions. Experimentally, we evaluate our method on both simulated and real-world datasets. Experimental results show that our method achieves high accuracy in identifying truths with convergence guarantee.

【Keywords】: asymptotic consistency; mixture model; truth discovery

206. Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach.

Paper Link】 【Pages】:1935-1944

【Authors】: Houping Xiao ; Jing Gao ; Qi Li ; Fenglong Ma ; Lu Su ; Yunlong Feng ; Aidong Zhang

【Abstract】: The demand for automatic extraction of true information (i.e., truths) from conflicting multi-source data has soared recently. A variety of truth discovery methods have witnessed great successes via jointly estimating source reliability and truths. All existing truth discovery methods focus on providing a point estimator for each object's truth, but in many real-world applications, confidence interval estimation of truths is more desirable, since confidence interval contains richer information. To address this challenge, in this paper, we propose a novel truth discovery method (ETCIBoot) to construct confidence interval estimates as well as identify truths, where the bootstrapping techniques are nicely integrated into the truth discovery procedure. Due to the properties of bootstrapping, the estimators obtained by ETCIBoot are more accurate and robust compared with the state-of-the-art truth discovery approaches. Theoretically, we prove the asymptotical consistency of the confidence interval obtained by ETCIBoot. Experimentally, we demonstrate that ETCIBoot is not only effective in constructing confidence intervals but also able to obtain better truth estimates.

【Keywords】: bootstrapping; confidence interval; truth discovery

207. Online Feature Selection: A Limited-Memory Substitution Algorithm and Its Asynchronous Parallel Variation.

Paper Link】 【Pages】:1945-1954

【Authors】: Haichuan Yang ; Ryohei Fujimaki ; Yukitaka Kusumura ; Ji Liu

【Abstract】: This paper considers the feature selection scenario where only a few features are accessible at any time point. For example, features are generated sequentially and visible one by one. Therefore, one has to make an online decision to identify key features after all features are only scanned once or twice. The optimization based approach is a powerful tool for the online feature selection. However, most existing optimization based algorithms explicitly or implicitly adopt L1 norm regularization to identify important features, and suffer two main disadvantages: 1) the penalty term for L1 norm term is hard to choose; and 2) the memory usage is hard to control or predict. To overcome these two drawbacks, this paper proposes a limited-memory and model parameter free online feature selection algorithm, namely online substitution (OS) algorithm. To improve the selection efficiency, an asynchronous parallel extension for OS (Asy-OS) is proposed. Convergence guarantees are provided for both algorithms. Empirical study suggests that the performance of OS and Asy-OS is comparable to the benchmark algorithm Grafting, but requires much less memory cost and can be easily extended to the parallel implementation.

【Keywords】: asynchronous parallel optimization; feature selection; online learning

208. Absolute Fused Lasso and Its Application to Genome-Wide Association Studies.

Paper Link】 【Pages】:1955-1964

【Authors】: Tao Yang ; Jun Liu ; Pinghua Gong ; Ruiwen Zhang ; Xiaotong Shen ; Jieping Ye

【Abstract】: In many real-world applications, the samples/features acquired are in spatial or temporal order. In such cases, the magnitudes of adjacent samples/features are typically close to each other. Meanwhile, in the high-dimensional scenario, identifying the most relevant samples/features is also desired. In this paper, we consider a regularized model which can simultaneously identify important features and group similar features together. The model is based on a penalty called Absolute Fused Lasso (AFL). The AFL penalty encourages sparsity in the coefficients as well as their successive differences of absolute values' i.e., local constancy of the coefficient components in absolute values. Due to the non-convexity of AFL, it is challenging to develop efficient algorithms to solve the optimization problem. To this end, we employ the Difference of Convex functions (DC) programming to optimize the proposed non-convex problem. At each DC iteration, we adopt the proximal algorithm to solve a convex regularized sub-problem. One of the major contributions of this paper is to develop a highly efficient algorithm to compute the proximal operator. Empirical studies on both synthetic and real-world data sets from Genome-Wide Association Studies demonstrate the efficiency and effectiveness of the proposed approach in simultaneous identifying important features and grouping similar features.

【Keywords】: absolute fused lasso; gwas; non-convex optimization; proximal operator

209. Diversified Temporal Subgraph Pattern Mining.

Paper Link】 【Pages】:1965-1974

【Authors】: Yi Yang ; Da Yan ; Huanhuan Wu ; James Cheng ; Shuigeng Zhou ; John C. S. Lui

【Abstract】: Many graphs in real-world applications, such as telecommunications networks, social-interaction graphs and co-authorship graphs, contain temporal information. However, existing graph mining algorithms fail to exploit these temporal information and the resulting subgraph patterns do not contain any temporal attribute. In this paper, we study the problem of mining a set of diversified temporal subgraph patterns from a temporal graph, where each subgraph is associated with the time interval that the pattern spans. This problem motivates important applications such as finding social trends in social networks, or detecting temporal hotspots in telecommunications networks. We propose a divide-and-conquer algorithm along with effective pruning techniques, and our approach runs 2 to 3 orders of magnitude faster than a baseline algorithm and obtains high-quality temporal subgraph patterns in real temporal graphs.

【Keywords】: dense subgraph mining; quasi-clique; temporal graph

210. Distributing the Stochastic Gradient Sampler for Large-Scale LDA.

Paper Link】 【Pages】:1975-1984

【Authors】: Yuan Yang ; Jianfei Chen ; Jun Zhu

【Abstract】: Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents.Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing.The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 103 topics and $10^3$ vocabulary size. In this paper, we present embarrassingly parallel SGLD (EPSGLD), a novel distributed stochastic gradient sampling method for topic models. Our sampler is built upon a divide-and-conquer architecture which enables us to produce robust and asymptotically exact samples with less communication overhead than DSGLD. We further propose several techniques to reduce the overhead in I/O and memory usage. Experiments on Wikipedia and ClueWeb12 documents demonstrate that, EPSGLD can scale up to large models with 1010 parameters (i.e., 105 topics, 105 vocabulary size), four orders of magnitude larger than DSGLD, and converge faster.

【Keywords】: embarrassingly parallel mcmc; large-scale lda; stochastic gradient sampler

211. FUSE: Full Spectral Clustering.

Paper Link】 【Pages】:1985-1994

【Authors】: Wei Ye ; Sebastian Goebl ; Claudia Plant ; Christian Böhm

【Abstract】: Multi-scale data which contains structures at different scales of size and density is a big challenge for spectral clustering. Even given a suitable locally scaled affinity matrix, the first k eigenvectors of such a matrix still cannot separate clusters well. Thus, in this paper, we exploit the fusion of the cluster-separation information from all eigenvectors to achieve a better clustering result. Our method FUll Spectral ClustEring (FUSE) is based on Power Iteration (PI) and Independent Component Analysis (ICA). PI is used to fuse all eigenvectors to one pseudo-eigenvector which inherits all the cluster-separation information. To conquer the cluster-collision problem, we utilize PI to generate p (p > k) pseudo-eigenvectors. Since these pseudo-eigenvectors are redundant and the cluster-separation information is contaminated with noise, ICA is adopted to rotate the pseudo-eigenvectors to make them pairwise statistically independent. To let ICA overcome local optima and speed up the search process, we develop a self-adaptive and self-learning greedy search method. Finally, we select k rotated pseudo-eigenvectors (independent components) which have more cluster-separation information measured by kurtosis for clustering. Various synthetic and real-world data verifies the effectiveness and efficiency of our FUSE method.

【Keywords】: givens rotation; ica; multi-scale data; power iteration; spectral clustering

212. A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization.

Paper Link】 【Pages】:1995-2004

【Authors】: Jianhua Yin ; Jianyong Wang

【Abstract】: In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSDMM+. FGSDMM+ assumes that there are at most Kmax clusters in the corpus, and regards these Kmax potential clusters as one large potential cluster at the beginning. During initialization, FGSDMM+ processes the documents one by one in an online clustering scheme. The first document will choose the potential cluster, and FGSDMM+ will create a new cluster to store this document. Later documents will choose one of the non-empty clusters or the potential cluster with probabilities derived from the Dirichlet multinomial mixture model. Each time a document chooses the potential cluster, FGSDMM+ will create a new cluster to store that document and decrease the probability of later documents choosing the potential cluster. After initialization, FGSDMM+ will run a collapsed Gibbs sampling algorithm several times to obtain the final clustering result. Our extensive experimental study shows that FGSDMM+ can achieve better performance than three other clustering methods on both short and long text datasets.

【Keywords】: dirichlet multinomial mixture; gibbs sampling; text clustering

213. Convex Optimization for Linear Query Processing under Approximate Differential Privacy.

Paper Link】 【Pages】:2005-2014

【Authors】: Ganzhao Yuan ; Yin Yang ; Zhenjie Zhang ; Zhifeng Hao

【Abstract】: Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals' privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate strategy, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem. This paper points out that under (ε, ཬ)-differential privacy, the optimal solution of the above constrained optimization problem in search of a suitable strategy can be found, rather surprisingly, by solving a simple and elegant convex optimization program. Then, we propose an efficient algorithm based on Newton's method, which we prove to always converge to the optimal solution with linear global convergence rate and quadratic local convergence rate. Empirical evaluations demonstrate the accuracy and efficiency of the proposed solution.

【Keywords】: convergence analysis; convex optimization; data privacy; differential privacy; nonsmooth optimization; semidefinite optimization

214. Beyond Sigmoids: The NetTide Model for Social Network Growth, and Its Applications.

Paper Link】 【Pages】:2015-2024

【Authors】: Chengxi Zang ; Peng Cui ; Christos Faloutsos

【Abstract】: What is the growth pattern of social networks, like Facebook and WeChat? Does it truly exhibit exponential early growth, as predicted by textbook models like the Bass model, SI, or the Branching Process? How about the count of links, over time, for which there are few published models? We examine the growth of several real networks, including one of the world's largest online social network, ``WeChat'', with 300 million nodes and 4.75 billion links by 2013; and we observe power law growth for both nodes and links, a fact that completely breaks the sigmoid models (like SI, and Bass). In its place, we propose NETTIDE, along with differential equations for the growth of the count of nodes, as well as links. Our model accurately fits the growth patterns of real graphs; it is general, encompassing as special cases all the known, traditional models (including Bass, SI, log-logistic growth); while still remaining parsimonious, requiring only a handful of parameters. Moreover, our NETTIDE for link growth is the first one of its kind, accurately fitting real data, and naturally leading to the densification phenomenon. We validate our model with four real, time-evolving social networks, where NETTIDE gives good fitting accuracy, and, more importantly, applied on the WeChat data, our NETTIDE forecasted more than 730 days into the future, with 3% error.

【Keywords】: fizzle logistic; growth model; link growth; power law growth; social networks

215. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit.

Paper Link】 【Pages】:2025-2034

【Authors】: Chunqiu Zeng ; Qing Wang ; Shekoofeh Mokhtari ; Tao Li

【Abstract】: Contextual multi-armed bandit problems have gained increasing popularity and attention in recent years due to their capability of leveraging contextual information to deliver online personalized recommendation services (e.g., online advertising and news article selection). To predict the reward of each arm given a particular context, existing relevant research studies for contextual multi-armed bandit problems often assume the existence of a fixed yet unknown reward mapping function. However, this assumption rarely holds in practice, since real-world problems often involve underlying processes that are dynamically evolving over time. In this paper, we study the time varying contextual multi-armed problem where the reward mapping function changes over time. In particular, we propose a dynamical context drift model based on particle learning. In the proposed model, the drift on the reward mapping function is explicitly modeled as a set of random walk particles, where good fitted particles are selected to learn the mapping dynamically. Taking advantage of the fully adaptive inference strategy of particle learning, our model is able to effectively capture the context change and learn the latent parameters. In addition, those learnt parameters can be naturally integrated into existing multi-arm selection strategies such as LinUCB and Thompson sampling. Empirical studies on two real-world applications, including online personalized advertising and news recommendation, demonstrate the effectiveness of our proposed approach. The experimental results also show that our algorithm can dynamically track the changing reward over time and consequently improve the click-through rate.

【Keywords】: particle learning; personalization; probability matching; recommender system; time varying contextual bandit

216. Accelerated Stochastic Block Coordinate Descent with Optimal Sampling.

Paper Link】 【Pages】:2035-2044

【Authors】: Aston Zhang ; Quanquan Gu

【Abstract】: We study the composite minimization problem where the objective function is the sum of two convex functions: one is the sum of a finite number of strongly convex and smooth functions, and the other is a general convex function that is non-differentiable. Specifically, we consider the case where the non-differentiable function is block separable and admits a simple proximal mapping for each block. This type of composite optimization is common in many data mining and machine learning problems, and can be solved by block coordinate descent algorithms. We propose an accelerated stochastic block coordinate descent (ASBCD) algorithm, which incorporates the incrementally averaged partial derivative into the stochastic partial derivative and exploits optimal sampling. We prove that ASBCD attains a linear rate of convergence. In contrast to uniform sampling, we reveal that the optimal non-uniform sampling can be employed to achieve a lower iteration complexity. Experimental results on different large-scale real data sets support our theory.

【Keywords】: sampling; stochastic block coordinate descent

217. Collaborative Multi-View Denoising.

Paper Link】 【Pages】:2045-2054

【Authors】: Lei Zhang ; Shupeng Wang ; Xiaoyu Zhang ; Yong Wang ; Binbin Li ; Dinggang Shen ; Shuiwang Ji

【Abstract】: In multi-view learning applications, like multimedia analysis and information retrieval, we often encounter the corrupted view problem in which the data are corrupted by two different types of noises, i.e., the intra- and inter-view noises. The noises may affect these applications that commonly acquire complementary representations from different views. Therefore, how to denoise corrupted views from multi-view data is of great importance for applications that integrate and analyze representations from different views. However, the heterogeneity among multi-view representations brings a significant challenge on denoising corrupted views. To address this challenge, we propose a general framework to jointly denoise corrupted views in this paper. Specifically, aiming at capturing the semantic complementarity and distributional similarity among different views, a novel Heterogeneous Linear Metric Learning (HLML) model with low-rank regularization, leave-one-out validation, and pseudo-metric constraints is proposed. Our method linearly maps multi-view data to a high-dimensional feature-homogeneous space that embeds the complementary information from different views. Furthermore, to remove the intra- and inter-view noises, we present a new Multi-view Semi-supervised Collaborative Denoising (MSCD) method with elementary transformation constraints and gradient energy competition to establish the complementary relationship among the heterogeneous representations. Experimental results demonstrate that our proposed methods are effective and efficient.

【Keywords】: denoising; heterogeneity; metric learning; multi-view

218. Online Asymmetric Active Learning with Imbalanced Data.

Paper Link】 【Pages】:2055-2064

【Authors】: Xiaoxuan Zhang ; Tianbao Yang ; Padmini Srinivasan

【Abstract】: This paper considers online learning with imbalanced streaming data under a query budget, where the act of querying for labels is constrained to a budget limit. We study different active querying strategies for classification. In particular, we propose an asymmetric active querying strategy that assigns different probabilities for query to examples predicted as positive and negative. To corroborate the proposed asymmetric query model, we provide a theoretical analysis on a weighted mistake bound. We conduct extensive evaluations of the proposed asymmetric active querying strategy in comparison with several baseline querying strategies and with previous online learning algorithms for imbalanced data. In particular, we perform two types of evaluations according to which examples appear as positive"/negative''. In push evaluation only the positive predictions given to the user are taken into account; in push and query evaluation the decision to query is also considered for evaluation. The push and query evaluation strategy is particularly suited for a recommendation setting because the items selected for querying for labels may go to the end-user to enable customization and personalization. These would not be shown any differently to the end-user compared to recommended content (i.e., the examples predicated as positive). Additionally, given our interest in imbalanced data we measure F-score instead of accuracy that is traditionally considered by online classification algorithms. We also compare the querying strategies on five classification tasks from different domains, and show that the probabilistic query strategy achieves higher F-scores on both types of evaluation than deterministic strategy, especially when the budget is small, and the asymmetric query model further improves performance. When compared to the state-of-the-art cost-sensitive online learning algorithm under a budget, our online classification algorithm with asymmetric querying achieves a higher F-score on four of the five tasks, especially on the push evaluation.

【Keywords】: f-score; imbalanced data; online learning; query budget

219. FLASH: Fast Bayesian Optimization for Data Analytic Pipelines.

Paper Link】 【Pages】:2065-2074

【Authors】: Yuyu Zhang ; Mohammad Taha Bahadori ; Hang Su ; Jimeng Sun

【Abstract】: Modern data science relies on data analytic pipelines to organize interdependent computational steps. Such analytic pipelines often involve different algorithms across multiple steps, each with its own hyperparameters. To achieve the best performance, it is often critical to select optimal algorithms and to set appropriate hyperparameters, which requires large computational efforts. Bayesian optimization provides a principled way for searching optimal hyperparameters for a single algorithm. However, many challenges remain in solving pipeline optimization problems with high-dimensional and highly conditional search space. In this work, we propose Fast LineAr SearcH (FLASH), an efficient method for tuning analytic pipelines. FLASH is a two-layer Bayesian optimization framework, which firstly uses a parametric model to select promising algorithms, then computes a nonparametric model to fine-tune hyperparameters of the promising algorithms. FLASH also includes an effective caching algorithm which can further accelerate the search process. Extensive experiments on a number of benchmark datasets have demonstrated that FLASH significantly outperforms previous state-of-the-art methods in both search speed and accuracy. Using 50% of the time budget, FLASH achieves up to 20% improvement on test error rate compared to the baselines. FLASH also yields state-of-the-art performance on a real-world application for healthcare predictive modeling.

【Keywords】: automated hyperparameter tuning; bayesian optimization; data analytic pipeline; health analytics

220. Portfolio Selections in P2P Lending: A Multi-Objective Perspective.

Paper Link】 【Pages】:2075-2084

【Authors】: Hongke Zhao ; Qi Liu ; Guifeng Wang ; Yong Ge ; Enhong Chen

【Abstract】: P2P lending is an emerging wealth-management service for individuals, which allows lenders to directly bid and invest on the loans created by borrowers. In these platforms, lenders often pursue multiple objectives (e.g., non-default probability, fully-funded probability and winning-bid probability) when they select loans to invest. How to automatically assess loans from these objectives and help lenders select loan portfolios is a very important but challenging problem. To that end, in this paper, we present a holistic study on portfolio selections in P2P lending. Specifically, we first propose to adapt gradient boosting decision tree, which combines both static features and dynamic features, to assess loans from multiple objectives. Then, we propose two strategies, i.e., weighted objective optimization strategy and multi-objective optimization strategy, to select portfolios for lenders. For each lender, the first strategy attempts to provide one optimal portfolio while the second strategy attempts to provide a Pareto-optimal portfolio set. Further, we design two algorithms, namely DPA and EVA, which can efficiently resolve the optimizations in these two strategies, respectively. Finally, extensive experiments on a large-scale real-world data set demonstrate the effectiveness of our solutions.

【Keywords】: dynamic feature; multi-objective optimization; p2p lending; portfolio selection

221. Hierarchical Incomplete Multi-source Feature Learning for Spatiotemporal Event Forecasting.

Paper Link】 【Pages】:2085-2094

【Authors】: Liang Zhao ; Jieping Ye ; Feng Chen ; Chang-Tien Lu ; Naren Ramakrishnan

【Abstract】: Forecasting significant societal events is an interesting and challenging problem as it taking into consideration multiple aspects of a society, including its economics, politics, and culture. Traditional forecasting methods based on a single data source find it hard to cover all these aspects comprehensively, thus limiting model performance. Multi source event forecasting has proven promising but still suffers from several challenges, including 1) geographical hierarchies in multi-source data features, 2) missing values, and 3) characterization of structured feature sparsity. This paper proposes a novel feature learning model that concurrently addresses all the above challenges. Specifically, given multi-source data from different geographical levels, we design a new forecasting model by characterizing the lower-level features' dependence on higher-level features. To handle the correlations amidst structured feature sets and deal with missing values among the coupled features, we propose a novel feature learning model based on an $N$th-order strong hierarchy and fused-overlapping group Lasso. An efficient algorithm is developed to optimize model parameters and ensure global optima. Extensive experiments on 10 datasets in different domains demonstrate the effectiveness and efficiency of the proposed model.

【Keywords】: event forecasting; feature selection.; multiple data sources

222. Efficient Shift-Invariant Dictionary Learning.

Paper Link】 【Pages】:2095-2104

【Authors】: Guoqing Zheng ; Yiming Yang ; Jaime G. Carbonell

【Abstract】: Shift-invariant dictionary learning (SIDL) refers to the problem of discovering a set of latent basis vectors (the dictionary) that captures informative local patterns at different locations of the input sequences, and a sparse coding for each sequence as a linear combination of the latent basis elements. It differs from conventional dictionary learning and sparse coding where the latent basis has the same dimension as the input vectors, where the focus is on global patterns instead of shift-invariant local patterns. Unsupervised discovery of shift-invariant dictionary and the corresponding sparse coding has been an open challenge as the number of candidate local patterns is extremely large, and the number of possible linear combinations of such local patterns is even more so. In this paper we propose a new framework for unsupervised discovery of both the shift-invariant basis and the sparse coding of input data, with efficient algorithms for tractable optimization. Empirical evaluations on multiple time series data sets demonstrate the effectiveness and efficiency of the proposed method.

【Keywords】: dictionary learning; sparse coding; time series

223. Topic Modeling of Short Texts: A Pseudo-Document View.

Paper Link】 【Pages】:2105-2114

【Authors】: Yuan Zuo ; Junjie Wu ; Hui Zhang ; Hao Lin ; Fei Wang ; Ke Xu ; Hui Xiong

【Abstract】: Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.

【Keywords】: latent dirichlet allocation; pseudo document; short texts; topic modeling

Tutorials 13

224. Scalable Data Analytics Using R: Single Machines to Hadoop Spark Clusters.

Paper Link】 【Pages】:2115

【Authors】: John-Mark Agosta ; Debraj GuhaThakurta ; Robert Horton ; Mario Inchiosa ; Srini Kumar ; Mengyue Zhao

【Abstract】: R is one of the most popular languages in the data science, statistical and machine learning (ML) community. However, when it comes to scalable data analysis and ML using R, many data scientists are blocked or hindered by (a) its limitations of available functions to handle large datasets efficiently, and (b) knowledge about the appropriate computing environments to scale R scripts from desktop exploratory analysis to elastic and distributed cloud services. In this tutorial we will discuss solutions that demonstrate the use of distributed compute environments and end to end solutions for R. We will present the topics through presentations and worked-out examples with sample code. In addition, we will provide a public code repository that attendees will be able to access and adapt to their own practice. We believe this tutorial will be of strong interest to a large and growing community of data scientists and developers using R for data analysis and modeling.

【Keywords】: HADOOP; R; SQL; advanced analytics; distributed systems; hierarchical time series; learning curves; machine learning; parallel computing; predictive analytics; scalability; spark; statistical computing; statistical modeling; visualization; yarn

225. Lifelong Machine Learning and Computer Reading the Web.

Paper Link】 【Pages】:2117-2118

【Authors】: Zhiyuan Chen ; Estevam R. Hruschka Jr. ; Bing Liu

【Abstract】: This tutorial introduces Lifelong Machine Learning (LML) and Machine Reading. The core idea of LML is to learn continuously and accumulate the learned knowledge, and to use the knowledge to help future learning, which is perhaps the hallmark of human learning and human intelligence. By us- ing prior knowledge seamlessly and effortlessly, we humans can learn without a lot of training data, but current machine learning algorithms tend to need a huge amount of training data. LML aims to mimic this human capability. Machine Reading is a research area with the goal of building systems to read natural language text. Among different approaches employed in Machine Reading, this tutorial focuses on projects and approaches that use the idea of LML. Most current machine learning (ML) algorithms learn in isolation. They are designed to address a specific problem using a single dataset. That is, given a dataset, an ML algorithm is executed on the dataset to build a model. Although this type of isolated learning is very useful, it does not have the ability to accumulate past knowledge and to make use of the knowledge for future learning, which we believe are critical for the future of machine learning and data mining. LML aims to design and develop computational systems and algorithms with this capability, i.e., to learn as humans do in a lifelong manner. In this tutorial, we introduce this important problem and the existing LML techniques and discuss opportunities and challenges of big data for lifelong machine learning. We also want to motivate researchers and practitioners to actively explore LML as the big data provides us a golden opportunity to learn a large volume of diverse knowledge, to connect different pieces of it, and to use it to raise data mining and machine learning to a new level.

【Keywords】: computer reading; lifelong machine learning; multi-task learning.; never-ending learning; transfer learning

226. IoT Big Data Stream Mining.

Paper Link】 【Pages】:2119-2120

【Authors】: Gianmarco De Francisci Morales ; Albert Bifet ; Latifur Khan ; João Gama ; Wei Fan

【Abstract】: The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in IoT stream mining. This tutorial is a gentle introduction to mining IoT big data streams. The first part introduces data stream learners for classification, regression, clustering, and frequent pattern mining. The second part deals with scalability issues inherent in IoT applications, and discusses how to mine data streams on distributed engines such as Spark, Flink, Storm, and Samza.

【Keywords】: IoT; big data; data science; data streams

227. Mining Reliable Information from Passively and Actively Crowdsourced Data.

Paper Link】 【Pages】:2121-2122

【Authors】: Jing Gao ; Qi Li ; Bo Zhao ; Wei Fan ; Jiawei Han

【Abstract】: Recent years have witnessed an astonishing growth of crowd-contributed data, which has become a powerful information source that covers almost every aspect of our lives. This big treasure trove of information has fundamentally changed the ways in which we learn about our world. Crowdsourcing has attracted considerable attentions with various approaches developed to utilize these enormous crowdsourced data from different perspectives. From the data collection perspective, crowdsourced data can be divided into two types: "passively" crowdsourced data and "actively" crowdsourced data; from task perspective, crowdsourcing research includes information aggregation, budget allocation, worker incentive mechanism, etc. To answer the need of a systematic introduction of the field and comparison of the techniques, we will present an organized picture on crowdsourcing methods in this tutorial. The covered topics will be interested for both advanced researchers and beginners in this field.

【Keywords】: crowdsourcing

228. Streaming Analytics.

Paper Link】 【Pages】:2123

【Authors】: Ashish Gupta ; Neera Agarwal

【Abstract】: Recently we have seen emergence and huge adoption of social media, internet of things for home, industrial internet of things, mobile applications and online transactions. These systems generate streaming data at very large scale. Building technologies and distributed systems that can capture, process and analyze this streaming data in real time is very important for gaining real time insights. Real-time analysis of streaming data can be used for applications as diverse as fraud detection, in-session targeting and recommendations, control systems for transportation systems and smarter cities, earthquake prediction and control of autonomous vehicles. This tutorial will provide overview of streaming systems and hands on tutorial on building streaming analytics systems using open source technologies.

【Keywords】: IoT; analytics; distributed systems; machine learning; stream

229. Algorithmic Bias: From Discrimination Discovery to Fairness-aware Data Mining.

Paper Link】 【Pages】:2125-2126

【Authors】: Sara Hajian ; Francesco Bonchi ; Carlos Castillo

【Abstract】: Algorithms and decision making based on Big Data have become pervasive in all aspects of our daily lives lives (offline and online), as they have become essential tools in personal finance, health care, hiring, housing, education, and policies. It is therefore of societal and ethical importance to ask whether these algorithms can be discriminative on grounds such as gender, ethnicity, or health status. It turns out that the answer is positive: for instance, recent studies in the context of online advertising show that ads for high-income jobs are presented to men much more often than to women [Datta et al., 2015]; and ads for arrest records are significantly more likely to show up on searches for distinctively black names [Sweeney, 2013]. This algorithmic bias exists even when there is no discrimination intention in the developer of the algorithm. Sometimes it may be inherent to the data sources used (software making decisions based on data can reflect, or even amplify, the results of historical discrimination), but even when the sensitive attributes have been suppressed from the input, a well trained machine learning algorithm may still discriminate on the basis of such sensitive attributes because of correlations existing in the data. These considerations call for the development of data mining systems which are discrimination-conscious by-design. This is a novel and challenging research area for the data mining community. The aim of this tutorial is to survey algorithmic bias, presenting its most common variants, with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions. The tutorial covers two main complementary approaches: algorithms for discrimination discovery and discrimination prevention by means of fairness-aware data mining. We conclude by summarizing promising paths for future research.

【Keywords】: algorithmic bias; discrimination discovery; discrimination prevention

230. Collective Sensemaking via Social Sensors: Extracting, Profiling, Analyzing, and Predicting Real-world Events.

Paper Link】 【Pages】:2127-2128

【Authors】: Yuheng Hu ; Yu-Ru Lin ; Jiebo Luo

【Abstract】: Social media platforms like Twitter and Facebook have emerged as some of the most important platforms for people to discover, report, share, and communicate with others about various public events, be they of global or local interest (some high profile examples include the U.S Presidential debates, the Boston bombings, the hurricane Sandy, etc). The burst of social media reaction can be seen as a valuable real-time reflection of events as they happen, and can be used for a variety of applications such as computational journalism. Until now, such analysis has been mostly done manually or through primitive tools. Scalable and automated approaches are needed given the massive amounts of both event and reaction information. These approaches must also be able to conduct in-depth analysis of complex interactions between an event and its audience. Supporting such automation and examination however poses several computational challenges. In recent years, research communities have witnessed a growing interest in tackling these challenges. Furthermore, much recent research has begun to focus on solving more complex event analytics tasks such as post-event effect quantification and event progress prediction. This tutorial aims to review and examine current state of the research progress on this emerging topic.

【Keywords】: event analysis; event modeling; event prediction; social media

231. Extracting Optimal Performance from Dynamic Time Warping.

Paper Link】 【Pages】:2129-2130

【Authors】: Abdullah Mueen ; Eamonn J. Keogh

【Abstract】: Dynamic Time Warping (DTW) is a distance measure that compares two time series after optimally aligning them. DTW is being used for decades in thousands of academic and industrial projects despite the very expensive computational complexity, O(n2). These applications include data mining, image processing, signal processing, robotics and computer graphics among many others. In spite of all this research effort, there are many myths and misunderstanding about DTW in the literature, for example "it is too slow to be useful" or "the warping window size does not matter much." In this tutorial, we correct these misunderstandings and we summarize the research efforts in optimizing both the efficiency and effectiveness of both the basic DTW algorithm, and of the higher-level algorithms that exploit DTW such as similarity search, clustering and classification. We will discuss variants of DTW such as constrained DTW, multidimensional DTW and asynchronous DTW, and optimization techniques such as lower bounding, early abandoning, run-length encoding, bounded approximation and hardware optimization. We will discuss a multitude of application areas including physiological monitoring, social media mining, activity recognition and animal sound processing. The optimization techniques are generalizable to other domains on various data types and problems.

【Keywords】: approximation; dynamic time warping; lower bounds; pruning; time series

232. Scalable Learning of Graphical Models.

Paper Link】 【Pages】:2131-2132

【Authors】: François Petitjean ; Geoffrey I. Webb

【Abstract】: From understanding the structure of data, to classification and topic modeling, graphical models are core tools in machine learning and data mining. They combine probability and graph theories to form a compact representation of probability distributions. In the last decade, as data stores became larger and higher-dimensional, traditional algorithms for learning graphical models from data, with their lack of scalability, became less and less usable, thus directly decreasing the potential benefits of this core technology. To scale graphical modeling techniques to the size and dimensionality of most modern data stores, data science researchers and practitioners now have to meld the most recent advances in numerous specialized fields including graph theory, statistics, pattern mining and graphical modeling. This tutorial covers the core building blocks that are necessary to build and use scalable graphical modeling technologies on large and high-dimensional data.

【Keywords】: graphical models; high-dimensional data; structure learning

233. Leveraging Propagation for Data Mining: Models, Algorithms and Applications.

Paper Link】 【Pages】:2133-2134

【Authors】: B. Aditya Prakash ; Naren Ramakrishnan

【Abstract】: Can we infer if a user is sick from her tweet? How do opinions get formed in online forums? Which people should we immunize to prevent an epidemic as fast as possible? How do we quickly zoom out of a graph? Graphs---also known as networks---are powerful tools for modeling processes and situations of interest in real life domains of social systems, cyber-security, epidemiology, and biology. They are ubiquitous, from online social networks, gene-regulatory networks, to router graphs. This tutorial will cover recent and state-of-the-art research on how propagation-like processes can help big-data mining specifically involving large networks and time-series, algorithms behind network problems, and their practical applications in various diverse settings. Topics include diffusion and virus propagation in networks, anomaly and outbreak detection, event prediction and connections with work in public health, the web and online media, social sciences, humanities, and cyber-security.

【Keywords】: cyber security; diffusion; dynamical processes; graph mining; propagation; public health; social media

234. CNTK: Microsoft's Open-Source Deep-Learning Toolkit.

Paper Link】 【Pages】:2135

【Authors】: Frank Seide ; Amit Agarwal

【Abstract】: This tutorial will introduce the Computational Network Toolkit, or CNTK, Microsoft's cutting-edge open-source deep-learning toolkit for Windows and Linux. CNTK is a powerful computation-graph based deep-learning toolkit for training and evaluating deep neural networks. Microsoft product groups use CNTK, for example to create the Cortana speech models and web ranking. CNTK supports feed-forward, convolutional, and recurrent networks for speech, image, and text workloads, also in combination. Popular network types are supported either natively (convolution) or can be described as a CNTK configuration (LSTM, sequence-to-sequence). CNTK scales to multiple GPU servers and is designed around efficiency. The tutorial will give an overview of CNTK's general architecture and describe the specific methods and algorithms used for automatic differentiation, recurrent-loop inference and execution, memory sharing, on-the-fly randomization of large corpora, and multi-server parallelization. We will then show how typical uses looks like for relevant tasks like image recognition, sequence-to-sequence modeling, and speech recognition.

【Keywords】: CNTK; computational networks; deep learning; neural network learning toolkit; neural networks

235. Healthcare Data Mining with Matrix Models.

Paper Link】 【Pages】:2137-2138

【Authors】: Fei Wang ; Ping Zhang ; Joel Dudley

【Abstract】: In the last decade, advances in high-throughput technologies, growth of clinical data warehouses, and rapid accumulation of biomedical knowledge provided unprecedented opportunities and challenges to researchers in biomedical informatics. One distinct solution, to efficiently conduct big data analytics for biomedical problems, is the application of matrix computation and factorization methods such as non-negative matrix factorization, joint matrix factorization, tensor factorization. Compared to probabilistic and information theoretic approaches, matrix-based methods are fast, easy to understand and implement. In this tutorial, we provide a review of recent advances in algorithms and methods using matrix and their potential applications in biomedical informatics. We survey various related articles from data mining venues as well as from biomedical informatics venues to share with the audience key problems and trends in matrix computation research, with different novel applications such as drug repositioning, personalized medicine, and electronic phenotyping.

【Keywords】: healthcare data mining; matrix model

236. Business Applications of Predictive Modeling at Scale.

Paper Link】 【Pages】:2139-2140

【Authors】: Qiang Zhu ; Songtao Guo ; Paul Ogilvie ; Yan Liu

【Abstract】: Predictive modeling is the art of building statistical models that forecast probabilities and trends of future events. It has broad applications in industry across different domains. Some popular examples include user intention predictions, lead scoring, churn analysis, etc. In this tutorial, we will focus on the best practice of predictive modeling in the big data era and its applications in industry, with motivating examples across a range of business tasks and relevance products. We will start with an overview of how predictive modeling helps power and drive various key business use cases. We will introduce the essential concepts and state of the art in building end-to-end predictive modeling solutions, and discuss the challenges, key technologies, and lessons learned from our practice, including case studies of LinkedIn feed relevance and a platform for email response prediction. Moreover, we will discuss some practical solutions of building predictive modeling platform to scale the modeling efforts for data scientists and analysts, along with an overview of popular tools and platforms used across the industry.

【Keywords】: business analytics; machine learning; machine learning platforms; predictive modeling