25. WWW 2017:Perth, Australia

Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017. ACM 【DBLP Link

Paper Num: 167 || Session Num: 41

Keynote Talks 3

1. Taming the Data Deluge to Unravel the Mysteries of the Universe.

Paper Link】 【Pages】:1

【Authors】: Melanie Johnston-Hollitt

【Abstract】: Modern Astrophysics is one of the most data intensive research fields in the world and is driving many of the required innovations in the "big data" space. Foremost in astronomy in terms of data generation is radio astronomy, and in the last decade an increase in global interest and investment in the field had led to a large number of new or upgraded facilities which are each currently generating petabytes of data per annum. The peak of this so-called 'radio renaissance' will be the Square Kilometre Array (SKA) -- a global observatory designed to uncover the mysteries of the Universe. The SKA will create the highest resolution, fastest frame rate movie of the evolving Universe ever and in doing so will generate 160 terrabytes of data a day, or close to 5 zettabytes of data per annum. Furthermore, due to the extreme faintness of extraterrestrial radio signals, the telescope elements for the SKA must be located in radio quite parts of the world with very low population density. Thus the project aims to build the most data intensive scientific experiment ever, in some of the most remote places on Earth. Generating and serving scientific data products of this scale to a global community of researchers from remote locations is just the first of the "big data" challenges the project faces. Coordination of a global network of tiered data resources will be required along with software tools to exploit the vast sea of results generated. In fact, to fully realize the enormous scientific potential of this project, we will need not only better data distribution and coordination mechanisms, but also improved algorithms, artificial intelligence and ontologies to extract knowledge in an automated way at a scale not yet attempted in science. In this keynote I will present an overview of the SKA project, outline the "big data" challenges the project faces and discuss some of the approaches we are taking to tame the astronomical data deluge we face.

【Keywords】: big data; data compression; design and analysis of algorithms; machine learning; pattern matching; radio astronomy; square kilometre array

2. The Web-Wide World.

Paper Link】 【Pages】:3

【Authors】: Mark Pesce

【Abstract】: The great project of the World Wide Web has succeeded - a large portion of the world's information is now instantly accessible through open protocols and open presentation formats. The Web is as Sir Tim Berners-Lee envisioned it, a vast resource of interconnected knowledge. Yet that resource exists in a universe of its own. Meanwhile the real world has become crowded with connected devices, none more significant than the smartphone - bringing the Web to eighty percent of the planet's adult population by the end of this decade. Smartphones have become fantastically adept at navigating cyberspace, but - with the singular exception of maps - have few real connections to the world immediately at hand. In 2017 we live in two worlds: the Web, and the real. The time has come to knit these two together. To begin that integration, our first step must be a deep moment of contemplation about what the Web and the real world have to offer one another. How can each amplify the value and capacity of the other? Because of the Web, the real world is pregnant with data and knowledge - what does that world look like? How do we use it? How does it change the way we think and behave? In this simple act of design thinking - toward a "Web-wide world" - we can reframe the possibilities of what both the Web and the real world can offer - and what we can offer both. This is the next great project for the Web - finding its place in the world.

【Keywords】: augmented reality; autonomy; development; discovery; distributed ledger; geodata; geofencing; geolocation; gps; hypertext; metadata; mixed reality; mobile web; mrs; navigability; navigation; protocols; usability; virtual reality; vrml

3. Web Mail is not Dead!: It's Just Not Human Anymore.

Paper Link】 【Pages】:5

【Authors】: Yoelle Maarek

【Abstract】: Many have noticed that personal communications have slowly moved from mail to social media and instant messaging platforms, especially with younger generation [6]. Yet Web Mail traffic continues to steadily grow. A paradox? Not really. We have observed at Yahoo Research that the nature of email traffic has significantly changed in the last two decades, and it is now dominated by machine-generated messages. These messages include hotel newsletters, from which users forgot to unsubscribe, repeated, and often annoying, notifications from a social media site, or critical information such as a flight e-ticket, a purchase invoice, or a telephone bill. In this talk, I first share some elements of this journey that led us to this critical finding that 90% of today's Web Mail is sent by automatic scripts [1]. I then discuss the challenges and opportunities this drastic change offers. First the key challenge: namely, the need for Web mail services to revisit their usage assumptions and their traditional features in light of this change. An obvious example is the "reply" button being displayed by default below messages sent from a "no-reply@" sender. Another feature is mail classification, which has finally experienced some changes in the last few years, [4]. I then discuss the opportunities in this era of big data. One first insight is that messages that have been generated by a same script, share some semantic commonality. Being able to automatically cluster such messages, and map such clusters into "templates" brings great value for discovering meaning, for generalizing findings and predicting behaviors [5]. A second insight is that within this commonality, the differences bring even more value, which allows highlighting what makes individuals unique within a crowd. In particular we discuss extraction techniques that automatically identify these unique elements [2]. Yet, they also present a clear risk in terms of privacy and I describe the absolute need for guaranteeing k-anonymity in our mining techniques, [3]. I conclude by encouraging the research community to explore this new domain of Web mail search and data mining.

【Keywords】: machine-generated email; web mail

1B: Ad Auctions 5

4. Deals or No Deals: Contract Design for Online Advertising.

Paper Link】 【Pages】:7-14

【Authors】: Vahab S. Mirrokni ; Hamid Nazerzadeh

【Abstract】: Billions of dollars worth of display advertising are sold via contracts and deals. This paper presents a formal study of preferred deals, a new generation of contracts for selling online advertisement, that generalize the traditional reservation contracts; these contracts are suitable for advertisers with advanced targeting capabilities. We propose a constant-factor approximation algorithm for maximizing the revenue that can be obtained from these deals. We show, both theoretically and via data analysis, that deals, with appropriately chosen minimum-purchase guarantees, can yield significantly higher revenue than auctions. We evaluate our algorithm using data from Google's ad exchange platform. Our algorithm obtains about 90% of the optimal revenue where the second-price auction, even with personalized reserve, obtains at most 52% of the benchmark.

【Keywords】:

5. Budget Management Strategies in Repeated Auctions.

Paper Link】 【Pages】:15-23

【Authors】: Santiago R. Balseiro ; Anthony Kim ; Mohammad Mahdian ; Vahab S. Mirrokni

【Abstract】: In online advertising, advertisers purchase ad placements by participating in a long sequence of repeated auctions. One of the most important features advertising platforms often provide, and advertisers often use, is budget management, which allows advertisers to control their cumulative expenditures. Advertisers typically declare the maximum daily amount they are willing to pay, and the platform adjusts allocations and payments to guarantee that cumulative expenditures do not exceed budgets. There are multiple ways to achieve this goal, and each one, when applied to all budget-constrained advertisers simultaneously, steers the system toward a different equilibrium. While previous research focused on online stochastic optimization techniques or game-theoretic equilibria of such settings, our goal in this paper is to compare the ``system equilibria'' of a range of budget management strategies in terms of the seller's profit and buyers' utility. In particular, we consider six different budget management strategies including probabilistic throttling, thresholding, bid shading, reserve pricing, and multiplicative boosting. We show these methods admit a system equilibrium in a rather general setting, and prove dominance relations between them in a simplified setting. Our study sheds light on the impact of budget management strategies on the tradeoff between the seller's profit and buyers' utility. Finally, we also empirically compare the system equilibria of these strategies using real ad auction data in sponsored search and randomly generated bids. The empirical study confirms our theoretical findings about the relative performances of budget management strategies.

【Keywords】: ad auctions; budget management; online advertising

6. GSP: The Cinderella of Mechanism Design.

Paper Link】 【Pages】:25-32

【Authors】: Christopher A. Wilkens ; Ruggiero Cavallo ; Rad Niazadeh

【Abstract】: Nearly fifteen years ago, Google unveiled the generalized second price (GSP) auction. By all theoretical accounts including their own [Varian 14], this was the wrong auction --- the Vickrey-Clarke-Groves (VCG) auction would have been the proper choice --- yet GSP has succeeded spectacularly. We give a deep justification for GSP's success: advertisers' preferences map to a model we call value maximization; they do not maximize profit as the standard theory would believe. For value maximizers, GSP is the truthful auction [Aggarwal 09]. Moreover, this implies an axiomatization of GSP --- it is an auction whose prices are truthful for value maximizers --- that can be applied much more broadly than the simple model for which GSP was originally designed. In particular, applying it to arbitrary single-parameter domains recovers the folklore definition of GSP. Through the lens of value maximization, GSP metamorphosizes into a powerful auction, sound in its principles and elegant in its simplicity.

【Keywords】: generalized second price auction; incentive compatibility; mechanism design; sponsored search; value maximizers

7. Horizon-Independent Optimal Pricing in Repeated Auctions with Truthful and Strategic Buyers.

Paper Link】 【Pages】:33-42

【Authors】: Alexey Drutsa

【Abstract】: We study revenue optimization learning algorithms for repeated posted-price auctions where a seller interacts with a (truthful or strategic) buyer that holds a fixed valuation. We focus on a practical situation in which the seller does not know in advance the number of played rounds (the time horizon) and has thus to use a horizon-independent pricing. First, we consider straightforward modifications of previously best known algorithms and show that these horizon-independent modifications have worser or even linear regret bounds. Second, we provide a thorough theoretical analysis of some broad families of consistent algorithms and show that there does not exist a no-regret horizon-independent algorithm in those families. Finally, we introduce a novel deterministic pricing algorithm that, on the one hand, is independent of the time horizon T and, on the other hand, has an optimal strategic regret upper bound in O(log log T). This result closes the logarithmic gap between the previously best known upper and lower bounds on strategic regret.

【Keywords】: horizon-independent pricing; posted-price auction; repeated auctions; reserve price; revenue optimization; strategic regret

Paper Link】 【Pages】:43-51

【Authors】: Ruggiero Cavallo ; Prabhakar Krishnamurthy ; Maxim Sviridenko ; Christopher A. Wilkens

【Abstract】: The generalized second price (GSP) auction has served as the core selling mechanism for sponsored search ads for over a decade. However, recent trends expanding the set of allowed ad formats---to include a variety of sizes, decorations, and other distinguishing features---have raised critical problems for GSP-based platforms. Alternatives such as the Vickrey-Clarke-Groves (VCG) auction raise different complications because they fundamentally change the way prices are computed. In this paper we report on our efforts to redesign a search ad selling system from the ground up in this new context, proposing a mechanism that optimizes an entire slate of ads globally and computes prices that achieve properties analogous to those held by GSP in the original, simpler setting of uniform ads. A careful algorithmic coupling of allocation-optimization and pricing-computation allows our auction to operate within the strict timing constraints inherent in real-time ad auctions. We report performance results of the auction in Yahoo's Gemini Search platform.

【Keywords】: ad auctions; gsp; local search; mechanism design; online auctions; pricing; sponsored search auctions

1C: Cloud and Sharing Economy 4

9. Prices and Subsidies in the Sharing Economy.

Paper Link】 【Pages】:53-62

【Authors】: Zhixuan Fang ; Longbo Huang ; Adam Wierman

【Abstract】: The growth of the sharing economy is driven by the emergence of sharing platforms, e.g., Uber and Lyft, that match owners looking to share their resources with customers looking to rent them. The design of such platforms is a complex mixture of economics and engineering, and how to "optimally" design such platforms is still an open problem. In this paper, we focus on the design of prices and subsidies in sharing platforms. Our results provide insights into the tradeoff between revenue maximizing prices and social welfare maximizing prices. Specifically, we introduce a novel model of sharing platforms and characterize the profit and social welfare maximizing prices in this model. Further, we bound the efficiency loss under profit maximizing prices, showing that there is a strong alignment between profit and efficiency in practical settings. Our results highlight that the revenue of platforms may be limited in practice due to supply short- ages; thus platforms have a strong incentive to encourage sharing via subsidies. We provide an analytic characterization of when such subsidies are valuable and show how to optimize the size of the subsidy provided. Finally, we validate the insights from our analysis using data from Didi Chuxing, the largest ridesharing platform in China.

【Keywords】: game theory; sharing economy

10. Segmenting Two-Sided Markets.

Paper Link】 【Pages】:63-72

【Authors】: Siddhartha Banerjee ; Sreenivas Gollapudi ; Kostas Kollias ; Kamesh Munagala

【Abstract】: Recent years have witnessed the rise of many successful e-commerce marketplace platforms like the Amazon marketplace, AirBnB, Uber/Lyft, and Upwork, where a central platform mediates economic transactions between buyers and sellers. A common feature of many of these two-sided marketplaces is that the platform has full control over search and discovery, but prices are determined by the buyers and sellers. Motivated by this, we study the algorithmic aspects of market segmentation via directed discovery in two-sided markets with endogenous prices. We consider a model where an online platform knows each buyer/seller's characteristics, and associated demand/supply elasticities. Moreover, the platform can use discovery mechanisms (search, recommendation, etc.) to control which buyers/sellers are visible to each other. We develop efficient algorithms for throughput (i.e. volume of trade) and welfare maximization with provable guarantees under a variety of assumptions on the demand and supply functions. We also test the validity of our assumptions on demand curves inferred from NYC taxicab log-data, as well as show the performance of our algorithms on synthetic experiments.

【Keywords】: directed discovery; market segmentation; online marketplaces

11. An Experimental Evaluation of Regret-Based Econometrics.

Paper Link】 【Pages】:73-81

【Authors】: Noam Nisan ; Gali Noti

【Abstract】: Using data obtained in a controlled ad-auction experiment that we ran, we evaluate the regret-based approach to econometrics that was recently suggested by Nekipelov, Syrgkanis, and Tardos (EC 2015). We found that despite the weak regret-based assumptions, the results were (at least) as accurate as those obtained using classic equilibrium-based assumptions. En route we studied to what extent humans actually minimize regret in our ad auction, and found a significant difference between the high types'' (players with a high valuation) who indeed rationally minimized regret and thelow types'' who significantly overbid. We suggest that correcting for these biases and adjusting the regret-based econometric method may improve the accuracy of estimated values.

【Keywords】: behavioral economics; cognitive biases; econometrics; regret; sponsored search auctions

12. Usage Patterns and the Economics of the Public Cloud.

Paper Link】 【Pages】:83-91

【Authors】: Cinar Kilcioglu ; Justin M. Rao ; Aadharsh Kannan ; R. Preston McAfee

【Abstract】: We examine the economics of demand and supply in cloud computing. The public cloud offers three main benefits to firms: 1) utilization can be scaled up or down easily; 2) capital expenditure (on-premises servers) can be converted to operating expenses, with the capital incurred by a specialist; 3) software can be ``pay-as-you-go.'' These benefits increase with the firm's ability to dynamically scale resource utilization and thus point to the need for dynamic prices to shape demand to the (short-run) fixed datacenter supply. Detailed utilization analysis reveals the large swings in utilization at the hourly, daily or weekly level are very rare at the customer level and non-existent at the datacenter level. Furthermore, few customers show volatility patterns that are excessively correlated with the market. These results explain why fixed prices currently prevail despite the seeming need for time-varying dynamics. Examining the actual CPU utilization provides a lens into the future. Here utilization varies by order half the datacenter capacity, but most firms are not dynamically scaling their assigned resources at-present to take advantage of these changes. If these gains are realized, demand fluctuations would be on par with the three classic industries where dynamic pricing is important (hotels, electricity, airlines) and dynamic prices would be essential for efficiency.

【Keywords】: cloud computing; economics; infrastructure-as-a-service; platform-as-a-service; scale economies

1D: Computational Health 4

13. Understanding and Discovering Deliberate Self-harm Content in Social Media.

Paper Link】 【Pages】:93-102

【Authors】: Yilin Wang ; Jiliang Tang ; Jundong Li ; Baoxin Li ; Yali Wan ; Clayton Mellina ; Neil O'Hare ; Yi Chang

【Abstract】: Studies suggest that self-harm users found it easier to discuss self-harm-related thoughts and behaviors using social media than in the physical world. Given the enormous and increasing volume of social media data, on-line self-harm content is likely to be buried rapidly by other normal content. To enable voices of self-harm users to be heard, it is important to distinguish self-harm content from other types of content. In this paper, we aim to understand self-harm content and provide automatic approaches to its detection. We first perform a comprehensive analysis on self-harm social media using different input cues. Our analysis, the first of its kind in large scale, reveals a number of important findings. Then we propose frameworks that incorporate the findings to discover self-harm content under both supervised and unsupervised settings. Our experimental results on a large social media dataset from Flickr demonstrate the effectiveness of the proposed frameworks and the importance of our findings in discovering self-harm content.

【Keywords】: mental health; self-harm detection; social media mining; user modeling

14. Mobile Sensing at the Service of Mental Well-being: a Large-scale Longitudinal Study.

Paper Link】 【Pages】:103-112

【Authors】: Sandra Servia Rodríguez ; Kiran K. Rachuri ; Cecilia Mascolo ; Peter J. Rentfrow ; Neal Lathia ; Gillian M. Sandstrom

【Abstract】: Measuring mental well-being with mobile sensing has been an increasingly active research topic. Pervasiveness of smartphones combined with the convenience of mobile app distribution platforms (e.g., Google Play) provide a tremendous opportunity to reach out to millions of users. However, the studies at the confluence of mental health and mobile sensing have been longitudinally limited, controlled, or confined to a small number of participants. In this paper we report on what we believe is the largest longitudinal in-the-wild study of mood through smartphones. We describe an Android app to collect participants' self-reported moods and system triggered experience sampling data while passively measuring their physical activity, sociability, and mobility via their device's sensors. We report the results of a large-scale analysis of the data collected for about three years from 18,000 users. The paper makes three primary contributions. First, we show how we used physical and software sensors in smartphones to automatically and accurately identify routines. Then, we demonstrate the strong correlation between these routines and users' personality, well-being perception, and other psychological variables. Finally, we explore predictability of users' mood using their passive sensing data. Our findings show that, especially for weekends, mobile sensing can be used to predict users' mood with an accuracy of about 70%. These results have the potential to impact the design of future mobile apps for mood/behavior tracking and interventions.

【Keywords】: affective computing; behavioural monitoring; mental wellbeing; mobile sensing

15. Harnessing the Web for Population-Scale Physiological Sensing: A Case Study of Sleep and Performance.

Paper Link】 【Pages】:113-122

【Authors】: Tim Althoff ; Eric Horvitz ; Ryen W. White ; Jamie Zeitzer

【Abstract】: Human cognitive performance is critical to productivity, learning, and accident avoidance. Cognitive performance varies throughout each day and is in part driven by intrinsic, near 24-hour circadian rhythms. Prior research on the impact of sleep and circadian rhythms on cognitive performance has typically been restricted to small-scale laboratory-based studies that do not capture the variability of real-world conditions, such as environmental factors, motivation, and sleep patterns in real-world settings. Given these limitations, leading sleep researchers have called for larger in situ monitoring of sleep and performance. We present the largest study to date on the impact of objectively measured real-world sleep on performance enabled through a reframing of everyday interactions with a web search engine as a series of performance tasks. Our analysis includes 3 million nights of sleep and 75 million interaction tasks. We measure cognitive performance through the speed of keystroke and click interactions on a web search engine and correlate them to wearable device-defined sleep measures over time. We demonstrate that real-world performance varies throughout the day and is influenced by both circadian rhythms, chronotype (morning/evening preference), and prior sleep duration and timing. We develop a statistical model that operationalizes a large body of work on sleep and performance and demonstrates that our estimates of circadian rhythms, homeostatic sleep drive, and sleep inertia align with expectations from laboratory-based sleep studies. Further, we quantify the impact of insufficient sleep on real-world performance and show that two consecutive nights with less than six hours of sleep are associated with decreases in performance which last for a period of six days. This work demonstrates the feasibility of using online interactions for large-scale physiological sensing.

【Keywords】: cognitive performance; performance; physiological sensing; search log; sleep; wearable; web search log

16. Cataloguing Treatments Discussed and Used in Online Autism Communities.

Paper Link】 【Pages】:123-131

【Authors】: Shaodian Zhang ; Tian Kang ; Lin Qiu ; Weinan Zhang ; Yong Yu ; Noémie Elhadad

【Abstract】: A large number of patients discuss treatments in online health communities (OHCs). One research question of interest to health researchers is whether treatments being discussed in OHCs are eventually used by community members in their real lives. In this paper, we rely on machine learning methods to automatically identify attributions of mentions of treatments from an online autism community. The context of our work is online autism communities, where parents exchange support for the care of their children with autism spectrum disorder. Our methods are able to distinguish discussions of treatments that are associated with patients, caregivers, and others, as well as identify whether a treatment is actually taken. We investigate treatments that are not just discussed but also used by patients according to two types of content analysis, cross-sectional and longitudinal. The treatments identified through our content analysis help create a catalogue of real-world treatments. This study results lay the foundation for future research to compare real-world drug usage with established clinical guidelines.

【Keywords】: autism; conditional random fields; natural language processing; online health community; treatment

1G: Ubiquitous and Mobile 1 4

17. Web Application Migration with Closure Reconstruction.

Paper Link】 【Pages】:133-142

【Authors】: Jin-woo Kwon ; Soo-Mook Moon

【Abstract】: Due to its high portability and simplicity, web application (app) based on HTML/JavaScript/CSS has been widely used for various smart-device platforms. To take advantage of its wide platform pool, a new idea called app migration has been proposed for the web platform. Web app migration is a framework to serialize a web app running on a device and restore it in another device to continue its execution. In JavaScript semantics, one of the language features that does not allow easy app migration is a closure. A JavaScript function can access variables defined in its outer function even if the execution of the outer function is terminated. It is allowed because the inner function is created as a closure such that it contains the outer function's environment. This feature is widely used in web app development because it is the most common way to implement data encapsulation in web programming. Closures are not easy to serialize because environments can be shared by a number of closures and environments can be created in a nested way. In this paper, we propose a novel approach to fully serialize closures. We created mechanisms to extract information from a closure's environment through the JavaScript engine and to serialize the information in a proper order so that the original relationship between closures and environments can be restored properly. We implemented our mechanism on the WebKit browser and successfully migrated Octane benchmarks and seven real web apps which heavily exploit closures. We also show that our mechanism works correctly even for some extreme, closure-heavy cases.

【Keywords】: app migration; closure; javascript; web application

18. AppHolmes: Detecting and Characterizing App Collusion among Third-Party Android Markets.

Paper Link】 【Pages】:143-152

【Authors】: Mengwei Xu ; Yun Ma ; Xuanzhe Liu ; Felix Xiaozhu Lin ; Yunxin Liu

【Abstract】: Background activities on smartphones are essential to today's "always-on" mobile device experience. Yet, there lacks a clear understanding of the cooperative behaviors among background activities as well as a quantification of the consequences. In this paper, we present the first in-depth study of app collusion, in which one app surreptitiously launches others in the background without user's awareness. To enable the study, we develop AppHolmes, a static analysis tool for detecting app collusion by examining the app binaries. By analyzing 10,000 apps from top third-party app markets, we found that i) covert, cooperative behaviors in background app launch are surprisingly pervasive, ii) most collusion is caused by shared services, libraries, or common interest among apps, and iii) collusion has serious impact on performance, efficiency, and security. Overall, our work presents a strong implication on future mobile system design.

【Keywords】: community detection; mobile computing; program analysis

19. The Long-Standing Privacy Debate: Mobile Websites vs Mobile Apps.

Paper Link】 【Pages】:153-162

【Authors】: Elias P. Papadopoulos ; Michalis Diamantaris ; Panagiotis Papadopoulos ; Thanasis Petsas ; Sotiris Ioannidis ; Evangelos P. Markatos

【Abstract】: The vast majority of online services nowadays, provide both a mobile friendly website and a mobile application to their users. Both of these choices are usually released for free, with their developers, usually gaining revenue by allowing advertisements from ad networks to be embedded into their content. In order to provide more personalized and thus more effective advertisements, ad networks usually deploy pervasive user tracking, raising this way significant privacy concerns. As a consequence, the users do not have to think only their convenience before deciding which choice to use while accessing a service: web or app, but also which one harms their privacy the least. In this paper, we aim to respond to this question: which of the two options protects the users' privacy in the best way apps or browsers? To tackle this question, we study a broad range of privacy related leaks in a comparison of several popular apps and their web counterpart. These leaks may contain not only personally identifying information (PII) but also device-specific information, able to cross-application and cross-site track the user into the network, and allow third parties to link web with app sessions. Finally, we propose an anti-tracking mechanism that enable the users to access an online service through a mobile app without risking their privacy. Our evaluation shows that our approach is able to preserve the privacy of the user by reducing the leaking identifiers of apps by 27.41% on average, while it imposes a practically negligible latency of less than 1 millisecond per request.

【Keywords】: anti-tracking; device tracking; mobile applications; mobile browsers; mobile privacy; mobile web; privacy leaks; user privacy

20. An Explorative Study of the Mobile App Ecosystem from App Developers' Perspective.

Paper Link】 【Pages】:163-172

【Authors】: Haoyu Wang ; Zhe Liu ; Yao Guo ; Xiangqun Chen ; Miao Zhang ; Guoai Xu ; Jason Hong

【Abstract】: With the prevalence of smartphones, app markets such as Apple App Store and Google Play has become the center stage in the mobile app ecosystem, with millions of apps developed by tens of thousands of app developers in each major market. This paper presents a study of the mobile app ecosystem from the perspective of app developers. Based on over one million Android apps and 320,000 developers from Google Play, we analyzed the Android app ecosystem from different aspects. Our analysis shows that while over half of the developers have released only one app in the market, many of them have released hundreds of apps. We classified developers into different groups based on the number of apps they have released, and compared their characteristics. Specially, we have analyzed the group of aggressive developers who have released more than 50 apps, trying to understand how and why they create so many apps. We also investigated the privacy behaviors of app developers, showing that some developers have a habit of producing apps with low privacy ratings. Our study shows that understanding the behavior of mobile developers can be helpful to not only other app developers, but also to app markets and mobile users.

【Keywords】: android; app clone; app developers; app ecosystem; google play; mobile apps; mobile privacy

1H: Recommeder Systems 1 4

21. Neural Collaborative Filtering.

Paper Link】 【Pages】:173-182

【Authors】: Xiangnan He ; Lizi Liao ; Hanwang Zhang ; Liqiang Nie ; Xia Hu ; Tat-Seng Chua

【Abstract】: In recent years, deep neural networks have yielded immense success on speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. In this work, we strive to develop techniques based on neural networks to tackle the key problem in recommendation --- collaborative filtering --- on the basis of implicit feedback. Although some recent work has employed deep learning for recommendation, they primarily used it to model auxiliary information, such as textual descriptions of items and acoustic features of musics. When it comes to model the key factor in collaborative filtering --- the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items. By replacing the inner product with a neural architecture that can learn an arbitrary function from data, we present a general framework named NCF, short for Neural network-based Collaborative Filtering. NCF is generic and can express and generalize matrix factorization under its framework. To supercharge NCF modelling with non-linearities, we propose to leverage a multi-layer perceptron to learn the user-item interaction function. Extensive experiments on two real-world datasets show significant improvements of our proposed NCF framework over the state-of-the-art methods. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance.

【Keywords】: collaborative filtering; deep learning; implicit feedback; matrix factorization; neural networks

22. Learning to Recommend Accurate and Diverse Items.

Paper Link】 【Pages】:183-192

【Authors】: Peizhe Cheng ; Shuaiqiang Wang ; Jun Ma ; Jiankai Sun ; Hui Xiong

【Abstract】: In this study, we investigate diversified recommendation problem by supervised learning, seeking significant improvement in diversity while maintaining accuracy. In particular, we regard each user as a training instance, and heuristically choose a subset of accurate and diverse items as ground-truth for each user. We then represent each user or item as a vector resulted from the factorization of the user-item rating matrix. In our paper, we try to discover a factorization for matching the following supervised learning task. In doing this, we define two coupled optimization problems, parameterized matrix factorization and structural learning, to formulate our task. And we propose a diversified collaborative filtering algorithm (DCF) to solve the coupled problems. We also introduce a new pairwise accuracy metric and a normalized topic coverage diversity metric to measure the performance of accuracy and diversity respectively. Extensive experiments on benchmark datasets show the performance gains of DCF in comparison with the state-of-the-art algorithms.

【Keywords】: collaborative filtering; diversity; recommender systems; structural svm

23. Collaborative Metric Learning.

Paper Link】 【Pages】:193-201

【Authors】: Cheng-Kang Hsieh ; Longqi Yang ; Yin Cui ; Tsung-Yi Lin ; Serge J. Belongie ; Deborah Estrin

【Abstract】: Metric learning algorithms produce distance metrics that capture the important relationships among data. In this work, we study the connection between metric learning and collaborative filtering. We propose Collaborative Metric Learning (CML) which learns a joint metric space to encode not only users' preferences but also the user-user and item-item similarity. The proposed algorithm outperforms state-of-the-art collaborative filtering algorithms on a wide range of recommendation tasks and uncovers the underlying spectrum of users' fine-grained preferences. CML also achieves significant speedup for Top-K recommendation tasks using off-the-shelf, approximate nearest-neighbor search, with negligible accuracy reduction.

【Keywords】: collaborative filtering; collaborative metric learning; metric learning; recommendation systems

24. Beyond Globally Optimal: Focused Learning for Improved Recommendations.

Paper Link】 【Pages】:203-212

【Authors】: Alex Beutel ; Ed Huai-hsin Chi ; Zhiyuan Cheng ; Hubert Pham ; John Anderson

【Abstract】: When building a recommender system, how can we ensure that all items are modeled well? Classically, recommender systems are built, optimized, and tuned to improve a global prediction objective, such as root mean squared error. However, as we demonstrate, these recommender systems often leave many items badly-modeled and thus under-served. Further, we give both empirical and theoretical evidence that no single matrix factorization, under current state-of-the-art methods, gives optimal results for each item. As a result, we ask: how can we learn additional models to improve the recommendation quality for a specified subset of items? We offer a new technique called focused learning, based on hyperparameter optimization and a customized matrix factorization objective. Applying focused learning on top of weighted matrix factorization, factorization machines, and LLORMA, we demonstrate prediction accuracy improvements on multiple datasets. For instance, on MovieLens we achieve as much as a 17% improvement in prediction accuracy for niche movies, cold-start items, and even the most badly-modeled items in the original model.

【Keywords】: hyperparameter optimization; recommendation; recommender systems; regularization

2B: Algorithms and Theory 1 4

25. AutoCyclone: Automatic Mining of Cyclic Online Activities with Robust Tensor Factorization.

Paper Link】 【Pages】:213-221

【Authors】: Tsubasa Takahashi ; Bryan Hooi ; Christos Faloutsos

【Abstract】: Given a collection of seasonal time-series, how can we find regular (cyclic) patterns and outliers (i.e. rare events)? These two types of patterns are hidden and mixed in the time-varying activities. How can we robustly separate regular patterns and outliers, without requiring any prior information? We present CycloneM, a unifying model to capture both cyclic patterns and outliers, and CycloneFact, a novel algorithm which solves the above problem. We also present an automatic mining framework AutoCyclone, based on CycloneM and CycloneFact. Our method has the following properties; (a) effective: it captures important cyclic features such as trend and seasonality, and distinguishes regular patterns and rare events clearly; (b) robust and accurate: it detects the above features and patterns accurately against outliers; (c) fast: CycloneFact takes linear time in the data size and typically converges in a few iterations; (d) parameter free: our modeling framework frees the user from having to provide parameter values. Extensive experiments on 4 real datasets demonstrate the benefits of the proposed model and algorithm, in that the model can capture latent cyclic patterns, trends and rare events, and the algorithm outperforms the existing state-of-the-art approaches. CycloneFact was up to 5 times more accurate and 20 times faster than top competitors.

【Keywords】: anomaly detection; tensor factorization; time-series

26. Assessing Percolation Threshold Based on High-Order Non-Backtracking Matrices.

Paper Link】 【Pages】:223-232

【Authors】: Yuan Lin ; Wei Chen ; Zhongzhi Zhang

【Abstract】: Percolation threshold of a network is the critical value such that when nodes or edges are randomly selected with probability below the value, the network is fragmented but when the probability is above the value, a giant component connecting a large portion of the network would emerge. Assessing the percolation threshold of networks has wide applications in network reliability, information spread, epidemic control, etc. The theoretical approach so far to assess the percolation threshold is mainly based on spectral radius of adjacency matrix or non-backtracking matrix, which is limited to dense graphs or locally treelike graphs, and is less effective for sparse networks with non-negligible amount of triangles and loops. In this paper, we study high-order non-backtracking matrices and their application to assessing percolation threshold. We first define high-order non-backtracking matrices and study the properties of their spectral radii. Then we focus on the 2nd-order non-backtracking matrix and demonstrate analytically that the reciprocal of its spectral radius gives a tighter lower bound than those of adjacency and standard non-backtracking matrices. We further build a smaller size matrix with the same largest eigenvalue as the 2nd-order non-backtracking matrix to improve computation efficiency. Finally, we use both synthetic networks and 42 real networks to illustrate that the use of the 2nd-order non-backtracking matrix does give better lower bound for assessing percolation threshold than adjacency and standard non-backtracking matrices.

【Keywords】: high-order non-backtracking matrix; information and inuence diffusion; non-backtracking matrix; percolation theory; percolation threshold

27. Large Scale Density-friendly Graph Decomposition via Convex Programming.

Paper Link】 【Pages】:233-242

【Authors】: Maximilien Danisch ; T.-H. Hubert Chan ; Mauro Sozio

【Abstract】: Algorithms for finding dense regions in an input graph have proved to be effective tools in graph mining and data analysis. Recently, Tatti and Gionis [WWW 2015] presented a novel graph decomposition (known as the locally-dense decomposition) that is similar to the well-known k-core decomposition, with the additional property that its components are arranged in order of their densities. Such a decomposition provides a valuable tool in graph mining. Unfortunately, their algorithm for computing the exact decomposition is based on a maximum-flow algorithm which cannot scale to massive graphs, while the approximate decomposition defined by the same authors misses several interesting properties. This calls for scalable algorithms for computing such a decomposition. In our work, we devise an efficient algorithm which is able to compute exact locally-dense decompositions in real-world graphs containing up to billions of edges. Moreover, we provide a new definition of approximate locally-dense decomposition which retains most of the properties of an exact decomposition, for which we devise an algorithm that can scale to real-world graphs containing up to tens of billions of edges. Our algorithm is based on the classic Frank-Wolfe algorithm which is similar to gradient descent and can be efficiently implemented in most of the modern architectures dealing with massive graphs. We provide a rigorous study of our algorithms and their convergence rates. We conduct an extensive experimental evaluation on multi-core architectures showing that our algorithms converge much faster in practice than their worst-case analysis. Our algorithm is even more efficient for the more specialized problem of computing a densest subgraph.

【Keywords】: density-friendly graph decomposition; frank-wolfe algorithm; large graph mining

28. nTD: Noise-Profile Adaptive Tensor Decomposition.

Paper Link】 【Pages】:243-252

【Authors】: Xinsheng Li ; K. Selçuk Candan ; Maria Luisa Sapino

【Abstract】: Tensor decomposition is used for many web and user data analysis operations from clustering, trend detection, anomaly detection, to correlation analysis. However, many of the tensor decomposition schemes are sensitive to noisy data, an inevitable problem in the real world that can lead to false conclusions. The problem is compounded by over-fitting when the user data is sparse. Recent research has shown that it is possible to avoid over-fitting by relying on probabilistic techniques. However, these have two major deficiencies: (a) firstly, they assume that all the data and intermediary results can fit in the main memory, and (b) they treat the entire tensor uniformly, ignoring potential non-uniformities in the noise distribution. In this paper, we propose a Noise-Profile Adaptive Tensor Decomposition (nTD) method, which aims to tackle both of these challenges. In particular, nTD leverages a grid-based two-phase decomposition strategy for two complementary purposes: firstly, the grid partitioning helps ensure that the memory footprint of the decomposition is kept low; secondly (and perhaps more importantly) any a priori knowledge about the noise profiles of the grid partitions enable us to develop a sample assignment strategy (or s-strategy) that best suits the noise distribution of the given tensor. Experiments show that nTD's performance is significantly better than conventional CP decomposition techniques on noisy user data tensors.

【Keywords】: noise profile; tensor decomposition

2C: Systems and Infrastructure 1 4

29. Measuring and Improving the Reliability of Wide-Area Cloud Paths.

Paper Link】 【Pages】:253-262

【Authors】: Osama Haq ; Mamoon Raja ; Fahad R. Dogar

【Abstract】: Many popular cloud applications use inter-data center paths; yet, little is known about the characteristics of these ``cloud paths''. Over an eighteen month period, we measure the inter-continental cloud paths of three providers (Amazon, Google, and Microsoft) using client side (VM-to-VM) measurements. We find that cloud paths are more predictable compared to public Internet paths, with an order of magnitude lower loss rate and jitter at the tail (95th percentile and beyond) compared to public Internet paths. We also investigate the nature of packet losses on these paths (e.g., random vs. bursty) and potential reasons why these paths may be better in quality. Based on our insights, we consider how we can further improve the quality of these paths with the help of existing loss mitigation techniques. We demonstrate that using the cloud path in conjunction with a detour path can mask most of the cloud losses, resulting in up to five 9's of network availability for applications.

【Keywords】: bandwidth; cloud availability; cloud paths reliability; detour routing; inter-data center networks; latency; loss rate

30. Blotter: Low Latency Transactions for Geo-Replicated Storage.

Paper Link】 【Pages】:263-272

【Authors】: Henrique Moniz ; João Leitão ; Ricardo J. Dias ; Johannes Gehrke ; Nuno M. Preguiça ; Rodrigo Rodrigues

【Abstract】: Most geo-replicated storage systems use weak consistency to avoid the performance penalty of coordinating replicas in different data centers. This departure from strong semantics poses problems to application programmers, who need to address the anomalies enabled by weak consistency. In this paper we use a recently proposed isolation level, called Non-Monotonic Snapshot Isolation, to achieve ACID transactions with low latency. To this end, we present Blotter, a geo-replicated system that leverages these semantics in the design of a new concurrency control protocol that leaves a small amount of local state during reads to make commits more efficient, which is combined with a configuration of Paxos that is tailored for good performance in wide area settings. Read operations always run on the local data center, and update transactions complete in a small number of message steps to a subset of the replicas. We implemented Blotter as an extension to Cassandra. Our experimental evaluation shows that Blotter has a small overhead at the data center scale, and performs better across data centers when compared with our implementations of the core Spanner protocol and of Snapshot Isolation on the same codebase.

【Keywords】: concurrency control; distributed database systems; distributed transactions; geo-replicated storage; paxos

Paper Link】 【Pages】:273-281

【Authors】: Charles L. A. Clarke ; Gordon V. Cormack ; Jimmy J. Lin ; Adam Roegiest

【Abstract】: This paper explores a simple question: How would we provide a high-quality search experience on Mars, where the fundamental physical limit is speed-of-light propagation delays on the order of tens of minutes? On Earth, users are accustomed to nearly instantaneous responses from web services. Is it possible to overcome orders-of-magnitude longer latency to provide a tolerable user experience on Mars? In this paper, we formulate the searching from Mars problem as a tradeoff between "effort" (waiting for responses from Earth) and "data transfer" (pre-fetching or caching data on Mars). The contribution of our work is articulating this design space and presenting two case studies that explore the effectiveness of baseline techniques, using publicly available data from the TREC Total Recall and Sessions Tracks. We intend for this research problem to be aspirational as well as inspirational---even if one is not convinced by the premise of Mars colonization, there are Earth-based scenarios such as searching from rural villages in India that share similar constraints, thus making the problem worthy of exploration and attention from researchers.

【Keywords】: mars; search sessions; simulation

32. Legion: Enriching Internet Services with Peer-to-Peer Interactions.

Paper Link】 【Pages】:283-292

【Authors】: Albert van der Linde ; Pedro Fouto ; João Leitão ; Nuno M. Preguiça ; Santiago J. Castiñeira ; Annette Bieniusa

【Abstract】: Many web applications are built around direct interactions among users, from collaborative applications and social networks to multi-user games. Despite being user-centric, these applications are usually supported by services running on servers that mediate all interactions among clients. When users are in close vicinity of each other, relying on a centralized infrastructure for mediating user interactions leads to unnecessarily high latency while hampering fault-tolerance and scalability. In this paper, we propose to extend user-centric Internet services with peer-to-peer interactions. We have designed a framework named Legion that enables client web applications to securely replicate data from servers, and synchronize these replicas directly among them. Legion allows for client-side modules, that we dub adapters, to leverage existing web platforms for storing data and to assist in Legion operation. Using these adapters, legacy applications accessing directly the web platforms can co-exist with new applications that use our framework, while accessing the same shared objects.Our experimental evaluation shows that, besides supporting direct client interactions, even when disconnected from the servers, Legion provides lower latency for update propagation with decreased network traffic for servers.

【Keywords】: crdts; frameworks; peer-to-peer systems; web applications

2D: Computational Health 2 4

Paper Link】 【Pages】:293-301

【Authors】: Luca Soldaini ; Elad Yom-Tov

【Abstract】: Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user privacy. To facilitate research on specific topics of interest, especially in medicine, we introduce an algorithm for identifying a trait of interest in anonymous users. We illustrate how a small set of labeled examples, together with statistical information about the entire population, can be aggregated to obtain labels on unseen examples. We validate our approach using labeled data from the political domain. We provide two applications of the proposed algorithm to the medical domain. In the first, we demonstrate how to identify users whose search patterns indicate they might be suffering from certain types of cancer. This shows, for the first time, that search queries can be used as a screening device for diseases that are currently often discovered too late, because no early screening tests exists. In the second, we detail an algorithm to predict the distribution of diseases given their incidence in a subset of the population at study, making it possible to predict disease spread from partial epidemiological data.

【Keywords】: disease screening; health informatics; query log analysis

34. Using Participatory Web-based Surveillance Data to Improve Seasonal Influenza Forecasting in Italy.

Paper Link】 【Pages】:303-310

【Authors】: Daniela Perrotta ; Michele Tizzoni ; Daniela Paolotti

【Abstract】: Traditional surveillance of seasonal influenza is generally affected by reporting lags of at least one week and by continuous revisions of the numbers initially released. As a consequence, influenza forecasts are often limited by the time required to collect new and accurate data. On the other hand, the availability of novel data streams for disease detection can help in overcoming these issues by capturing an additional surveillance signal that can be used to complement data collected by public health agencies. In this study, we investigate how combining both traditional and participatory Web-based surveillance data can provide accurate predictions for seasonal influenza in real-time fashion. To this aim, we use two data sources available in Italy from two different monitoring systems: traditional surveillance data based on sentinel doctors reports and digital surveillance data deriving from a participatory system that monitors the influenza activity through Internet-based surveys. We integrate such digital component in a linear autoregressive exogenous (ARX) model based on traditional surveillance data and evaluate its predictive ability over the course of four influenza seasons in Italy, from 2012-2013 to 2015-2016, for each of the four weekly time horizons. Our results show that by using data extracted from a Web-based participatory surveillance system, which are usually available one week in advance with respect to traditional surveillance, it is possible to obtain accurate weekly predictions of influenza activity at national level up to four weeks in advance. Compared to a model that is only based on data from sentinel doctors, our approach significantly improves real-time forecasts of influenza activity, by increasing the Pearson's correlation up to 30% and by reducing the Mean Absolute Error up to 43% for the four weekly time horizons.

【Keywords】: forecasts; influenza; modeling; participatory surveillance; web-based surveillance

35. Forecasting Seasonal Influenza Fusing Digital Indicators and a Mechanistic Disease Model.

Paper Link】 【Pages】:311-319

【Authors】: Qian Zhang ; Nicola Perra ; Daniela Perrotta ; Michele Tizzoni ; Daniela Paolotti ; Alessandro Vespignani

【Abstract】: The availability of novel digital data streams that can be used as proxy for monitoring infectious disease incidence is ushering in a new era for real-time forecast approaches to disease spreading. Here, we propose the first seasonal influenza forecast framework based on a stochastic, spatially structured mechanistic model (individual level microsimulation) initialized with geo-localized microblogging data. The framework provides for more than 600 census areas in the United States, Italy and Spain, the initial conditions for a stochastic epidemic computational model that generates an ensemble of forecasts for the main indicators of the epidemic season: peak time and intensity. We evaluate the forecasts accuracy and reliability by comparing the results with the data from the official influenza surveillance systems in the US, Italy and Spain in the seasons 2014/15 and 2015/16. In all countries studied, the proposed framework provides reliable results with leads of up to 6 weeks that became more stable and accurate with progression of the season. The results for the United States have been generated in real-time in the context of the Centers for Disease Control and Prevention ``Forecasting the Influenza Season Challenge''. A characteristic feature of the mechanistic modeling approach is in the explicit estimate of key epidemiological parameters relevant for public health decision-making that cannot be achieved with statistical models that do not consider the disease dynamic. Furthermore, the presented framework allows the fusion of multiple data streams in the initialization stage and can be enriched with census, weather and socioeconomic data.

【Keywords】: computational epidemiology; influenza modeling; real-time forecasting; social media

36. PhLeGrA: Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data.

Paper Link】 【Pages】:321-329

【Authors】: Maulik R. Kamdar ; Mark A. Musen

【Abstract】: Integrated approaches for pharmacology are required for the mechanism-based predictions of adverse drug reactions that manifest due to concomitant intake of multiple drugs. These approaches require the integration and analysis of biomedical data and knowledge from multiple, heterogeneous sources with varying schemas, entity notations, and formats. To tackle these integrative challenges, the Semantic Web community has published and linked several datasets in the Life Sciences Linked Open Data (LSLOD) cloud using established W3C standards. We present the PhLeGrA platform for Linked Graph Analytics in Pharmacology in this paper. Through query federation, we integrate four sources from the LSLOD cloud and extract a drug-reaction network, composed of distinct entities. We represent this graph as a hidden conditional random field (HCRF), a discriminative latent variable model that is used for structured output predictions. We calculate the underlying probability distributions in the drug-reaction HCRF using the datasets from the U.S. Food and Drug Administration's Adverse Event Reporting System. We predict the occurrence of 146 adverse reactions due to multiple drug intake with an AUROC statistic greater than 0.75. The PhLeGrA platform can be extended to incorporate other sources published using Semantic Web technologies, as well as to discover other types of pharmacological associations.

【Keywords】: data mining; drug-drug interactions; federated querying; graph analysis; semantic web

2G: Ubiquitous and Mobile 2 4

37. Reducing Latency by Eliminating Synchrony.

Paper Link】 【Pages】:331-340

【Authors】: Min Hong Yun ; Songtao He ; Lin Zhong

【Abstract】: Drawing or dragging an object on a mobile device is annoying today because the latency is manifested spatially with an obvious gap between the touch point and the line head or dragged object. This work identifies the multiple synchronization points in the input to display path of modern mobile systems as a major source of latency, contributing about 30 ms to the overall latency. We present Presto, an asynchronous design of the input to display path. By focusing on the main application and relaxing conventional requirements of no frame drop and no tearing effects, Presto is able to eliminate much of the latency due to synchrony. By carefully guarding against consecutive frame drops and limiting the risk of tearing to a small region around the touch point, Presto is able to reduce their visual impact to barely noticeable. Using a prototype based on Android 5, we are able to quantify the effectiveness, overhead and user experience of Presto through both objective measurements and subjective user assessment. We show that Presto is able to reduce the latency of legacy Android applications by close to half; and more importantly, we show this reduction is orthogonal to that by other popular approaches. When combined with touch prediction, Presto is able to reduce the touch latency below 10 ms, a remarkable achievement without any hardware support.

【Keywords】: android; input to display path; synchrony; touchscreen; user-perceived latency

38. Almond: The Architecture of an Open, Crowdsourced, Privacy-Preserving, Programmable Virtual Assistant.

Paper Link】 【Pages】:341-350

【Authors】: Giovanni Campagna ; Rakesh Ramesh ; Silei Xu ; Michael Fischer ; Monica S. Lam

【Abstract】: This paper presents the architecture of Almond, an open, crowdsourced, privacy-preserving and programmable virtual assistant for online services and the Internet of Things (IoT). Included in Almond is Thingpedia, a crowdsourced public knowledge base of natural language interfaces and open APIs. Our proposal addresses four challenges in virtual assistant technology: generality, interoperability, privacy, and usability. Generality is addressed by crowdsourcing Thingpedia, while interoperability is provided by ThingTalk, a high-level domain-specific language that connects multiple devices or services via open APIs. For privacy, user credentials and user data are managed by our open-source ThingSystem, which can be run on personal phones or home servers. Finally, we address usability by providing a natural language interface, whose capability can be extended via training with the help of a menu-driven interface. We have created a fully working prototype, and crowdsourced a set of 187 functions across 45 different kinds of devices. Almond is the first virtual assistant that lets users specify trigger-action tasks in natural language. Despite the lack of real usage data, our experiment suggests that Almond can understand about 40% of the complex tasks when uttered by a user familiar with its capability.

【Keywords】: conversational agents; crowdsourcing; internet of things; natural language programming; virtual assistant

39. DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing.

Paper Link】 【Pages】:351-360

【Authors】: Shuochao Yao ; Shaohan Hu ; Yiran Zhao ; Aston Zhang ; Tarek F. Abdelzaher

【Abstract】: Mobile sensing and computing applications usually require time-series inputs from sensors, such as accelerometers, gyroscopes, and magnetometers. Some applications, such as tracking, can use sensed acceleration and rate of rotation to calculate displacement based on physical system models. Other applications, such as activity recognition, extract manually designed features from sensor inputs for classification. Such applications face two challenges. On one hand, on-device sensor measurements are noisy. For many mobile applications, it is hard to find a distribution that exactly describes the noise in practice. Unfortunately, calculating target quantities based on physical system and noise models is only as accurate as the noise assumptions. Similarly, in classification applications, although manually designed features have proven to be effective, it is not always straightforward to find the most robust features to accommodate diverse sensor noise patterns and heterogeneous user behaviors. To this end, we propose DeepSense, a deep learning framework that directly addresses the aforementioned noise and feature customization challenges in a unified manner. DeepSense integrates convolutional and recurrent neural networks to exploit local interactions among similar mobile sensors, merge local interactions of different sensory modalities into global interactions, and extract temporal relationships to model signal dynamics. DeepSense thus provides a general signal estimation and classification framework that accommodates a wide range of applications. We demonstrate the effectiveness of DeepSense using three representative and challenging tasks: car tracking with motion sensors, heterogeneous human activity recognition, and user identification with biometric motion analysis. DeepSense significantly outperforms the state-of-the-art methods for all three tasks. In addition, we show that DeepSense is feasible to implement on smartphones and embedded devices thanks to its moderate energy consumption and low latency.

【Keywords】: activity recognition; deep learning; internet of things; mobile computing; mobile sensing; tracking; user identification

40. Regions, Periods, Activities: Uncovering Urban Dynamics via Cross-Modal Representation Learning.

Paper Link】 【Pages】:361-370

【Authors】: Chao Zhang ; Keyang Zhang ; Quan Yuan ; Haoruo Peng ; Yu Zheng ; Tim Hanratty ; Shaowen Wang ; Jiawei Han

【Abstract】: With the ever-increasing urbanization process, systematically modeling people's activities in the urban space is being recognized as a crucial socioeconomic task. This task was nearly impossible years ago due to the lack of reliable data sources, yet the emergence of geo-tagged social media (GTSM) data sheds new light on it. Recently, there have been fruitful studies on discovering geographical topics from GTSM data. However, their high computational costs and strong distributional assumptions about the latent topics hinder them from fully unleashing the power of GTSM. To bridge the gap, we present CrossMap, a novel cross-modal representation learning method that uncovers urban dynamics with massive GTSM data. CrossMap first employs an accelerated mode seeking procedure to detect spatiotemporal hotspots underlying people's activities. Those detected hotspots not only address spatiotemporal variations, but also largely alleviate the sparsity of the GTSM data. With the detected hotspots, CrossMap then jointly embeds all spatial, temporal, and textual units into the same space using two different strategies: one is reconstruction-based and the other is graph-based. Both strategies capture the correlations among the units by encoding their co-occurrence and neighborhood relationships, and learn low-dimensional representations to preserve such correlations. Our experiments demonstrate that CrossMap not only significantly outperforms state-of-the-art methods for activity recovery and classification, but also achieves much better efficiency.

【Keywords】: activity; geographical topic; representation learning; social media; spatiotemporal data; twitter; urban dynamics

2H: Recommender Systems 2 4

41. Fairness in Package-to-Group Recommendations.

Paper Link】 【Pages】:371-379

【Authors】: Dimitris Serbos ; Shuyao Qi ; Nikos Mamoulis ; Evaggelia Pitoura ; Panayiotis Tsaparas

【Abstract】: Recommending packages of items to groups of users has several applications, including recommending vacation packages to groups of tourists, entertainment packages to groups of friends, or sets of courses to groups of students. In this paper, we focus on a novel aspect of package-to-group recommendations, that of fairness. Specifically, when we recommend a package to a group of people, we ask that this recommendation is fair in the sense that every group member is satisfied by a sufficient number of items in the package. We explore two definitions of fairness and show that for either definition the problem of finding the most fair package is NP-hard. We exploit the fact that our problem can be modeled as a coverage problem, and we propose greedy algorithms that find approximate solutions within reasonable time. In addition, we study two extensions of the problem, where we impose category or spatial constraints on the items to be included in the recommended packages. We evaluate the appropriateness of the fairness models and the performance of the proposed algorithms using real data from Yelp, and a user study.

【Keywords】: envy-freeness; fairness; package-to-group; proportionality; recommendation systems

42. Streaming Recommender Systems.

Paper Link】 【Pages】:381-389

【Authors】: Shiyu Chang ; Yang Zhang ; Jiliang Tang ; Dawei Yin ; Yi Chang ; Mark A. Hasegawa-Johnson ; Thomas S. Huang

【Abstract】: The increasing popularity of real-world recommender systems produces data continuously and rapidly, and it becomes more realistic to study recommender systems under streaming scenarios. Data streams present distinct properties such as temporally ordered, continuous and high-velocity, which poses tremendous challenges to traditional recommender systems. In this paper, we investigate the problem of recommendation with stream inputs. In particular, we provide a principled framework termed sRec, which provides explicit continuous-time random process models of the creation of users and topics, and of the evolution of their interests. A variational Bayesian approach called recursive meanfield approximation is proposed, which permits computationally efficient instantaneous on-line inference. Experimental results on several real-world datasets demonstrate the advantages of our sRec over other state-of-the-arts.

【Keywords】: continuous time; data stream.; online learning; recommender system; streaming

43. What Your Images Reveal: Exploiting Visual Contents for Point-of-Interest Recommendation.

Paper Link】 【Pages】:391-400

【Authors】: Suhang Wang ; Yilin Wang ; Jiliang Tang ; Kai Shu ; Suhas Ranganath ; Huan Liu

【Abstract】: The rapid growth of Location-based Social Networks (LBSNs) provides a vast amount of check-in data, which facilitates the study of point-of-interest (POI) recommendation. The majority of the existing POI recommendation methods focus on four aspects, i.e., temporal patterns, geographical influence, social correlations and textual content indications. For example, user's visits to locations have temporal patterns and users are likely to visit POIs near them. In real-world LBSNs such as Instagram, users can upload photos associating with locations. Photos not only reflect users' interests but also provide informative descriptions about locations. For example, a user who posts many architecture photos is more likely to visit famous landmarks; while a user posts lots of images about food has more incentive to visit restaurants. Thus, images have potentials to improve the performance of POI recommendation. However, little work exists for POI recommendation by exploiting images. In this paper, we study the problem of enhancing POI recommendation with visual contents. In particular, we propose a new framework Visual Content Enhanced POI recommendation (VPOI), which incorporates visual contents for POI recommendations. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework.

【Keywords】: location-based social networks; poi recommendation; visual contents

44. A General Model for Out-of-town Region Recommendation.

Paper Link】 【Pages】:401-410

【Authors】: Tuan-Anh Nguyen Pham ; Xutao Li ; Gao Cong

【Abstract】: With the rapid growth of location-based social networks (LBSNs), it is now available to analyze and understand user mobility behavior in real world. Studies show that users usually visit nearby points of interest (POIs), located in small regions, especially when they travel out of their hometowns. However, previous out-of-town recommendation systems mainly focus on recommending individual POIs that may reside far from each other, which makes the recommendation results less useful. In this paper, we introduce a novel problem called Region Recommendation, which aims to recommend an out-of-town region of POIs that are likely to be visited by a user. The proximity characteristic of user mobility behavior implies that the probability of visiting one POI depends on those of nearby POIs. Thus, to make accurate region recommendation, our proposed model exploits the influence between POIs, instead of treating them individually. Moreover, to overcome the efficiency problem of searching the best region, we propose a sweeping line-based method, and subsequently an constant-bounded algorithm for better efficiency. Experiments on two real-world datasets demonstrate the improved effectiveness of our models over baseline methods and efficiency of the approximate algorithm.

【Keywords】: location-based social networks; out-of-town recommendation; region recommendation

3B: Algorithms and Theory 2 4

45. Linear Additive Markov Processes.

Paper Link】 【Pages】:411-419

【Authors】: Ravi Kumar ; Maithra Raghu ; Tamás Sarlós ; Andrew Tomkins

【Abstract】: We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parameterization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-state and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of real-world experiments to show that LAMP is more powerful than first-order Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity.

【Keywords】: browsing models; markov chains; probabilistic models; random walks; sequential models

46. Submodular Optimization Over Sliding Windows.

Paper Link】 【Pages】:421-430

【Authors】: Alessandro Epasto ; Silvio Lattanzi ; Sergei Vassilvitskii ; Morteza Zadimoghaddam

【Abstract】: Maximizing submodular functions under cardinality constraints lies at the core of numerous data mining and machine learning applications, including data diversification, data summarization, and coverage problems. In this work, we study this question in the context of data streams, where elements arrive one at a time, and we want to design low-memory and fast update-time algorithms that maintain a good solution. Specifically, we focus on the sliding window model, where we are asked to maintain a solution that considers only the last W items. In this context, we provide the first non-trivial algorithm that maintains a provable approximation of the optimum using space sublinear in the size of the window. In particular we give a 1/3 - ε approximation algorithm that uses space polylogarithmic in the spread of the values of the elements, δ, and linear in the solution size k for any constant ε > 0. At the same time, processing each element only requires a polylogarithmic number of evaluations of the function itself. When a better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a 1/2 - ε approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem. We demonstrate the efficacy of the algorithms on a number of real world datasets, showing that their practical performance far exceeds the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a negligible fraction of them.

【Keywords】: sliding-window streams; streaming algorithms; submodular maximization

47. When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors.

Paper Link】 【Pages】:431-440

【Authors】: Aneesh Sharma ; C. Seshadhri ; Ashish Goel

【Abstract】: Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ. In contrast to previous work where τ is assumed to be quite close to 1, we focus on recommendation applications where τ is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ. To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the ``wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges.

【Keywords】: matrix multiplication; nearest neighbor search; similarity search; wedge sampling

48. A Fast and Provable Method for Estimating Clique Counts Using Turán's Theorem.

Paper Link】 【Pages】:441-449

【Authors】: Shweta Jain ; C. Seshadhri

【Abstract】: Clique counts reveal important properties about the structure of massive graphs, especially social networks. The simple setting of just 3-cliques (triangles) has received much attention from the research community. For larger cliques (even, say 6-cliques) the problem quickly becomes intractable because of combinatorial explosion. Most methods used for triangle counting do not scale for large cliques, and existing algorithms require massive parallelism to be feasible. We present a new randomized algorithm that provably approximates the number of k-cliques, for any constant k. The key insight is the use of (strengthenings of) the classic Turán's theorem: this claims that if the edge density of a graph is sufficiently high, the k-clique density must be non-trivial. We define a combinatorial structure called a Turàn shadow, the construction of which leads to fast algorithms for clique counting. We design a practical heuristic, called TURÀN-SHADOW, based on this theoretical algorithm, and test it on a large class of test graphs. In all cases, TURÀN-SHADOW has less than 2% error, in a fraction of the time used by well-tuned exact algorithms. We do detailed comparisons with a range of other sampling algorithms, and find that TURÀN-SHADOW is generally much faster and more accurate. For example, TURÀN-SHADOW estimates all cliques numbers up to size 10 in social network with over a hundred million edges. This is done in less than three hours on a single commodity machine.

【Keywords】: cliques; graphs; sampling; turán's theorem

3C: Systems and Infrastructure 2 4

49. Exploring HTTP Header Manipulation In-The-Wild.

Paper Link】 【Pages】:451-458

【Authors】: Gareth Tyson ; Shan Huang ; Félix Cuadrado ; Ignacio Castro ; Vasile Claudiu Perta ; Arjuna Sathiaseelan ; Steve Uhlig

【Abstract】: Headers are a critical part of HTTP, and it has been shown that they are increasingly subject to middlebox manipulation. Although this is well known, little is understood about the general regional and network trends that underpin these manipulations. In this paper, we collect data on thousands of networks to understand how they intercept HTTP headers in-the-wild. Our analysis reveals that 25% of measured ASes modify HTTP headers. Beyond this, we witness distinct trends among different regions and AS types; e.g., we observe high numbers of cache headers in poorly connected regions. Finally, we perform an in-depth analysis of the types of manipulations and how they differ across regions.

【Keywords】: hola; http; middleboxes; web

50. Push or Request: An Investigation of HTTP/2 Server Push for Improving Mobile Performance.

Paper Link】 【Pages】:459-468

【Authors】: Sanae Rosen ; Bo Han ; Shuai Hao ; Zhuoqing Morley Mao ; Feng Qian

【Abstract】: In HTTP/1.1, it is necessary for the client to request an object (e.g. an image in a page) in order for the server to send it, even if the server knows in advance what the client will need. Server Push is a feature introduced in HTTP/2 that promises to improve page load times (PLT) by having the server push content to the browser in advance. In this paper, we investigate the benefits and challenges of using Server Push on mobile devices. We first examine whether pushing all content or just the CSS and Javascript files performs better, and find the former leads to much better web performance. Also, we find that sites making use of domain sharding or which otherwise have content divided across many servers do not benefit much from Server Push, a major challenge for Server Push going forward. Network performance characteristics also play a major role. Server Push is especially effective at improving performance at high loss rates (16% median PLT reduction with a 2% loss rate) and high latencies (14% PLT reduction with 100 ms latency), and has little benefit for high-speed Ethernet connections. This motivates its use on mobile devices, although we also find the limited processing power of these devices limits the benefits of Server Push. Server Push also offers modest energy benefits, with energy savings of 9% on LTE for one device. Overall, Server Push is a promising approach for improving web performance in mobile networks, but there are a number of challenges in achieving the full benefits of Server Push.

【Keywords】: http/2; mobile networking performance; server push; wireless networking performance

51. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications.

Paper Link】 【Pages】:469-478

【Authors】: Hiranya Jayathilaka ; Chandra Krintz ; Rich Wolski

【Abstract】: In this paper, we describe Roots - a system for automatically identifying the "root cause" of performance anomalies in web applications deployed in Platform-as-a-Service (PaaS) clouds. Roots does not require application-level instrumentation. Instead, it tracks events within the PaaS cloud that are triggered by application requests using a combination of metadata injection and platform-level instrumentation. We describe the extensible architecture of Roots, a prototype implementation of the system, and a statistical methodology for performance anomaly detection and diagnosis. We evaluate the efficacy of Roots using a set of PaaS-hosted web applications, and detail the performance overhead and scalability of the implementation.

【Keywords】: application performance monitoring; cloud computing; platform-as-a-service; root cause analysis; web services

52. BOAT: Building Auto-Tuners with Structured Bayesian Optimization.

Paper Link】 【Pages】:479-488

【Authors】: Valentin Dalibard ; Michael Schaarschmidt ; Eiko Yoneki

【Abstract】: Due to their complexity, modern systems expose many configuration parameters which users must tune to maximize performance. Auto-tuning has emerged as an alternative in which a black-box optimizer iteratively evaluates configurations to find efficient ones. Unfortunately, for many systems, such as distributed systems, evaluating performance takes too long and the space of configurations is too large for the optimizer to converge within a reasonable time. We present BOAT, a framework which allows developers to build efficient bespoke auto-tuners for their system, in situations where generic auto-tuners fail. At BOAT's core is structured Bayesian optimization (SBO), a novel extension of the Bayesian optimization algorithm. SBO leverages contextual information provided by system developers, in the form of a probabilistic model of the system's behavior, to make informed decisions about which configurations to evaluate. In a case study, we tune the scheduling of a neural network computation on a heterogeneous cluster. Our auto-tuner converges within ten iterations. The optimized configurations outperform those found by generic auto-tuners in thirty iterations by up to 2X.

【Keywords】: auto-tuning; bayesian optimization; distributed stochastic gradient descent; distributed systems; neural networks; probabilistic programming

3D: Computational Health 3 4

53. Investigating the Healthiness of Internet-Sourced Recipes: Implications for Meal Planning and Recommender Systems.

Paper Link】 【Pages】:489-498

【Authors】: Christoph Trattner ; David Elsweiler

【Abstract】: Food recommenders have the potential to positively influence the eating habits of users. To achieve this, however, we need to understand how healthy recommendations are and the factors which influence this. Focusing on two approaches from the literature (single item and daily meal plan recommendation) and utilizing a large Internet sourced dataset from Allrecipes.com, we show how algorithmic solutions relate to the healthiness of the underlying recipe collection. First, we analyze the healthiness of Allrecipes.com recipes using nutritional standards from the World Health Organisation and the United Kingdom Food Standards Agency. Second, we investigate user interaction patterns and how these relate to the healthiness of recipes. Third, we experiment with both recommendation approaches. Our results indicate that overall the recipes in the collection are quite unhealthy, but this varies across categories on the website. Users in general tend to interact most often with the least healthy recipes. Recommender algorithms tend to score popular items highly and thus on average promote unhealthy items. This can be tempered, however, with simple post-filtering approaches, which we show by experiment are better suited to some algorithms than others. Similarly, we show that the generation of meal plans can dramatically increase the number of healthy options open to users. One of the main findings is, nevertheless, that the utility of both approaches is strongly restricted by the recipe collection. Based on our findings we draw conclusions how researchers should attempt to make food recommendation systems promote healthy nutrition.

【Keywords】: meal planning; online recipes; public health; recommender systems

54. Sangoshthi: Empowering Community Health Workers through Peer Learning in Rural India.

Paper Link】 【Pages】:499-508

【Authors】: Deepika Yadav ; Pushpendra Singh ; Kyle Montague ; Vijay Kumar ; Deepak Sood ; Madeline Balaam ; Drishti Sharma ; Mona Duggal ; Tom Bartindale ; Delvin Varghese ; Patrick Olivier

【Abstract】: The Healthcare system of India provides outreach services to the rural population with a key focus on the maternal and child health through its flagship program of Community Health Workers (CHWs). The program since its launch has reached a scale of over 900000 health workers across the country and observed significant benefits on the health indicators. However, traditional face to face training mechanisms face persistent challenge in providing adequate training and capacity building opportunities to CHWs which leads to their sub-optimal knowledge and skill sets. In this paper, we propose Sangoshthi, a low-cost mobile based training and learning platform that fits well into the environment of low-Internet access. Sangoshthi leverages the architecture that combines Internet and IVR technology to host real time training sessions with the CHWs having access to basic phones only. We present our findings of a four week long field deployment with 40 CHWs using both qualitative and quantitative methods. Sangoshthi offers a lively environment of peer learning that was well received by the CHW community and resulted into their knowledge gains (16%) and increased confidence levels to handle the cases. Our study highlights the potential of complementary training platforms that can empower CHWs in-situ without the need of additional infrastructure.

【Keywords】: chw; ict4d; ivr; mhealth; mlearning; peer learning

55. Is Saki #delicious?: The Food Perception Gap on Instagram and Its Relation to Health.

Paper Link】 【Pages】:509-518

【Authors】: Ferda Ofli ; Yusuf Aytar ; Ingmar Weber ; Raggi al Hammouri ; Antonio Torralba

【Abstract】: Food is an integral part of our life and what and how much we eat crucially affects our health. Our food choices largely depend on how we perceive certain characteristics of food, such as whether it is healthy, delicious or if it qualifies as a salad. But these perceptions differ from person to person and one person's "single lettuce leaf" might be another person's "side salad". Studying how food is perceived in relation to what it actually is typically involves a laboratory setup. Here we propose to use recent advances in image recognition to tackle this problem. Concretely, we use data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does. We show that this difference, which we call the "perception gap", relates to a number of health outcomes observed at the county level. To the best of our knowledge, this is the first time that image recognition is being used to study the "misalignment" of how people describe food images vs. what they actually depict.

【Keywords】: computer vision; food; instagram; public health

56. The Spread of Physical Activity Through Social Networks.

Paper Link】 【Pages】:519-528

【Authors】: David Stück ; Haraldur Tómas Hallgrímsson ; Greg Ver Steeg ; Alessandro Epasto ; Luca Foschini

【Abstract】: Many behaviors that lead to worsened health outcomes are modifiable, social, and visible. Social influence has thus the potential to foster adoption of habits that promote health and improve disease management. In this study, we consider the evolution of the physical activity of 44.5 thousand Fitbit users as they interact on the Fitbit social network, in relation to their health status. The users collectively recorded 9.3 million days of steps over the period of a year through a Fitbit device. 7,515 of the users also self-reported whether they were diagnosed with a major chronic condition. A time-aggregated analysis shows that ego net size, average alter physical activity, gender, and body mass index (BMI) are significantly predictive of ego physical activity. For users who self-reported chronic conditions, the direction and effect size of associations varied depending on the condition, with diabetic users specifically showing almost a 6-fold increase in additional daily steps for each additional social tie. Subsequently, we consider the co-evolution of activity and friendship longitudinally on a month by month basis. We show that the fluctuations in average alter activity significantly predict fluctuations in ego activity. By leveraging a class of novel non-parametric statistical tests we investigate the causal factors in these fluctuations. We find that under certain stationarity assumptions, non-null causal dependence exists between ego and alter's activity, even in the presence of unobserved stationary individual traits. We believe that our findings provide evidence that the study of online social networks have the potential to improve our understanding of factors affecting adoption of positive habits, especially in the context of chronic condition management.

【Keywords】: digital health; dynamic networks; network science; wearables

57. Drawing Sound Conclusions from Noisy Judgments.

Paper Link】 【Pages】:529-537

【Authors】: David Goldberg ; Andrew Trotman ; Xiao Wang ; Wei Min ; Zongru Wan

【Abstract】: The quality of a search engine is typically evaluated using hand-labeled data sets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less accurate labels (e.g. from crowdsourcing) must be used. This introduces errors in the labels, and thus errors in standard precision metrics (such as P@k and DCG); the lower the quality of the judge, the more errorful the labels, consequently the more inaccurate the metric. We introduce equations and algorithms that can adjust the metrics to the values they would have had if there were no annotation errors. This is especially important when two search engines are compared by comparing their metrics. We give examples where one engine appeared to be statistically significantly better than the other, but the effect disappeared after the metrics were corrected for annotation error. In other words the evidence supporting a statistical difference was illusory, and caused by a failure to account for annotation error.

【Keywords】: precision; standard error; statistical significance

Paper Link】 【Pages】:539-548

【Authors】: Liangda Li ; Hongbo Deng ; Anlei Dong ; Yi Chang ; Ricardo A. Baeza-Yates ; Hongyuan Zha

【Abstract】: Contextual data plays an important role in modeling search engine users' behaviors on both query auto-completion (QAC) log and normal query (click) log. User's recent search history on each log has been widely studied individually as the context to benefit the modeling of users' behaviors on that log. However, there is no existing work that explores or incorporates both logs together for contextual data. As QAC and click logs actually record users' sequential behaviors while interacting with a search engine, the available context of a user's current behavior based on the same type of log can be strengthened from the user's recent search history shown on the other type of log. Our paper proposes to model users' behaviors on both QAC and click logs simultaneously by utilizing both logs as the contextual data of each other. The key idea is to capture the correlation between users' behavior patterns on both logs. We model such correlation through a novel probabilistic model based on the Latent Dirichlet allocation (LDA) model. The learned users' behavior patterns on both logs are utilized to address not only the application of query auto-completion on QAC logs, but also the click prediction and relevance ranking of web documents on click logs. Experiments on real-world logs demonstrate the effectiveness of the proposed model on both applications.

【Keywords】: context; latent dirichlet allocation; query auto-completion

Paper Link】 【Pages】:549-557

【Authors】: Adam Fourney ; Meredith Ringel Morris ; Ryen W. White

【Abstract】: Many people rely on web search engines to check the spelling or grammatical correctness of input phrases. For example, one might search [recurring or reoccurring] to decide between these similar words. While language-related queries are common, they have low click-through rates, lack a strong intent signal, and are generally challenging to study. Perhaps for these reasons, they have yet to be characterized in the literature. In this paper we report the results of two surveys that investigate how, when, and why people use web search to support low-level, language-related tasks. The first survey was distributed by email, and asked participants to reflect on a recent search task. The second survey was embedded directly in search result pages, and captured information about searchers' intents in-situ. Our analysis confirms that language-related search tasks are indeed common, accounting for at least 2.7% of all queries posed by our respondents. Survey responses also reveal: (1) the range of language-related tasks people perform with search, (2) the contexts in which these tasks arise, and (3), the reasons why people elect to use web search rather than relying on traditional proofing tools (e.g., spelling and grammar checkers).

【Keywords】: grammar; language-related queries; spelling; web search

60. Query Expansion Based on a Feedback Concept Model for Microblog Retrieval.

Paper Link】 【Pages】:559-568

【Authors】: Yashen Wang ; Heyan Huang ; Chong Feng

【Abstract】: We tackle the problem of improving microblog retrieval algorithms by proposing a Feedback Concept Model for query expansion. In particular, we expand the query using knowledge information derived from Probase so that the expanded one could better reflect users' search intent, which allows for microblog retrieval at a concept-level, rather than term-level. In the proposed feedback concept model: (i) we mine the concept information implicit in short-texts based on the external knowledge bases; (ii) with the relevant concepts associated with short-texts, a mixture model is generated to estimate a concept language model; (iii) finally, we utilize the concept language model for query expansion. Moreover, we incorporate temporal prior into the proposed query expansion method to satisfy real-time information need. Finally, we test the generalization power of the feedback concept model on the TREC Microblog corpora. The experimental results demonstrate that the proposed model outperforms the previous methods for microblog retrieval significantly.

【Keywords】: microblog retrieval; pseudo-relevance feedback; query expansion; short-text conceptualization

3H: Information Cascades 4

61. Why Do Cascade Sizes Follow a Power-Law?

Paper Link】 【Pages】:569-576

【Authors】: Karol Wegrzycki ; Piotr Sankowski ; Andrzej Pacuk ; Piotr Wygocki

【Abstract】: We introduce random directed acyclic graph and use it to model the information diffusion network. Subsequently, we analyze the cascade generation model (CGM) introduced by Leskovec et al. [19]. Until now only empirical studies of this model were done. In this paper, we present the first theoretical proof that the sizes of cascades generated by the CGM follow the power-law distribution, which is consistent with multiple empirical analysis of the large social networks. We compared the assumptions of our model with the Twitter social network and tested the goodness of approximation.

【Keywords】: information diffusion; modelling and validation; social networks; twitter

62. DeepCas: An End-to-end Predictor of Information Cascades.

Paper Link】 【Pages】:577-586

【Authors】: Cheng Li ; Jiaqi Ma ; Xiaoxiao Guo ; Qiaozhu Mei

【Abstract】: Information cascades, effectively facilitated by most social network platforms, are recognized as a major factor in almost every social success and disaster in these networks. Can cascades be predicted? While many believe that they are inherently unpredictable, recent work has shown that some key properties of information cascades, such as size, growth, and shape, can be predicted by a machine learning algorithm that combines many features. These predictors all depend on a bag of hand-crafting features to represent the cascade network and the global network structures. Such features, always carefully and sometimes mysteriously designed, are not easy to extend or to generalize to a different platform or domain. Inspired by the recent successes of deep learning in multiple data mining tasks, we investigate whether an end-to-end deep learning approach could effectively predict the future size of cascades. Such a method automatically learns the representation of individual cascade graphs in the context of the global network structure, without hand-crafted features or heuristics. We find that node embeddings fall short of predictive power, and it is critical to learn the representation of a cascade graph as a whole. We present algorithms that learn the representation of cascade graphs in an end-to-end manner, which significantly improve the performance of cascade prediction over strong baselines including feature based methods, node embedding methods, and graph kernel methods. Our results also provide interesting implications for cascade prediction in general.

【Keywords】: cascade prediction; deep learning; graph representation

63. Cascades: A View from Audience.

Paper Link】 【Pages】:587-596

【Authors】: Rahmtin Rotabi ; Krishna Kamath ; Jon M. Kleinberg ; Aneesh Sharma

【Abstract】: Cascades on social and information networks have been a tremendously popular subject of study in the past decade, and there is a considerable literature on phenomena such as diffusion mechanisms, virality, cascade prediction, and peer network effects. Against the backdrop of this research, a basic question has received comparatively little attention: how desirable are cascades on a social media platform from the point of view of users' While versions of this question have been considered from the perspective of the producers of cascades, any answer to this question must also take into account the effect of cascades on their audience --- the viewers of the cascade who do not directly participate in generating the content that launched it. In this work, we seek to fill this gap by providing a consumer perspective of information cascades. Users on social and information networks play the dual role of producers and consumers, and our work focuses on how users perceive cascades as consumers. Starting from this perspective, we perform an empirical study of the interaction of Twitter users with retweet cascades. We measure how often users observe retweets in their home timeline, and observe a phenomenon that we term the Impressions Paradox: the share of impressions for cascades of size k decays much more slowly than frequency of cascades of size k. Thus, the audience for cascades can be quite large even for rare large cascades. We also measure audience engagement with retweet cascades in comparison to non-retweeted or organic content. Our results show that cascades often rival or exceed organic content in engagement received per impression. This result is perhaps surprising in that consumers didn't opt in to see tweets from these authors. Furthermore, although cascading content is widely popular, one would expect it to eventually reach parts of the audience that may not be interested in the content. Motivated by the tension in these empirical findings, we posit a simple theoretical model that focuses on the effect of cascades on the audience (rather than the cascade producers). Our results on this model highlight the balance between retweeting as a high-quality content selection mechanism and the role of network users in filtering irrelevant content. In particular, the results suggest that together these two effects enable the audience to consume a high quality stream of content in the presence of cascades.

【Keywords】: cascade models; consumer; twitter

64. Detecting Large Reshare Cascades in Social Networks.

Paper Link】 【Pages】:597-605

【Authors】: Karthik Subbian ; B. Aditya Prakash ; Lada A. Adamic

【Abstract】: Detecting large reshare cascades is an important problem in online social networks. There are a variety of attempts to model this problem, from using time series analysis methods to stochastic processes. Most of these approaches heavily depend on the underlying network features and use network information to detect the virality of cascades. In most cases, however, getting such detailed network information can be hard or even impossible. In contrast, in this paper, we propose SANSNET, a network-agnostic approach instead. Our method can be used to answer two important questions: (1) Will a cascade go viral? and (2) How early can we predict it? We use techniques from survival analysis to build a supervised classifier in the space of survival probabilities and show that the optimal decision boundary is a survival function. A notable feature of our approach is that it does not use any network-based features for the prediction tasks, making it very cheap to implement. Finally, we evaluate our approach on several real-life data sets, including popular social networks like Facebook and Twitter, on metrics like recall, F-measure and breakout coverage. We find that network agnostic SANSNET classifier outperforms several non-trivial competitors and baselines which utilize network information.

【Keywords】: detecting cascades; reshare cascades; social networks; survival model analysis

4B: Crowdsourcing 1 4

65. Identifying Value in Crowdsourced Wireless Signal Measurements.

Paper Link】 【Pages】:607-616

【Authors】: Zhijing Li ; Ana Nika ; Xinyi Zhang ; Yanzi Zhu ; Yuanshun Yao ; Ben Y. Zhao ; Haitao Zheng

【Abstract】: While crowdsourcing is an attractive approach to collect large-scale wireless measurements, understanding the quality and variance of the resulting data is difficult. Our work analyzes the quality of crowdsourced cellular signal measurements in the context of basestation localization, using large international public datasets (419M signal measurements and 1M cells) and corresponding ground truth values. Performing localization using raw received signal strength (RSS) data produces poor results and very high variance. Applying supervised learning improves results moderately, but variance remains high. Instead, we propose feature clustering, a novel application of unsupervised learning to detect hidden correlation between measurement instances, their features, and localization accuracy. Our results identify RSS standard deviation and RSS-weighted dispersion mean as key features that correlate with highly predictive measurement samples for both sparse and dense measurements respectively. Finally, we show how optimizing crowdsourcing measurements for these two features dramatically improves localization accuracy and reduces variance.

【Keywords】: network measurement; networks

66. Collaborative Optimization for Collective Decision-making in Continuous Spaces.

Paper Link】 【Pages】:617-626

【Authors】: Nikhil Garg ; Vijay Kamble ; Ashish Goel ; David Marn ; Kamesh Munagala

【Abstract】: Many societal decision problems lie in high-dimensional continuous spaces not amenable to the voting techniques common for their discrete or single-dimensional counterparts. These problems are typically discretized before running an election or decided upon through negotiation by representatives. We propose a meta-algorithm called Iterative Local Voting for collective decision-making in this setting, in which voters are sequentially sampled and asked to modify a candidate solution within some local neighborhood of its current value, as defined by a ball in some chosen norm. In general, such schemes do not converge, or, when they do, the resulting solution does not have a natural description. We first prove the convergence of this algorithm under appropriate choices of neighborhoods to plausible solutions in certain natural settings: when the voters' utilities can be expressed in terms of some form of distance from their ideal solution, and when these utilities are additively decomposable across dimensions. In many of these cases, we obtain convergence to the societal welfare maximizing solution. We then describe an experiment in which we test our algorithm for the decision of the U.S. Federal Budget on Mechanical Turk with over 4,000 workers, employing neighborhoods defined by §L1, §L2 and §L∞ balls. We make several observations that inform future implementations of such a procedure.

【Keywords】: collective intelligence; crowdsourced democracy; crowdsourcing; participatory budgeting; societal decision-making

67. Location Privacy-Preserving Task Allocation for Mobile Crowdsensing with Differential Geo-Obfuscation.

Paper Link】 【Pages】:627-636

【Authors】: Leye Wang ; Dingqi Yang ; Xiao Han ; Tianben Wang ; Daqing Zhang ; Xiaojuan Ma

【Abstract】: In traditional mobile crowdsensing applications, organizers need participants' precise locations for optimal task allocation, e.g., minimizing selected workers' travel distance to task locations. However, the exposure of their locations raises privacy concerns. Especially for those who are not eventually selected for any task, their location privacy is sacrificed in vain. Hence, in this paper, we propose a location privacy-preserving task allocation framework with geo-obfuscation to protect users' locations during task assignments. Specifically, we make participants obfuscate their reported locations under the guarantee of differential privacy, which can provide privacy protection regardless of adversaries' prior knowledge and without the involvement of any third-part entity. In order to achieve optimal task allocation with such differential geo-obfuscation, we formulate a mixed-integer non-linear programming problem to minimize the expected travel distance of the selected workers under the constraint of differential privacy. Evaluation results on both simulation and real-world user mobility traces show the effectiveness of our proposed framework. Particularly, our framework outperforms Laplace obfuscation, a state-of-the-art differential geo-obfuscation mechanism, by achieving 45% less average travel distance on the real-world data.

【Keywords】: crowdsensing; differential location privacy; task allocation

68. PaRE: A System for Personalized Route Guidance.

Paper Link】 【Pages】:637-646

【Authors】: Yaguang Li ; Han Su ; Ugur Demiryurek ; Bolong Zheng ; Tieke He ; Cyrus Shahabi

【Abstract】: The turn-by-turn directions provided in existing navigation applications are exclusively derived from underlying road network topology information, i.e., the connectivity of edges to each other. Therefore, the turn-by-turn directions are simplified as metric translation of physical world (e.g. distance/time to turn) to spoken language. Such translation - that ignores human cognition of the geographic space - is often verbose and redundant for the drivers who have knowledge about the geographical areas. In this paper, we study a Personalized RoutE Guidance System dubbed PaRE - with which the goal is to generate more customized and intuitive directions based on user generated content. PaRE utilizes a wealth of user generated historical trajectory data to extract namely "landmarks" (e.g., point of interests or intersections) and frequently visited routes between them from the road network. The extracted information is used to obtain cognitive customized directions for each user. We formalize this task as a problem of finding the optimal partition for a given route that maximizes the familiarity while minimizing the number of segments in the partition, and propose two efficient algorithms to solve it. For empirical study, we apply our solution to both real and synthetic trajectory datasets to evaluate the performance and effectiveness of PaRE.

【Keywords】: gis; personalized route guide; trajectory

4C: Security 1 4

69. Who Controls the Internet?: Analyzing Global Threats using Property Graph Traversals.

Paper Link】 【Pages】:647-656

【Authors】: Milivoj Simeonovski ; Giancarlo Pellegrino ; Christian Rossow ; Michael Backes

【Abstract】: The Internet is built on top of intertwined network services, e.g., email, DNS, and content distribution networks operated by private or governmental organizations. Recent events have shown that these organizations may, knowingly or unknowingly, be part of global-scale security incidents including state-sponsored mass surveillance programs and large-scale DDoS attacks. For example, in March 2015 the Great Cannon attack has shown that an Internet service provider can weaponize millions of Web browsers and turn them into DDoS bots by injecting malicious JavaScript code into transiting TCP connections. While attack techniques and root cause vulnerabilities are routinely studied, we still lack models and algorithms to study the intricate dependencies between services and providers, reason on their abuse, and assess the attack impact. To close this gap, we present a technique that models services, providers, and dependencies as a property graph. Moreover, we present a taint-style propagation-based technique to query the model, and present an evaluation of our framework on the top 100k Alexa domains.

【Keywords】: (dos) denial of service attacks; cyber-attacks; property graph traversals

70. Tools for Automated Analysis of Cybercriminal Markets.

Paper Link】 【Pages】:657-666

【Authors】: Rebecca S. Portnoff ; Sadia Afroz ; Greg Durrett ; Jonathan K. Kummerfeld ; Taylor Berg-Kirkpatrick ; Damon McCoy ; Kirill Levchenko ; Vern Paxson

【Abstract】: Underground forums are widely used by criminals to buy and sell a host of stolen items, datasets, resources, and criminal services. These forums contain important resources for understanding cybercrime. However, the number of forums, their size, and the domain expertise required to understand the markets makes manual exploration of these forums unscalable. In this work, we propose an automated, top-down approach for analyzing underground forums. Our approach uses natural language processing and machine learning to automatically generate high-level information about underground forums, first identifying posts related to transactions, and then extracting products and prices. We also demonstrate, via a pair of case studies, how an analyst can use these automated approaches to investigate other categories of products and transactions. We use eight distinct forums to assess our tools: Antichat, Blackhat World, Carders, Darkode, Hack Forums, Hell, L33tCrew and Nulled. Our automated approach is fast and accurate, achieving over 80% accuracy in detecting post category, product, and prices.

【Keywords】: cybercrime; machine learning/nlp; measurement

71. Tracking Phishing Attacks Over Time.

Paper Link】 【Pages】:667-676

【Authors】: Qian Cui ; Guy-Vincent Jourdan ; Gregor von Bochmann ; Russell Couturier ; Iosif-Viorel Onut

【Abstract】: The so-called ``phishing'' attacks are one of the important threats to individuals and corporations in today's Internet. Combatting phishing is thus a top-priority, and has been the focus of much work, both on the academic and on the industry sides. In this paper, we look at this problem from a new angle. We have monitored a total of 19,066 phishing attacks over a period of ten months and found that over 90% of these attacks were actually replicas or variations of other attacks in the database. This provides several opportunities and insights for the fight against phishing: first, quickly and efficiently detecting replicas is a very effective prevention tool. We detail one such tool in this paper. Second, the widely held belief that phishing attacks are dealt with promptly is but an illusion. We have recorded numerous attacks that stay active throughout our observation period. This shows that the current prevention techniques are ineffective and need to be overhauled. We provide some suggestions in this direction. Third, our observation give a new perspective into the modus operandi of attackers. In particular, some of our observations suggest that a small group of attackers could be behind a large part of the current attacks. Taking down that group could potentially have a large impact on the phishing attacks observed today.

【Keywords】: clustering; phishing detection

72. Security Challenges in an Increasingly Tangled Web.

Paper Link】 【Pages】:677-684

【Authors】: Deepak Kumar ; Zane Ma ; Zakir Durumeric ; Ariana Mirian ; Joshua Mason ; J. Alex Halderman ; Michael Bailey

【Abstract】: Over the past 20 years, websites have grown increasingly complex and interconnected. In 2016, only a negligible number of sites are dependency free, and over 90% of sites rely on external content. In this paper, we investigate the current state of web dependencies and explore two security challenges associated with the increasing reliance on external services: (1) the expanded attack surface associated with serving unknown, implicitly trusted third-party content, and (2) how the increased set of external dependencies impacts HTTPS adoption. We hope that by shedding light on these issues, we can encourage developers to consider the security risks associated with serving third-party content and prompt service providers to more widely deploy HTTPS.

【Keywords】: https adoption; privacy/tracking; website complexity

4D: Computational Health 4 4

73. Blood Pressure Prediction via Recurrent Models with Contextual Layer.

Paper Link】 【Pages】:685-693

【Authors】: Xiaohan Li ; Shu Wu ; Liang Wang

【Abstract】: Recently, the percentage of people with hypertension is increasing, and this phenomenon is widely concerned. At the same time, wireless home Blood Pressure (BP) monitors become accessible in people's life. Since machine learning methods have made important contributions in different fields, many researchers have tried to employ them in dealing with medical problems. However, the existing studies for BP prediction are all based on clinical data with short time ranges. Besides, there do not exist works which can jointly make use of historical measurement data (e.g. BP and heart rate) and contextual data (e.g. age, gender, BMI and altitude). Recurrent Neural Networks (RNNs), especially those using Long Short-Term Memory (LSTM) units, can capture long range dependencies, so they are effective in modeling variable-length sequences. In this paper, we propose a novel model named recurrent models with contextual layer, which can model the sequential measurement data and contextual data simultaneously to predict the trend of users' BP. We conduct our experiments on the BP data set collected from a type of wireless home BP monitors, and experimental results show that the proposed models outperform several competitive compared methods.

【Keywords】: privacy/tracking

74. Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance.

Paper Link】 【Pages】:695-704

【Authors】: Vasileios Lampos ; Bin Zou ; Ingemar Johansson Cox

【Abstract】: Health surveillance systems based on online user-generated content often rely on the identification of textual markers that are related to a target disease. Given the high volume of available data, these systems benefit from an automatic feature selection process. This is accomplished either by applying statistical learning techniques, which do not consider the semantic relationship between the selected features and the inference task, or by developing labour-intensive text classifiers. In this paper, we use neural word embeddings, trained on social media content from Twitter, to determine, in an unsupervised manner, how strongly textual features are semantically linked to an underlying health concept. We then refine conventional feature selection methods by a priori operating on textual variables that are sufficiently close to a target concept. Our experiments focus on the supervised learning problem of estimating influenza-like illness rates from Google search queries. A "flu infection" concept is formulated and used to reduce spurious and potentially confounding features that were selected by previously applied approaches. In this way, we also address forms of scepticism regarding the appropriateness of the feature space, alleviating potential cases of overfitting. Ultimately, the proposed hybrid feature selection method creates a more reliable model that, according to our empirical analysis, improves the inference performance (Mean Absolute Error) of linear and nonlinear regressors by 12% and 28.7%, respectively.

【Keywords】: computational health; feature selection; gaussian processes; influenza-like illness; regularised regression; search query logs; syndromic surveillance; user-generated content; word embeddings

75. Adverse Drug Event Detection in Tweets with Semi-Supervised Convolutional Neural Networks.

Paper Link】 【Pages】:705-714

【Authors】: Kathy Lee ; Ashequl Qadir ; Sadid A. Hasan ; Vivek V. Datla ; Aaditya Prakash ; Joey Liu ; Oladimeji Farri

【Abstract】: Current Adverse Drug Events (ADE) surveillance systems are often associated with a sizable time lag before such events are published. Online social media such as Twitter could describe adverse drug events in real-time, prior to official reporting. Deep learning has significantly improved text classification performance in recent years and can potentially enhance ADE classification in tweets. However, these models typically require large corpora with human expert-derived labels, and such resources are very expensive to generate and are hardly available. Semi-supervised deep learning models, which offer a plausible alternative to fully supervised models, involve the use of a small set of labeled data and a relatively larger collection of unlabeled data for training. Traditionally, these models are trained on labeled and unlabeled data from similar topics or domains. In reality, millions of tweets generated daily often focus on disparate topics, and this could present a challenge for building deep learning models for ADE classification with random Twitter stream as unlabeled training data. In this work, we build several semi-supervised convolutional neural network (CNN) models for ADE classification in tweets, specifically leveraging different types of unlabeled data in developing the models to address the problem. We demonstrate that, with the selective use of a variety of unlabeled data, our semi-supervised CNN models outperform a strong state-of-the-art supervised classification model by +9.9% F1-score. We evaluated our models on the Twitter data set used in the PSB 2016 Social Media Shared Task. Our results present the new state-of-the-art for this data set.

【Keywords】: adverse drug events; healthcare; pharmacovigilance; semi-supervised convolutional neural networks; social media; text classification

76. DeepMood: Forecasting Depressed Mood Based on Self-Reported Histories via Recurrent Neural Networks.

Paper Link】 【Pages】:715-724

【Authors】: Yoshihiko Suhara ; Yinzhan Xu ; Alex Sandy Pentland

【Abstract】: Depression is a prevailing issue and is an increasing problem in many people's lives. Without observable diagnostic criteria, the signs of depression may go unnoticed, resulting in high demand for detecting depression in advance automatically. This paper tackles the challenging problem of forecasting severely depressed moods based on self-reported histories. Despite the large amount of research on understanding individual moods including depression, anxiety, and stress based on behavioral logs collected by pervasive computing devices such as smartphones, forecasting depressed moods is still an open question. This paper develops a recurrent neural network algorithm that incorporates categorical embedding layers for forecasting depression. We collected large-scale records from 2,382 self-declared depressed people to conduct the experiment. Experimental results show that our method forecast the severely depressed mood of a user based on self-reported histories, with higher accuracy than SVM. The results also showed that the long-term historical information of a user improves the accuracy of forecasting depressed mood.

【Keywords】: depression; mobile applications; neural networks

4G: Social 1 4

77. GPOP: Scalable Group-level Popularity Prediction for Online Content in Social Networks.

Paper Link】 【Pages】:725-733

【Authors】: Minh X. Hoang ; Xuan Hong Dang ; Xiang Wu ; Zhenyu Yan ; Ambuj K. Singh

【Abstract】: Predicting the popularity of online content in social networks is important in many applications, ranging from ad campaign design, web content caching and prefetching, to web-search result ranking. Earlier studies target this problem by learning models that either generalize behaviors of the entire network population or capture behaviors of each individual user. In this paper, we claim that a novel approach based on group-level popularity is necessary and more practical, given that users naturally organize themselves into clusters and that users within a cluster react to online content in a uniform manner. We develop a novel framework by first grouping users into cohesive clusters, and then adopt tensor decomposition to make predictions. In order to minimize the impact of noisy data and be more flexible in capturing changes in users' interests, our framework exploits both the network topology and interaction among users in learning a robust user clustering. The PARAFAC tensor decomposition is adapted to work with hierarchical constraint over user groups, and we show that optimizing this constrained function via gradient descent achieves faster convergence and leads to more stable solutions. Extensive experimental results over two social networks demonstrate that our framework is scalable, finds meaningful user groups, and significantly outperforms eight baseline methods in terms of prediction accuracy.

【Keywords】: content prediction; graph clustering; tensor decomposition

78. Expecting to be HIP: Hawkes Intensity Processes for Social Media Popularity.

Paper Link】 【Pages】:735-744

【Authors】: Marian-Andrei Rizoiu ; Lexing Xie ; Scott Sanner ; Manuel Cebrián ; Honglin Yu ; Pascal Van Hentenryck

【Abstract】: Modeling and predicting the popularity of online content is a significant problem for the practice of information dissemination, advertising, and consumption. Recent work analyzing massive datasets advances our understanding of popularity, but one major gap remains: To precisely quantify the relationship between the popularity of an online item and the external promotions it receives. This work supplies the missing link between exogenous inputs from public social media platforms, such as Twitter, and endogenous responses within the content platform, such as YouTube. We develop a novel mathematical model, the Hawkes intensity process, which can explain the complex popularity history of each video according to its type of content, network of diffusion, and sensitivity to promotion. Our model supplies a prototypical description of videos, called an endo-exo map. This map explains popularity as the result of an extrinsic factor -- the amount of promotions from the outside world that the video receives, acting upon two intrinsic factors -- sensitivity to promotion, and inherent virality. We use this model to forecast future popularity given promotions on a large 5-months feed of the most-tweeted videos, and found it to lower the average error by 28.6% from approaches based on popularity history. Finally, we can identify videos that have a high potential to become viral, as well as those for which promotions will have hardly any effect.

【Keywords】: hawkes intensity process; item virality; popularity forecasting; popularity modeling; self-exciting processes

79. Taming the Unpredictability of Cultural Markets with Social Influence.

Paper Link】 【Pages】:745-754

【Authors】: Andrés Abeliuk ; Gerardo Berbeglia ; Pascal Van Hentenryck ; Tad Hogg ; Kristina Lerman

【Abstract】: Unpredictability is often portrayed as an undesirable outcome of social influence in cultural markets. Unpredictability stems from the "rich get richer" effect, whereby small fluctuations in the market share or popularity of products are amplified over time by social influence. In this paper, we report results of an experimental study that shows that unpredictability is not an inherent property of social influence. We investigate strategies for creating markets in which the popularity of products is better-and more predictably-aligned with their underlying quality. For our study, we created a cultural market of science stories and conducted randomized experiments on different policies for presenting the stories to study participants. Specifically, we varied how the stories were ranked, and whether or not participants were shown the ratings these stories received from others. We present a policy that leverages social influence and product positioning to help distinguish the product's market share (popularity) from underlying quality. Highlighting products with the highest estimated quality reduces the "rich get richer" effect highlighting popular products. We show that this policy allows us to more robustly and predictably identify high quality products and promote blockbusters. The policy can be used to create more efficient online cultural markets with a better allocation of resources to products.

【Keywords】: rich get richer effect; cultural markets; experimental study; mathew effect; position bias; ranking policies; social influence; unpredictability

80. Predicting the Success of Online Petitions Leveraging Multidimensional Time-Series.

Paper Link】 【Pages】:755-764

【Authors】: Julia Proskurnia ; Przemyslaw A. Grabowicz ; Ryota Kobayashi ; Carlos Castillo ; Philippe Cudré-Mauroux ; Karl Aberer

【Abstract】: Applying classical time-series analysis techniques to online content is challenging, as web data tends to have data quality issues and is often incomplete, noisy, or poorly aligned. In this paper, we tackle the problem of predicting the evolution of a time series of user activity on the web in a manner that is both accurate and interpretable, using related time series to produce a more accurate prediction. We test our methods in the context of predicting signatures for online petitions using data from thousands of petitions posted on The Petition Site - one of the largest platforms of its kind. We observe that the success of these petitions is driven by a number of factors, including promotion through social media channels and on the front page of the petitions platform. We propose an interpretable model that incorporates seasonality, aging effects, self-excitation, and external effects. The interpretability of the model is important for understanding the elements that drives the activity of an online content. We show through an extensive empirical evaluation that our model is significantly better at predicting the outcome of a petition than state-of-the-art techniques.

【Keywords】: online petitions; time series prediction; web applications

4H: Semantics and Knowledge 5

81. Dynamic Key-Value Memory Networks for Knowledge Tracing.

Paper Link】 【Pages】:765-774

【Authors】: Jiani Zhang ; Xingjian Shi ; Irwin King ; Dit-Yan Yeung

【Abstract】: Knowledge Tracing (KT) is a task of tracing evolving knowledge state of students with respect to one or more concepts as they engage in a sequence of learning activities. One important purpose of KT is to personalize the practice sequence to help students learn knowledge concepts efficiently. However, existing methods such as Bayesian Knowledge Tracing and Deep Knowledge Tracing either model knowledge state for each predefined concept separately or fail to pinpoint exactly which concepts a student is good at or unfamiliar with. To solve these problems, this work introduces a new model called Dynamic Key-Value Memory Networks (DKVMN) that can exploit the relationships between underlying concepts and directly output a student's mastery level of each concept. Unlike standard memory-augmented neural networks that facilitate a single memory matrix or two static memory matrices, our model has one static matrix called key, which stores the knowledge concepts and the other dynamic matrix called value, which stores and updates the mastery levels of corresponding concepts. Experiments show that our model consistently outperforms the state-of-the-art model in a range of KT datasets. Moreover, the DKVMN model can automatically discover underlying concepts of exercises typically performed by human annotations and depict the changing knowledge state of a student.

【Keywords】: deep learning; dynamic key-value memory networks; knowledge tracing; massive open online courses

82. How Users Explore Ontologies on the Web: A Study of NCBO's BioPortal Usage Logs.

Paper Link】 【Pages】:775-784

【Authors】: Simon Walk ; Lisette Espin Noboa ; Denis Helic ; Markus Strohmaier ; Mark A. Musen

【Abstract】: Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develop. Thus, a crucial prerequisite for ontology adoption and reuse is effective support for exploring and finding existing ontologies. Towards that goal, the National Center for Biomedical Ontology (NCBO) has developed BioPortal---an online repository containing more than 500 biomedical ontologies. In 2016, BioPortal represents one of the largest portals for exploration of semantic biomedical vocabularies and terminologies, which is used by many researchers and practitioners. While usage of this portal is high, we know very little about how exactly users search and explore ontologies and what kind of usage patterns or user groups exist in the first place. Deeper insights into user behavior on such portals can provide valuable information to devise strategies for a better support of users in exploring and finding existing ontologies, and thereby enable better ontology reuse. To that end, we study and group users according to their browsing behavior on BioPortal and use data mining techniques to characterize and compare exploration strategies across ontologies. In particular, we were able to identify seven distinct browsing types, all relying on different functionality provided by BioPortal. For example, Search Explorers extensively use the search functionality while Ontology Tree Explorers mainly rely on the class hierarchy for exploring ontologies. Further, we show that specific characteristics of ontologies influence the way users explore and interact with the website. Our results may guide the development of more user-oriented systems for ontology exploration on the Web.

【Keywords】: bioportal; browsing behavior; clustering; markov chain; semantic web; stationary distribution

83. Type-based Semantic Optimization for Scalable RDF Graph Pattern Matching.

Paper Link】 【Pages】:785-793

【Authors】: HyeongSik Kim ; Padmashree Ravindra ; Kemafor Anyanwu

【Abstract】: Scalable query processing relies on early and aggressive determination and pruning of query-irrelevant data. Besides the traditional space-pruning techniques such as indexing, type-based optimizations that exploit integrity constraints defined on the types can be used to rewrite queries into more efficient ones. However, such optimizations are only applicable in strongly-typed data and query models which make it a challenge for semi-structured models such as RDF. Consequently, developing techniques for enabling typebased query optimizations will contribute new insight to improving the scalability of RDF processing systems. In this paper, we address the challenge of type-based query optimization for RDF graph pattern queries. The approach comprises of (i) a novel type system for RDF data induced from data and ontologies and (ii) a query optimization and evaluation framework for evaluating graph pattern queries using type-based optimizations. An implementation of this approach integrated into Apache Pig is presented and evaluated. Comprehensive experiments conducted on real-world and synthetic benchmark datasets show that our approach is up to 500X faster than existing approaches

【Keywords】: hadoop; mapreduce; ontology; rdf; schema; semantic optimization; semantics; sparql; type-based optimization; typing

84. Extracting Emerging Knowledge from Social Media.

Paper Link】 【Pages】:795-804

【Authors】: Marco Brambilla ; Stefano Ceri ; Emanuele Della Valle ; Riccardo Volonterio ; Felix Xavier Acero Salazar

【Abstract】: Massive data integration technologies have been recently used to produce very large ontologies. However, knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail. Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities. Thus, we propose a method for discovering emerging entities by extracting them from social content. Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities; we use a purely syntactic method as a baseline, and we propose several semantics-based variants. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors, built by using terms occurring in their social content, and then ranks the candidates by using their distance from the centroid of seeds, returning the top candidates as result. The method can be continuously or periodically iterated, using the results as new seeds. We validate our method by applying it to a set of diverse domain-specific application scenarios, spanning fashion, literature, and exhibitions.

【Keywords】: big data; content analysis; emerging knowledge; entity typing; evolving knowledge; knowledge extraction; online communities; social media analysis; web data mining; web science

85. Distilling Task Knowledge from How-To Communities.

Paper Link】 【Pages】:805-814

【Authors】: Cuong Xuan Chu ; Niket Tandon ; Gerhard Weikum

【Abstract】: Knowledge graphs have become a fundamental asset for search engines. A fair amount of user queries seek information on problem-solving tasks such as building a fence or repairing a bicycle. However, knowledge graphs completely lack this kind of how-to knowledge. This paper presents a method for automatically constructing a formal knowledge base on tasks and task-solving steps, by tapping the contents of online communities such as WikiHow. We employ Open-IE techniques to extract noisy candidates for tasks, steps and the required tools and other items. For cleaning and properly organizing this data, we devise embedding-based clustering techniques. The resulting knowledge base, HowToKB, includes a hierarchical taxonomy of disambiguated tasks, temporal orders of sub-tasks, and attributes for involved items. A comprehensive evaluation of HowToKB shows high accuracy. As an extrinsic use case, we evaluate automatically searching related YouTube videos for HowToKB tasks.

【Keywords】: howto commonsense; knowledge base construction

5B: Crowdsourcing 2 4

86. Constructing and Evaluating a Novel Crowdsourcing-based Paraphrased Opinion Spam Dataset.

Paper Link】 【Pages】:827-836

【Authors】: Seongsoon Kim ; Seongwoon Lee ; Donghyeon Park ; Jaewoo Kang

【Abstract】: Opinion spam, intentionally written by spammers who do not have actual experience with services or products, has recently become a factor that undermines the credibility of information online. In recent years, studies have attempted to detect opinion spam using machine learning algorithms. However, limitations of gold-standard spam datasets still prove to be a major obstacle in opinion spam research. In this paper, we introduce a novel dataset called Paraphrased OPinion Spam (POPS), which contains a new type of review spam that imitates real human opinions using crowdsourcing. To create such a seemingly truthful review spam dataset, we asked task participants to paraphrase truthful reviews, and include factual information and domain knowledge in their reviews. The classification experiments and semantic analysis results show that our POPS dataset most linguistically and semantically resembles truthful reviews. We believe that our new deceptive opinion spam dataset will help advance opinion spam research.

【Keywords】: crowdsourcing; deceptive opinion spam; paraphrased opinion spam

87. Optimizing the Recency-Relevancy Trade-off in Online News Recommendations.

Paper Link】 【Pages】:837-846

【Authors】: Abhijnan Chakraborty ; Saptarshi Ghosh ; Niloy Ganguly ; Krishna P. Gummadi

【Abstract】: Online news media sites are emerging as the primary source of news for a large number of users. The selection of 'front-page' stories on these media sites usually takes into consideration several crowdsourced popularity metrics, such as number of views or shares by the readers. In this work, we focus on automatically recommending front-page stories in such media websites. When recommending news stories, there are two basic metrics of interest - recency and relevancy. Ideally, recommender systems should recommend the most relevant stories soon after they are published. However, the relevancy of a story only becomes evident as the story ages, thereby creating a tension between recency and relevancy. A systematic analysis of popular recommendation strategies in use today reveals that they lead to poor trade-offs between recency and relevancy in practice. So, in this paper, we propose a new recommendation strategy (called Highest Future-Impact) which attempts to optimize on both the axes. To implement our proposed strategy in practice, we develop an optimization framework combining the predicted future-impact of the stories with the uncertainties in the predictions. Evaluations over three real-world news datasets show that our implementation achieves good performance trade-offs between recency and relevancy.

【Keywords】: front-page news selection; non-personalized recommender systems; recency; relevancy

88. Distilling Information Reliability and Source Trustworthiness from Digital Traces.

Paper Link】 【Pages】:847-855

【Authors】: Behzad Tabibian ; Isabel Valera ; Mehrdad Farajtabar ; Le Song ; Bernhard Schölkopf ; Manuel Gomez-Rodriguez

【Abstract】: Online knowledge repositories typically rely on their users or dedicated editors to evaluate the reliability of their contents. These explicit feedback mechanisms can be viewed as noisy measurements of both information reliability and information source trustworthiness. Can we leverage these noisy measurements, often biased, to distill a robust, unbiased and interpretable measure of both notions? In this paper, we argue that the large volume of digital traces left by the users within knowledge repositories also reflect information reliability and source trustworthiness. In particular, we propose a temporal point process modeling framework which links the temporal behavior of the users to information reliability and source trustworthiness. Furthermore, we develop an efficient convex optimization procedure to learn the parameters of the model from historical traces of the evaluations provided by these users. Experiments on real-world data gathered from Wikipedia and Stack Overflow show that our modeling framework accurately predicts evaluation events, provides an interpretable measure of information reliability and source trustworthiness, and yields interesting insights about real-world events.

【Keywords】: information reliability; point processes; source trustworthiness

89. An Army of Me: Sockpuppets in Online Discussion Communities.

Paper Link】 【Pages】:857-866

【Authors】: Srijan Kumar ; Justin Cheng ; Jure Leskovec ; V. S. Subrahmanian

【Abstract】: In online discussion communities, users can interact and share information and opinions on a wide variety of topics. However, some users may create multiple identities, or sockpuppets, and engage in undesired behavior by deceiving others or manipulating discussions. In this work, we study sockpuppetry across nine discussion communities, and show that sockpuppets differ from ordinary users in terms of their posting behavior, linguistic traits, as well as social network structure. Sockpuppets tend to start fewer discussions, write shorter posts, use more personal pronouns such as ``I'', and have more clustered ego-networks. Further, pairs of sockpuppets controlled by the same individual are more likely to interact on the same discussion at the same time than pairs of ordinary users. Our analysis suggests a taxonomy of deceptive behavior in discussion communities. Pairs of sockpuppets can vary in their deceptiveness, i.e., whether they pretend to be different users, or their supportiveness, i.e., if they support arguments of other sockpuppets controlled by the same user. We apply these findings to a series of prediction tasks, notably, to identify whether a pair of accounts belongs to the same underlying user or not. Altogether, this work presents a data-driven view of deception in online discussion communities and paves the way towards the automatic detection of sockpuppets.

【Keywords】: antisocial behavior; malicious users; multiple account use

5C: Security 2 4

90. SMARTGEN: Exposing Server URLs of Mobile Apps With Selective Symbolic Execution.

Paper Link】 【Pages】:867-876

【Authors】: Chaoshun Zuo ; Zhiqiang Lin

【Abstract】: Server URLs including domain names, resource path, and query parameters are important to many security applications such as hidden service identification, malicious website detection, and server vulnerability fuzzing. Unlike traditional desktop web apps in which server URLs are often directly visible, the server URLs of mobile apps are often hidden, only being exposed when the corresponding app code gets executed. Therefore, it is important to automatically analyze the mobile app code to expose the server URLs and enable the security applications with them. We have thus developed SMARTGEN to feature selective symbolic execution for the purpose of automatically generate server request messages to expose the server URLs by extracting and solving user input constraints in mobile apps. Our evaluation with 5,000 top-ranked mobile apps (each with over one million installs) in Google Play shows that with SMARTGEN we are able to reveal 297,780 URLs in total for these apps. We have then submitted all of these exposed URLs to a harmful URL detection service provided by VirusTotal, which further identified 8634 URLs being harmful. Among them, Phising belong to phishing sites, 3,722 malware sites and 3,228 malicious sites (there are 387 overlapped sites between malware and malicious sites).

【Keywords】: mobile app; symbolic execution; url security

91. On the Content Security Policy Violations due to the Same-Origin Policy.

Paper Link】 【Pages】:877-886

【Authors】: Dolière Francis Somé ; Nataliia Bielova ; Tamara Rezk

【Abstract】: Modern browsers implement different security policies such as the Content Security Policy (CSP), a mechanism designed to mitigate popular web vulnerabilities, and the Same Origin Policy (SOP), a mechanism that governs interactions between resources of web pages. In this work, we describe how CSP may be violated due to the SOP when a page contains an embedded iframe from the same origin. We analyse 1 million pages from 10,000 top Alexa sites and report that at least 31.1% of current CSP-enabled pages are potentially vulnerable to CSP violations. Further considering real-world situations where those pages are involved in same-origin nested browsing contexts, we found that in at least 23.5% of the cases, CSP violations are possible. During our study, we also identified a divergence among browsers implementations in the enforcement of CSP in srcdoc sandboxed iframes, which actually reveals a problem in Gecko-based browsers CSP implementation. To ameliorate the problematic conflicts of the security mechanisms, we discuss measures to avoid CSP violations.

【Keywords】: content security policy; same origin policy; security and privacy; web application security

92. Transparent Web Service Auditing via Network Provenance Functions.

Paper Link】 【Pages】:887-895

【Authors】: Adam M. Bates ; Wajih Ul Hassan ; Kevin R. B. Butler ; Alin Dobra ; Bradley Reaves ; Patrick T. Cable II ; Thomas Moyer ; Nabil Schear

【Abstract】: Detecting and explaining the nature of attacks in distributed web services is often difficult -- determining the nature of suspicious activity requires following the trail of an attacker through a chain of heterogeneous software components including load balancers, proxies, worker nodes, and storage services. Unfortunately, existing forensic solutions cannot provide the necessary context to link events across complex workflows, particularly in instances where application layer semantics (e.g., SQL queries, RPCs) are needed to understand the attack. In this work, we present a transparent provenance-based approach for auditing web services through the introduction of Network Provenance Functions (NPFs). NPFs are a distributed architecture for capturing detailed data provenance for web service components, leveraging the key insight that mediation of an application's protocols can be used to infer its activities without requiring invasive instrumentation or developer cooperation. We design and implement NPF with consideration for the complexity of modern cloud-based web services, and evaluate our architecture against a variety of applications including DVDStore, RUBiS, and WikiBench to show that our system imposes as little as 9.3% average end-to-end overhead on connections for realistic workloads. Finally, we consider several scenarios in which our system can be used to concisely explain attacks. NPF thus enables the hassle-free deployment of semantically rich provenance-based auditing for complex applications workflows in the Cloud.

【Keywords】: audit; data provenance; security

93. J-Force: Forced Execution on JavaScript.

Paper Link】 【Pages】:897-906

【Authors】: Kyungtae Kim ; I Luk Kim ; Chung Hwan Kim ; Yonghwi Kwon ; Yunhui Zheng ; Xiangyu Zhang ; Dongyan Xu

【Abstract】: Web-based malware equipped with stealthy cloaking and obfuscation techniques is becoming more sophisticated nowadays. In this paper, we propose J-FORCE, a crash-free forced JavaScript execution engine to systematically explore possible execution paths and reveal malicious behaviors in such malware. In particular, J-FORCE records branch outcomes and mutates them for further explorations. J-FORCE inspects function parameter values that may reveal malicious intentions and expose suspicious DOM injections. We addressed a number of technical challenges encountered. For instance, we keep track of missing objects and DOM elements, and create them on demand. To verify the efficacy of our techniques, we apply J-FORCE to detect Exploit Kit (EK) attacks and malicious Chrome extensions. We observe that J-FORCE is more effective compared to the existing tools.

【Keywords】: evasion; javascript; malware; security

5D: User Modeling 1 4

94. User Personalized Satisfaction Prediction via Multiple Instance Deep Learning.

Paper Link】 【Pages】:907-915

【Authors】: Zheqian Chen ; Ben Gao ; Huimin Zhang ; Zhou Zhao ; Haifeng Liu ; Deng Cai

【Abstract】: Community question answering(CQA) services have arisen as a popular knowledge sharing pattern for netizens. With abundant interactions among users, individuals are capable of obtaining satisfactory information. However, it is not effective for users to attain satisfying answers within minutes. Users have to check the progress over time until the appropriate answers submitted. We address this problem as a user personalized satisfaction prediction task. Existing methods usually exploit manual feature selection. It is not desirable as it requires careful design and is labor intensive. In this paper, we settle this issue by developing a new multiple instance deep learning framework. Specifically, in our settings, each question follows a multiple instance learning assumption, where its obtained answers can be regarded as instance sets in a bag and we define the question resolved with at least one satisfactory answer. We design an efficient framework exploiting multiple instance learning property with deep learning tactic to model the question-answer pairs relevance and rank the asker's satisfaction possibility. Extensive experiments on large-scale datasets from different forums of Stack Exchange demonstrate the feasibility of our proposed framework in predicting asker personalized satisfaction.

【Keywords】: deep learning; multiple instance learning; user satisfaction prediction

Paper Link】 【Pages】:917-926

【Authors】: Dimitar Dimitrov ; Philipp Singer ; Florian Lemmerich ; Markus Strohmaier

【Abstract】: While a plethora of hypertext links exist on the Web, only a small amount of them are regularly clicked. Starting from this observation, we set out to study large-scale click data from Wikipedia in order to understand what makes a link successful. We systematically analyze effects of link properties on the popularity of links. By utilizing mixed-effects hurdle models supplemented with descriptive insights, we find evidence of user preference towards links leading to the periphery of the network, towards links leading to semantically similar articles, and towards links in the top and left-side of the screen. We integrate these findings as Bayesian priors into a navigational Markov chain model and by doing so successfully improve the model fits. We further adapt and improve the well-known classic PageRank algorithm that assumes random navigation by accounting for observed navigational preferences of users in a weighted variation. This work facilitates understanding navigational click behavior and thus can contribute to improving link structures and algorithms utilizing these structures.

【Keywords】: human click behavior; navigation; wikipedia

96. Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity.

Paper Link】 【Pages】:927-936

【Authors】: Jack Hessel ; Lillian Lee ; David M. Mimno

【Abstract】: The content of today's social media is becoming more and more rich, increasingly mixing text, images, videos, and audio. It is an intriguing research question to model the interplay between these different modes in attracting user attention and engagement. But in order to pursue this study of multimodal content, we must also account for context: timing effects, community preferences, and social factors (e.g., which authors are already popular) also affect the amount of feedback and reaction that social-media posts receive. In this work, we separate out the influence of these non-content factors in several ways. First, we focus on ranking pairs of submissions posted to the same community in quick succession, e.g., within 30 seconds; this framing encourages models to focus on time-agnostic and community-specific content features. Within that setting, we determine the relative performance of author vs. content features. We find that victory usually belongs to "cats and captions," as visual and textual features together tend to outperform identity-based features. Moreover, our experiments show that when considered in isolation, simple unigram text features and deep neural network visual features yield the highest accuracy individually, and that the combination of the two modalities generally leads to the best accuracies overall.

【Keywords】: image processing; language modeling; multimodal; reddit; social media

97. Clustered Model Adaption for Personalized Sentiment Analysis.

Paper Link】 【Pages】:937-946

【Authors】: Lin Gong ; Benjamin Haines ; Hongning Wang

【Abstract】: We propose to capture humans' variable and idiosyncratic sentiment via building personalized sentiment classification models at a group level. Our solution roots in the social comparison theory that humans tend to form groups with others of similar minds and ability, and the cognitive consistency theory that mutual influence inside groups will eventually shape group norms and attitudes, with which group members will all shift to align. We formalize personalized sentiment classification as a multi-task learning problem. In particular, to exploit the clustering property of users' opinions, we impose a non-parametric Dirichlet Process prior over the personalized models, in which group members share the same customized sentiment model adapted from a global classifier. Extensive experimental evaluations on large collections of Amazon and Yelp reviews confirm the effectiveness of the proposed solution: it outperformed user-independent classification solutions, and several state-of-the-art model adaptation and multi-task learning algorithms.

【Keywords】: model adaptation; multi-task learning; sentiment analysis

5G: Social 2 4

98. Exact Computation of Influence Spread by Binary Decision Diagrams.

Paper Link】 【Pages】:947-956

【Authors】: Takanori Maehara ; Hirofumi Suzuki ; Masakazu Ishihata

【Abstract】: Evaluating influence spread in social networks is a fundamental procedure to estimate the word-of-mouth effect in viral marketing. There are enormous studies about this topic; however, under the standard stochastic cascade models, the exact computation of influence spread is known to be #P-hard. Thus, the existing studies have used Monte-Carlo simulation-based approximations to avoid exact computation. We propose the first algorithm to compute influence spread exactly under the independent cascade model. The algorithm first constructs binary decision diagrams (BDDs) for all possible realizations of influence spread, then computes influence spread by dynamic programming on the constructed BDDs. To construct the BDDs efficiently, we designed a new frontier-based search-type procedure. The constructed BDDs can also be used to solve other influence-spread related problems, such as random sampling without rejection, conditional influence spread evaluation, dynamic probability update, and gradient computation for probability optimization problems. We conducted computational experiments to evaluate the proposed algorithm. The algorithm successfully computed influence spread on real-world networks with a hundred edges in a reasonable time, which is quite impossible by the naive algorithm. We also conducted an experiment to evaluate the accuracy of the Monte-Carlo simulation-based approximation by comparing exact influence spread obtained by the proposed algorithm.

【Keywords】: binary decision diagram; enumeration; influence spread

99. Secure Centrality Computation Over Multiple Networks.

Paper Link】 【Pages】:957-966

【Authors】: Gilad Asharov ; Francesco Bonchi ; David García-Soriano ; Tamir Tassa

【Abstract】: Consider a multi-layered graph, where the different layers correspond to different proprietary social networks on the same ground set of users. Suppose that the owners of the different networks (called hosts) are mutually non-trusting parties: how can they compute a centrality score for each of the users using all the layers, but without disclosing information about their private graphs? Under this setting we study a suite of three centrality measures whose algebraic structure allows performing that computation with provable security and efficiency. The first measure counts the nodes reachable from a node within a given radius. The second measure extends the first one by counting the number of paths between any two nodes. The final one is a generalization to the multi-layered graph case: not only the number of paths is counted, but also the multiplicity of these paths in the different layers is considered. We devise a suite of multiparty protocols to compute those centrality measures, which are all provably secure in the information-theoretic sense. One typical challenge and limitation of secure multiparty computation protocols is their scalability. We tackle this problem and devise a protocol which is highly scalable and still provably secure. We test our protocols on several real-world multi-layered graphs: interestingly, the protocol to compute the most sensitive measure (i.e., the multi-layered centrality) is also the most scalable one and can be efficiently run on very large networks.

【Keywords】: centrality measures; multi-layered graphs; secure multiparty protocols; social networks

100. Interplay between Social Influence and Network Centrality: A Comparative Study on Shapley Centrality and Single-Node-Influence Centrality.

Paper Link】 【Pages】:967-976

【Authors】: Wei Chen ; Shang-Hua Teng

【Abstract】: We study network centrality based on dynamic influence propagation models in social networks. To illustrate our integrated mathematical-algorithmic approach for understanding the fundamental interplay between dynamic influence processes and static network structures, we focus on two basic centrality measures: (a) Single Node Influence (SNI) centrality, which measures each node's significance by its influence spread; and (b) Shapley Centrality, which uses the Shapley value of the influence spread function --- formulated based on a fundamental cooperative-game-theoretical concept --- to measure the significance of nodes. We present a comprehensive comparative study of these two centrality measures. Mathematically, we present axiomatic characterizations, which precisely capture the essence of these two centrality measures and their fundamental differences. Algorithmically, we provide scalable algorithms for approximating them for a large family of social-influence instances. Empirically, we demonstrate their similarity and differences in a number of real-world social networks, as well as the efficiency of our scalable algorithms. Our results shed light on their applicability: SNI centrality is suitable for assessing individual influence in isolation while Shapley centrality assesses individuals' performance in group influence settings.

【Keywords】: influence diffusion model; interplay between network and influence model; network centrality; scalable algorithms; shapley values; social influence; social network

101. Portfolio Optimization for Influence Spread.

Paper Link】 【Pages】:977-985

【Authors】: Naoto Ohsaka ; Yuichi Yoshida

【Abstract】: Motivated by viral marketing, stochastic diffusion processes that model influence spread on a network have been studied intensively. The primary interest in such models has been to find a seed set of a fixed size that maximizes the expected size of the cascade from it. Practically, however, it is not desirable to have the risk of ending with a small cascade, even if the expected size of the cascade is large. To address this issue, we adopt conditional value at risk (CVaR) as a risk measure, and propose an algorithm that computes a portfolio over seed sets with a provable guarantee on its CVaR. Using real-world social networks, we demonstrate that the portfolio computed by our algorithm has a significantly better CVaR than seed sets computed by other baseline methods.

【Keywords】: conditional value at risk; diffusion process; influence maximization; portfolio optimization; social networks

5H: Information Extraction 4

102. Extracting and Ranking Travel Tips from User-Generated Reviews.

Paper Link】 【Pages】:987-996

【Authors】: Ido Guy ; Avihai Mejer ; Alexander Nus ; Fiana Raiber

【Abstract】: User-generated reviews are a key driving force behind some of the leading websites, such as Amazon, TripAdvisor, and Yelp. Yet, the proliferation of user reviews in such sites also poses an information overload challenge: many items, especially popular ones, have a large number of reviews, which cannot all be read by the user. In this work, we propose to extract short practical tips from user reviews. We focus on tips for travel attractions extracted from user reviews on TripAdvisor. Our method infers a list of templates from a small gold set of tips and applies them to user reviews to extract tip candidates. For each attraction, the associated candidates are then ranked according to their predicted usefulness. Evaluation based on labeling by professional annotators shows that our method produces high-quality tips, with good coverage of cities and attractions.

【Keywords】: advice extraction; social recommender systems; travel assistants; user-generated content; web mining

103. Information Extraction in Illicit Web Domains.

Paper Link】 【Pages】:997-1006

【Authors】: Mayank Kejriwal ; Pedro Szekely

【Abstract】: Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have 'long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.

【Keywords】: distributional semantics; feature-agnostic; illicit domains; information extraction; named entity recognition

104. Learning to Extract Events from Knowledge Base Revisions.

Paper Link】 【Pages】:1007-1014

【Authors】: Alexander Konovalov ; Benjamin Strauss ; Alan Ritter ; Brendan O'Connor

【Abstract】: Broad-coverage knowledge bases (KBs) such as Wikipedia, Freebase, Microsoft's Satori and Google's Knowledge Graph contain structured data describing real-world entities. These data sources have become increasingly important for a wide range of intelligent systems: from information retrieval and question answering, to Facebook's Graph Search, IBM's Watson, and more. Previous work on learning to populate knowledge bases from text has, for the most part, made the simplifying assumption that facts remain constant over time. But this is inaccurate -- we live in a rapidly changing world. Knowledge should not be viewed as a static snapshot, but instead a rapidly evolving set of facts that must change as the world changes. In this paper we demonstrate the feasibility of accurately identifying entity-transition-events, from real-time news and social media text streams, that drive changes to a knowledge base. We use Wikipedia's edit history as distant supervision to learn event extractors, and evaluate the extractors based on their ability to predict online updates. Our weakly supervised event extractors are able to predict 10 KB revisions per month at 0.8 precision. By lowering our confidence threshold, we can suggest 34.3 correct edits per month at 0.4 precision. 64% of predicted edits were detected before they were added to Wikipedia. The average lead time of our forecasted knowledge revisions over Wikipedia's editors is 40 days, demonstrating the utility of our method for suggesting edits that can be quickly verified and added to the knowledge graph.

【Keywords】: database management; information extraction; knowledge base population; natural language processing

105. CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Paper Link】 【Pages】:1015-1024

【Authors】: Xiang Ren ; Zeqiu Wu ; Wenqi He ; Meng Qu ; Clare R. Voss ; Heng Ji ; Tarek F. Abdelzaher ; Jiawei Han

【Abstract】: Extracting entities and relations for types of interest from text is important for understanding massive text corpora. Traditionally, systems of entity relation extraction have relied on human-annotated corpora for training and adopted an incremental pipeline. Such systems require additional human expertise to be ported to a new domain, and are vulnerable to errors cascading down the pipeline. In this paper, we investigate joint extraction of typed entities and relations with labeled data heuristically obtained from knowledge bases (i.e., distant supervision). As our algorithm for type labeling via distant supervision is context-agnostic, noisy training data poses unique challenges for the task. We propose a novel domain-independent framework, called CoType, that runs a data-driven text segmentation algorithm to extract entity mentions, and jointly embeds entity mentions, relation mentions, text features and type labels into two low-dimensional spaces (for entity and relation mentions respectively), where, in each space, objects whose types are close will also have similar representations. CoType, then using these learned embeddings, estimates the types of test (unlinkable) mentions. We formulate a joint optimization problem to learn embeddings from text corpora and knowledge bases, adopting a novel partial-label loss function for noisy labeled data and introducing an object "translation" function to capture the cross-constraints of entities and relations on each other. Experiments on three public datasets demonstrate the effectiveness of CoType across different domains (e.g., news, biomedical), with an average of 25% improvement in F1 score compared to the next best method.

【Keywords】: distant supervision; entity recognition and typing; information extraction; joint embedding; relation extraction

6B: Machine Learning 4

106. Correlation Clustering with Low-Rank Matrices.

Paper Link】 【Pages】:1025-1034

【Authors】: Nate Veldt ; Anthony Ian Wirth ; David F. Gleich

【Abstract】: Correlation clustering is a technique for aggregating data based on qualitative information about which pairs of objects are labeled similar' ordissimilar.' Because the optimization problem is NP-hard, much of the previous literature focuses on finding approximation algorithms. In this paper we explore how to solve the correlation clustering objective exactly when the data to be clustered can be represented by a low-rank matrix. We prove in particular that correlation clustering can be solved in polynomial time when the underlying matrix is positive semidefinite with small constant rank, but that the task remains NP-hard in the presence of even one negative eigenvalue. Based on our theoretical results, we develop an algorithm for efficiently ``solving'' low-rank positive semidefinite correlation clustering by employing a procedure for zonotope vertex enumeration. We demonstrate the effectiveness and speed of our algorithm by using it to solve several clustering problems on both synthetic and real-world data.

【Keywords】: correlation clustering; low-rank matrices

107. Consistent Weighted Sampling Made More Practical.

Paper Link】 【Pages】:1035-1043

【Authors】: Wei Wu ; Bin Li ; Ling Chen ; Chengqi Zhang

【Abstract】: Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets -- Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5~1/3 reduced empirical runtime.

【Keywords】: consistent weighted sampling; lsh; weighted min-hash

108. Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification.

Paper Link】 【Pages】:1045-1052

【Authors】: Jan Deriu ; Aurélien Lucchi ; Valeria De Luca ; Aliaksei Severyn ; Simon Müller ; Mark Cieliebak ; Thomas Hofmann ; Martin Jaggi

【Abstract】: This paper presents a novel approach for multi-lingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multi-lingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weakly-supervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pre-training of such networks. We thoroughly evaluate our approach on various multi-lingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved state-of-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse - but still acceptable - performance when compared to the single language model, while benefiting from better generalization properties across languages.

【Keywords】: multi-language; neural networks; sentiment classification; weak supervision

109. Theory of the GMM Kernel.

Paper Link】 【Pages】:1053-1062

【Authors】: Ping Li ; Cun-Hui Zhang

【Abstract】: In web search, data mining, and machine learning, two popular measures of data similarity are the cosine and the resemblance (the latter is for binary data). In this study, we develop theoretical results for both the cosine and the GMM (generalized min-max) kernel, which is a generalization of the resemblance. GMM has direct applications in machine learning as a positive definite kernel and can be efficiently linearized via probabilistic hashing to handle big data. Owing to its discrete nature, the hashed values can also be used to build hash tables for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a general family of distributions and includes the multivariate normal and t-distribution as special cases. The consistency result holds as long as the data have bounded first moment (an assumption which typically holds for data commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. We also prove the limit of cosine under elliptical distributions. In comparison, the consistency of GMM requires much weaker conditions. For example, when data follow a t-distribution with ν degrees of freedom, GMM typically provides a better estimate of similarity than cosine when ν < 8 (ν = 8 means the distribution is very close to normal). These theoretical results help explain the recent success of GMM and lay the foundation for further research.

【Keywords】: hashing; random fourier features; similarity; the gmm kernel

6C: Spam Detection 4

110. Bimodal Distribution and Co-Bursting in Review Spam Detection.

Paper Link】 【Pages】:1063-1072

【Authors】: Huayi Li ; Geli Fei ; Shuai Wang ; Bing Liu ; Weixiang Shao ; Arjun Mukherjee ; Jidong Shao

【Abstract】: Online reviews play a crucial role in helping consumers evaluate and compare products and services. This critical importance of reviews also incentivizes fraudsters (or spammers) to write fake or spam reviews to secretly promote or demote some target products and services. Existing approaches to detecting spam reviews and reviewers employed review contents, reviewer behaviors, star rating patterns, and reviewer-product networks for detection. In this research, we further discovered that reviewers' posting rates (number of reviews written in a period of time) also follow an interesting distribution pattern, which has not been reported before. That is, their posting rates are bimodal. Multiple spammers also tend to collectively and actively post reviews to the same set of products within a short time frame, which we call co-bursting. Furthermore, we found some other interesting patterns in individual reviewers' temporal dynamics and their co-bursting behaviors with other reviewers. Inspired by these findings, we first propose a two-mode Labeled Hidden Markov Model to model spamming using only individual reviewers' review posting times. We then extend it to the Coupled Hidden Markov Model to capture both reviewer posting behaviors and co-bursting signals. Our experiments show that the proposed model significantly outperforms state-of-the-art baselines in identifying individual spammers. Furthermore, we propose a co-bursting network based on co-bursting relations, which helps detect groups of spammers more effectively than existing approaches.

【Keywords】: hidden markov model; review spam; spam groups

111. Detecting Collusive Spamming Activities in Community Question Answering.

Paper Link】 【Pages】:1073-1082

【Authors】: Yuli Liu ; Yiqun Liu ; Ke Zhou ; Min Zhang ; Shaoping Ma

【Abstract】: Community Question Answering (CQA) portals provide rich sources of information on a variety of topics. However, the authenticity and quality of questions and answers (Q&As) has proven hard to control. In a troubling direction, the widespread growth of crowdsourcing websites has created a large-scale, potentially difficult-to-detect workforce to manipulate malicious contents in CQA. The crowd workers who join the same crowdsourcing task about promotion campaigns in CQA collusively manipulate deceptive Q&As for promoting a target (product or service). The collusive spamming group can fully control the sentiment of the target. How to utilize the structure and the attributes for detecting manipulated Q&As? How to detect the collusive group and leverage the group information for the detection task? To shed light on these research questions, we propose a unified framework to tackle the challenge of detecting collusive spamming activities of CQA. First, we interpret the questions and answers in CQA as two independent networks. Second, we detect collusive question groups and answer groups from these two networks respectively by measuring the similarity of the contents posted within a short duration. Third, using attributes (individual-level and group-level) and correlations (user-based and content-based), we proposed a combined factor graph model to detect deceptive Q&As simultaneously by combining two independent factor graphs. With a large-scale practical data set, we find that the proposed framework can detect deceptive contents at early stage, and outperforms a number of competitive baselines.

【Keywords】: community question answering; crowdsourcing manipulation; factor graph.; spam detection

112. FLOCK: Combating Astroturfing on Livestreaming Platforms.

Paper Link】 【Pages】:1083-1091

【Authors】: Neil Shah

【Abstract】: Livestreaming platforms have become increasingly popular in recent years as a means of sharing and advertising creative content. Popular content streamers who attract large viewership to their live broadcasts can earn a living by means of ad revenue, donations and channel subscriptions. Unfortunately, this incentivized popularity has simultaneously resulted in incentive for fraudsters to provide services to astroturf, or artificially inflate viewership metrics by providing fake ``live'' views to customers. Our work provides a number of major contributions: (a) formulation: we are the first to introduce and characterize the viewbot fraud problem in livestreaming platforms, (b) methodology: we propose FLOCK, a principled and unsupervised method which efficiently and effectively identifies botted broadcasts and their constituent botted views, and (c) practicality: our approach achieves over 98% precision in identifying botted broadcasts and over 90% precision/recall against sizable synthetically generated viewbot attacks on a real-world livestreaming workload of over 16 million views and 92 thousand broadcasts. FLOCK successfully operates on larger datasets in practice and is regularly used at a large, undisclosed livestreaming corporation.

【Keywords】: anomaly detection; livestreaming; viewbots

113. Can You Spot the Fakes?: On the Limitations of User Feedback in Online Social Networks.

Paper Link】 【Pages】:1093-1102

【Authors】: David Mandell Freeman

【Abstract】: Online social networks (OSNs) are appealing platforms for spammers and fraudsters, who typically use fake or compromised accounts to connect with and defraud real users. To combat such abuse, OSNs allow users to report fraudulent profiles or activity. The OSN can then use reporting data to review and/or limit activity of reported accounts. Previous authors have suggested that an OSN can augment its takedown algorithms by identifying a "trusted set" of users whose reports are weighted more heavily in the disposition of flagged accounts. Such identification would allow the OSN to improve both speed and accuracy of fake account detection and thus reduce the impact of spam on users. In this work we provide the first public, data-driven assessment of whether the above assumption is true: are some users better at reporting than others? Specifically, is reporting skill both measurable, i.e., possible to distinguish from random guessing; and repeatable, i.e., persistent over repeated sampling? Our main contributions are to develop a statistical framework that describes these properties and to apply this framework to data from LinkedIn, the professional social network. Our data includes member reports of fake profiles as well as the more voluminous, albeit weaker, signal of member responses to connection requests. We find that members demonstrating measurable, repeatable skill in identifying fake profiles do exist but are rare: at most 2.4% of those reporting fakes and at most 1.3% of those rejecting connection requests. We conclude that any reliable "trusted set" of members will be too small to have noticeable impact on spam metrics.

【Keywords】: fake accounts; online trust; reputation systems; social networks; spam detection

6D: User Modeling 2 4

114. Modeling Consumer Preferences and Price Sensitivities from Large-Scale Grocery Shopping Transaction Logs.

Paper Link】 【Pages】:1103-1112

【Authors】: Mengting Wan ; Di Wang ; Matthew Goldman ; Matt Taddy ; Justin Rao ; Jie Liu ; Dimitrios Lymberopoulos ; Julian McAuley

【Abstract】: In order to match shoppers with desired products and provide personalized promotions, whether in online or offline shopping worlds, it is critical to model both consumer preferences and price sensitivities simultaneously. Personalized preferences have been thoroughly studied in the field of recommender systems, though price (and price sensitivity) has received relatively little attention. At the same time, price sensitivity has been richly explored in the area of economics, though typically not in the context of developing scalable, working systems to generate recommendations. In this study, we seek to bridge the gap between large-scale recommender systems and established consumer theories from economics, and propose a nested feature-based matrix factorization framework to model both preferences and price sensitivities. Quantitative and qualitative results indicate the proposed personalized, interpretable and scalable framework is capable of providing satisfying recommendations (on two datasets of grocery transactions) and can be applied to obtain economic insights into consumer behavior.

【Keywords】: consumer behavior; matrix factorization; price elasticity; recommender system

115. Do "Also-Viewed" Products Help User Rating Prediction?

Paper Link】 【Pages】:1113-1122

【Authors】: Chanyoung Park ; Dong Hyun Kim ; Jinoh Oh ; Hwanjo Yu

【Abstract】: For online product recommendation engines, learning high-quality product embedding that captures various aspects of the product is critical to improving the accuracy of user rating prediction. In recent research, in conjunction with user feedback, the appearance of a product as side information has been shown to be helpful for learning product embedding. However, since a product has a variety of aspects such as functionality and specifications, taking into account only its appearance as side information does not suffice to accurately learn its embedding. In this paper, we propose a matrix co-factorization method that leverages information hidden in the so-called "also-viewed" products, i.e., a list of products that has also been viewed by users who have viewed a target product. "Also-viewed" products reflect various aspects of a given product that have been overlooked by visually-aware recommendation methods proposed in past research. Experiments on multiple real-world datasets demonstrate that our proposed method outperforms state-of-the-art baselines in terms of user rating prediction. We also perform classification on the product embedding learned by our method, and compare it with a state-of-the-art baseline to demonstrate the superiority of our method in generating high-quality product embedding that better represents the product.

【Keywords】: collaborative filtering; online shopping; product embedding

116. Monetary Discount Strategies for Real-Time Promotion Campaign.

Paper Link】 【Pages】:1123-1132

【Authors】: Ying-Chun Lin ; Chi-Hsuan Huang ; Chu-Cheng Hsieh ; Yu-Chen Shu ; Kun-Ta Chuang

【Abstract】: The effectiveness of monetary promotions has been well reported in the literature to affect shopping decisions for products in real life experience. Nowadays, e-commerce retailers are facing more fierce competition on price promotion in that consumers can easily use a search engine to find another merchant selling an identical product for comparing price. To achieve more effectiveness on real-time promotion in pursuit of better profits, we propose two discount-giving strategies: an algorithm based on Kernel density estimation, and the other algorithm based on Thompson sampling strategy. We show that, given a pre-determined discount budget, our algorithms can significantly acquire better revenue in return than classical strategies with simply fixed discount on label price. We then demonstrate its feasibility to be a promising deployment in e-commerce services for real-time promotion.

【Keywords】: discount-giving strategy; online shopping; real-time promotion; user behavior modeling

117. Predicting Latent Structured Intents from Shopping Queries.

Paper Link】 【Pages】:1133-1141

【Authors】: Chao-Yuan Wu ; Amr Ahmed ; Gowtham Ramani Kumar ; Ritendra Datta

【Abstract】: In online shopping, users usually express their intent through search queries. However, these queries are often ambiguous. For example, it is more likely (and easier) for users to write a query like "high-end bike" than "21 speed carbon frames jamis or giant road bike". It is challenging to interpret these ambiguous queries and thus search result accuracy suffers. A user oftentimes needs to go through the frustrating process of refining search queries or self-teaching from possibly unstructured information. However, shopping is indeed a structured domain, that is composed of category hierarchy, brands, product lines, features, etc. It would be much better if a shopping site could understand users' intent through this structure, present organized information, and then find the items with the right categories, brands or features. In this paper we study the problem of inferring the latent intent from unstructured queries and mapping them to structured attributes. We present a novel framework that jointly learns this knowledge from user consumption behaviors and product metadata. We present a hybrid Long Short-term Memory (LSTM) joint model that is accurate and robust, even though user queries are noisy and product catalog is rapidly growing. Our study is conducted on a large-scale dataset from Google Shopping, that is composed of millions of items and user queries along with their click responses. Extensive qualitative and quantitative evaluation shows that the proposed model is more accurate, concise, and robust than multiple possible alternatives. In terms of information retrieval (IR) performance, our model is able to improve the quality of current Google Shopping production system, which is a very strong baseline.

【Keywords】: autoencoder; entity relationship modeling; query understanding; recurrent neural networks; shopping

6G: Social 3 5

118. EOMM: An Engagement Optimized Matchmaking Framework.

Paper Link】 【Pages】:1143-1150

【Authors】: Zhengxing Chen ; Su Xue ; John Kolen ; Navid Aghdaie ; Kazi A. Zaman ; Yizhou Sun ; Magy Seif El-Nasr

【Abstract】: Matchmaking connects multiple players to participate in online player-versus-player games. Current matchmaking systems depend on a single core strategy: create fair games at all times. These systems pair similarly skilled players on the assumption that a fair game is best player experience. We will demonstrate, however, that this intuitive assumption sometimes fails and that matchmaking based on fairness is not optimal for engagement. In this paper, we propose an Engagement Optimized Matchmaking (EOMM) framework that maximizes overall player engagement. We prove that equal-skill based matchmaking is a special case of EOMM on a highly simplified assumption that rarely holds in reality. Our simulation on real data from a popular game made by Electronic Arts, Inc. (EA) supports our theoretical results, showing significant improvement in enhancing player engagement compared to existing matchmaking methods.

【Keywords】: matchmaking; player engagement; video games

119. Back To The Source: An Online Approach for Sensor Placement and Source Localization.

Paper Link】 【Pages】:1151-1160

【Authors】: Brunella Spinelli ; L. Elisa Celis ; Patrick Thiran

【Abstract】: Source localization, the act of finding the originator of a disease or rumor in a network, has become an important problem in sociology and epidemiology. The localization is done using the infection state and time of infection of a few designated sensor nodes; however, maintaining sensors can be very costly in practice. We propose the first online approach to source localization: We deploy a priori only a small number of sensors (which reveal if they are reached by an infection) and then iteratively choose the best location to place a new sensor in order to localize the source. This approach allows for source localization with a very small number of sensors; moreover, the source can be found while the epidemic is still ongoing. Our method applies to a general network topology and performs well even with random transmission delays.

【Keywords】: epidemics; online source localization; sensor placement

120. What's in a Name?: Understanding Profile Name Reuse on Twitter.

Paper Link】 【Pages】:1161-1170

【Authors】: Enrico Mariconti ; Jeremiah Onaolapo ; Syed Sharique Ahmad ; Nicolas Nikiforou ; Manuel Egele ; Nick Nikiforakis ; Gianluca Stringhini

【Abstract】: Users on Twitter are commonly identified by their profile names. These names are used when directly addressing users on Twitter, are part of their profile page URLs, and can become a trademark for popular accounts, with people referring to celebrities by their real name and their profile name, interchangeably. Twitter, however, has chosen to not permanently link profile names to their corresponding user accounts. In fact, Twitter allows users to change their profile name, and afterwards makes the old profile names available for other users to take. In this paper, we provide a large-scale study of the phenomenon of profile name reuse on Twitter. We show that this phenomenon is not uncommon, investigate the dynamics of profile name reuse, and characterize the accounts that are involved in it. We find that many of these accounts adopt abandoned profile names for questionable purposes, such as spreading malicious content, and using the profile name's popularity for search engine optimization. Finally, we show that this problem is not unique to Twitter (as other popular online social networks also release profile names) and argue that the risks involved with profile-name reuse outnumber the advantages provided by this feature.

【Keywords】: impersonation; measurement; osn; profile name; security; social network

121. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment.

Paper Link】 【Pages】:1171-1180

【Authors】: Muhammad Bilal Zafar ; Isabel Valera ; Manuel Gomez-Rodriguez ; Krishna P. Gummadi

【Abstract】: Automated data-driven decision making systems are increasingly being used to assist, or even replace humans in many settings. These systems function by learning from historical decisions, often taken by humans. In order to maximize the utility of these systems (or, classifiers), their training involves minimizing the errors (or, misclassifications) over the given historical data. However, it is quite possible that the optimally trained classifier makes decisions for people belonging to different social groups with different misclassification rates (e.g., misclassification rates for females are higher than for males), thereby placing these groups at an unfair disadvantage. To account for and avoid such unfairness, in this paper, we introduce a new notion of unfairness, disparate mistreatment, which is defined in terms of misclassification rates. We then propose intuitive measures of disparate mistreatment for decision boundary-based classifiers, which can be easily incorporated into their formulation as convex-concave constraints. Experiments on synthetic as well as real world datasets show that our methodology is effective at avoiding disparate mistreatment, often at a small cost in terms of accuracy.

【Keywords】: algorithmic decision making; discrimination in decision making; fair classification; fair decision making; machine learning and law

122. Sampling from Social Networks with Attributes.

Paper Link】 【Pages】:1181-1190

【Authors】: Claudia Wagner ; Philipp Singer ; Fariba Karimi ; Jürgen Pfeffer ; Markus Strohmaier

【Abstract】: Sampling from large networks represents a fundamental challenge for social network research. In this paper, we explore the sensitivity of different sampling techniques (node sampling, edge sampling, random walk sampling, and snowball sampling) on social networks with attributes. We consider the special case of networks (i) where we have one attribute with two values (e.g., male and female in the case of gender), (ii) where the size of the two groups is unequal (e.g., a male majority and a female minority), and (iii) where nodes with the same or different attribute value attract or repel each other (i.e., homophilic or heterophilic behavior). We evaluate the different sampling techniques with respect to conserving the position of nodes and the visibility of groups in such networks. Experiments are conducted both on synthetic and empirical social networks. Our results provide evidence that different network sampling techniques are highly sensitive with regard to capturing the expected centrality of nodes, and that their accuracy depends on relative group size differences and on the level of homophily that can be observed in the network. We conclude that uninformed sampling from social networks with attributes thus can significantly impair the ability of researchers to draw valid conclusions about the centrality of nodes and the visibility or invisibility of groups in social networks.

【Keywords】: homophily; sampling bias; sampling methods; social networks

6H: Question Answering and Topic Modeling 4

123. Automated Template Generation for Question Answering over Knowledge Graphs.

Paper Link】 【Pages】:1191-1200

【Authors】: Abdalghani Abujabal ; Mohamed Yahya ; Mirek Riedewald ; Gerhard Weikum

【Abstract】: Templates are an important asset for question answering over knowledge graphs, simplifying the semantic parsing of input utterances and generating structured queries for interpretable answers. State-of-the-art methods rely on hand-crafted templates with limited coverage. This paper presents QUINT, a system that automatically learns utterance-query templates solely from user questions paired with their answers. Additionally, QUINT is able to harness language compositionality for answering complex questions without having any templates for the entire question. Experiments with different benchmarks demonstrate the high quality of QUINT.

【Keywords】: knowledge graphs; question answering; semantic parsing

124. A Semantic Graph-Based Approach for Mining Common Topics from Multiple Asynchronous Text Streams.

Paper Link】 【Pages】:1201-1209

【Authors】: Long Chen ; Joemon M. Jose ; Haitao Yu ; Fajie Yuan

【Abstract】: In the age of Web 2.0, a substantial amount of unstructured content are distributed through multiple text streams in an asynchronous fashion, which makes it increasingly difficult to glean and distill useful information. An effective way to explore the information in text streams is topic modelling, which can further facilitate other applications such as search, information browsing, and pattern mining. In this paper, we propose a semantic graph based topic modelling approach for structuring asynchronous text streams. Our model integrates topic mining and time synchronization, two core modules for addressing the problem, into a unified model. Specifically, for handling the lexical gap issues, we use global semantic graphs of each timestamp for capturing the hidden interaction among entities from all the text streams. For dealing with the sources asynchronism problem, local semantic graphs are employed to discover similar topics of different entities that can be potentially separated by time gaps. Our experiment on two real-world datasets shows that the proposed model significantly outperforms the existing ones.

【Keywords】: knowledge repository; language modelling; topic modelling

125. Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level.

Paper Link】 【Pages】:1211-1220

【Authors】: Denis Lukovnikov ; Asja Fischer ; Jens Lehmann ; Sören Auer

【Abstract】: Question Answering (QA) systems over Knowledge Graphs (KG) automatically answer natural language questions using facts contained in a knowledge graph. Simple questions, which can be answered by the extraction of a single fact, constitute a large part of questions asked on the web but still pose challenges to QA systems, especially when asked against a large knowledge resource. Existing QA systems usually rely on various components each specialised in solving different sub-tasks of the problem (such as segmentation, entity recognition, disambiguation, and relation classification etc.). In this work, we follow a quite different approach: We train a neural network for answering simple questions in an end-to-end manner, leaving all decisions to the model. It learns to rank subject-predicate pairs to enable the retrieval of relevant facts given a question. The network contains a nested word/character-level question encoder which allows to handle out-of-vocabulary and rare word problems while still being able to exploit word-level semantics. Our approach achieves results competitive with state-of-the-art end-to-end approaches that rely on an attention mechanism.

【Keywords】: knowledge graphs; question answering

126. Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules.

Paper Link】 【Pages】:1221-1229

【Authors】: Wei Emma Zhang ; Quan Z. Sheng ; Jey Han Lau ; Ermyas Abebe

【Abstract】: Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection methodologies from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.

【Keywords】: association rules; classification; community-based question answering; latent semantics; question quality

7B: Privacy 1 4

127. How Public Is My Private Life?: Privacy in Online Dating.

Paper Link】 【Pages】:1231-1240

【Authors】: Camille Cobb ; Tadayoshi Kohno

【Abstract】: Online dating services let users expand their dating pool beyond their social network and specify important characteristics of potential partners. To assess compatibility, users share personal information -- e.g., identifying details or sensitive opinions about sexual preferences or worldviews -- in profiles or in one-on-one communication. Thus, participating in online dating poses inherent privacy risks. How people reason about these privacy risks in modern online dating ecosystems has not been extensively studied. We present the results of a survey we designed to examine privacy-related risks, practices, and expectations of people who use or have used online dating, then delve deeper using semi-structured interviews. We additionally analyzed 400 Tinder profiles to explore how these issues manifest in practice. Our results reveal tensions between privacy and competing user values and goals, and we demonstrate how these results can inform future designs.

【Keywords】: human values; information disclosure; online dating; privacy; security

128. Trajectory Recovery From Ash: User Privacy Is NOT Preserved in Aggregated Mobility Data.

Paper Link】 【Pages】:1241-1250

【Authors】: Fengli Xu ; Zhen Tu ; Yong Li ; Pengyu Zhang ; Xiaoming Fu ; Depeng Jin

【Abstract】: Human mobility data has been ubiquitously collected through cellular networks and mobile applications, and publicly released for academic research and commercial purposes for the last decade. Since releasing individual's mobility records usually gives rise to privacy issues, datasets owners tend to only publish aggregated mobility data, such as the number of users covered by a cellular tower at a specific timestamp, which is believed to be sufficient for preserving users' privacy. However, in this paper, we argue and prove that even publishing aggregated mobility data could lead to privacy breach in individuals' trajectories. We develop an attack system that is able to exploit the uniqueness and regularity of human mobility to recover individual's trajectories from the aggregated mobility data without any prior knowledge. By conducting experiments on two real-world datasets collected from both mobile application and cellular network, we reveal that the attack system is able to recover users' trajectories with accuracy about 73%~91% at the scale of tens of thousands to hundreds of thousands users, which indicates severe privacy leakage in such datasets. Through the investigation on aggregated mobility data, our work recognizes a novel privacy problem in publishing statistic data, which appeals for immediate attentions from both academy and industry.

【Keywords】: aggregated mobility data; statistic data privacy; trajectory privacy

129. The Onions Have Eyes: A Comprehensive Structure and Privacy Analysis of Tor Hidden Services.

Paper Link】 【Pages】:1251-1260

【Authors】: Iskander Sánchez-Rola ; Davide Balzarotti ; Igor Santos

【Abstract】: Tor is a well known and widely used darknet, known for its anonymity. However, while its protocol and relay security have already been extensively studied, to date there is no comprehensive analysis of the structure and privacy of its Web Hidden Service. To fill this gap, we developed a dedicated analysis platform and used it to crawl and analyze over 1.5M URLs hosted in 7257 onion domains. For each page we analyzed its links, resources, and redirections graphs, as well as the language and category distribution. According to our experiments, Tor hidden services are organized in a sparse but highly connected graph, in which around 10% of the onions sites are completely isolated. Our study also measures for the first time the tight connection that exists between Tor hidden services and the Surface Web. In fact, more than 20% of the onion domains we visited imported resources from the Surface Web, and links to the Surface Web are even more prevalent than to other onion domains. Finally, we measured for the first time the prevalence and the nature of web tracking in Tor hidden services, showing that, albeit not as widespread as in the Surface Web, tracking is notably present also in the Dark Web: more than 40% of the scripts are used for this purpose, with the 70% of them being completely new tracking scripts unknown by existing anti-tracking solutions.

【Keywords】: browser security & privacy; dark web; privacy

130. De-anonymizing Web Browsing Data with Social Networks.

Paper Link】 【Pages】:1261-1269

【Authors】: Jessica Su ; Ansh Shukla ; Sharad Goel ; Arvind Narayanan

【Abstract】: Can online trackers and network adversaries de-anonymize web browsing data readily available to them? We show---theoretically, via simulation, and through experiments on real user data---that de-identified web browsing histories can be linked to social media profiles using only publicly available data. Our approach is based on a simple observation: each person has a distinctive social network, and thus the set of links appearing in one's feed is unique. Assuming users visit links in their feed with higher probability than a random user, browsing histories contain tell-tale marks of identity. We formalize this intuition by specifying a model of web browsing behavior and then deriving the maximum likelihood estimate of a user's social profile. We evaluate this strategy on simulated browsing histories, and show that given a history with 30 links originating from Twitter, we can deduce the corresponding Twitter profile more than 50% of the time.To gauge the real-world effectiveness of this approach, we recruited nearly 400 people to donate their web browsing histories, and we were able to correctly identify more than 70% of them. We further show that several online trackers are embedded on sufficiently many websites to carry out this attack with high accuracy. Our theoretical contribution applies to any type of transactional data and is robust to noisy observations, generalizing a wide range of previous de-anonymization attacks. Finally, since our attack attempts to find the correct Twitter profile out of over 300 million candidates, it is---to our knowledge---the largest-scale demonstrated de-anonymization to date.

【Keywords】: de-anonymization; deanonymization; privacy; social network; social networking; social networks; twitter

Paper Link】 【Pages】:1271-1279

【Authors】: Chenyan Xiong ; Russell Power ; Jamie Callan

【Abstract】: This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR's ability in improving Semantic Scholar's online production system, especially on hard queries where word-based ranking fails.

【Keywords】: academic search; entity-based ranking; knowledge graph

Paper Link】 【Pages】:1281-1290

【Authors】: Sourav Dutta ; Pratik Nayek ; Arnab Bhattacharya

【Abstract】: Labeled graphs provide a natural way of representing entities, relationships and structures within real datasets such as knowledge graphs and protein interactions. Applications such as question answering, semantic search, and motif discovery entail efficient approaches for subgraph matching involving both label and structural similarities. Given the NP-completeness of subgraph isomorphism and the presence of noise, approximate graph matching techniques are required to handle queries in a robust and real-time manner. This paper presents a novel technique to characterize the subgraph similarity based on statistical significance captured by chi-square statistic. The statistical significance model takes into account the background structure and label distribution in the neighborhood of vertices to obtain the best matching subgraph and, therefore, robustly handles partial label and structural mismatches. Based on the model, we propose two algorithms, VELSET and NAGA, that, given a query graph, return the top-k most similar subgraphs from a (large) database graph. While VELSET is more accurate and robust to noise, NAGA is faster and more applicable for scenarios with low label noise. Experiments on large real-life graph datasets depict significant improvements in terms of accuracy and running time in comparison to the state-of-the-art methods.

【Keywords】: approximate subgraph matching; chi-square statistic; statistical significance; subgraph similarity

133. Learning to Match using Local and Distributed Representations of Text for Web Search.

Paper Link】 【Pages】:1291-1299

【Authors】: Bhaskar Mitra ; Fernando Diaz ; Nick Craswell

【Abstract】: Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or 'duet' performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.

【Keywords】: document ranking; information retrieval; neural networks

134. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments.

Paper Link】 【Pages】:1301-1310

【Authors】: Alexey Drutsa ; Gleb Gusev ; Pavel Serdyukov

【Abstract】: State-of-the-art user engagement metrics (such as session-per-user) are widely used by modern Internet companies to evaluate ongoing updates of their web services via A/B testing. These metrics are predictive of companies' long-term goals, but suffer from this property due to slow user learning of an evaluated treatment, which causes a delay in the treatment effect. That, in turn, causes low sensitivity of the metrics and requires to conduct A/B experiments with longer duration or larger set of users from a limited traffic. In this paper, we study how the delay property of user learning can be used to improve sensitivity of several popular metrics of user loyalty and activity. We consider both novel and previously known modifications of these metrics, including different methods of quantifying a trend in a metric's time series and delaying its calculation. These modifications are analyzed with respect to their sensitivity and directionality on a large set of A/B tests run on real users of Yandex. We discover that mostly loyalty metrics gain profit from the considered modifications. We find such modifications that both increase sensitivity of the source metric and are consistent with the sign of its average treatment effect as well.

【Keywords】: a/b test; delay; dft; directionality; online controlled experiment; quality metric; sensitivity; time series; trend; user engagement

7D: User Modeling 3 4

135. GB-CENT: Gradient Boosted Categorical Embedding and Numerical Trees.

Paper Link】 【Pages】:1311-1319

【Authors】: Qian Zhao ; Yue Shi ; Liangjie Hong

【Abstract】: Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. Since in real-world applications we usually have both abundant numerical features and categorical features with large cardinality (e.g. geolocations, IDs, tags etc.), we design a new model, called GB-CENT, which leverages latent factor embedding and tree components to achieve the merits of both while avoiding their demerits. With two real-world data sets, we demonstrate that GB-CENT can effectively (i.e. fast and accurately) achieve better accuracy than state-of-the-art matrix factorization, decision tree based models and their ensemble.

【Keywords】: decision trees; gradient boosting; large cardinality; low-dimensional embedding; matrix factorization; numerical and categorical features; recommender systems

136. Decoupled Collaborative Ranking.

Paper Link】 【Pages】:1321-1329

【Authors】: Jun Hu ; Ping Li

【Abstract】: We propose a new pointwise collaborative ranking approach for recommender systems, which focuses on improving ranking performance at the top of recommended list. Our approach is different from common pointwise methods in that we consider user ratings as ordinal rather than viewing them as real values or categorical labels. In addition, positively rated items (higher rating scores) are emphasized more in our method in order to improve the performance at the top of recommended list. In our method, user ratings are modeled based on an ordinal classification framework, which is made up of a sequence of binary classification problems in which one discriminates between ratings no less than a specific ordinal category c and ratings below that category ({̥ c}vs.{< c}). The results are used subsequently to generate a ranking score that puts higher weights on the output of those binary classification problems concerning high values of c so as to improve the ranking performance at the top of list. As our method crucially builds on a decomposition into binary classification problems, we call our proposed method as Decoupled Collaborative Ranking (DCR). As an extension, we impose pairwise learning on DCR, which yields further improvement with regard to the ranking performance of the proposed method. We demonstrate through extensive experiments on benchmark datasets that our method outperforms many considered state-of-the-art collaborative ranking algorithms in terms of the NDCG metric.

【Keywords】: collaborative ranking; matrix factorization; recommender systems

137. LETOR Methods for Unsupervised Rank Aggregation.

Paper Link】 【Pages】:1331-1340

【Authors】: Avradeep Bhowmik ; Joydeep Ghosh

【Abstract】: Learning the true rank ordering among objects by aggregating a set of expert opinion rank order lists is an important and ubiquitous problem in many applications ranging from social choice theory to recommendation systems and search aggregation. We study the problem of unsupervised rank aggregation where no ground truth ordering information in available, neither about the true preference ordering among any set of objects nor about the quality of individual rank lists. Aggregating the often inconsistent and poor quality rank lists in such an unsupervised manner is a highly challenging problem, and standard consensus-based methods fall short in terms of both quality and scalability. In this manuscript we propose a novel framework to bypass these issues by using object attributes to augment the standard rank aggregation framework. We design algorithms that learn joint models on both rank lists and object features to obtain an aggregated rank ordering that is more accurate and robust, and also helps weed out rank lists of dubious validity. We validate our techniques on synthetic datasets where our algorithm is able to estimate the true rank ordering even when the rank lists are corrupted. Experiments on three real datasets, MQ2008, MQ2007 and OHSUMED, show that using object features can result in significant improvement in performance over existing rank aggregation methods that do not use object information. Furthermore, when at least some of the rank lists are of high quality, our methods are able to effectively exploit such information to output an aggregated rank ordering of high accuracy.

【Keywords】: learning to rank; monotone retargeting; rank aggregation; unsupervised methods

138. A Generic Coordinate Descent Framework for Learning from Implicit Feedback.

Paper Link】 【Pages】:1341-1350

【Authors】: Immanuel Bayer ; Xiangnan He ; Bhargav Kanagal ; Steffen Rendle

【Abstract】: In recent years, interest in recommender research has shifted from explicit feedback towards implicit feedback data. A diversity of complex models has been proposed for a wide variety of applications. Despite this, learning from implicit feedback is still computationally challenging. So far, most work relies on stochastic gradient descent (SGD) solvers which are easy to derive, but in practice challenging to apply, especially for tasks with many items. For the simple matrix factorization model, an efficient coordinate descent (CD) solver has been previously proposed. However, efficient CD approaches have not been derived for more complex models. In this paper, we provide a new framework for deriving efficient CD algorithms for complex recommender models. We identify and introduce the property of k-separable models. We show that k-separability is a sufficient property to allow efficient optimization of implicit recommender problems with CD. We illustrate this framework on a variety of state-of-the-art models including factorization machines and Tucker decomposition. To summarize, our work provides the theory and building blocks to derive efficient implicit CD algorithms for complex recommender models.

【Keywords】: coordinate descent; factorization machine; implicit feedback; matrix factorization; recommender systems

7G: Social 4 4

139. On Analyzing User Topic-Specific Platform Preferences Across Multiple Social Media Sites.

Paper Link】 【Pages】:1351-1359

【Authors】: Roy Ka-Wei Lee ; Tuan-Anh Hoang ; Ee-Peng Lim

【Abstract】: Topic modeling has traditionally been studied for single text collections and applied to social media data represented in the form of text documents. With the emergence of many social media platforms, users find themselves using different social media for posting content and for social interaction. While many topics may be shared across social media platforms, users typically show preferences of certain social media platform(s) over others for certain topics. Such platform preferences may even be found at the individual level. To model social media topics as well as platform preferences of users, we propose a new topic model known as MultiPlatform-LDA (MultiLDA). Instead of just merging all posts from different social media platforms into a single text collection, MultiLDA keeps one text collection for each social media platform but allowing these platforms to share a common set of topics. MultiLDA further learns the user-specific platform preferences for each topic. We evaluate MultiLDA against TwitterLDA, the state-of-the-art method for social media content modeling, on two aspects: (i) the effectiveness in modeling topics across social media platforms, and (ii) the ability to predict platform choices for each post. We conduct experiments on three real-world datasets from Twitter, Instagram and Tumblr sharing a set of common users. Our experiments results show that the MultiLDA outperforms in both topic modeling and platform choice prediction tasks. We also show empirically that among the three social media platforms, "Daily matters" and "Relationship matters" are dominant topics in Twitter, "Social gathering", "Outing" and "Fashion" are dominant topics in Instagram, and "Music", "Entertainment" and "Fashion" are dominant topics in Tumblr.

【Keywords】: multiple social networks; topic modeling; user preference

140. Competition and Selection Among Conventions.

Paper Link】 【Pages】:1361-1370

【Authors】: Rahmtin Rotabi ; Cristian Danescu-Niculescu-Mizil ; Jon M. Kleinberg

【Abstract】: In many domains, a latent competition among different conventions determines which one will come to dominate. One sees such effects in the success of community jargon, of competing frames in political rhetoric, or of terminology in technical contexts. These effects have become widespread in the on-line domain, where the ease of information transmission makes them particularly forceful, and where the available data offers the potential to study competition among conventions at a fine-grained level. In analyzing the dynamics of conventions over time, however, even with detailed on-line data, one encounters two significant challenges. First, as conventions evolve, the underlying substance of their meaning tends to change as well; and such substantive changes confound investigations of social effects. Second, the selection of a convention takes place through the complex interactions of individuals within a community, and contention between the users of competing conventions plays a key role in the convention's evolution. Any analysis of the overall dynamics must take place in the presence of these two issues. In this work we study a setting in which we can cleanly track the competition among conventions while explicitly taking these sources of complexity into account. Our analysis is based on the spread of low-level authoring conventions in the e-print arXiv over 24 years and roughly a million posted papers: by tracking the spread of macros and other author-defined conventions, we are able to study conventions that vary even as the underlying meaning remains constant. We find that the interaction among co-authors over time plays a crucial role in the selection of conventions; the distinction between more and less experienced members of the community, and the distinction between conventions with visible versus invisible effects, are both central to the underlying processes. Through our analysis we make predictions at the population level about the ultimate success of different synonymous conventions over time --- and at the individual level about the outcome of ``fights'' between people over convention choices.

【Keywords】: arxiv; conventions; diffusion of information; innovations

141. Discussion Quality Diffuses in the Digital Public Square.

Paper Link】 【Pages】:1371-1380

【Authors】: George Berry ; Sean J. Taylor

【Abstract】: Studies of online social influence have demonstrated that friends have important effects on many types of behavior in a wide variety of settings. However, we know much less about how influence works among relative strangers in digital public squares, despite important conversations happening in such spaces. We present the results of a study on large public Facebook Pages where we randomly used two different methods---most recent and social feedback---to order comments on posts. We find that the social feedback condition results in higher quality viewed comments and response comments. After measuring the average quality of comments written by users before the study, we find that social feedback has a positive effect on response quality for both low and high quality commenters. We draw on a theoretical framework of social norms to explain this empirical result. In order to examine the influence mechanism further, we measure the similarity between comments viewed and written during the study, finding that similarity increases for the highest quality contributors under the social feedback condition. This suggests that, in addition to norms, some individuals may respond with increased relevance to high-quality comments.

【Keywords】: online discussions; comment ranking; social influence; social norms

142. When Confidence and Competence Collide: Effects on Online Decision-Making Discussions.

Paper Link】 【Pages】:1381-1390

【Authors】: Liye Fu ; Lillian Lee ; Cristian Danescu-Niculescu-Mizil

【Abstract】: Group discussions are a way for individuals to exchange ideas and arguments in order to reach better decisions than they could on their own. One of the premises of productive discussions is that better solutions will prevail, and that the idea selection process is mediated by the (relative) competence of the individuals involved. However, since people may not know their actual competence on a new task, their behavior is influenced by their self-estimated competence -- that is, their confidence -- which can be misaligned with their actual competence. Our goal in this work is to understand the effects of confidence-competence misalignment on the dynamics and outcomes of discussions. To this end, we design a large-scale natural setting, in the form of an online team-based geography game, that allows usto disentangle confidence from competence and thus separate their effects. We find that in task-oriented discussions, the more-confident individuals have a larger impact on the group's decisions even when these individuals are at the same level of competence as their teammates. Furthermore, this unjustified role of confidence in the decision-making process often leads teams to under-perform. We explore this phenomenon by investigating the effects of confidence on conversational dynamics. For example, we take up the question: do more-confident people introduce more ideas than the less-confident, or do they introduce the same number of ideas but their ideas get more uptake? Moreover, we show that the language people use is more predictive of a person's confidence level than their actual competence. This also suggests potential practical applications, given that in many settings, true competence cannot be assessed before the task is completed, whereas the conversation can be tracked during the course of the problem-solving process.

【Keywords】: collaboration; confidence; conversations; decision-making; group dynamics; ideas; linguistic; overconfidence; synergy; teams

7H: Web Mining 4

143. Ex Machina: Personal Attacks Seen at Scale.

Paper Link】 【Pages】:1391-1399

【Authors】: Ellery Wulczyn ; Nithum Thain ; Lucas Dixon

【Abstract】: The damage personal attacks cause to online discourse motivates many platforms to try to curb the phenomenon. However, understanding the prevalence and impact of personal attacks in online platforms at scale remains surprisingly difficult. The contribution of this paper is to develop and illustrate a method that combines crowdsourcing and machine learning to analyze personal attacks at scale. We show an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate. We apply our methodology to English Wikipedia, generating a corpus of over 100k high quality human-labeled comments and 63M machine-labeled ones from a classifier that is as good as the aggregate of 3 crowd-workers, as measured by the area under the ROC curve and Spearman correlation. Using this corpus of machine-labeled scores, our methodology allows us to explore some of the open questions about the nature of online personal attacks. This reveals that the majority of personal attacks on Wikipedia are not the result of a few malicious users, nor primarily the consequence of allowing anonymous contributions from unregistered users.

【Keywords】: online discussions; online harassment; wikipedia

144. Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach.

Paper Link】 【Pages】:1401-1410

【Authors】: Dominik Kowald ; Subhash Chandra Pujari ; Elisabeth Lex

【Abstract】: Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with the aim of designing a cognitive-inspired hashtag recommendation algorithm we call BLLi,s. Our main idea is to incorporate the effect of time on (i) individual hashtag reuse (i.e., reusing own hashtags), and (ii) social hashtag reuse (i.e., reusing hashtags, which has been previously used by a followee) into a predictive model. For this, we turn to the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R, which accounts for the time-dependent decay of item exposure in human memory. We validate BLLI,S using two crawled Twitter datasets in two evaluation scenarios. Firstly, only temporal usage patterns of past hashtag assignments are utilized and secondly, these patterns are combined with a content-based analysis of the current tweet. In both evaluation scenarios, we find not only that temporal effects play an important role for both individual and social hashtag reuse but also that our BLLI,S approach provides significantly better prediction accuracy and ranking results than current state-of-the-art hashtag recommendation methods.

【Keywords】: act-r; bll equation; hashtag recommendation; hashtag reuse prediction; hashtag usage recency; hashtags; recommender systems; temporal dynamics; tf-idf; twitter

145. Exploring Rated Datasets with Rating Maps.

Paper Link】 【Pages】:1411-1419

【Authors】: Sihem Amer-Yahia ; Sofia Kleisarchaki ; Naresh Kumar Kolloju ; Laks V. S. Lakshmanan ; Ruben H. Zamar

【Abstract】: Online rated datasets have become a source for large-scale population studies for analysts and a means for end-users to achieve routine tasks such as finding a book club. Existing systems however only provide limited insights into the opinions of different segments of the rater population. In this paper, we develop a framework for finding and exploring population segments and their opinions. We propose rating maps, a collection of (population segment, rating distribution) pairs, where a segment, e.g., {18-29 year old males in CA} has a rating distribution in the form of a histogram that aggregates its ratings for a set of items (e.g., movies starring Russel Crowe). We formalize the problem of building rating maps dynamically given desired input distributions. Our problem raises two challenges: (i) the choice of an appropriate measure for comparing rating distributions, and (ii) the design of efficient algorithms to find segments. We show that the Earth Mover's Distance (EMD) is well-adapted to comparing rating distributions and prove that finding segments whose rating distribution is close to input ones is NP-complete. We propose an efficient algorithm for building Partition Decision Trees and heuristics for combining the resulting partitions to further improve their quality. Our experiments on real and synthetic datasets validate the utility of rating maps for both analysts and end-users.

【Keywords】: earth's mover distance; partition decision tree; rated datasets; rating distribution comparison

146. Modeling the Dynamics of Learning Activity on the Web.

Paper Link】 【Pages】:1421-1430

【Authors】: Charalampos Mavroforakis ; Isabel Valera ; Manuel Gomez-Rodriguez

【Abstract】: People are increasingly relying on social media and the Web to find solutions to their problems in a wide range of domains. In this setting, closely related problems often lead to the same characteristic learning pattern --- people sharing a similar problem visit closely related pieces of information, perform almost identical queries or, more generally, take a series of similar actions at a similar pace. In this paper, we introduce a novel modeling framework for clustering continuous-time grouped streaming data, the Hierarchical Dirichlet Hawkes process (HDHP), which allows us to automatically uncover a wide variety of learning patterns from detailed traces of learning activity. Our model allows for efficient inference, scaling to millions of actions and thousands of users. Experiments on real data from Stack Overflow reveal that our framework recovers meaningful learning patterns, accurately tracks users' interests and goals over time and achieves better predictive performance than the state of the art.

【Keywords】: continuous-time data clustering; learning activity modeling; user interest tracking

7I: Graph Algorithms 4

147. ESCAPE: Efficiently Counting All 5-Vertex Subgraphs.

Paper Link】 【Pages】:1431-1440

【Authors】: Ali Pinar ; C. Seshadhri ; Vaidyanathan Vishal

【Abstract】: Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.

【Keywords】: graph orientations; motif analysis; subgraph counting

148. The k-peak Decomposition: Mapping the Global Structure of Graphs.

Paper Link】 【Pages】:1441-1450

【Authors】: Priya Govindan ; Chenghong Wang ; Chumeng Xu ; Hongyu Duan ; Sucheta Soundarajan

【Abstract】: The structure of real-world complex networks has long been an area of interest, and one common way to describe the structure of a network has been with the k-core decomposition. The core number of a node can be thought of as a measure of its centrality and importance, and is used by applications such as community detection, understanding viral spreads, and detecting fraudsters. However, we observe that the k-core decomposition suffers from an important flaw: namely, it is calculated globally, and so if the network contains distinct regions of different densities, the sparser among these regions may be neglected. To resolve this issue, we propose the k-peak graph decomposition method, based on the k-core algorithm, which finds the centers of distinct regions in the graph. Our contributions are as follows: (1) We present a novel graph decomposition- the k-peak decomposition- and corresponding algorithm, and perform a theoretical analysis of its properties. (2) We describe a new visualization method, the "Mountain Plot", which can be used to better understand the global structure of a graph. (3) We perform an extensive empirical analysis of real-world graphs, including technological, social, biological, and collaboration graphs, and show how the k-peak decomposition gives insight into the structures of these graphs. (4) We demonstrate the advantage of using the k-peak decomposition in various applications, including community detection, contagion and identifying essential proteins.

【Keywords】: graph visualization; graphs; k-core; k-peak

149. Scalable Motif-aware Graph Clustering.

Paper Link】 【Pages】:1451-1460

【Authors】: Charalampos E. Tsourakakis ; Jakub Pachocki ; Michael Mitzenmacher

【Abstract】: We develop new methods based on graph motifs for graph clustering, allowing more efficient detection of communities within networks. We focus on triangles within graphs, but our techniques extend to other clique motifs as well. Our intuition, which has been suggested but not formalized similarly in previous works, is that triangles are a better signature of community than edges. We therefore generalize the notion of conductance for a graph to triangle conductance, where the edges are weighted according to the number of triangles containing the edge. This methodology allows us to develop variations of several existing clustering techniques, including spectral clustering, that minimize triangles split by the cluster instead of edges cut by the cluster. We provide theoretical results in a planted partition model to demonstrate the potential for triangle conductance in clustering problems. We then show experimentally the effectiveness of our methods to multiple applications in machine learning and graph mining.

【Keywords】: community detection; expanders; graph clustering; large-scale graph mining

150. Indexing Public-Private Graphs.

Paper Link】 【Pages】:1461-1470

【Authors】: Aaron Archer ; Silvio Lattanzi ; Peter Likarish ; Sergei Vassilvitskii

【Abstract】: We consider the reachability indexing problem for private-public directed graphs. In these graphs nodes come in three flavors: public--nodes visible to all users, private--nodes visible to a specific set of users, and protected--nodes visible to any user who can see at least one of the node's parents. We are interested in computing the set of nodes visible to a specific user online. There are two obvious algorithms: precompute the result for every user, or run a reachability algorithm at query time. This paper explores the trade-off between these two strategies. Our approach is to identify a set of additional visible seed nodes for each user. The online reachability algorithm explores the graph starting at these nodes. We first formulate the problem as asymmetric k-center with outliers, and then give an efficient and practical algorithm. We prove new theoretical guarantees for this problem and show empirically that it performs very well in practice.

【Keywords】: community detection

8B: Privacy 2 4

151. Pinning Down Abuse on Google Maps.

Paper Link】 【Pages】:1471-1479

【Authors】: Danny Yuxing Huang ; Doug Grundman ; Kurt Thomas ; Abhishek Kumar ; Elie Bursztein ; Kirill Levchenko ; Alex C. Snoeren

【Abstract】: In this paper, we investigate a new form of blackhat search engine optimization that targets local listing services like Google Maps. Miscreants register abusive business listings in an attempt to siphon search traffic away from legitimate businesses and funnel it to deceptive service industries---such as unaccredited locksmiths---or to traffic-referral scams, often for the restaurant and hotel industry. In order to understand the prevalence and scope of this threat, we obtain access to over a hundred-thousand business listings on Google Maps that were suspended for abuse. We categorize the types of abuse affecting Google Maps; analyze how miscreants circumvented the protections against fraudulent business registration such as postcard mail verification; identify the volume of search queries affected; and ultimately explore how miscreants generated a profit from traffic that necessitates physical proximity to the victim. This physical requirement leads to unique abusive behaviors that are distinct from other online fraud such as pharmaceutical and luxury product scams.

【Keywords】: abuse; affiliate fraud; local listings; online map

152. Extended Tracking Powers: Measuring the Privacy Diffusion Enabled by Browser Extensions.

Paper Link】 【Pages】:1481-1490

【Authors】: Oleksii Starov ; Nick Nikiforakis

【Abstract】: Users have come to rely on browser extensions to realize features that are not implemented by browser vendors. Extensions offer users the ability to, among others, block ads, de-clutter websites, enrich pages with third-party content, and take screenshots. At the same time, because of their privileged position inside a user's browser, extensions have access to content and functionality that is not available to webpages, such as, the ability to conduct and read cross-origin requests, as well as get access to a browser's history and cookie jar. In this paper, we report on the first large-scale study of privacy leakage enabled by extensions. By using dynamic analysis and simulated user interactions, we investigate the leaking happening by the 10,000 most popular browser extensions of Google Chrome and find that a non-negligible fraction leaks sensitive information about the user's browsing habits, such as, their browsing history and search-engine queries. We identify common ways that extensions use to obfuscate this leakage and discover that, while some leakage happens on purpose, a large fraction of it is accidental because of the way that extensions attempt to introduce third-party content to a page's DOM. To counter the inference of a user's interests and private information enabled by this leakage, we design, implement, and evaluate BrowsingFog, a browser extension that automatically browses the web in a way that conceals a user's true interests, from a vantage point of history-stealing, third-party trackers.

【Keywords】: browser extensions; privacy diffusion; web tracking

Paper Link】 【Pages】:1491-1500

【Authors】: Li Chang ; Hsu-Chun Hsiao ; Wei Jeng ; Tiffany Hyun-Jin Kim ; Wei-Hsi Lin

【Abstract】: URL redirection is a popular technique that automatically navigates users to an intended destination webpage with- out user awareness. However, such a seemingly advantageous feature may offer inadequate protection from security vulnerabilities unless every redirection is performed over HTTPS. Even worse, as long as the final redirection to a website is performed over HTTPS, the browser's URL bar indicates that the website is secure regardless of the security of prior redirections, which may provide users with a false sense of security. This paper reports a well-rounded investigation to analyze the wellness of URL redirection security. As an initial large-scale investigation, we screened the integrity and consistency of URL redirections for the Alexa top one million (1M) websites, and further examined 10,000 (10K) websites with their login features. Our results suggest that 1) the majority (83.3% in the 1M dataset and 78.6% in the 10K dataset) of redirection trails among web- sites that support only HTTPS are vulnerable to attacks, and 2) current incoherent practices (e.g., naked domains and www subdomains being redirected to different destinations with varying security levels) undermine the security guarantees provided by HTTPS and HSTS.

【Keywords】: hsts; https; redirection trail; ssl/tls; url redirection

154. Some Recipes Can Do More Than Spoil Your Appetite: Analyzing the Security and Privacy Risks of IFTTT Recipes.

Paper Link】 【Pages】:1501-1510

【Authors】: Milijana Surbatovich ; Jassim Aljuraidan ; Lujo Bauer ; Anupam Das ; Limin Jia

【Abstract】: The use of end-user programming, such as if-this-then-that (IFTTT), is becoming increasingly common. Services like IFTTT allow users to easily create new functionality by connecting arbitrary Internet-of-Things (IoT) devices and online services using simple if-then rules, commonly known as recipes. However, such convenience at times comes at the cost of security and privacy risks for end users. To gain an in-depth understanding of the potential security and privacy risks, we build an information-flow model to analyze how often IFTTT recipes involve potential integrity or secrecy violations. Our analysis finds that around 50% of the 19,323 unique recipes we examined are potentially unsafe, as they contain a secrecy violation, an integrity violation, or both. We next categorize the types of harm that these potentially unsafe recipes can cause to users. After manually examining a random selection of potentially unsafe recipes, we find that recipes can not only lead to harms such as personal embarrassment but can also be exploited by an attacker, e.g., to distribute malware or carry out denial-of-service attacks. The use of IoT devices and services like IFTTT is expected only to grow in the near future; our analysis suggests users need to be both informed about and protected from these emerging threats to which they could be unwittingly exposing themselves.

【Keywords】: end-user programming; ifttt service; information-flow; internet of things (iot)

Paper Link】 【Pages】:1511-1520

【Authors】: Qingyao Ai ; Susan T. Dumais ; Nick Craswell ; Daniel J. Liebling

【Abstract】: As the number of email users and messages continues to grow, search is becoming more important for finding information in personal archives. In spite of its importance, email search is much less studied than web search, particularly using large-scale behavioral log analysis. In this paper we report the results of a large-scale log analysis of email search and complement this with a survey to better understand email search intent and success. We characterize email search behaviors and highlight differences from web search. When searching for email, people know many attributes about what they are looking for; they often look for specific known items; their queries are shorter and they click on fewer items than in web search. Although repeat queries are common in both email and web search, repeat visits to the same search result are much less common in email search suggesting that the same query is used for different search intents over time. We consider search intent from multiple angles. In email search logs, we find that people use email search not just to find information but also to perform tasks such as cleanup or organization, and that the distribution of actions they perform depends on the type of query. In our survey, people reported that they looked for specific information in both email search and web search, but they were much less likely to search for general information on a topic in email. The differences in overall behavior, re-finding patterns and search intents we observed between email and web search have important implications for the design of email search algorithms and interfaces.

【Keywords】: email search; query log analysis; re-finding; user evaluation

156. Template Induction over Unstructured Email Corpora.

Paper Link】 【Pages】:1521-1530

【Authors】: Julia Proskurnia ; Marc-Allen Cartright ; Lluis Garcia Pueyo ; Ivo Krka ; James Bradley Wendt ; Tobias Kaufmann ; Balint Miklos

【Abstract】: Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. machine generated HTML emails. However much less work has been done in performing the same task over unstructured email data. We propose a technique for inducing high quality templates from plain text emails at scale based on the suffix array data structure. We evaluate this method against an industry-standard approach for finding similar content based on shingling, running both algorithms over two corpora: a synthetically created email corpus for a high level of experimental control, as well as user-generated emails from the well-known Enron email corpus. Our experimental results show that the proposed method is more robust to variations in cluster quality than the baseline and templates contain more text from the emails, which would benefit extraction tasks by identifying transient parts of the emails. Our study indicates templates induced using suffix arrays contain approximately half as much noise (measured as entropy) as templates induced using shingling. Furthermore, the suffix array approach is substantially more scalable, proving to be an order of magnitude faster than shingling even for modestly-sized training clusters. Public corpus analysis shows that email clusters contain on average 4 segments of common phrases, where each of the segments contains on average 9 words, thus showing that templatization could help users reduce the email writing effort by an average of 35 words per email in an assistance or auto-reply related task.

【Keywords】: enron corpus; fixed phrase extraction; human-generated email; structural template; suffix array generalization; templatization

157. Situational Context for Ranking in Personal Search.

Paper Link】 【Pages】:1531-1540

【Authors】: Hamed Zamani ; Michael Bendersky ; Xuanhui Wang ; Mingyang Zhang

【Abstract】: Modern search engines leverage a variety of sources, beyond the conventional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, query context was mainly modeled by user search history, either long-term or short-term, to help the ranking of future queries. In this paper, we focus on situational context, i.e., the contextual features of the current search request that are independent from both query content and user history. As an example, situational context can depend on search request time and location. We propose two context-aware ranking models based on neural networks. The first model learns a low-dimensional deep representation from the combination of contextual features. The second model extends the first one by leveraging binarized contextual features in addition to the high-level abstractions learned using a deep network. The existing context-aware ranking models are mainly based on search history, especially click data that can be gathered from the search engine logs. Although context-aware models have been widely explored in web search, their influence on search scenarios where click data is highly sparse is relatively unstudied. The focus of this paper, personal search (e.g., email search or on-device search), is one of such scenarios. We evaluate our models using the click data collected from one of the world's largest personal search engines. The experiments demonstrate that the proposed models significantly outperform the baselines which do not take context into account. These results indicate the importance of situational context for personal search, and open up a venue for further exploration of situational context in other search scenarios.

【Keywords】: contextual information; deep learning; email search; personal search; query context

Paper Link】 【Pages】:1541-1549

【Authors】: David Carmel ; Liane Lewin-Eytan ; Alex Libov ; Yoelle Maarek ; Ariel Raviv

【Abstract】: Web mail search is an emerging topic, which has not been the object of as many studies as traditional Web search. In particular, little is known about the characteristics of mail searchers and of the queries they issue. We study here the characteristics of Web mail searchers, and explore how demographic signals such as location, age, gender, and inferred income, influence their search behavior. We try to understand for instance, whether women exhibit different mail search patterns than men, or whether senior people formulate more precise queries than younger people. We compare our results, obtained from the analysis of a Yahoo Web mail search query log, to similar work conducted in Web and Twitter search. In addition, we demonstrate the value of the user's personal query log, as well as of the global query log and of the demographic signals, in a key search task: dynamic query auto-completion. We discuss how going beyond users' personal query logs (their search history) significantly improves the quality of suggestions, in spite of the fact that a user's mailbox is perceived as being highly personal. In particular, we note the striking value of demographic features for queries relating to companies/organizations, thus verifying our assumption that query completion benefits from leveraging queries issued by ``people like me". We believe that demographics and other such global features can be leveraged in other mail applications, and hope that this work is a first step in this direction.

【Keywords】: mail search; mail search demographics; query auto-completion; query suggestion

159. Promoting Relevant Results in Time-Ranked Mail Search.

Paper Link】 【Pages】:1551-1559

【Authors】: David Carmel ; Liane Lewin-Eytan ; Alex Libov ; Yoelle Maarek ; Ariel Raviv

【Abstract】: Mail search has traditionally served time-ranked results, even if it has been shown that relevance ranking provides higher retrieval quality on average. Some Web mail services have recently started to provide relevance ranking options such as the relevance toggle in the search results page of Yahoo Mail, or the top results" section in Inbox by Gmail. Yet, ranking results by relevance is not accepted by all, either in mail search, or in in other domains such as social media, where it has even triggered some public outcry. Given the sensitivity of the topic, we propose here to investigate a mixed approach of promoting the most relevant results, to which we refer asheroes'', on top of time-ranked results. We argue that this approach represents a good compromise to mail searchers, supporting on one hand the time sorted paradigm they are familiar with, while being almost as effective as full relevance ranking view that Web mail users seem to be reluctant to adopt. We describe three hero-selection algorithms we have devised and the associated experiments we have conducted in Yahoo mail. We measure retrieval success via two metrics: MRR (Mean Reciprocal Rank) and Success@k, and verify agreement between these metrics and users' direct feedback. We demonstrate that supplementing time-sorted results with hero results leads to a higher MRR than the traditional time-sorted view. We additionally show that MRR better reflects users' perception of quality than Success@k. Finally, we report on online results following the successful launch of one of our hero-selection algorithms for all Yahoo enterprise mail users and a few million Yahoo Web mail users.

【Keywords】:

8D: User Modeling 4 4

160. AttriInfer: Inferring User Attributes in Online Social Networks Using Markov Random Fields.

Paper Link】 【Pages】:1561-1569

【Authors】: Jinyuan Jia ; Binghui Wang ; Le Zhang ; Neil Zhenqiang Gong

【Abstract】: In the attribute inference problem, we aim to infer users' private attributes (e.g., locations, sexual orientation, and interests) using their public data in online social networks. State-of-the-art methods leverage a user's both public friends and public behaviors (e.g., page likes on Facebook, apps that the user reviewed on Google Play) to infer the user's private attributes. However, these methods suffer from two key limitations: 1) suppose we aim to infer a certain attribute for a target user using a training dataset, they only leverage the labeled users who have the attribute, while ignoring the label information of users who do not have the attribute; 2) they are inefficient because they infer attributes for target users one by one. As a result, they have limited accuracies and applicability in real-world social networks. In this work, we propose AttriInfer, a new method to infer user attributes in online social networks. AttriInfer can leverage both friends and behaviors, as well as the label information of training users who have an attribute and who do not have the attribute. Specifically, we model a social network as a pairwise Markov Random Field (pMRF). Given a training dataset, which consists of some users who have a certain attribute and some users who do not have a certain attribute, we compute the posterior probability that a target user has the attribute and use the posterior probability to infer attributes. In the basic version of AttriInfer, we use Loopy Belief Propagation (LBP) to compute the posterior probability. However, LBP is not scalable to very large-scale real-world social networks and not guaranteed to converge. Therefore, we further optimize LBP to be scalable and guaranteed to converge. We evaluated our method and compare it with state-of-the-art methods using a real-world Google+ dataset with 5.7M users. Our results demonstrate that our method substantially outperforms state-of-the-art methods in terms of both accuracy and efficiency.

【Keywords】: attribute inference; machine learning as a privacy attack; privacy in online social networks

161. Neural Underpinnings of Website Legitimacy and Familiarity Detection: An fNIRS Study.

Paper Link】 【Pages】:1571-1580

【Authors】: Ajaya Neupane ; Nitesh Saxena ; Leanne M. Hirshfield

【Abstract】: In this paper, we study the neural underpinnings relevant to user-centered web security through the lens of functional near-infrared spectroscopy (fNIRS). Specifically, we design and conduct an fNIRS study to pursue a thorough investigation of users' processing of legitimate vs. illegitimate and familiar vs. unfamiliar websites. We pinpoint the neural activity in these tasks as well as the brain areas that control such activity. We show that, at the neurological level, users process the legitimate websites differently from the illegitimate websites when subject to phishing attacks. Similarly, we show that users exhibit marked differences in the way their brains process the previously familiar websites from unfamiliar websites. These findings have several defensive and offensive implications. In particular, we discuss how these differences may be used by the system designers in the future to differentiate between legitimate and illegitimate websites automatically based on neural signals. Similarly, we discuss the potential for future malicious attackers, with access to neural signals, in compromising the privacy of users by detecting whether a website is previously familiar or unfamiliar to the user. Compared to prior research, our novelty lies in several aspects. First, we employ a neuroimaging methodology (fNIRS) not tapped into by prior security research for the problem domain we are studying. Second, we provide a focused study design and comprehensive investigation of the neural processing underlying the specific tasks of legitimate vs. illegitimate and familiar vs. unfamiliar websites. Third, we use an experimental set-up much more amenable to real-world settings, compared to previous fMRI studies. Beyond these scientific innovations, our work also serves to corroborate and extend several of the findings of the prior literature with independent methodologies, tools, and settings.

【Keywords】: fnirs; phishing detection; privacy attacks

162. Probabilistic Visitor Stitching on Cross-Device Web Logs.

Paper Link】 【Pages】:1581-1589

【Authors】: Sungchul Kim ; Nikhil Kini ; Jay Pujara ; Eunyee Koh ; Lise Getoor

【Abstract】: Personalization -- the customization of experiences, interfaces, and content to individual users -- has catalyzed user growth and engagement for many web services. A critical prerequisite to personalization is establishing user identity. However the variety of devices, including mobile phones, appliances, and smart watches, from which users access web services from both anonymous and logged-in sessions poses a significant obstacle to user identification. The resulting entity resolution task of establishing user identity across devices and sessions is commonly referred to as ``visitor stitching.'' We introduce a general, probabilistic approach to visitor stitching using features and attributes commonly contained in web logs. Using web logs from two real-world corporate websites, we motivate the need for probabilistic models by quantifying the difficulties posed by noise, ambiguity, and missing information in deployment. Next, we introduce our approach using probabilistic soft logic (PSL), a statistical relational learning framework capable of capturing similarities across many sessions and enforcing transitivity. We present a detailed description of model features and design choices relevant to the visitor stitching problem. Finally, we evaluate our PSL model on binary classification performance for two real-world visitor stitching datasets. Our model demonstrates significantly better performance than several state-of-the-art classifiers, and we show how this advantage results from collective reasoning across sessions.

【Keywords】: cross-device users; personalization; visitor stitching

163. Why We Read Wikipedia.

Paper Link】 【Pages】:1591-1600

【Authors】: Philipp Singer ; Florian Lemmerich ; Robert West ; Leila Zia ; Ellery Wulczyn ; Markus Strohmaier ; Jure Leskovec

【Abstract】: Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Based on an initial series of user surveys, we build a taxonomy of Wikipedia use cases along several dimensions, capturing users' motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia. Then, we quantify the prevalence of these use cases via a large-scale user survey conducted on live Wikipedia with almost 30,000 responses. Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom. Finally, we match survey responses to the respondents' digital traces in Wikipedia's server logs, enabling the discovery of behavioral patterns associated with specific use cases. For instance, we observe long and fast-paced page sequences across topics for users who are bored or exploring randomly, whereas those using Wikipedia for work or school spend more time on individual articles focused on topics such as science. Our findings advance our understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia's user experience, editors striving to cater to their readers' needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines.

【Keywords】: log analysis; motivation; survey; wikipedia

8G: Social 5 4

164. Learning Personalized Preference of Strong and Weak Ties for Social Recommendation.

Paper Link】 【Pages】:1601-1610

【Authors】: Xin Wang ; Steven C. H. Hoi ; Martin Ester ; Jiajun Bu ; Chun Chen

【Abstract】: Recent years have seen a surge of research on social recommendation techniques for improving recommender systems due to the growing influence of social networks to our daily life. The intuition of social recommendation is that users tend to show affinities with items favored by their social ties due to social influence. Despite the extensive studies, no existing work has attempted to distinguish and learn the personalized preferences between strong and weak ties, two important terms widely used in social sciences, for each individual in social recommendation. In this paper, we first highlight the importance of different types of ties in social relations originated from social sciences, and then propose anovel social recommendation method based on a new Probabilistic Matrix Factorization model that incorporates the distinction of strong and weak ties for improving recommendation performance. The proposed method is capable of simultaneously classifying different types of social ties in a social network w.r.t. optimal recommendation accuracy, and learning a personalized tie type preference for each user in addition to other parameters. We conduct extensive experiments on four real-world datasets by comparing our method with state-of-the-art approaches, and find encouraging results that validate the efficacy of the proposed method in exploiting the personalized preferences of strong and weak ties for social recommendation.

【Keywords】: personalization; social recommendation; strong and weak ties; user behavior modeling

Paper Link】 【Pages】:1611-1619

【Authors】: Xiaokai Wei ; Linchuan Xu ; Bokai Cao ; Philip S. Yu

【Abstract】: Link Prediction has been an important task for social and information networks. Existing approaches usually assume the completeness of network structure. However, in many real-world networks, the links and node attributes can usually be partially observable. In this paper, we study the problem of Cross View Link Prediction (CVLP) on partially observable networks, where the focus is to recommend nodes with only links to nodes with only attributes (or vice versa). We aim to bridge the information gap by learning a robust consensus for link-based and attribute-based representations so that nodes become comparable in the latent space. Also, the link-based and attribute-based representations can lend strength to each other via this consensus learning. Moreover, attribute selection is performed jointly with the representation learning to alleviate the effect of noisy high-dimensional attributes. We present two instantiations of this framework with different loss functions and develop an alternating optimization framework to solve the problem. Experimental results on four real-world datasets show the proposed algorithm outperforms the baseline methods significantly for cross-view link prediction.

【Keywords】: information network; link prediction; partially observable network; social network

166. Semi-supervised Clustering in Attributed Heterogeneous Information Networks.

Paper Link】 【Pages】:1621-1629

【Authors】: Xiang Li ; Yao Wu ; Martin Ester ; Ben Kao ; Xin Wang ; Yudian Zheng

【Abstract】: A heterogeneous information network (HIN) is one whose nodes model objects of different types and whose links model objects' relationships. In many applications, such as social networks and RDF-based knowledge bases, information can be modeled as HINs. To enrich its information content, objects (as represented by nodes) in an HIN are typically associated with additional attributes. We call such an HIN an Attributed HIN or AHIN. We study the problem of clustering objects in an AHIN, taking into account objects' similarities with respect to both object attribute values and their structural connectedness in the network. We show how supervision signal, expressed in the form of a must-link set and a cannot-link set, can be leveraged to improve clustering results. We put forward the SCHAIN algorithm to solve the clustering problem. We conduct extensive experiments comparing SCHAIN with other state-of-the-art clustering algorithms and show that SCHAIN outperforms the others in clustering quality.

【Keywords】: attributed heterogeneous information network; network structure; object attributes; semi-supervised clustering

167. An Efficient Approach to Event Detection and Forecasting in Dynamic Multivariate Social Media Networks.

Paper Link】 【Pages】:1631-1639

【Authors】: Minglai Shao ; Jianxin Li ; Feng Chen ; Hongyi Huang ; Shuai Zhang ; Xunxun Chen

【Abstract】: Anomalous subgraph detection has been successfully applied to event detection in social media. However, the subgraph detection problembecomes challenging when the social media network incorporates abundant attributes, which leads to a multivariate network. The multivariate characteristic makes most existing methods incapable to tackle this problem effectively and efficiently, as it involves joint feature selection and subgraph detection that has not been well addressed in the current literature, especially, in the dynamic multivariate networks in which attributes evolve over time. This paper presents a generic framework, namely dynamic multivariate evolving anomalous subgraphs scanning (DMGraphScan), to addressthis problem in dynamic multivariate social media networks. We generalize traditional nonparametric statistics, and propose a new class of scan statistic functions for measuring the joint significance of evolving subgraphs and subsets of attributes to indicate the ongoing or forthcoming event in dynamic multivariate networks. We reformulate each scan statistic function as a sequence of subproblems with provable guarantees, and then propose an efficient approximation algorithm for tackling each subproblem. This algorithm resorts to the Lagrangian relaxation and a dynamic programming based on tree-shaped priors. As a case study, we conduct extensive experiments to demonstrate the performance of our proposed approach on two real-world applications (flu outbreak detection, haze detection) in different domains.

【Keywords】: approximation algorithm.; dynamic multivariate networks; evolving sub-graphs detection; feature selection; nonparametric statistics; social media