Nitish Upreti

A daydreamer exploring Algorithms, Software Systems Engineering and Data Science.

Kinara - Coke Studio

| Comments

The song Kinara by Atif Aslam and Riaz Ali Khan from Coke Studio Pakistan is simply amazing. I am in love with the part from 3:52 onwards. Listen to it here.

What do you do when you are in love with a song ? Especially, if it is one of those not so popular song (atleast in my friend circle)?

You write a blog post, hope that people stumble onto the page (thanks Google) and share your love for the song. Say Hi on the comments if you love the song! :)

Cheers !

Readings on Graph Mining

| Comments

I recently took a class on graph mining. In this post, I am aggregating all the lecture topics that make a complete four months worth of material. Hope this serves as a good archive on hand picked topics around several different areas of graph mining.

Thanks to Dr. Kamesh Madduri for this course!

Introduction

Slides from Georgia Tech Network Science class

Link1 and Link2

Chapters 1 and 2 of Networks, Markets, Crowds book

Brandes et al., What is Network Science?

Estrada et al., Complex Networks: an invitation

Lecture blurbs of Barabasi’s Network Science class

Mislove et al., Measurement and Analysis of Online Social Networks, Proc. IMC 2007

Broder et al., Graph structure in the web, Proc. WWW 2000

Meusel et al., Graph structure in the web – revisited, Proc. WWW 2014

Centrality

Centrality chapters in Network Analysis: Methodological foundations book Link1 and Link2

Freeman’s paper on centrality in social networks

PageRank papers : Link1 and Link2

Gleich, PageRank beyond the web, 2014

The Wikipedia page on PageRank has some clarifications about the formula to use

Hoeffding’s inequality (used to analyze randomized algorithms)

Brandes’ Betweenness centrality algorithm paper

Faust, Centrality in affiliation networks

Papers/slides on Harmonic centrality by Boldi and Vigna : Link1 and Link2

Community detection

Fortunato, Community detection in graphs, 2010

The paper that introduces the modularity measure Newman, Fast algorithm for detecting community structure in networks, 2003

Newman presents the divisive spectral approach to modularity maximization Newman, Modularity and community structure in networks, 2006

The agglomerative clustering based CNM algorithm for modularity maximization Clauset, Newman, Moore, Finding community structure in very large networks,2004

Introduces the Louvain algorithm, a “multilevel” method for modularity maximization Blondel et al., Fast unfolding of communities in complex networks, 2008

A fast and simple algorithm for community detection. Doesn’t explicitly try to maximize modularity. Raghavan, Albert, Kumara, Near linear time algorithm to detect community structure in large-scale networks, 2007

Comparative evaluation of community structure in a collection of large-scale networks Leskovec et al., Community Structure in large networks: natural cluster sizes and the absence of well-defined clusters, 2008

Various implementations for the partitioning and community detection problems were compared here. Note that graph partitioning and community detection are two different problems. The 10th DIMACS Implementation Challenge: Graph Partitioning and Graph Clustering

Winner of the DIMACS challenge for community detection. If you had to pick one implementation for non-overlapping community detection, this would be it. Ovelgonne and Geyer-Schulz, An ensemble learning strategy for community detection

Review paper on overlapping community detection. Xie et al., Overlapping community detection in networks: the state of the art and comparative study, 2011

One possible algorithm for overlapping community detection. Yang and Leskovec, Defining and evaluating network communities based on ground-truth, Proc. ICDM 2012.

A very nice exposition of this problem. An extension of the above paper. Yang, Community Structure of Large Networks, PhD dissertation, 2014.

Miscellaneous analysis algorithms

Assortativity

Newman, Mixing patterns in networks, 2003

Latapy, Main-memory triangle counting for very large sparse power-law graphs, 2008

Seshadhri et al., Wedge sampling for computing clustering coefficients and triangle counts on large graphs, 2013.

Tsourakakis et al., DOULION: Counting Triangles in Massive Graphs with a coin, 2009.

Degeneracy/k-core decomposition

Batagelj and Zaversnik’s (Pajek) 2002 k-core algorithm paper:

Sariyuce et al., Streaming algorithms for k-core decomposition, 2013

Rossi et al., Fast maximum clique algorithms for large graphs, 2014

Cohen, Cohesive subgraphs for social network analysis, 2008

Sariyuce et al., Finding the Hierarchy of Dense Subgraphs using Nucleus decompositions, 2014

Network Models

Watts and Strogatz, The Collective dynamics of small-world networks, 1998

Model for small-world phenomenon : Barabasi and Albert, Emergence of scaling in random networks, 1999

The preferential attachment paper : Kleinberg, The small-world phenomenon: an algorithmic perspective, 2000

Another model explaining small-world nature : Kleinberg et al., The web as a graph: measurements, models, and methods, 2000

Web graph modeling : Doyle et al., The “robust yet fragile” nature of the internet, 2005

Observations about the internet router-level topology Aiello, Chung, Lu, A random graph model for power law graphs, 1999

Some results about power-law random graphs Robins et al., An Introduction to exponential random graph models for social networks, 2007

Newman, Structure and function of complex networks, 2003

Chapter 4 on Random graphs from Foundations of Data Science book by Hopcroft and Kannan, 2014

Link prediction

Kleinberg, The link prediction problem for social networks, 2004.

Lu, Zhou, Link prediction in complex networks: a survey, 2010.

Al Hasan, Zaki, A survey of link prediction in social networks 2010.

Wang et al., Human mobility, social ties, and link prediction, Proc. KDD 2011.

Sadilek et al., Finding your friends and following them to where you are, Proc. WSDM 2012.

WTF: the Who to Follow service at Twitter, Proc. WWW 2014 : Link1 and Link2

Chakrabarti et al., Joint inference of multiple label types in large networks, 2014.

Schifanella et al., Folks in folksonomies: social link prediction with shared metadata, 2010.

Dong et al., Link prediction and recommendation across heterogeneous social networks, Proc. ICDM 2012.

Cannistraci et al., From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks, 2014

Subgraph mining

The Wikipedia page on network motifs:

Milo et al., Network motifs: Simple building blocks of complex networks, 2002

Yan, Han, gspan: Graph-based substructure pattern mining, 2002

Kuramochi and Karypis, Finding frequent patterns in a large sparse graph, 2005

Gibson, Kumar, Tomkins, Discovering large dense subgraphs in massive graphs, 2005.

Alon, Yuster, Zwick, Color-coding, 1995.

Ray, Holder, Choudhury, Frequent subgraph discovery in large attributed streaming graphs, 2014.

Pattabiraman et al., Fast algorithms for the maximum clique problem on massive graphs, 2013.

Yang et al., Schemaless and structureless graph querying, 2014.

Applications

Merritt, Clauset, Social network dynamics in a massive online game: network turnover, non-densification, and team engagement in Halo Reach, 2013.

Chapter 6 of Networks, Crowds, and Markets book

Gale, Kariv, Financial networks.

Zhu, Wilkinson, Shreiber, Pan, Large-scale parallel collaborative filtering for the Netflix prize, Proc. AAIM 2008.

Koren, Bell, Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009.

Natarajan, Dhillon, Inductive matrix completion for predicting gene-disease associations, 2014.

Singh-Blom et al., Prediction and Validation of Gene-Disease Associations Using Methods Inspired by Social Network Analyses, 2013.

Graph storage and analysis frameworks

Bronson et al., TAO: Facebook’s distributed graph data store for the social graph, Proc. ATC 2013.

Curtiss et al., Unicorn: a system for searching the social graph, Proc. VLDB 2013.

Gonzales et al., PowerGraph: distributed graph-parallel computation on natural graphs, Proc. OSDI 2012.

Gonzales et al., GraphX: Graph processing in a distributed dataflow framework, Proc. OSDI 2014.

Graph partitioning

Buluc et al., Recent advances in graph partitioning, 2015

The 10th DIMACS Implementation Challenge: Graph Partitioning and Graph Clustering

Graph compression & visualization

Boldi, Vigna, The WebGraph Framework I: Compression Techniques, Proc. WWW 2014.

Buehrer, Chellapilla, A scalable pattern mining approach to web graph compression with communities, Proc. WSDM 2008

Hu, Efficient and High-quality force-directed graph drawing.

Ahmed et al., Visualisation and Analysis of the Internet Movie Database, Proc. APVIS 2007

Herr, Ke, Hardy, Borner, Movies and Actors: Mapping the Internet Movie Database, Proc. IV 2007

Freebase US film data analytics using GraphLab

Introducion to Apache Spark

| Comments

A couple of months back, I gave an introductory talk on Apache Spark. Spark, designed at Berkeley’s amplab is a fast general engine for large-scale data processing. This slide deck covers the background behind Apache Spark and discuss two academic papers Spark: Cluster Computing with Working Sets and RDD: A Fault-Tolerant Abstraction for In-Memory Cluster Computing that provide deep insights into the architecture. There is a also sneak peek of BlinkDB which I previously covered in-depth in an another post.

I am also really excited to be attending Spark Summit East in New York this March. Special thanks to folks at Databricks for sponsoring my ticket! :-)

Here are the embedded slides :

How Secure Is Your Password ?

| Comments

If you are reading this, chances are that you are wise enough not to use “123456” and “dragon” for your password. Even for the folks who are somewhat paranoid about online security and believe they are indeed using “strong” passwords, there is a still a chance of improvement (with a better understanding of the science behind). This post scratches the surface on “state of the art research for choosing a secure password”. The idea is to document the key lessons and provide pointers to academic journals to dive in deeper.

Our Assumption for the “Threat Model” : We assume that the attacker has obtained access to a list of hashed passwords. They can easily launch offline attacks in their own sweet time as an attempt to crack these passwords. Also most users tend to reuse passwords across different accounts, which helps an adversary maximize the possible damage.

How strong is your password?

A strong password is not easily “guessable”. Thus the time needed for some of the most efficient password-cracking algorithms to reveal the plaintext password can quantify a password’s strength.

Here are some of the findings from the security community you can use for a quick self assessment. We discuss the all issues more clearly subsequently.

  1. Any dictionary based password is inherently insecure.
  2. As a thumb rule, having a longer password makes it more secure. That said, it is still quite possible that a user with the best of intentions chooses a long password (non dictionary) which is still easy to break
  3. Password that follow a “predictable” structure are more insecure.
  4. Multiple websites force a password expiration policy. Most of the users when choosing a new password, tend to generate future passwords incrementally based on old passwords. Most of the time, these new passwords are a few simple tweaks on the old ones. (Almost everyone does this, right?)

So how do the bad guys exploit the weakness?

In an ideal world, every user chooses a perfectly random password. Perhaps something like : “cmMs&59my@6Ue*T” or “3%RaKYn5zMBB&9P”. Randomness is indeed our friend. There is a just small flaw, these password are incredibly hard to remember.

Not so surprisingly, dictionary based passwords and attacks are most common. Any attacker will try to guess the password by initially using a set of dictionary words. Even most of the lengthy non dictionary passwords usually have a predictable structure with a non-uniform (in simple words: predictable) distribution of characters. Thus the most efficient cracking tools use heuristics and do not explore all possible passwords(brute force).

They instead use clever techniques like :

Markov Chains

“For example, in English, ‘w’ is more likely to be followed by ‘h’ than by another ‘w’. The concept can be extended to character sequences of arbitrary length, called n-grams; the 3-gram (or tri-gram). ‘thr’ is far more likely to be followed by a vowel than by a consonant. By training the Markov chain on known password lists, dictionaries, or both, these distributions can be estimated and used to generate a list of possible passwords that is significantly more effective than random guessing or the publicly available programs like John the Ripper”[1]

Probabilistic Context Free Grammars

“This algorithm is based on the observation that passwords tend to have predictable “structures”. The structure of a password is defined as the way in which the password can be broken into strings(or tokens) of letters, digits, and symbols (e.g. S1U1L6D2 represents a special character followed by an uppercase letter,then 6 lowercase letters, and ending with two digits). PCFG attacks are highly effective because when a password composition policy requires that a digit or symbol be included in a password, users are far more likely to append the required character to an existing password rather than place it in the middle. Similarly, uppercase letters are predominantly used at the beginning of an alphabetic string. By tabulating the number of occurrences of each distinct structure found in a training set, an attacker can gain valuable insight into the likely distribution of structures of the passwords.”[1]

Thus even with strict password policies enforced by websites, PCFG based attacks can be highly lethal, exploiting the bias users have when choosing passwords.

Transformation based Algorithmic Attacks

These attacks when targeted on an individual user, make small modifications to existing leaked passwords, exploiting incremental nature and making the attacks more efficient.

Summing Up

Choosing a strong password has quite a few associated nuances. It is simply not a choice between a shorter and a longer passphrase. It takes a little more thought and effort to ensure our passwords are indeed strong and resistant to offline attacks.

Suggested Readings :

References : 1

BlinkDB: Queries With Bounded Errors and Bounded Response Times on Very Large Data

| Comments

I recently gave an introductory talk on ‘BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data’. BlinkDb, designed at Berkeley’s amplab provides an approximate query engine for running interactive SQL queries on large volumes of data. Kudos to Sameer Agarwal for his great work!

Here are my slides covering an overview of BlinkDb :

How Does Your Company Make Money?

| Comments

Sometimes as a developer working in a very large company, it is very common to be oblivious on “how your company really makes money”? How do the folks in sales actually woo new customers? What exactly do these customers want? What sort of competition exists in the market? How does your company differentiate itself from its competitors?

Recently I was having a discussion with a Program Manager at Microsoft where it all stuck me. A chain of thoughts further took me down this lane : If I want to spend three to four years of my initial career working as a developer in a huge company, I should do some business reading to build a background in something other than pure engineering.

A quick Google search took me to this page : http://personalmba.com/ . I am reading the book now and hope that it serves as a gateway for me to dig in deeper further down the lane. Bon Voyage! ;)

Passenger - Rolling Stone

| Comments

Last weekend while working from Starbucks, I heard this song and have been in love with it since then.

The lyrics are beautiful…

Sometimes I feel I’m going nowhere
Sometimes I’m sure I never will
She said it’s cause I’m always moving
I never notice cause I never stand still

Sometimes I feel like I’m falling
Falling fast and falling free
She says my darling you’re not falling
It always looked like you were flying to me

But I fear I’ve grown a rolling stone inside of me
She said oh don’t you know
that rolling stones stop at the sea
And that’s where I’ll be

Sometimes (I feel/) I’m sure I know no one
A thousand faces but no names
She said my love you do know someone
Oh and I know you back just the same

But I’m scared I said, what if this stone don’t slow down
Just be aware she said, what goes up will come down
and when you do I’ll be around

Oh when I’ve dragged this rolling stone across this land
I’ll make sure I leave this stone in her hands
For we both know too well, rolling stones turn in to sand
if they don’t find a place to stand

Hope the song vibrates with your soul :)

Say NO to Notifications

| Comments

Technology has always been my bread and butter. I strive to be an ‘early adopter’, trying to integrate every piece of meaningful technology into my daily life. However, since the past few months I have started thinking deeply about productivity and how our interactions with cellphones, wearable gadgets, internet among others affect it. After reading a bunch of literature around, there is one very obvious and clear lesson :

Say NO to Notifications

Yes, I am referring to the luring Facebook, Quora, LinkedIn, < Your Favorite Messenger App >, < Your Favorite News App > (the list continues)… on your smartphone, tablet and email. They are always on the hook to lure you in with further engagement. A very simple key to productivity thus is turning them off.

I have had a very positive experience after turning off most of the notifications I was subscribed to. As of now, I only have two major applications with notifications : Whatsapp and Twitter on my cellphone. Have cut down on everything else and it does feel good.

Wish you a productive day! :)

Beauty of Tail Recursion

| Comments

I always overlooked Tail Recursion whenever I encountered it. Also, I am ashamed to admit that for a very long time I had a very wrong idea of what exactly was being optimized by Tail Recursion.

Today while following FP with Scala on coursera, I finally took time to think through the idea which cleared my everlasting misconception.

I will quickly and lazily summarize the difference between the two flavors ( recursion with and without tail call optimization) here.

Factorial without Tail Call optimization
1
2
def factorial(x : Int) : Int = if ( x == 0) 1 else x * factorial(x-1)
factorial(5);
Factorial with Tail Call optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scala.annotation._
def factorial(x:Int):Int = {

       @tailrec
       def factorial(accumulator: Int, x: Int) : Int = {
         if(x == 1 || x==0)
             accumulator
         else
            factorial(x * accumulator, x - 1)
 }

     factorial(1,x)
}
println(factorial(5));

Consequence of second implementation : If a function calls itself as its last action, the function’s stack frame can be reused, so we save lots and lots of memory!

Now that I have a blog post on Tail Recursion, I am free from the guilt of misunderstanding this concept for so long. :D