Nitish Upreti

A daydreamer exploring Algorithms, Software Systems Engineering and Data Science.

Best Practices for Moving Data Encrypted With Always Encrypted

| Comments

Being a part of the Azure SQL DB and SQL Server product team at Microsoft, I recently authored an article on the SQL Security blog titled “Best Practices for Moving Data Encrypted with Always Encrypted”. As the title suggests, the article summarizes best practices to follow when moving data encrypted with SQL16’s new Always Encrypted feature.

Would love to hear any feedback and suggestions @nitish or comments on the blog post. :)

Running Inside Windows Debugging Sample Programs in Windows 10

| Comments

If you are trying to run the sample programs from “Inside Windows Debugging” in windows 10, you are at the right place. This post will save you troubleshooting time.

  1. Install WDK (Windows Driver Kit) from msdn downloads.
  2. Locate the binary “NMake2MSBuild.exe” in your installation directory.
  3. Run “NMake2MSBuild.exe dirs” (where dirs is the top level file in the book’s sample code directory)

This will generate appropriate VStudio solution file in the code directory. When you build this soltution, some projects might fail with: Warning treated as error. Disable this with: “Treat Warning as Errors” options from Visual Studio.

My Observation of Indians Working in Silicon Valley/Technology

| Comments

I am an Indian (desi) living in United States and working in technology. Most(All?) of my friends are Indians who are working in engineering positions for top technology companies. All of them are brilliant folks. I happen to be around 6-10 years younger to all of them. They studied from top tier schools, got the best of jobs and simply aced everything that they ever put their hands on.

Here is my general observation : All of them started off brilliant and hungry. Over the years with multiple years of experience and working on the same job, ‘work’ is now just ‘work’ and nothing more. It is definitely not a passion. This does not mean that they suck at their jobs. They are pragmatic, efficient and highly skilled engineers.

This scenario would definitely not just apply to Indians and perhaps is true for any worker in technology(any job?). It also makes a lot of sense. You have a family to take care of, you also want to enjoy your life and fulfill other responsibilities. Life is definitely more than just work/professional life. You simply have more commitments.

That makes me wonder :
Should a guy with 10 years or experience be as obsessed about his work as a fresh graduate ? Is it even possible to share a similar enthusiasm after years of experience?

I think it boils down to personal temperament and choices. There is no right or wrong answer.

What do you think?


| Comments

Ghazals are losing prominence in India. To the best of my understanding, people in my generation (and coming?) simply cannot relate to Ghazals. Everyone finds the music too slow and lyrics too cheesy. However, sometimes you just want to slow down and appreciate the music with childlike innocence. To me ghazals bring back the nostalgic 90s childhood I spent in India. :)

I would end this post with a Ghazal I recently stumbled on and fell in love with.

चांद के साथ कई दर्द पुराने निकले,

कितने गम थे जो तेरे गम के बहाने निकले

फ़सल-ए-गुल आई फ़िर एक बार असीनाने-वफ़ा,

अपने ही खून के दरिया में नहाने निकले

दिल ने एक ईंट से तामीर किया हसीं ताजमहल,

तुने एक बात कही लाख फसाने निकले

दश्त-ए-तन्हाई ये हिजरा में खडा सोचता हुँ,

हाय क्या लोग मेरा साथ निभाने निकले ?

Kinara - Coke Studio

| Comments

The song Kinara by Atif Aslam and Riaz Ali Khan from Coke Studio Pakistan is simply amazing. I am in love with the part from 3:52 onwards. Listen to it here.

What do you do when you are in love with a song ? Especially, if it is one of those not so popular song (atleast in my friend circle)?

You write a blog post, hope that people stumble onto the page (thanks Google) and share your love for the song. Say Hi on the comments if you love the song! :)

Cheers !

Readings on Graph Mining

| Comments

I recently took a class on graph mining. In this post, I am aggregating all the lecture topics that make a complete four months worth of material. Hope this serves as a good archive on hand picked topics around several different areas of graph mining.

Thanks to Dr. Kamesh Madduri for this course!


Slides from Georgia Tech Network Science class

Link1 and Link2

Chapters 1 and 2 of Networks, Markets, Crowds book

Brandes et al., What is Network Science?

Estrada et al., Complex Networks: an invitation

Lecture blurbs of Barabasi’s Network Science class

Mislove et al., Measurement and Analysis of Online Social Networks, Proc. IMC 2007

Broder et al., Graph structure in the web, Proc. WWW 2000

Meusel et al., Graph structure in the web – revisited, Proc. WWW 2014


Centrality chapters in Network Analysis: Methodological foundations book Link1 and Link2

Freeman’s paper on centrality in social networks

PageRank papers : Link1 and Link2

Gleich, PageRank beyond the web, 2014

The Wikipedia page on PageRank has some clarifications about the formula to use

Hoeffding’s inequality (used to analyze randomized algorithms)

Brandes’ Betweenness centrality algorithm paper

Faust, Centrality in affiliation networks

Papers/slides on Harmonic centrality by Boldi and Vigna : Link1 and Link2

Community detection

Fortunato, Community detection in graphs, 2010

The paper that introduces the modularity measure Newman, Fast algorithm for detecting community structure in networks, 2003

Newman presents the divisive spectral approach to modularity maximization Newman, Modularity and community structure in networks, 2006

The agglomerative clustering based CNM algorithm for modularity maximization Clauset, Newman, Moore, Finding community structure in very large networks,2004

Introduces the Louvain algorithm, a “multilevel” method for modularity maximization Blondel et al., Fast unfolding of communities in complex networks, 2008

A fast and simple algorithm for community detection. Doesn’t explicitly try to maximize modularity. Raghavan, Albert, Kumara, Near linear time algorithm to detect community structure in large-scale networks, 2007

Comparative evaluation of community structure in a collection of large-scale networks Leskovec et al., Community Structure in large networks: natural cluster sizes and the absence of well-defined clusters, 2008

Various implementations for the partitioning and community detection problems were compared here. Note that graph partitioning and community detection are two different problems. The 10th DIMACS Implementation Challenge: Graph Partitioning and Graph Clustering

Winner of the DIMACS challenge for community detection. If you had to pick one implementation for non-overlapping community detection, this would be it. Ovelgonne and Geyer-Schulz, An ensemble learning strategy for community detection

Review paper on overlapping community detection. Xie et al., Overlapping community detection in networks: the state of the art and comparative study, 2011

One possible algorithm for overlapping community detection. Yang and Leskovec, Defining and evaluating network communities based on ground-truth, Proc. ICDM 2012.

A very nice exposition of this problem. An extension of the above paper. Yang, Community Structure of Large Networks, PhD dissertation, 2014.

Miscellaneous analysis algorithms


Newman, Mixing patterns in networks, 2003

Latapy, Main-memory triangle counting for very large sparse power-law graphs, 2008

Seshadhri et al., Wedge sampling for computing clustering coefficients and triangle counts on large graphs, 2013.

Tsourakakis et al., DOULION: Counting Triangles in Massive Graphs with a coin, 2009.

Degeneracy/k-core decomposition

Batagelj and Zaversnik’s (Pajek) 2002 k-core algorithm paper:

Sariyuce et al., Streaming algorithms for k-core decomposition, 2013

Rossi et al., Fast maximum clique algorithms for large graphs, 2014

Cohen, Cohesive subgraphs for social network analysis, 2008

Sariyuce et al., Finding the Hierarchy of Dense Subgraphs using Nucleus decompositions, 2014

Network Models

Watts and Strogatz, The Collective dynamics of small-world networks, 1998

Model for small-world phenomenon : Barabasi and Albert, Emergence of scaling in random networks, 1999

The preferential attachment paper : Kleinberg, The small-world phenomenon: an algorithmic perspective, 2000

Another model explaining small-world nature : Kleinberg et al., The web as a graph: measurements, models, and methods, 2000

Web graph modeling : Doyle et al., The “robust yet fragile” nature of the internet, 2005

Observations about the internet router-level topology Aiello, Chung, Lu, A random graph model for power law graphs, 1999

Some results about power-law random graphs Robins et al., An Introduction to exponential random graph models for social networks, 2007

Newman, Structure and function of complex networks, 2003

Chapter 4 on Random graphs from Foundations of Data Science book by Hopcroft and Kannan, 2014

Link prediction

Kleinberg, The link prediction problem for social networks, 2004.

Lu, Zhou, Link prediction in complex networks: a survey, 2010.

Al Hasan, Zaki, A survey of link prediction in social networks 2010.

Wang et al., Human mobility, social ties, and link prediction, Proc. KDD 2011.

Sadilek et al., Finding your friends and following them to where you are, Proc. WSDM 2012.

WTF: the Who to Follow service at Twitter, Proc. WWW 2014 : Link1 and Link2

Chakrabarti et al., Joint inference of multiple label types in large networks, 2014.

Schifanella et al., Folks in folksonomies: social link prediction with shared metadata, 2010.

Dong et al., Link prediction and recommendation across heterogeneous social networks, Proc. ICDM 2012.

Cannistraci et al., From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks, 2014

Subgraph mining

The Wikipedia page on network motifs:

Milo et al., Network motifs: Simple building blocks of complex networks, 2002

Yan, Han, gspan: Graph-based substructure pattern mining, 2002

Kuramochi and Karypis, Finding frequent patterns in a large sparse graph, 2005

Gibson, Kumar, Tomkins, Discovering large dense subgraphs in massive graphs, 2005.

Alon, Yuster, Zwick, Color-coding, 1995.

Ray, Holder, Choudhury, Frequent subgraph discovery in large attributed streaming graphs, 2014.

Pattabiraman et al., Fast algorithms for the maximum clique problem on massive graphs, 2013.

Yang et al., Schemaless and structureless graph querying, 2014.


Merritt, Clauset, Social network dynamics in a massive online game: network turnover, non-densification, and team engagement in Halo Reach, 2013.

Chapter 6 of Networks, Crowds, and Markets book

Gale, Kariv, Financial networks.

Zhu, Wilkinson, Shreiber, Pan, Large-scale parallel collaborative filtering for the Netflix prize, Proc. AAIM 2008.

Koren, Bell, Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009.

Natarajan, Dhillon, Inductive matrix completion for predicting gene-disease associations, 2014.

Singh-Blom et al., Prediction and Validation of Gene-Disease Associations Using Methods Inspired by Social Network Analyses, 2013.

Graph storage and analysis frameworks

Bronson et al., TAO: Facebook’s distributed graph data store for the social graph, Proc. ATC 2013.

Curtiss et al., Unicorn: a system for searching the social graph, Proc. VLDB 2013.

Gonzales et al., PowerGraph: distributed graph-parallel computation on natural graphs, Proc. OSDI 2012.

Gonzales et al., GraphX: Graph processing in a distributed dataflow framework, Proc. OSDI 2014.

Graph partitioning

Buluc et al., Recent advances in graph partitioning, 2015

The 10th DIMACS Implementation Challenge: Graph Partitioning and Graph Clustering

Graph compression & visualization

Boldi, Vigna, The WebGraph Framework I: Compression Techniques, Proc. WWW 2014.

Buehrer, Chellapilla, A scalable pattern mining approach to web graph compression with communities, Proc. WSDM 2008

Hu, Efficient and High-quality force-directed graph drawing.

Ahmed et al., Visualisation and Analysis of the Internet Movie Database, Proc. APVIS 2007

Herr, Ke, Hardy, Borner, Movies and Actors: Mapping the Internet Movie Database, Proc. IV 2007

Freebase US film data analytics using GraphLab

Introducion to Apache Spark

| Comments

A couple of months back, I gave an introductory talk on Apache Spark. Spark, designed at Berkeley’s amplab is a fast general engine for large-scale data processing. This slide deck covers the background behind Apache Spark and discuss two academic papers Spark: Cluster Computing with Working Sets and RDD: A Fault-Tolerant Abstraction for In-Memory Cluster Computing that provide deep insights into the architecture. There is a also sneak peek of BlinkDB which I previously covered in-depth in an another post.

I am also really excited to be attending Spark Summit East in New York this March. Special thanks to folks at Databricks for sponsoring my ticket! :-)

Here are the embedded slides :

How Secure Is Your Password ?

| Comments

If you are reading this, chances are that you are wise enough not to use “123456” and “dragon” for your password. Even for the folks who are somewhat paranoid about online security and believe they are indeed using “strong” passwords, there is a still a chance of improvement (with a better understanding of the science behind). This post scratches the surface on “state of the art research for choosing a secure password”. The idea is to document the key lessons and provide pointers to academic journals to dive in deeper.

Our Assumption for the “Threat Model” : We assume that the attacker has obtained access to a list of hashed passwords. They can easily launch offline attacks in their own sweet time as an attempt to crack these passwords. Also most users tend to reuse passwords across different accounts, which helps an adversary maximize the possible damage.

How strong is your password?

A strong password is not easily “guessable”. Thus the time needed for some of the most efficient password-cracking algorithms to reveal the plaintext password can quantify a password’s strength.

Here are some of the findings from the security community you can use for a quick self assessment. We discuss the all issues more clearly subsequently.

  1. Any dictionary based password is inherently insecure.
  2. As a thumb rule, having a longer password makes it more secure. That said, it is still quite possible that a user with the best of intentions chooses a long password (non dictionary) which is still easy to break
  3. Password that follow a “predictable” structure are more insecure.
  4. Multiple websites force a password expiration policy. Most of the users when choosing a new password, tend to generate future passwords incrementally based on old passwords. Most of the time, these new passwords are a few simple tweaks on the old ones. (Almost everyone does this, right?)

So how do the bad guys exploit the weakness?

In an ideal world, every user chooses a perfectly random password. Perhaps something like : “cmMs&59my@6Ue*T” or “3%RaKYn5zMBB&9P”. Randomness is indeed our friend. There is a just small flaw, these password are incredibly hard to remember.

Not so surprisingly, dictionary based passwords and attacks are most common. Any attacker will try to guess the password by initially using a set of dictionary words. Even most of the lengthy non dictionary passwords usually have a predictable structure with a non-uniform (in simple words: predictable) distribution of characters. Thus the most efficient cracking tools use heuristics and do not explore all possible passwords(brute force).

They instead use clever techniques like :

Markov Chains

“For example, in English, ‘w’ is more likely to be followed by ‘h’ than by another ‘w’. The concept can be extended to character sequences of arbitrary length, called n-grams; the 3-gram (or tri-gram). ‘thr’ is far more likely to be followed by a vowel than by a consonant. By training the Markov chain on known password lists, dictionaries, or both, these distributions can be estimated and used to generate a list of possible passwords that is significantly more effective than random guessing or the publicly available programs like John the Ripper”[1]

Probabilistic Context Free Grammars

“This algorithm is based on the observation that passwords tend to have predictable “structures”. The structure of a password is defined as the way in which the password can be broken into strings(or tokens) of letters, digits, and symbols (e.g. S1U1L6D2 represents a special character followed by an uppercase letter,then 6 lowercase letters, and ending with two digits). PCFG attacks are highly effective because when a password composition policy requires that a digit or symbol be included in a password, users are far more likely to append the required character to an existing password rather than place it in the middle. Similarly, uppercase letters are predominantly used at the beginning of an alphabetic string. By tabulating the number of occurrences of each distinct structure found in a training set, an attacker can gain valuable insight into the likely distribution of structures of the passwords.”[1]

Thus even with strict password policies enforced by websites, PCFG based attacks can be highly lethal, exploiting the bias users have when choosing passwords.

Transformation based Algorithmic Attacks

These attacks when targeted on an individual user, make small modifications to existing leaked passwords, exploiting incremental nature and making the attacks more efficient.

Summing Up

Choosing a strong password has quite a few associated nuances. It is simply not a choice between a shorter and a longer passphrase. It takes a little more thought and effort to ensure our passwords are indeed strong and resistant to offline attacks.

Suggested Readings :

References : 1

BlinkDB: Queries With Bounded Errors and Bounded Response Times on Very Large Data

| Comments

I recently gave an introductory talk on ‘BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data’. BlinkDb, designed at Berkeley’s amplab provides an approximate query engine for running interactive SQL queries on large volumes of data. Kudos to Sameer Agarwal for his great work!

Here are my slides covering an overview of BlinkDb :