cshalizi + network_data_analysis   219

Estimating the Causal Effects of Social Interaction with Endogenous Networks
"Identifying causal effects attributable to network membership is a key challenge in empirical studies of social networks. In this article, we examine the consequences of endogeneity for inferences about the effects of networks on network members’ behavior. Using the House office lottery (in which newly elected members select their office spaces in a randomly chosen order) as an instrumental variable to estimate the causal impact of legislative networks on roll call behavior and cosponsorship decisions in the 105th–112th Houses, we find no evidence that office proximity affects patterns of legislative behavior. These results contrast with decades of congressional scholarship and recent empirical studies. Our analysis demonstrates the importance of accounting for selection processes and omitted variables in estimating the causal impact of networks."
to:NB  causal_inference  re:critique_of_diffusion  social_influence  congress  network_data_analysis  social_networks  homophily  re:homophily_and_confounding 
13 days ago by cshalizi
[1205.1406] Graph Prediction in a Low-Rank and Autoregressive Setting
"We study the problem of prediction for evolving graph data. We formulate the problem as the minimization of a convex objective encouraging sparsity and low-rank of the solution, that reflect natural graph properties. The convex formulation allows to obtain oracle inequalities and efficient solvers. We provide empirical results for our algorithm and comparison with competing methods, and point out two open questions related to compressed sensing and algebra of low-rank and sparse matrices."
to:NB  network_data_analysis  prediction  statistics  low-rank_approximation 
19 days ago by cshalizi
Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems
"Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high-dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data.
"This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high-dimensional data sets where the number of samples n much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning."
to_read  network_data_analysis  machine_learning  high-dimensional_statistics  kolar.mladen  kith_and_kin  relational_learning 
25 days ago by cshalizi
"Network Coevolution and Democracy: A Spatial Econometric Approach" by Aya Kachi
"Regime transitions are contagious according to the diffusion-of-democracy literature: a country's regime is affected by others' through various predefined networks (e.g. geographical proximity), as well as by the country's own political, economic and social attributes (e.g. GDP levels). My account departs from the existing diffusion theory by allowing for countries' self-selection into peer regime networks based on their democracy levels in the past. For example, a country can form stronger dependency ties with countries that demonstrated similar democracy levels in the past (homophily). In the longitudinal setting, the traditional diffusion mechanism with the presence of self-selection generates the "co-evolutionary dynamic" between country networks and democracy levels. With this recursive feedback process between tie formation and democracy levels, it becomes extremely difficult to evaluate empirically how each country's level of democracy is determined, because we need to distinguish the following three processes statistically. First, country-specific attributes determine the level of democracy as in the earliest democratization studies. Second, other states' democracy levels also predict a country's regime as demonstrated in the conventional diffusion studies. Finally with my theory of endogenous network formation, the seeming diffusion effect is partially a consequence of their self-selection into peer networks. A newer spatial econometric model, an "M-STAR + Co-Evolution" model, is one of the first that allows us to test for all of these three dynamics behind democratization. In my first-cut analysis, I find that all three processes indeed exist."

ETA: It's good to recognize the problem exists, but the model used here does not make it go away, and still fails to identify the influence effect (if one exists).
to:NB  to_read  political_science  network_data_analysis  homophily  contagion  re:critique_of_diffusion  democracy 
4 weeks ago by cshalizi
[1201.5871] Null models for network data
"The analysis of datasets taking the form of simple, undirected graphs continues to gain in importance across a variety of disciplines. Two choices of null model, the logistic-linear model and the implicit log-linear model, have come into common use for analyzing such network data, in part because each accounts for the heterogeneity of network node degrees typically observed in practice. Here we show how these both may be viewed as instances of a broader class of null models, with the property that all members of this class give rise to essentially the same likelihood-based estimates of link probabilities in sparse graph regimes. This facilitates likelihood-based computation and inference, and enables practitioners to choose the most appropriate null model from this family based on application context. Comparative model fits for a variety of network datasets demonstrate the practical implications of our results."
in_NB  network_data_analysis  have_read  statistics  estimation  approximation  re:smoothing_adjacency_matrices 
5 weeks ago by cshalizi
[1204.3941] A Log-Linear Graphical Model for Inferring Genetic Networks from High-Throughput Sequencing Data
"We develop a novel method for estimating high-dimensional Poisson graphical models, the Log-Linear Graphical Model, allowing us to infer networks based on high-throughput sequencing data. Our model assumes that conditional on all other genes, each gene is Poisson, jointly defining a pair-wise Poisson Markov random field. We estimate our genetic networks via neighborhood selection by fitting `1-norm penalized log-linear models, an approach we call the Poisson Graphical Lasso. Additionally, we develop a fast parallel algorithm, permitting us to fit our graphical models to high-dimensional genomic data sets."
in_NB  graphical_models  gene_expression_data_analysis  lasso  network_data_analysis  statistics  regression 
5 weeks ago by cshalizi
[1204.2255] Identifying edge clusters in networks via edge graphlet degree vectors (edge-GDVs) and edge-GDV-similarities
"Inference of new biological knowledge, e.g., prediction of protein function, from protein-protein interaction (PPI) networks has received attention in the post-genomic era. A popular strategy has been to cluster the network into functionally coherent groups of proteins and predict protein function from the clusters. Traditionally, network research has focused on clustering of nodes. However, why favor nodes over edges, when clustering of edges may be preferred? For example, nodes belong to multiple functional groups, but clustering of nodes typically cannot capture the group overlap, while clustering of edges can. Clustering of adjacent edges that share many neighbors was proposed recently, outperforming different node clustering methods. However, since some biological processes can have characteristic "signatures" throughout the network, not just locally, it may be of interest to consider edges that are not necessarily adjacent. Hence, we design a sensitive measure of the "topological similarity" of edges that can deal with edges that are not necessarily adjacent. We cluster edges that are similar according to our measure in different baker's yeast PPI networks, outperforming existing node and edge clustering approaches."
to:NB  network_data_analysis  community_discovery 
6 weeks ago by cshalizi
[1204.2296] Co-clustering for Directed Graphs; the Stochastic Co-Blockmodel and a Spectral Algorithm
"This paper extends the spectral clustering algorithm to directed networks in a way that co-clusters or bi-clusters the rows and columns of a graph Laplacian. Co-clustering leverages the increased complexity of asymmetric relationships to gain new insight into the structure of the directed network. To understand this algorithm and to study its asymptotic properties in a canonical setting, we propose the Stochastic Co-Blockmodel to encode co-clustering structure. This is the first statistical model of co-clustering and it is derived using the concept of stochastic equivalence that motivated the original Stochastic Blockmodel. Although directed spectral clustering is not derived from the Stochastic Co-Blockmodel, we show that, asymptotically, the algorithm can estimate the blocks in a high dimensional asymptotic setting in which the number of blocks grows with the number of nodes. The algorithm, model, and asymptotic results can all be extended to bipartite graphs."
in_NB  relational_learning  network_data_analysis  statistics  clustering  community_discovery  spectral_clustering  yu.bin 
6 weeks ago by cshalizi
[1204.2581] Modeling Relational Data via Latent Factor Blockmodel
"In this paper we address the problem of modeling relational data, which appear in many applications such as social network analysis, recommender systems and bioinformatics. Previous studies either consider latent feature based models but disregarding local structure in the network, or focus exclusively on capturing local structure of objects based on latent blockmodels without coupling with latent characteristics of objects. To combine the benefits of the previous work, we propose a novel model that can simultaneously incorporate the effect of latent features and covariates if any, as well as the effect of latent structure that may exist in the data. To achieve this, we model the relation graph as a function of both latent feature factors and latent cluster memberships of objects to collectively discover globally predictive intrinsic properties of objects and capture latent block structure in the network to improve prediction performance. We also develop an optimization transfer algorithm based on the generalized EM-style strategy to learn the latent factors. We prove the efficacy of our proposed model through the link prediction task and cluster analysis task, and extensive experiments on the synthetic data and several real world datasets suggest that our proposed LFBM model outperforms the other state of the art approaches in the evaluated tasks."
in_NB  network_data_analysis  community_discovery  statistics  inference_to_latent_objects  factor_analysis  relational_learning 
6 weeks ago by cshalizi
Bonacich, P. and Lu, P.: Introduction to Mathematical Sociology.
Judging from the table of contents (which is unfair), a weird mix of reviewing truly elementary concepts and some actually interesting stuff. (And yes, I know who Bonacich is.)
to:NB  books:noted  sociology  networks  modeling  to_be_shot_after_a_fair_trial  network_data_analysis 
6 weeks ago by cshalizi
Phys. Rev. E 85, 046103 (2012): Unbiased degree-preserving randomization of directed binary networks
"Randomizing networks using a naive “accept-all” edge-swap algorithm is generally biased. Building on recent results for nondirected graphs, we construct an ergodic detailed balance Markov chain with nontrivial acceptance probabilities for directed graphs, which converges to a strictly uniform measure and is based on edge swaps that conserve all in and out degrees. The acceptance probabilities can also be generalized to define Markov chains that target any alternative desired measure on the space of directed graphs in order to generate graphs with more sophisticated topological features. This is demonstrated by defining a process tailored to the production of directed graphs with specified degree-degree correlation functions. The theory is implemented numerically and tested on synthetic and biological network examples."
to:NB  network_data_analysis  graph_theory  monte_carlo 
7 weeks ago by cshalizi
[1203.3083] Clustering in networks with the collapsed Stochastic Block Model
"We present an efficient MCMC algorithm to cluster the nodes of a network such that nodes with similar role in the network are clustered together. This is known as block-modelling or block-clustering. We extend the stochastic blockmodel (SBM) of Nowicki & Snijders (2001), by exploiting parameter collapsing to integrate out block parameters. The resulting model defines a posterior over the number of clusters and cluster memberships. Sampling from this model is simpler than from the original SBM as transdimensional MCMC can be avoided. Moreover, our extensions allow the number of clusters to be directly estimated, rather than given as an input parameter. The algorithm is based on the allocation sampler of Nobile & Fearnside (2007). We use synthetic and real data to test the speed and accuracy of our model and algorithm, including the ability to estimate the number of clusters. The algorithm can scale to networks with up to ten thousand nodes."
in_NB  network_data_analysis  community_discovery  statistics 
7 weeks ago by cshalizi
[1203.2200] Role-Dynamics: Fast Mining of Large Dynamic Networks
"To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks."
in_NB  network_data_analysis  data_mining  community_discovery  neville.jennifer 
7 weeks ago by cshalizi
Network Analysis of Corticocortical Connections Reveals Ventral and Dorsal Processing Streams in Mouse Visual Cortex
"Much of the information used for visual perception and visually guided actions is processed in complex networks of connections within the cortex. To understand how this works in the normal brain and to determine the impact of disease, mice are promising models. In primate visual cortex, information is processed in a dorsal stream specialized for visuospatial processing and guided action and a ventral stream for object recognition. Here, we traced the outputs of 10 visual areas and used quantitative graph analytic tools of modern network science to determine, from the projection strengths in 39 cortical targets, the community structure of the network. We found a high density of the cortical graph that exceeded that shown previously in monkey. Each source area showed a unique distribution of projection weights across its targets (i.e., connectivity profile) that was well fit by a lognormal function. Importantly, the community structure was strongly dependent on the location of the source area: outputs from medial/anterior extrastriate areas were more strongly linked to parietal, motor, and limbic cortices, whereas lateral extrastriate areas were preferentially connected to temporal and parahippocampal cortices. These two subnetworks resemble dorsal and ventral cortical streams in primates, demonstrating that the basic layout of cortical networks is conserved across species."
to:NB  neuroscience  neural_data_analysis  network_data_analysis 
8 weeks ago by cshalizi
[1203.5351] Activity driven modeling of dynamic networks
"Network modeling plays a critical role in identifying statistical regularities and structural principles common to many systems. The large majority of recent modeling approaches are connectivity driven, in the sense that the structural pattern of the network is at the basis of the mechanisms ruling the network formation. Connectivity driven models necessarily provide a time-aggregated representation that may fail to describe the instantaneous and fluctuating dynamics of many networks. We address this challenge by defining the activity potential, a time invariant function characterizing the agents' interactions in real-world networks and constructing an activity driven model capable of encoding the instantaneous time description of the network dynamics. The model provides an explanation of structural features such as the presence of hubs, which simply originate from the heterogeneous activity of agents. Additionally, we find that diffusive processes in highly dynamical networks can be described analytically in terms of the activity potential, allowing a quantitative discussion of the biases induced by the time-aggregated network representation in the analysis of dynamical processes in evolving networks."
to:NB  network_data_analysis  networks  stochastic_processes  markov_models  transaction_networks  to_read  re:stacs 
8 weeks ago by cshalizi
[1203.5974] The Concentration and Stability of the Community Detecting Functions on Random Networks
"We propose a general form of community detecting functions for finding the communities or the optimal partition of a random network, and examine the concentration and stability of the function values using the bounded difference martingale method. We derive LDP inequalities for both the general case and several specific community detecting functions: modularity, graph bipartitioning and q-Potts community structure. We also discuss the concentration and stability of community detecting functions on different types of random networks: the sparse and non-sparse networks and some examples such as ER and CL networks."
in_NB  to_read  community_discovery  network_data_analysis  statistics 
8 weeks ago by cshalizi
[1203.5438] A Regularization Approach for Prediction of Edges and Node Features in Dynamic Graphs
"We consider the two problems of predicting links in a dynamic graph sequence and predicting functions defined at each node of the graph. In many applications, the solution of one problem is useful for solving the other. Indeed, if these functions reflect node features, then they are related through the graph structure. In this paper, we formulate a hybrid approach that simultaneously learns the structure of the graph and predicts the values of the node-related functions. Our approach is based on the optimization of a joint regularization objective. We empirically test the benefits of the proposed method with both synthetic and real data. The results indicate that joint regularization improves prediction performance over the graph evolution and the node features."
to:NB  network_data_analysis 
8 weeks ago by cshalizi
[1203.6093] Consensus clustering in complex networks
"The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics."
in_NB  community_discovery  network_data_analysis  clustering  data_mining 
8 weeks ago by cshalizi
Taylor & Francis Online :: Nonparametric Regression on a Graph - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"The ‘Signal plus Noise’ model for nonparametric regression can be extended to the case of observations taken at the vertices of a graph. This model includes many familiar regression problems. This article discusses the use of the edges of a graph to measure roughness in penalized regression. Distance between estimate and observation is measured at every vertex in the L2 norm, and roughness is penalized on every edge in the L1 norm. Thus the ideas of total variation penalization can be extended to a graph. The resulting minimization problem presents special computational challenges, so we describe a new and fast algorithm and demonstrate its use with examples.

The examples include image analysis, a simulation applicable to discrete spatial variation, and classification. In our examples, penalized regression improves upon kernel smoothing in terms of identifying local extreme values on planar graphs. In all examples we use fully automatic procedures for setting the smoothing parameters."
to:NB  statistics  network_data_analysis  smoothing  regression 
8 weeks ago by cshalizi
Taylor & Francis Online :: Statistical Inference on Random Graphs: Comparative Power Analyses via Monte Carlo - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"We present a comparative power analysis, via Monte Carlo, of various graph invariants used as statistics for testing graph homogeneity versus a “chatter” alternative—the existence of a local region of excessive activity. Our results indicate that statistical inference on random graphs, even in a relatively simple setting, can be decidedly nontrivial. We find that none of the graph invariants considered is uniformly most powerful throughout our space of alternatives. Code for reproducing all the simulation results presented in this article is available online."
to:NB  re:network_differences  statistics  hypothesis_testing  network_data_analysis 
8 weeks ago by cshalizi
[1202.4805] Fast Generation of Large Scale Social Networks with Clustering
"A key challenge within the social network literature is the problem of network generation - that is, how can we create synthetic networks that match characteristics traditionally found in most real world networks? Important characteristics that are present in social networks include a power law degree distribution, small diameter and large amounts of clustering; however, most current network generators, such as the Chung Lu and Kronecker models, largely ignore the clustering present in a graph and choose to focus on preserving other network statistics, such as the power law distribution. Models such as the exponential random graph model have a transitivity parameter, but are computationally difficult to learn, making scaling to large real world networks intractable. In this work, we propose an extension to the Chung Lu ran- dom graph model, the Transitive Chung Lu (TCL) model, which incorporates the notion of a random transitive edge. That is, with some probability it will choose to connect to a node exactly two hops away, having been introduced to a 'friend of a friend'. In all other cases it will follow the standard Chung Lu model, selecting a 'random surfer' from anywhere in the graph according to the given invariant distribution. We prove TCL's expected degree distribution is equal to the degree distribution of the original graph, while being able to capture the clustering present in the network. The single parameter required by our model can be learned in seconds on graphs with millions of edges, while networks can be generated in time that is linear in the number of edges. We demonstrate the performance TCL on four real- world social networks, including an email dataset with hundreds of thousands of nodes and millions of edges, showing TCL generates graphs that match the degree distribution, clustering coefficients and hop plots of the original networks."
to:NB  network_data_analysis 
12 weeks ago by cshalizi
Phys. Rev. E 78, 046102 (2008): Network quotients: Structural skeletons of complex systems
"A defining feature of many large empirical networks is their intrinsic complexity. However, many networks also contain a large degree of structural repetition. An immediate question then arises: can we characterize essential network complexity while excluding structural redundancy? In this article we utilize inherent network symmetry to collapse all redundant information from a network, resulting in a coarse graining which we show to carry the essential structural information of the “parent” network. In the context of algebraic combinatorics, this coarse-graining is known as the “quotient.” We systematically explore the theoretical properties of network quotients and summarize key statistics of a variety of “real-world” quotients with respect to those of their parent networks. In particular, we find that quotients can be substantially smaller than their parent networks yet typically preserve various key functional properties such as complexity (heterogeneity and hub vertices) and communication (diameter and mean geodesic distance), suggesting that quotients constitute the essential structural skeletons of their parent networks. We summarize with a discussion of potential uses of quotients in analysis of biological regulatory networks and ways in which using quotients can reduce the computational complexity of network algorithms."
in_NB  network_data_analysis 
february 2012 by cshalizi
Phylogenetic Networks - Academic and Professional Books - Cambridge University Press
"The evolutionary history of species is traditionally represented using a rooted phylogenetic tree. However, when reticulate events such as hybridization, horizontal gene transfer or recombination are believed to be involved, phylogenetic networks that can accommodate non-treelike evolution have an important role to play. This book provides the first interdisciplinary overview of phylogenetic networks. Beginning with a concise introduction to both phylogenetic trees and phylogenetic networks, the fundamental concepts and results are then presented for both rooted and unrooted phylogenetic networks. Current approaches and algorithms available for computing phylogenetic networks from different types of datasets are then discussed, accompanied by examples of their application to real biological datasets. The book also summarises the algorithms used for drawing phylogenetic networks, along with the existing software for their computation and evaluation. All datasets, examples and other additional information and links are available from the book's companion website at www.phylogenetic-networks.org."
in_NB  phylogenetics  network_data_analysis  evolutionary_biology  cultural_evolution  books:noted 
february 2012 by cshalizi
Okabayashi , Geyer : Long range search for maximum likelihood in exponential families
"Exponential families are often used to model data sets with complex dependence. Maximum likelihood estimators (MLE) can be difficult to estimate when the likelihood is expensive to compute. Markov chain Monte Carlo (MCMC) methods based on the MCMC-MLE algorithm in [17] are guaranteed to converge in theory under certain conditions when starting from any value, but in practice such an algorithm may labor to converge when given a poor starting value. We present a simple line search algorithm to find the MLE of a regular exponential family when the MLE exists and is unique. The algorithm can be started from any initial value and avoids the trial and error experimentation associated with calibrating algorithms like stochastic approximation. Unlike many optimization algorithms, this approach utilizes first derivative information only, evaluating neither the likelihood function itself nor derivatives of higher order than first. We show convergence of the algorithm for the case where the gradient can be calculated exactly. When it cannot, it has a particularly convenient form that is easily estimable with MCMC, making the algorithm still useful to a practitioner."
to:NB  statistics  exponential_families  exponential_family_random_graphs  network_data_analysis  estimation  monte_carlo  optimization  geyer.charles 
february 2012 by cshalizi
[1112.6028] The entropy of stochastic blockmodel ensembles
"Stochastic blockmodels are generative network models where the vertices are separated into discrete groups, and the probability of an edge existing between two vertices is determined solely by their group membership. In this paper, we derive expressions for the entropy of stochastic blockmodel ensembles. We consider several ensemble variants, including the traditional model as well as the newly introduced degree-corrected version [Karrer et al. Phys. Rev. E {bf 83}, 016107 (2011)], which imposes a degree sequence on the vertices, in addition to the block structure. The imposed degree sequence is implemented both as "soft" constraints, where only the expected degrees are imposed, and as "hard" constraints, where they are required to be the same on all samples of the ensemble. We also consider generalizations to multigraphs and directed graphs. We illustrate one of many applications of this measure by directly deriving a log-likelihood function from the entropy expression, and using it to infer latent block structure in observed data. Due to the general nature of the ensembles considered, the method works well for ensembles with intrinsic degree correlations (i.e. with entropic origin) as well as extrinsic degree correlations, which go beyond the block structure."
in_NB  network_data_analysis  exponential_family_random_graphs  re:your_favorite_ergm_sucks  community_discovery 
january 2012 by cshalizi
[1201.2931] A glocal distance for network comparison
"When comparing networks (with the same number of nodes) with direct methods, a number of possible distances is already available in literature. Among others, two of the most common families are the set of edit-like distances and the spectral distances. The functions in the former family quantitatively evaluate the differences between two networks in terms of minimum number of edit operations (with possibly different costs) transforming one network into the other, that is, deletion and insertion of links, while spectral measures relies on functions of the eigenvalues of one of the connectivity matrices of the underlying graph.
A noticeable issue affecting edit distance is the fact of being local, i.e. not taking into account the global structure of the networks but only summing the contributions coming from each single link. On the other hand, spectral measures cannot distinguish isomorphic or isospectral graphs. We propose here a possible solution to overcome both issues: combining together an edit and a spectral distance in a product metric we will define textit{glocal}. In what follows we define the two components and the glocal metric itself, with a few examples of applications."
to:NB  network_data_analysis  re:network_differences 
january 2012 by cshalizi
[1201.1927] Respondent-driven Sampling on Directed Networks
"Respondent-driven sampling (RDS) is a commonly used substitute for random sampling when studying hidden populations, such as injective drug users or men who have sex with men, for which no sampling frame is known. The method works like a snowball sample but can, given that some assumptions are met, generate unbiased population estimates. One key assumption, not likely to be met, is that the acquaintance network in which the recruitment process takes place is undirected, meaning that all recruiters should have the potential to be recruited by the person they recruit. Here we investigate the potential bias of directedness by simulating RDS on real and artificial network structures. We show that directedness is likely to generate bias that cannot be compensated for unless the sampled individuals know how many potentially may have recruited them (i.e. their indegree), which is unlikely in most situations. We propose three indegree-based estimators for RDS on directed networks, and show that they can be used in the situation when only outdegrees are observed, either together with prior information or in a sensitivity analysis taking uncertainty of indegree properties of the network into account."
to:NB  network_data_analysis  network_sampling  statistics 
january 2012 by cshalizi
[1201.2788] Inferring global network properties from egocentric data with applications to epidemics
"Social networks are rarely observed in full detail. In many situations properties are known for only a sample of the individuals in the network and it is desirable to induce global properties of the full social network from this "egocentric" network data. In the current paper we study a few different types of egocentric data, and show what global network properties are consistent with those egocentric data. Two global network properties are considered: the size of the largest connected component in the network (the giant), and secondly, the possible size of an epidemic outbreak taking place on the network, in which transmission occurs only between network neighbours, and with probability $p$. The main conclusion is that in most cases, egocentric data allow for a large range of possible sizes of the giant and the outbreak. However, there is an upper bound for the latter. For the case that the network is selected uniformly among networks with prescribed egocentric data (satisfying some conditions), the asymptotic size of the giant and the outbreak is characterised."
to:NB  network_data_analysis  network_sampling  epidemic_models 
january 2012 by cshalizi
[1201.2770] Bergm: Bayesian Exponential Random Graphs in R
"In this paper we describe the basic features of the Bergm package for the open-source R software which provides a comprehensive framework for Bayesian analysis for exponential random graph models: tools for parameter estimation, model selection and goodness-of-fit diagnostics. We illustrate the capabilities of this package describing the algorithms that drive the package through a tutorial analysis of two well-known network datasets."
to:NB  network_data_analysis  statistics  exponential_family_random_graphs 
january 2012 by cshalizi
[1112.3265] Predicting Links and Inferring Attributes using a Social-Attribute Network (SAN)
"The effects of homophily and social influence suggest that both network structure and node attribute information can inform the tasks of link prediction and node attribute inference. However, the algorithmic question of how to efficiently incorporate these two sources of information remains largely unanswered. In this paper, we propose a Social-Attribute Network (SAN) model that gracefully integrates node attributes with network structure to predict network links and infer node attributes. We adapt several leading unsupervised link prediction algorithms to the SAN model and demonstrate performance improvement for each algorithm. We also generalize these algorithms to infer node attributes via the SAN model and show that we can further improve link prediction accuracy by first inferring attributes for nodes with missing attributes. We evaluate these algorithms on a novel Google+ network dataset and achieve state- of-the-art performance, thus demonstrating that the SAN model effectively integrates network structure and node attribute data."
in_NB  network_data_analysis  machine_learning  to_be_shot_after_a_fair_trial  homophily 
december 2011 by cshalizi
[1112.3308] Spatial correlations in attribute communities
"Community detection is an important tool for exploring and classifying the properties of large complex networks and should be of great help for spatial networks. Indeed, in addition to their location, nodes in spatial networks can have attributes such as the language for individuals, or any other socio-economical feature that we would like to identify in communities. We discuss in this paper a crucial aspect which was not considered in previous studies which is the possible existence of correlations between space and attributes. Introducing a simple toy model in which both space and node attributes are considered, we discuss the effect of space-attribute correlations on the results of various community detection methods proposed for spatial networks in this paper and in previous studies. When space is irrelevant, our model is equivalent to the stochastic block model which has been shown to display a detectability-non detectability transition. In the regime where space dominates the link formation process, most methods can fail to recover the communities, an effect which is particularly marked when space-attributes correlations are strong. In this latter case, community detection methods which remove the spatial component of the network can miss a large part of the community structure and can lead to incorrect results."
in_NB  to_read  statistics  community_discovery  network_data_analysis  spatial_statistics 
december 2011 by cshalizi
[1112.2774] Measuring Tie Strength in Implicit Social Networks
"Given a set of people and a set of events they attend, we address the problem of measuring connectedness or tie strength between each pair of persons given that attendance at mutual events gives an implicit social network between people. We take an axiomatic approach to this problem. Starting from a list of axioms that a measure of tie strength must satisfy, we characterize functions that satisfy all the axioms and show that there is a range of measures that satisfy this characterization. A measure of tie strength induces a ranking on the edges (and on the set of neighbors for every person). We show that for applications where the ranking, and not the absolute value of the tie strength, is the important thing about the measure, the axioms are equivalent to a natural partial order. Also, to settle on a particular measure, we must make a non-obvious decision about extending this partial order to a total order, and that this decision is best left to particular applications. We classify measures found in prior literature according to the axioms that they satisfy. In our experiments, we measure tie strength and the coverage of our axioms in several datasets. Also, for each dataset, we bound the maximum Kendall's Tau divergence (which measures the number of pairwise disagreements between two lists) between all measures that satisfy the axioms using the partial order. This informs us if particular datasets are well behaved where we do not have to worry about which measure to choose, or we have to be careful about the exact choice of measure we make."
to:NB  network_data_analysis  inference_to_latent_objects 
december 2011 by cshalizi
[1112.2755] Using Proximity to Predict Activity in Social Networks
"The structure of a social network contains information useful for predicting its evolution. Nodes that are "close" in some sense are more likely to become linked in the future than more distant nodes. We show that structural information can also help predict node activity. We use proximity to capture the degree to which two nodes are "close" to each other in the network. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that can occur between network nodes. We argue that the "closer" nodes are in a social network, the more similar will be their activity. We study this claim using data about URL recommendation on social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both recommend. We also show that given friends' activity, knowing their proximity to the user can help better predict which URLs the user will recommend. We compare the performance of different proximity metrics on the activity prediction task and find that some metrics lead to substantial performance improvements."
to:NB  network_data_analysis  machine_learning  lerman.kristina 
december 2011 by cshalizi
[1112.1831] Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach
"A "community" in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc. For many years algorithms for finding communities implicitly assumed communities are nonoverlapping (leading to use of clustering-based approaches) but there is increasing interest in finding overlapping communities. A barrier to finding communities is that the solution concept is often defined in terms of an NP-complete problem such as Clique or Hierarchical Clustering.
This paper seeks to initiate a rigorous approach to the problem of finding overlapping communities, where "rigorous" means that we clearly state the following: (a) the object sought by our algorithm (b) the assumptions about the underlying network (c) the (worst-case) running time.
Our assumptions about the network lie between worst-case and average-case. An average case analysis would require a precise probabilistic model of the network, on which there is currently no consensus. However, some plausible assumptions about network parameters can be gleaned from a long body of work in the sociology community spanning five decades focusing on the study of individual communities and ego-centric networks. Thus our assumptions are somewhat "local" in nature. Nevertheless they suffice to permit a rigorous analysis of running time of algorithms that recover global structure.
Our algorithms use random sampling similar to that in property testing and algorithms for dense graphs. However, our networks are not necessarily dense graphs, not even in local neighborhoods.
Our algorithms explore a local-global relationship between ego-centric and socio-centric networks that we hope will provide a fruitful framework for future work both in computer science and sociology."
in_NB  community_discovery  theoretical_computer_science  network_data_analysis 
december 2011 by cshalizi
[1112.1047] Network Inference and Biological Dynamics
"Network inference approaches are now widely used in biological applications to probe regulatory relationships between molecular components such as genes or proteins. Many methods have been proposed for this setting, but the connections and differences between their statistical formulations have received less attention. In this paper, we show how a broad class of statistical network inference methods, including a number of existing approaches, can be described in terms of variable selection for the linear model. This reveals some subtle but important differences between the methods, including the treatment of time intervals in discretely observed data. In developing a general formulation, we also explore the relationship between single-cell stochastic dynamics and network inference on averages over cells. This clarifies the link between biochemical networks as they operate at the cellular level and network inference as carried out on data that are averages over populations of cells. We present empirical results, comparing thirty-two network inference methods that are instances of the general formulation we describe, using two published dynamical models. Our investigation sheds light on the applicability and limitations of network inference and provides guidance for practitioners and suggestions for experimental design."
in_NB  have_read  biochemical_networks  network_data_analysis 
december 2011 by cshalizi
[1112.1115] Social-Topical Affiliations: The Interplay between Structure and Popularity
"Information popularity and social relationships are intimately connected. However, measuring the extent to which they affect each other has remained an open question. Because we now have access to rich and large data sets from online social networks, we can begin to quantitatively understand the interplay between them. We examine the interface of two decisive structures forming the backbone of online social media: the graph structure of social networks - who is friends with whom - and the set structure of topical affiliations - who talks about what. In studying this interface, we identify key relationships whereby each of these structures can be understood in terms of the other. The context for our study is Twitter, where we look at the social network of both follower relationships and communication relationships, alongside the affiliations outlined by the hashtags used by people to label their communications. On Twitter, we demonstrate how the hashtags that a user adopts can be used to predict their social relationships, and also how the social relationships between the adopters of a hashtag can be used to predict the future popularity of that hashtag. Importantly, we find that both relationships are driven by highly computationally simple structural determinants. While our analysis focuses on Twitter, we view our analysis of social-topical affiliations as broadly applicable to a host of diverse affiliations, including the movies people watch, the brands people like, or the locations people frequent."
in_NB  network_data_analysis  social_media  text_mining  community_discovery 
december 2011 by cshalizi
Phys. Rev. E 84, 056111 (2011): Exploring the structural regularities in networks
"In this paper, we consider the problem of exploring structural regularities of networks by dividing the nodes of a network into groups such that the members of each group have similar patterns of connections to other groups. Specifically, we propose a general statistical model to describe network structure. In this model, a group is viewed as a hidden or unobserved quantity and it is learned by fitting the observed network data using the expectation-maximization algorithm. Compared with existing models, the most prominent strength of our model is the high flexibility. This strength enables it to possess the advantages of existing models and to overcome their shortcomings in a unified way. As a result, not only can broad types of structure be detected without prior knowledge of the type of intrinsic regularities existing in the target network, but also the type of identified structure can be directly learned from the network. Moreover, by differentiating outgoing edges from incoming edges, our model can detect several types of structural regularities beyond competing models. Tests on a number of real world and artificial networks demonstrate that our model outperforms the state-of-the-art model in shedding light on the structural regularities of networks, including the overlapping community structure, multipartite structure, and several other types of structure, which are beyond the capability of existing models." [At a guess from the abstract: block models]
to:NB  network_data_analysis  community_discovery 
december 2011 by cshalizi
[1112.0840] On the question of effective sample size in network modeling
"We raise the issue of effective sample size in network graph modeling and inference and illustrate, using simple models and arguments, how this issue can quickly become nontrivial."
in_NB  network_data_analysis  have_read  estimation  statistics  fisher_information  exponential_family_random_graphs  kolaczyk.eric  krivitsky.pavel 
december 2011 by cshalizi
[1111.6118] Sample-to-sample fluctuations in real-network ensembles
"Network modeling based on ensemble averages tacitly assumes that the networks meant to be modeled are typical in the ensemble. Previous research on network eigenvalues, which govern a range of dynamical phenomena, has shown that this is indeed the case for uncorrelated networks with minimum degree $ge 3$. Here we focus on real networks, which generally have both structural correlations and low-degree nodes. We show that: (i) the ensemble distribution of the dynamically most important eigenvalues can be not only broad and far apart from the real eigenvalue but also highly structured, often with a multimodal rather than bell-shaped form; (ii) these interesting properties are found to be due to low-degree nodes, mainly those with degree $< 3$, and network communities, which is a common form of structural correlation found in real networks. In addition to having implications for ensemble-based approaches, this shows that low-degree nodes may have a stronger influence on collective dynamics than previously anticipated from the study of computer-generated networks."
to:NB  network_data_analysis 
december 2011 by cshalizi
[1111.6115] Discovering Network Structure Beyond Communities
"To understand the formation, evolution, and function of complex systems, it is crucial to understand the internal organization of their interaction networks. Partly due to the impossibility of visualizing large complex networks, resolving network structure remains a challenging problem. Here we overcome this difficulty by combining the visual pattern recognition ability of humans with the high processing speed of computers to develop an exploratory method for discovering groups of nodes characterized by common network properties, including but not limited to communities of densely connected nodes. Without any prior information about the nature of the groups, the method simultaneously identifies the number of groups, the group assignment, and the properties that define these groups. The results of applying our method to real networks suggest the possibility that most group structures lurk undiscovered in the fast-growing inventory of social, biological, and technological networks of scientific interest."
in_NB  network_data_analysis  community_discovery 
december 2011 by cshalizi
[1111.5899] Sampling, Filtering and Sparse Approximations on Combinatorial Graphs
In this paper we address sampling and approximation of functions on combinatorial graphs. We develop filtering on graphs by using Schr"odinger's group of operators generated by combinatorial Laplace operator. Then we construct a sampling theory by proving Poincare and Plancherel-Polya-type inequalities for functions on graphs. These results lead to a theory of sparse approximations on graphs and have potential applications to filtering, denoising, data dimension reduction, image processing, image compression, computer graphics, visualization and learning theory.
to:NB  network_data_analysis  approximation  graph_theory 
december 2011 by cshalizi
Social network models predict movement and connectivity in ecological landscapes
"Network analysis is on the rise across scientific disciplines because of its ability to reveal complex, and often emergent, patterns and dynamics. Nonetheless, a growing concern in network analysis is the use of limited data for constructing networks. This concern is strikingly relevant to ecology and conservation biology, where network analysis is used to infer connectivity across landscapes. In this context, movement among patches is the crucial parameter for interpreting connectivity but because of the difficulty of collecting reliable movement data, most network analysis proceeds with only indirect information on movement across landscapes rather than using observed movement to construct networks. Statistical models developed for social networks provide promising alternatives for landscape network construction because they can leverage limited movement information to predict linkages. Using two mark-recapture datasets on individual movement and connectivity across landscapes, we test whether commonly used network constructions for interpreting connectivity can predict actual linkages and network structure, and we contrast these approaches to social network models. We find that currently applied network constructions for assessing connectivity consistently, and substantially, overpredict actual connectivity, resulting in considerable overestimation of metapopulation lifetime. Furthermore, social network models provide accurate predictions of network structure, and can do so with remarkably limited data on movement. Social network models offer a flexible and powerful way for not only understanding the factors influencing connectivity but also for providing more reliable estimates of connectivity and metapopulation persistence in the face of limited data."
to:NB  network_data_analysis  ecology 
december 2011 by cshalizi
[1111.5312] Representations and Ensemble Methods for Dynamic Relational Classification
"Temporal networks are ubiquitous and evolve over time by the addition, deletion, and changing of links, nodes, and attributes. Although many relational datasets contain temporal information, the majority of existing techniques in relational learning focus on static snapshots and ignore the temporal dynamics. We propose a framework for discovering temporal representations of relational data to increase the accuracy of statistical relational learning algorithms. The temporal relational representations serve as a basis for classification, ensembles, and pattern mining in evolving domains. The framework includes (1) selecting the time-varying relational components (links, attributes, nodes), (2) selecting the temporal granularity, (3) predicting the temporal influence of each time-varying relational component, and (4) choosing the weighted relational classifier. Additionally, we propose temporal ensemble methods that exploit the temporal-dimension of relational data. These ensembles outperform traditional and more sophisticated relational ensembles while avoiding the issue of learning the most optimal representation. Finally, the space of temporal-relational models are evaluated using a sample of classifiers. In all cases, the proposed temporal-relational classifiers outperform competing models that ignore the temporal information. The results demonstrate the capability and necessity of the temporal-relational representations for classification, ensembles, and for mining temporal datasets."
in_NB  to_read  relational_learning  network_data_analysis  transaction_networks  neville.jennifer  machine_learning  ensemble_methods  time_series  classifiers 
november 2011 by cshalizi
PLoS ONE: The Small World of Psychopathology
"Background
Mental disorders are highly comorbid: people having one disorder are likely to have another as well. We explain empirical comorbidity patterns based on a network model of psychiatric symptoms, derived from an analysis of symptom overlap in the Diagnostic and Statistical Manual of Mental Disorders-IV (DSM-IV).

Principal Findings
We show that a) half of the symptoms in the DSM-IV network are connected, b) the architecture of these connections conforms to a small world structure, featuring a high degree of clustering but a short average path length, and c) distances between disorders in this structure predict empirical comorbidity rates. Network simulations of Major Depressive Episode and Generalized Anxiety Disorder show that the model faithfully reproduces empirical population statistics for these disorders.

Conclusions
In the network model, mental disorders are inherently complex. This explains the limited successes of genetic, neuroscientific, and etiological approaches to unravel their causes. We outline a psychosystems approach to investigate the structure and dynamics of mental disorders."
to:NB  psychometrics  psychiatry  network_data_analysis  inference_to_latent_objects  borsboom.denny  have_read  to:blog 
november 2011 by cshalizi
[1111.4181] Locating privileged information spreaders during political protests on an Online Social Network
"Although the phrase "Twitter revolution" was coined back in 2009 to refer to the mass mobilizations in Moldova and soon after in Iran, year 2011 has confirmed the connection between social media and social unrest. Undoubtedly, the "Arab spring" or the "Spanish revolution" --which has spread throughout and culminated with Liberty Square occupation in New York-- cannot be understood without the role of social networking sites to help protesters self-organize and attain a critical mass of participants. In this context, we need to distinguish dynamic activity (which comprises actual information exchange) from the underlying structure (which reflects relatively stable relationships between users --who follows who). We provide a quantitative analysis which stems from complex network theory to scrutinize the mobilization's temporal evolution and its resulting structure and dynamics both at the macro- and micro-scale levels. Most importantly, we study the interplay between the structural and dynamic levels to decipher how the former facilitates the latter's success, understood as efficiency in information spreading. We discuss who (and why) has privileged spreading capabilities when it comes to information diffusion, based on the analysis of empirical data."
to:NB  to_read  social_media  network_data_analysis  social_influence  arab_spring 
november 2011 by cshalizi
[1111.3652] Features and heterogeneities in growing network models
"we provide a generalization of growing network models with preferential attachment that includes the effect of heterogeneous features of the nodes. The main effect of heterogeneity is the emergence of an "effective fitness" for each class of nodes, determining the rate at which nodes acquire new links. Beyond the degree distribution, in this paper we give a full characterization of the other relevant properties of the model. We evaluate the clustering coefficient and show that it disappears for large network size, a property shared with the Barab'asi-Albert model. Negative degree correlations are also present in the studied class of models, along with non-trivial mixing patterns among features. We therefore conclude that both small clustering coefficients and disassortative mixing are outcomes of the preferential attachment mechanism in general growing networks."
to:NB  network_data_analysis  cumulative_advantage  network_growth  homophily 
november 2011 by cshalizi
[1111.3054] Consistency under Sampling of Exponential Random Graph Models
"The growing availability of network data and of scientific interest in distributed systems has led to the rapid development of statistical models of network structure. Typically, however, these are models for the entire network, while the data consists only of a sampled sub-network. Parameters for the whole network, which is what is of interest, are estimated by applying the model to the sub-network. This assumes that the model is consistent under sampling, or, in terms of the theory of stochastic processes, that it defines a projective family. Focussing on the popular class of exponential random graph models (ERGMs), we show that this apparently trivial condition is in fact violated by many popular and scientifically appealing models, and that satisfying it drastically limits ERGM's expressive power. These results are actually special cases of more general ones about exponential families of dependent random variables, which we also prove. Using such results, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses."
in_NB  self-promotion  exponential_family_random_graphs  exponential_families  statistical_inference_for_stochastic_processes  statistics  network_data_analysis  re:your_favorite_ergm_sucks  estimation  large_deviations 
november 2011 by cshalizi
Bickel , Chen , Levina : The method of moments and degree distributions for network models
"Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including Omega(1). Additional results are obtained for the important special case of degree distributions."

After reading this, I note that they do not go through even one example of actually estimating anything. I think this is because the inversion from moments to graphons, while mathematically well-defined, is hellish to calculate (and probably very numerically unstable).
network_data_analysis  statistics  estimation  bickel.peter  levina.elizaveta  re:smoothing_adjacency_matrices  in_NB  have_read 
november 2011 by cshalizi
[1111.1352] Max-plus objects to study the complexity of graphs
"Given an undirected graph $G$, we define a new object $H_G$, called the mp-chart of $G$, in the max-plus algebra. We use it, together with the max-plus permanent, to describe the complexity of graphs. We show how to compute the mean and the variance of $H_G$ in terms of the adjacency matrix of $G$ and we give a central limit theorem for $H_G$. Finally, we show that the mp-chart is easily tractable also for the complement graph." --- I have no idea what this means either.
in_NB  network_data_analysis  graph_theory  algebra 
november 2011 by cshalizi
[1111.2018] Intrinsically Dynamic Network Communities
"Community finding algorithms for networks have recently been extended to dynamic data. Most of these recent methods aim at exhibiting community partitions from successive graph snapshots and thereafter connecting or smoothing these partitions using clever time-dependent features and sampling techniques. These approaches are nonetheless achieving longitudinal rather than dynamic community detection. We assume that communities are fundamentally defined by the repetition of interactions among a set of nodes over time. According to this definition, analyzing the data by considering successive snapshots induces a significant loss of information: we suggest that it blurs essentially dynamic phenomena - such as communities based on repeated inter-temporal interactions, nodes switching from a community to another across time, or the possibility that a community survives while its members are being integrally replaced over a longer time period. We propose a formalism which aims at tackling this issue in the context of time-directed datasets (such as citation networks), and present several illustrations on both empirical and synthetic dynamic networks. We eventually introduce intrinsically dynamic metrics to qualify temporal community structure and emphasize their possible role as an estimator of the quality of the community detection - taking into account the fact that various empirical contexts may call for distinct `community' definitions and detection criteria."
in_NB  community_discovery  network_data_analysis  transaction_networks  roth.camille 
november 2011 by cshalizi
Evolving Cluster Mixed-Membership Blockmodel for Time-Evolving Networks
"Time-evolving networks are a natural presentation for dynamic social and biological interactions. While latent space models are gaining popularity in network modeling and analysis, previous works mostly ignore networks with temporal behavior and multi-modal actor roles. Furthermore, prior knowledge, such as division and grouping of social actors or biological specificity of molecular functions, has not been systematically exploited in network modeling. In this paper, we develop a network model featuring a state space mixture prior that tracks complex actor latent role changes through time. We provide a fast variational inference algorithm for learning our model, and validate it with simulations and held-out likelihood comparisons on real-world time-evolving networks. Finally, we demonstrate our model's utility as a network analysis tool, by applying it to United States Congress voting data. "
in_NB  to_read  community_discovery  network_data_analysis  machine_learning  xing.eric 
november 2011 by cshalizi
Multiscale Community Blockmodel for Network Exploration
"Real world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross- community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks. "
in_NB  to_read  community_discovery  network_data_analysis  machine_learning  statistics  xing.eric 
november 2011 by cshalizi
[1110.6517] Classification and estimation in the Stochastic Block Model based on the empirical degrees
"The Stochastic Block Model (Holland et al., 1983) is a mixture model for heterogeneous network data. Unlike the usual statistical framework, new nodes give additional information about the previous ones in this model. Thereby the distribution of the degrees concentrates in points conditionally on the node class. We show under a mild assumption that classification, estimation and model selection can actually be achieved with no more than the empirical degree data. We provide an algorithm able to process very large networks and consistent estimators based on it. In particular, we prove a bound of the probability of misclassification of at least one node, including when the number of classes grows."
in_NB  to_read  network_data_analysis  community_discovery  statistics 
november 2011 by cshalizi
[1110.5383] Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs
"We describe the first sub-quadratic sampling algorithm for the Multiplicative Attribute Graph Model (MAGM) of Kim and Leskovec (2010). We exploit the close connection between MAGM and the Kronecker Product Graph Model (KPGM) of Leskovec et al. (2010), and show that to sample a graph from a MAGM it suffices to sample small number of KPGM graphs and emph{quilt} them together. Under mild technical conditions our algorithm runs in $O((log_2(n))^3 |E|)$ time, where $n$ is the number of nodes and $|E|$ is the number of edges in the sampled graph. We demonstrate the scalability of our algorithm via extensive empirical evaluation; we can sample a MAGM graph with 8 million nodes and 20 billion edges in under 6 hours."
in_NB  to_read  network_data_analysis  re:smoothing_adjacency_matrices 
october 2011 by cshalizi
[1110.5186] Removing spurious interactions in complex networks
"Identifying and removing spurious links in complex networks is a meaningful problem for many real applications and is crucial for improving the reliability of network data, which in turn can lead to a better understanding of the highly interconnected nature of various social, biological and communication systems. In this work we study the features of different simple spurious link elimination methods, revealing that they may lead to the distortion of networks' structural and dynamical properties. Accordingly, we propose a hybrid method which combines similarity-based index and edge-betweenness centrality. We show that our method can effectively eliminate the spurious interactions while leaving the network connected and preserving the network's functionalities."
in_NB  network_data_analysis 
october 2011 by cshalizi
[1110.3864] Parameter-free identification of salient features in complex networks
"Large scale complex networks in natural, social, and technological systems generically exhibit an overabundance of rich information. Extracting essential and meaningful structural features from network data is one of the most challenging tasks in network theory. In this context, a variety of methods and concepts have been proposed, including centrality statistics, motif identification, community detection algorithms, hierarchical models, and backbone-extraction methods. Typically these classification schemes rely on external and often arbitrary parameters, such as centrality thresholds. However, parameter-dependent classifications are often problematic, since the resulting classifications of network elements depend sensitively on the parameter, and it is also unknown whether typical networks permit the classification of elements without external intervention. Here we introduce the concept of link salience, a parameter-free approach for classifying network elements based on a consensus estimate of all nodes. We show that a wide range of empirical networks exhibit a natural, network-implicit, and robust classification of links into two qualitatively distinct groups. We show that despite significant differences in the networks' tologogy and statistical features, their salient skeletons exhibit universal topological and statistical features. In addition to a parameter- free method for network reduction, link salience points the way towards a better understanding of universal, hidden features in real world networks that are masked by their complexity."
in_NB  network_data_analysis  to_be_shot_after_a_fair_trial 
october 2011 by cshalizi
[1110.4088] Infinitely exchangeable random graphs generated from a Poisson point process on monotone sets and applications to cluster analysis for networks
"We construct an infinitely exchangeable process on the set $cate$ of subsets of the power set of the natural numbers $mathbb{N}$ via a Poisson point process with mean measure $Lambda$ on the power set of $mathbb{N}$. Each $Eincate$ has a least monotone cover in $catf$, the collection of monotone subsets of $cate$, and every monotone subset maps to an undirected graph $Gincatg$, the space of undirected graphs with vertex set $mathbb{N}$. We show a natural mapping $caterightarrowcatfrightarrowcatg$ which induces an infinitely exchangeable measure on the projective system $catg^{rest}$ of graphs $catg$ under permutation and restriction mappings given an infinitely exchangeable family of measures on the projective system $cate^{rest}$ of subsets with permutation and restriction maps. We show potential connections of this process to applications in cluster analysis, machine learning, classification and Bayesian inference."
to:NB  to_read  stochastic_processes  networks  network_data_analysis  point_processes  graph_limits  re:smoothing_adjacency_matrices  re:your_favorite_ergm_sucks 
october 2011 by cshalizi
[1110.3860] Contending Parties: A Logistic Choice Analysis of Inter- and Intra-group Blog Citation Dynamics in the 2004 US Presidential Election
"The 2004 US Presidential Election cycle marked the debut of Internet-based media such as blogs and social networking websites as institutionally recognized features of the American political landscape. Using a longitudinal sample of all DNC/RNC-designated blog-citation networks we are able to test the influence of various strategic, institutional, and balance-theoretic mechanisms and exogenous factors such as seasonality and political events on the propensity of blogs to cite one another over time. Capitalizing on the temporal resolution of our data, we utilize an autoregressive network regression framework to carry out inference for a logistic choice process. Using a combination of deviance-based model selection criteria and simulation-based model adequacy tests, we identify the combination of processes that best characterizes the choice behavior of the contending blogs."
to:NB  network_data_analysis  blogs  us_politics  model_selection  simulation 
october 2011 by cshalizi
[1110.3225] Mining Patterns in Networks using Homomorphism
"In recent years many algorithms have been developed for finding patterns in graphs and networks. A disadvantage of these algorithms is that they use subgraph isomorphism to determine the support of a graph pattern; subgraph isomorphism is a well-known NP complete problem. In this paper, we propose an alternative approach which mines tree patterns in networks by using subgraph homomorphism. The advantage of homomorphism is that it can be computed in polynomial time, which allows us to develop an algorithm that mines tree patterns in arbitrary graphs in incremental polynomial time. Homomorphism however entails two problems not found when using isomorphism: (1) two patterns of different size can be equivalent; (2) patterns of unbounded size can be frequent. In this paper we formalize these problems and study solutions that easily fit within our algorithm."
in_NB  to_read  re:smoothing_adjacency_matrices  network_data_analysis  data_mining  graph_theory 
october 2011 by cshalizi
[1110.2558] Epidemic centrality and the underestimated epidemic impact on network peripheral nodes
"Studies of disease spreading on complex networks have provided a deep insight into the conditions of onset, dynamics and prevention of epidemics in human populations and malicious software propagation in computer networks. Identifying nodes which, when initially infected, infect the largest part of the network and ranking them according to their epidemic impact is a priority for public health policies. In simulations of the disease spreading in SIR model on studied empirical complex networks, it is shown that the ranking depends on the dynamical regime of the disease spreading. A possible mechanism leading to this dynamical dependence is illustrated in an analytically tractable example. A measure called epidemic centrality, averaging the epidemic impact over all possible disease spreading regimes, is introduced as a basis of epidemic ranking. Contrary to standard notion, the epidemic centrality of nodes with high degree, k-cores value or betweenness, which are structurally central, is comparable to epidemic centrality of structurally peripheral nodes. These findings indicate that the impact of an epidemic starting at structurally peripheral nodes may be considerably underestimated. Network periphery should gain a more prominent role in the allocation of resources in future epidemic preparedness plans."
in_NB  to_read  network_data_analysis  epidemic_models  re:social-networks-as-sensor-networks 
october 2011 by cshalizi
Weisfiler-Lehman Graph Kernels
"In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis."
in_NB  network_data_analysis  kernel_methods  data_analysis  graph_limits  machine_learning  re:smoothing_adjacency_matrices  to_read 
october 2011 by cshalizi
New consistent and asymptotically normal parameter estimates for random-graph mixture models - Ambroise - 2011 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"Random-graph mixture models are very popular for modelling real data networks. Parameter estimation procedures usually rely on variational approximations, either combined with the expectation–maximization (em) algorithm or with Bayesian approaches. Despite good results on synthetic data, the validity of the variational approximation is, however, not established. Moreover, these variational approaches aim at approximating the maximum likelihood or the maximum a posteriori estimators, whose behaviour in an asymptotic framework (as the sample size increases to ∞) remains unknown for these models. In this work, we show that, in many different affiliation contexts (for binary or weighted graphs), parameter estimators based either on moment equations or on the maximization of some composite likelihood are strongly consistent and √n convergent, when the number n of nodes increases to ∞. As a consequence, our result establishes that the overall structure of an affiliation model can be (asymptotically) caught by the description of the network in terms of its number of triads (order 3 structures) and edges (order 2 structures). Moreover, these parameter estimates are either explicit (as for the moment estimators) or may be approximated by using a simple em algorithm, whose convergence properties are known. We illustrate the efficiency of our method on simulated data and compare its performances with other existing procedures. A data set of cross-citations among economics journals is also analysed." --- Do they just mean block models?
network_data_analysis  statistics  to:NB  to_read  re:your_favorite_ergm_sucks 
october 2011 by cshalizi
[1109.3911] Benefits of Bias: Towards Better Characterization of Network Sampling
"From social networks to P2P systems, network sampling arises in many settings. We present a detailed study on the nature of biases in network sampling strategies to shed light on how best to sample from networks. We investigate connections between specific biases and various measures of structural representativeness. We show that certain biases are, in fact, beneficial for many applications, as they "push" the sampling process towards inclusion of desired properties. Finally, we describe how these sampling biases can be exploited in several, real-world applications including disease outbreak detection and market research."
network_data_analysis  in_NB  network_sampling 
september 2011 by cshalizi
[1108.2228] A consistent dot product embedding for stochastic blockmodel graphs
"We present a method to estimate block membership of nodes in a random graph generated by a stochastic blockmodel. We use an embedding procedure motivated by the random dot product graph model, a particular example of the latent position model. The embedded vectors are clustered through minimization of a mean square error/criteria. We prove that this method is consistent for assigning nodes to blocks, as only a negligible number of nodes will be mis-assigned. We prove consistency of the method for directed and undirected graphs. The consistent block assignment makes possible consistent parameter estimation for a stochastic blockmodel. We extend the result for when the number of blocks grows slowly with the number of nodes. Our method is also computationally feasible even for very large graphs."
community_discovery  network_data_analysis  in_NB  statistics  re:homophily_and_confounding 
august 2011 by cshalizi
Phys. Rev. E 84, 026103 (2011): Constrained randomization of weighted networks
"We propose a Markov chain method to efficiently generate surrogate networks that are random under the constraint of given vertex strengths. With these strength-preserving surrogates and with edge-weight-preserving surrogates we investigate the clustering coefficient and the average shortest path length of functional networks of the human brain as well as of the International Trade Networks. We demonstrate that surrogate networks can provide additional information about network-specific characteristics and thus help interpreting empirical weighted networks."
network_data_analysis  surrogate_data  in_NB 
august 2011 by cshalizi
« earlier      

related tags

20th_century_history  adamic.lada  afghanistan  ahmed.amr  airoldi.edo  algebra  algebraic_statistics  amaral.luis  anomaly_detection  approximation  arab_spring  basu.sumit  belief_propagation  bender-de_moll.skye  bergstrom.carl  bibliometry  bickel.peter  biochemical_networks  bioinformatics  blitzstein.joseph  blogged  blogging  blogs  books:noted  books:recommended  bootstrap  borsboom.denny  brandeis.louis  butts.carter_t  campaign_finance  carnegie_mellon  causal_inference  chatterjee.souav  choudhury.tanzeem  citation_networks  clarkson.brian  classifiers  clauset.aaron  clustering  community_discovery  complexity  computational_complexity  computational_humanities  computational_statistics  conferences  confounding  congress  contagion  context-free_grammars  corporations  counter-insurgency  creeping_authoritarianism  criminal_conspiracies  cultural_evolution  cumulative_advantage  darpa  data_analysis  data_mining  data_sets  democracy  diaconis.persi  discretization  dynamical_systems  eagle.nathan  eckles.dean  ecology  econometrics  em_algorithm  enron  ensemble_methods  epidemic_models  epidemiology  estimation  ethology  events  evisceration  evolutionary_biology  exchangeable_arrays  experimental_sociology  exponential_families  exponential_family_random_graphs  factor_analysis  FBI  feith.douglas  fienberg.steve  finance  fisher_information  flickr  fmri  food_webs  fraud  freeman.linton_c.  functional_connectivity  functional_data_analysis  gene_expression_data_analysis  getoor.lise  geyer.charles  gigs  goldenberg.anna  goodreau.steven_m  grants  graphical_models  graph_limits  graph_theory  handcock.mark  handcock.mark_s  have_read  heard_the_talk  heavy_tails  hierarchical_structure  high-dimensional_statistics  hilbert_space  historical_linguistics  hofman.jake  holme.petter  homophily  humanities  hunter.david_r  hypergraphs  hypothesis_testing  identifiability  increasing_returns  inference_to_latent_objects  influence  information_theory  in_NB  ising_model  kernel_methods  kirshner.sergey  kith_and_kin  kolaczyk.eric  kolar.mladen  kossinets.gueorgi  krivitsky.pavel  kronecker_graphs  large_deviations  lasso  latent_variables  lauritzen.steffen  lazer.david  learning  learning_theory  leenders.roger  lerman.kristina  levina.elizaveta  levina.liza  link_prediction  logistic_regression  low-rank_approximation  machine_learning  manifold_learning  map-reduce  markov_models  mathematics  mcpherson.miller  mean-field_theory  menczer.filippo  military_industrial_complex  minimum_description_length  model-checking  modeling  model_selection  monte_carlo  moody.james  moore.cris  morris.martina  national_surveillance_state  natural_history_of_truthiness  networks  network_data_analysis  network_formation  network_growth  network_sampling  neural_data_analysis  neuropsychology  neuroscience  neville.jennifer  newman.mark  nexus-7  noel.hans  nonparametrics  NSA  nyhan.brendan  optimization  oscillators  owen.art  parallel_computing  pattison.philippa  pentland.alex  phase_transitions  phylogenetics  pittsburgh  point_processes  political_networks  political_science  prediction  preferential_attachment  psychiatry  psychometrics  random_fields  random_graphs  random_walks  re:almost_none  re:critique_of_diffusion  re:donor_networks  re:functional_communities  re:homophily_and_confounding  re:network_differences  re:sensor-networks-as-social-networks  re:smoothing_adjacency_matrices  re:social-networks-as-sensor-networks  re:sporns_review  re:stacs  re:XV_for_networks  re:your_favorite_ergm_sucks  regression  relational_learning  rinaldo.alessandro  robins.james  rosvall.martin  roth.camille  running_dogs_of_reaction  sampling  self-centered  self-promotion  semantics_from_syntax  simulation  smoothing  snijders.tom  social_influence  social_life_of_the_mind  social_media  social_networks  sociology  software  song.le  sparsity  spatial_statistics  spectral_clustering  statistical_inference_for_stochastic_processes  statistical_mechanics  statistics  stochastic_processes  sufficiency  summer_schools  surrogate_data  surveillance  tagging  taskar.ben  terrorism_fears  text_mining  theoretical_computer_science  the_continuing_crises  thomas.andrew  time_series  to:blog  to:NB  tofias.michael  to_be_shot_after_a_fair_trial  to_read  to_teach:complexity-and-inference  to_teach:data-mining  traceroute  track_down_references  transaction_networks  us_military  us_politics  utter_stupidity  van_handel.ramon  vast_right-wing_conspiracy  via:?  via:ale  via:ariddell  via:arinaldo  via:arthegall  via:cwiggins  via:deaneckles  via:dsparks  via:guslacerda  via:iqss  via:john-burke  via:kevin_drum  via:laura_rozen  via:mason  via:wiggins  visual_display_of_quantitative_information  voter_model  watts.duncan  wurmser.david  xing.eric  yu.bin  zheng.alice  zhu.ji 

Copy this bookmark:



description:


tags: