cshalizi + network_data_analysis   832

Network dependence testing via diffusion maps and distance-based correlations | Biometrika | Oxford Academic
"Deciphering the associations between network connectivity and nodal attributes is one of the core problems in network science. The dependency structure and high dimensionality of networks pose unique challenges to traditional dependency tests in terms of theoretical guarantees and empirical performance. We propose an approach to test network dependence via diffusion maps and distance-based correlations. We prove that the new method yields a consistent test statistic under mild distributional assumptions on the graph structure, and demonstrate that it is able to efficiently identify the most informative graph embedding with respect to the diffusion time. The methodology is illustrated on both simulated and real data."
to:NB  network_data_analysis  statistics  to_read  to_teach:baby-nets 
6 days ago by cshalizi
[1703.10146] Community detection and stochastic block models: recent developments
"The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences.
"This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds.
"The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed."
to:NB  to_read  community_discovery  network_data_analysis  to_teach:baby-nets  via:liza_levina  stochastic_block_models 
28 days ago by cshalizi
[1903.07902] A Comparative Study for Unsupervised Network Representation Learning
"There has been appreciable progress in unsupervised network representation learning (UNRL) approaches over graphs recently with flexible random-walk approaches, new optimization objectives and deep architectures. However, there is no common ground for systematic comparison of embeddings to understand their behavior for different graphs and tasks. In this paper we theoretically group different approaches under a unifying framework and empirically investigate the effectiveness of different network representation methods. In particular, we argue that most of the UNRL approaches either explicitly or implicit model and exploit context information of a node. Consequently, we propose a framework that casts a variety of approaches -- random walk based, matrix factorization and deep learning based -- into a unified context-based optimization function. We systematically group the methods based on their similarities and differences. We study the differences among these methods in detail which we later use to explain their performance differences (on downstream tasks). We conduct a large-scale empirical study considering 9 popular and recent UNRL techniques and 11 real-world datasets with varying structural properties and two common tasks -- node classification and link prediction. We find that there is no single method that is a clear winner and that the choice of a suitable method is dictated by certain properties of the embedding methods, task and structural properties of the underlying graph. In addition we also report the common pitfalls in evaluation of UNRL methods and come up with suggestions for experimental design and interpretation of results."
to:NB  network_data_analysis  graph_embedding  link_prediction 
28 days ago by cshalizi
Veitch , Roy : Sampling and estimation for (sparse) exchangeable graphs
"Sparse exchangeable graphs on ℝ+R+, and the associated graphex framework for sparse graphs, generalize exchangeable graphs on ℕN, and the associated graphon framework for dense graphs. We develop the graphex framework as a tool for statistical network analysis by identifying the sampling scheme that is naturally associated with the models of the framework, formalizing two natural notions of consistent estimation of the parameter (the graphex) underlying these models, and identifying general consistent estimators in each case. The sampling scheme is a modification of independent vertex sampling that throws away vertices that are isolated in the sampled subgraph. The estimators are variants of the empirical graphon estimator, which is known to be a consistent estimator for the distribution of dense exchangeable graphs; both can be understood as graph analogues to the empirical distribution in the i.i.d. sequence setting. Our results may be viewed as a generalization of consistent estimation via the empirical graphon from the dense graph regime to also include sparse graphs."
to:NB  graph_limits  network_data_analysis  graphons  nonparametrics  statistics  veitch.victor  to_teach:graphons 
5 weeks ago by cshalizi
[1810.01509] Hierarchical community detection by recursive partitioning
"The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We also apply the algorithm to a dataset of statistics papers to construct a hierarchical tree of statistical research communities."
to:NB  community_discovery  statistics  network_data_analysis  levina.elizaveta  sarkar.purna  bickel.peter_j. 
5 weeks ago by cshalizi
[1910.12091] Understanding Isomorphism Bias in Graph Data Sets
"In recent years there has been a rapid increase in classification methods on graph structured data. Both in graph kernels and graph neural networks, one of the implicit assumptions of successful state-of-the-art models was that incorporating graph isomorphism features into the architecture leads to better empirical performance. However, as we discover in this work, commonly used data sets for graph classification have repeating instances which cause the problem of isomorphism bias, i.e. artificially increasing the accuracy of the models by memorizing target information from the training set. This prevents fair competition of the algorithms and raises a question of the validity of the obtained results. We analyze 54 data sets, previously extensively used for graph-related tasks, on the existence of isomorphism bias, give a set of recommendations to machine learning practitioners to properly set up their models, and open source new data sets for the future experiments."
to:NB  network_data_analysis 
5 weeks ago by cshalizi
[1910.11445] Finite Mixtures of ERGMs for Ensembles of Networks
"Ensembles of networks arise in many scientific fields, but currently there are few statistical models aimed at understanding their generative processes. To fill in this gap, we propose characterizing network ensembles via finite mixtures of exponential family random graph models, employing a Metropolis-within-Gibbs algorithm to conduct Bayesian inference. Simulation studies show that the proposed procedure can recover the true cluster assignments and cluster-specific parameters. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators."
to:NB  exponential_family_random_graphs  ensemble_methods  network_data_analysis  statistics  butts.carter  to_teach:baby-nets 
5 weeks ago by cshalizi
[1910.12711] A framework for second order eigenvector centralities and clustering coefficients
"We propose and analyse a general tensor-based framework for incorporating second order features into network measures. This approach allows us to combine traditional pairwise links with information that records whether triples of nodes are involved in wedges or triangles. Our treatment covers classical spectral methods and recently proposed cases from the literature, but we also identify many interesting extensions. In particular, we define a mutually-reinforcing (spectral) version of the classical clustering coefficient. The underlying object of study is a constrained nonlinear eigenvalue problem associated with a cubic tensor. Using recent results from nonlinear Perron--Frobenius theory, we establish existence and uniqueness under appropriate conditions, and show that the new spectral measures can be computed efficiently with a nonlinear power method. To illustrate the added value of the new formulation, we analyse the measures on a class of synthetic networks. We also give computational results on centrality and link prediction for real-world networks."
to:NB  linear_algebra  network_data_analysis  to_teach:baby-nets 
5 weeks ago by cshalizi
[1810.09569] Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning
"One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role."
to:NB  graph_embedding  network_data_analysis  linear_algebra  perturbation_theory  manifold_learning 
5 weeks ago by cshalizi
[1910.12128] Joint Analysis of Social and Item Response Networks with Latent Space Models
"The adjustment of students to a school environment is fundamentally linked to the friendship networks they form with their peers. Consequently, the complete picture of a student's adjustment can only be obtained by taking into account both their friendship network and their reported perceptions of the school environment. However, there is a lack of flexible statistical models and methods that can jointly analyze a social network with an item-response data matrix. In this paper, we propose a latent space model for heterogeneous (multimodal) networks (LSMH) and its extension LSMH-Item, which combine the framework of latent space modeling in network analysis with item response theory in psychometrics. Using LSMH, we summarize the information from the social network and the item responses in a person-item joint latent space. We use a Variational Bayesian Expectation-Maximization estimation algorithm to estimate the item and person locations in the joint latent space. This methodology allows for effective integration, informative visualization, and prediction of social networks and item responses. We apply the proposed methodology to data collected from 16 third-grade classrooms comprised of 451 third-grade students' self-reported friendships and school liking, which were collected as part of the Early Learning Ohio project. Through the person-item joint latent space, we are able to identify students with potential adjustment difficulties and found consistent connection between students' friendship networks and their well-being. We believe that using LSMH, researchers will be able to easily identify students in need of intervention and revolutionize the understanding of social behaviors."
to:NB  social_networks  network_data_analysis  psychometrics  statistics 
6 weeks ago by cshalizi
[1910.11299] Fraud Detection in Networks: State-of-the-art
"Financial fraud detection represents the challenge of finding anomalies in networks of financial transactions. In general, the anomaly detection (AD) is the problem of distinguishing between normal data samples with well defined patterns or signatures and those that do not conform to the expected profiles. The fraudulent behaviour in money laundering may manifest itself through unusual patterns in financial transaction networks. In such networks, nodes represents customers and the edges are transactions: a directed edge between two nodes illustrates that there is a money transfer in the respective direction, where the weight on the edge is the transferred amount. In this paper we present a survey on the fundamental anomaly detection techniques and then present briefly the relevant literature in connection with fraud detection context."
to:NB  fraud  classifiers  network_data_analysis  relational_learning  statistics 
6 weeks ago by cshalizi
[1910.02301] Change Detection in Noisy Dynamic Networks: A Spectral Embedding Approach
"Change detection in dynamic networks is an important problem in many areas, such as fraud detection, cyber intrusion detection and health care monitoring. It is a challenging problem because it involves a time sequence of graphs, each of which is usually very large and sparse with heterogeneous vertex degrees, resulting in a complex, high dimensional mathematical object. Spectral embedding methods provide an effective way to transform a graph to a lower dimensional latent Euclidean space that preserves the underlying structure of the network. Although change detection methods that use spectral embedding are available, they do not address sparsity and degree heterogeneity that usually occur in noisy real-world graphs and a majority of these methods focus on changes in the behaviour of the overall network.
"In this paper, we adapt previously developed techniques in spectral graph theory and propose a novel concept of applying Procrustes techniques to embedded points for vertices in a graph to detect changes in entity behaviour. Our spectral embedding approach not only addresses sparsity and degree heterogeneity issues, but also obtains an estimate of the appropriate embedding dimension. We call this method CDP (change detection using Procrustes analysis). We demonstrate the performance of CDP through extensive simulation experiments and a real-world application. CDP successfully detects various types of vertex-based changes including (i) changes in vertex degree, (ii) changes in community membership of vertices, and (iii) unusual increase or decrease in edge weight between vertices. The change detection performance of CDP is compared with two other baseline methods that employ alternative spectral embedding approaches. In both cases, CDP generally shows superior performance."
to:NB  network_data_analysis  spectral_methods  re:network_differences  change-point_problem  statistics 
6 weeks ago by cshalizi
[1910.09679] Sparse Networks with Core-Periphery Structure
"We propose a statistical model for graphs with a core-periphery structure. To do this we define a precise notion of what it means for a graph to have this structure, based on the sparsity properties of the subgraphs of core and periphery nodes. We present a class of sparse graphs with such properties, and provide methods to simulate from this class, and to perform posterior inference. We demonstrate that our model can detect core-periphery structure in simulated and real-world networks."
to:NB  network_data_analysis  statistics  to_teach:baby-nets 
6 weeks ago by cshalizi
[1910.08657] Temporal Network Sampling
"Temporal networks representing a stream of timestamped edges are seemingly ubiquitous in the real-world. However, the massive size and continuous nature of these networks make them fundamentally challenging to analyze and leverage for descriptive and predictive modeling tasks. In this work, we propose a general framework for temporal network sampling with unbiased estimation. We develop online, single-pass sampling algorithms and unbiased estimators for temporal network sampling. The proposed algorithms enable fast, accurate, and memory-efficient statistical estimation of temporal network patterns and properties. In addition, we propose a temporally decaying sampling algorithm with unbiased estimators for studying networks that evolve in continuous time, where the strength of links is a function of time, and the motif patterns are temporally-weighted. In contrast to the prior notion of a △t-temporal motif, the proposed formulation and algorithms for counting temporally weighted motifs are useful for forecasting tasks in networks such as predicting future links, or a future time-series variable of nodes and links. Finally, extensive experiments on a variety of temporal networks from different domains demonstrate the effectiveness of the proposed algorithms."
to:NB  network_data_analysis  statistics  computational_statistics  to_teach:baby-nets 
6 weeks ago by cshalizi
Longepierre , Matias : Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model
"We consider a dynamic version of the stochastic block model, in which the nodes are partitioned into latent classes and the connection between two nodes is drawn from a Bernoulli distribution depending on the classes of these two nodes. The temporal evolution is modeled through a hidden Markov chain on the nodes memberships. We prove the consistency (as the number of nodes and time steps increase) of the maximum likelihood and variational estimators of the model parameters, and obtain upper bounds on the rates of convergence of these estimators. We also explore the particular case where the number of time steps is fixed and connectivity parameters are allowed to vary."
to:NB  stochastic_block_models  network_data_analysis  statistics  to_teach:baby-nets 
6 weeks ago by cshalizi
[1910.09538] The Transsortative Structure of Networks
"Network topologies can be non-trivial, due to the complex underlying behaviors that form them. While past research has shown that some processes on networks may be characterized by low-order statistics describing nodes and their neighbors, such as degree assortativity, these quantities fail to capture important sources of variation in network structure. We introduce a property called transsortativity that describes correlations among a node's neighbors, generalizing these statistics from immediate one-hop neighbors to two-hop neighbors. We describe how transsortativity can be systematically varied, independently of the network's degree distribution and assortativity. Moreover, we show that it can significantly impact the spread of contagions as well as the perceptions of neighbors, known as the majority illusion. Our work improves our ability to create and analyze more realistic models of complex networks."
to:NB  networks  network_data_analysis  lerman.kristina  to_read  to_teach:baby-nets 
6 weeks ago by cshalizi
[1910.09483] Sampling random graph homomorphisms and applications to network data analysis
"A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph F into a large network . When  is the complete graph with q nodes, this becomes the well-known problem of sampling uniform q-colorings of F. We propose two complementary MCMC algorithms for sampling a random graph homomorphisms and establish bounds on their mixing times and concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neigborhood sampling. Various time averages of the MCMC trajectory give us real-, function-, and network-valued computable observables, including well-known ones such as homomorphism density and average clustering coefficient. One of the main observable we propose is called the conditional homomorphism density profile, which reveals hierarchical structure of the network. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We also provide various examples and simulations demonstrating our framework through synthetic and real-world networks. For instance, we apply our framework to analyze Word Adjacency Networks of a 45 novels data set and propose an authorship attribution scheme using motif sampling and conditional homomorphism density profiles."
to:NB  network_data_analysis  network_sampling  graph_theory  computational_statistics  statistics 
6 weeks ago by cshalizi
Phys. Rev. E 100, 042306 (2019) - Backbone reconstruction in temporal networks from epidemic data
"Many complex systems are characterized by time-varying patterns of interactions. These interactions comprise strong ties, driven by dyadic relationships, and weak ties, based on node-specific attributes. The interplay between strong and weak ties plays an important role on dynamical processes that could unfold on complex systems. However, seldom do we have access to precise information about the time-varying topology of interaction patterns. A particularly elusive question is to distinguish strong from weak ties, on the basis of the sole node dynamics. Building upon analytical results, we propose a statistically-principled algorithm to reconstruct the backbone of strong ties from data of a spreading process, consisting of the time series of individuals' states. Our method is numerically validated over a range of synthetic datasets, encapsulating salient features of real-world systems. Motivated by compelling evidence, we propose the integration of our algorithm in a targeted immunization strategy that prioritizes influential nodes in the inferred backbone. Through Monte Carlo simulations on synthetic networks and a real-world case study, we demonstrate the viability of our approach."
in_NB  network_data_analysis  statistics  epidemics_on_networks 
7 weeks ago by cshalizi
Arroyo Relión , Kessler , Levina , Taylor : Network classification with applications to brain connectomics
"While statistical analysis of a single network has received a lot of attention in recent years, with a focus on social networks, analysis of a sample of networks presents its own challenges which require a different set of analytic tools. Here we study the problem of classification of networks with labeled nodes, motivated by applications in neuroimaging. Brain networks are constructed from imaging data to represent functional connectivity between regions of the brain, and previous work has shown the potential of such networks to distinguish between various brain disorders, giving rise to a network classification problem. Existing approaches tend to either treat all edge weights as a long vector, ignoring the network structure, or focus on graph topology as represented by summary measures while ignoring the edge weights. Our goal is to design a classification method that uses both the individual edge information and the network structure of the data in a computationally efficient way, and that can produce a parsimonious and interpretable representation of differences in brain connectivity patterns between classes. We propose a graph classification method that uses edge weights as predictors but incorporates the network nature of the data via penalties that promote sparsity in the number of nodes, in addition to the usual sparsity penalties that encourage selection of edges. We implement the method via efficient convex optimization and provide a detailed analysis of data from two fMRI studies of schizophrenia."
to:NB  network_data_analysis  re:network_differences  statistics  levina.elizaveta 
7 weeks ago by cshalizi
[1910.07781] Econometric Models of Network Formation
"This article provides a selective review on the recent literature on econometric models of network formation. The survey starts with a brief exposition on basic concepts and tools for the statistical description of networks. I then offer a review of dyadic models, focussing on statistical models on pairs of nodes and describe several developments of interest to the econometrics literature. The article also presents a discussion of non-dyadic models where link formation might be influenced by the presence or absence of additional links, which themselves are subject to similar influences. This is related to the statistical literature on conditionally specified models and the econometrics of game theoretical models. I close with a (non-exhaustive) discussion of potential areas for further development."
to:NB  social_networks  network_data_analysis  economics  statistics 
7 weeks ago by cshalizi
[1703.03862] Joint Embedding of Graphs
"Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks."
to:NB  network_visualization  re:network_differences  network_data_analysis 
7 weeks ago by cshalizi
[1910.07452] Identifying Network Ties from Panel Data: Theory and an Application to Tax Competition
"Social interactions determine many economic behaviors, but information on social ties does not exist in most publicly available and widely used datasets. We present results on the identification of social networks from observational panel data that contains no information on social ties between agents. In the context of a canonical social interactions model, we provide sufficient conditions under which the social interactions matrix, endogenous and exogenous social effect parameters are all globally identified. While this result is relevant across different estimation strategies, we then describe how high-dimensional estimation techniques can be used to estimate the interactions model based on the Adaptive Elastic Net GMM method. We employ the method to study tax competition across US states. We find the identified social interactions matrix implies tax competition differs markedly from the common assumption of competition between geographically neighboring states, providing further insights for the long-standing debate on the relative roles of factor mobility and yardstick competition in driving tax setting behavior across states. Most broadly, our identification and application show the analysis of social interactions can be extended to economic realms where no network data exists."
to:NB  to_be_shot_after_a_fair_trial  network_data_analysis  causal_inference  time_series  statistics 
7 weeks ago by cshalizi
[1412.5647] Nonlinear Factor Models for Network and Panel Data
"Factor structures or interactive effects are convenient devices to incorporate latent variables in panel data models. We consider fixed effect estimation of nonlinear panel single-index models with factor structures in the unobservables, which include logit, probit, ordered probit and Poisson specifications. We establish that fixed effect estimators of model parameters and average partial effects have normal distributions when the two dimensions of the panel grow large, but might suffer of incidental parameter bias. We show how models with factor structures can also be applied to capture important features of network data such as reciprocity, degree heterogeneity, homophily in latent variables and clustering. We illustrate this applicability with an empirical example to the estimation of a gravity equation of international trade between countries using a Poisson model with multiple factors."
to:NB  factor_analysis  network_data_analysis  statistics  to_be_shot_after_a_fair_trial 
7 weeks ago by cshalizi
[1910.06538] Network Mediation Analysis Using Model-based Eigenvalue Decomposition
"This paper proposes a new two-stage network mediation method based on the use of a latent network approach -- model-based eigenvalue decomposition -- for analyzing social network data with nodal covariates. In the decomposition stage of the observed network, no assumption on the metric of the latent space structure is required. In the mediation stage, the most important eigenvectors of a network are used as mediators. This method further offers an innovative way for controlling for the conditional covariates and it only considers the information left in the network. We demonstrate this approach in a detailed tutorial R code provided for four separate cases -- unconditional and conditional model-based eigenvalue decompositions for either a continuous outcome or a binary outcome -- to show its applicability to empirical network data."
to:NB  network_data_analysis  causal_inference  statistics  to_be_shot_after_a_fair_trial  re:community_control 
7 weeks ago by cshalizi
[1910.06593] Iterative procedure for network inference
"When a network is reconstructed from data, two types of errors can occur: false positive and false negative errors about the presence or absence of links. In this paper, the vertex degree distribution of the true underlying network is analytically reconstructed using an iterative procedure. Such procedure is based on the inferred network and estimates for the probabilities α and β of type I and type II errors, respectively. The iteration procedure consists of choosing various values for α to perform the iteration steps of the network reconstruction. For the first step, the standard value for α of 0.05 can be chosen as an example. The result of this first step gives a first estimate of the network topology of interest. For the second iteration step the value for α is adjusted according to the findings of the first step. This procedure is iterated, ultimately leading to a reconstruction of the vertex degree distribution tailored to its previously unknown network topology."
to:NB  network_data_analysis  to_be_shot_after_a_fair_trial  statistics 
7 weeks ago by cshalizi
[1910.05870] Network Modularity Controls the Speed of Information Diffusion
"The rapid diffusion of information and the adoption of ideas are of critical importance in situations as diverse as emergencies, collective actions, or advertising and marketing. Although the dynamics of large cascades have been extensively studied in various contexts, few have examined the mechanisms that govern the efficiency of information diffusion. Here, by employing the linear threshold model on networks with communities, we demonstrate that a prominent network feature---the modular structure---strongly affects the speed of information diffusion. Our simulation results show that, when global cascades are enabled, there exists an optimal network modularity for the most efficient information spreading process. Beyond this critical value, either a stronger or a weaker modular structure actually hinders the speed of global cascades. These results are further confirmed by predictions using an analytical approach. Our findings have practical implications in disciplines from marketing to epidemics, from neuroscience to engineering, where the understanding of the structural design of complex systems focuses on the efficiency of information propagation."
in_NB  information_cascades  community_discovery  network_data_analysis  epidemics_on_networks  re:do-institutions-evolve 
7 weeks ago by cshalizi
[1905.03353] Regression from Dependent Observations
"The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates. The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables."

--- Umm? Spatial statistics is a thing? From a very quick skim, they seem to just be reproducing the usual asymptotics about M-estimators, but maybe the concentration results will be of interest. (Note: published in a CS theory conference, not a stats journal or even an ML conference.)
to:NB  network_data_analysis  regression  statistics  to_be_shot_after_a_fair_trial 
8 weeks ago by cshalizi
[1910.01692] Algebraic statistics, tables, and networks: The Fienberg advantage
"Stephen Fienberg's affinity for contingency table problems and reinterpreting models with a fresh look gave rise to a new approach for hypothesis testing of network models that are linear exponential families. We outline his vision and influence in this fundamental problem, as well as generalizations to multigraphs and hypergraphs."
to:NB  statistics  network_data_analysis  hypothesis_testing  fienberg.stephen 
8 weeks ago by cshalizi
Smith , Asta , Calder : The Geometry of Continuous Latent Space Models for Network Data
"We review the class of continuous latent space (statistical) models for network data, paying particular attention to the role of the geometry of the latent space. In these models, the presence/absence of network dyadic ties are assumed to be conditionally independent given the dyads’ unobserved positions in a latent space. In this way, these models provide a probabilistic framework for embedding network nodes in a continuous space equipped with a geometry that facilitates the description of dependence between random dyadic ties. Specifically, these models naturally capture homophilous tendencies and triadic clustering, among other common properties of observed networks. In addition to reviewing the literature on continuous latent space models from a geometric perspective, we highlight the important role the geometry of the latent space plays on properties of networks arising from these models via intuition and simulation. Finally, we discuss results from spectral graph theory that allow us to explore the role of the geometry of the latent space, independent of network size. We conclude with conjectures about how these results might be used to infer the appropriate latent space geometry from observed networks."
to:NB  network_data_analysis  kith_and_kin  asta.dena 
8 weeks ago by cshalizi
[1910.00423] Limit theorems for out-of-sample extensions of the adjacency and Laplacian spectral embeddings
"Graph embeddings, a class of dimensionality reduction techniques designed for relational data, have proven useful in exploring and modeling network structure. Most dimensionality reduction methods allow out-of-sample extensions, by which an embedding can be applied to observations not present in the training set. Applied to graphs, the out-of-sample extension problem concerns how to compute the embedding of a vertex that is added to the graph after an embedding has already been computed. In this paper, we consider the out-of-sample extension problem for two graph embedding procedures: the adjacency spectral embedding and the Laplacian spectral embedding. In both cases, we prove that when the underlying graph is generated according to a latent space model called the random dot product graph, which includes the popular stochastic block model as a special case, an out-of-sample extension based on a least-squares objective obeys a central limit theorem about the true latent position of the out-of-sample vertex. In addition, we prove a concentration inequality for the out-of-sample extension of the adjacency spectral embedding based on a maximum-likelihood objective. Our results also yield a convenient framework in which to analyze trade-offs between estimation accuracy and computational expense, which we explore briefly."
to:NB  graph_embedding  inference_to_latent_objects  statistics  network_data_analysis  re:hyperbolic_networks 
9 weeks ago by cshalizi
[1910.00544] A machine learning approach to predicting dynamical observables from network structure
"Estimating the outcome of a given dynamical process from structural features is a key unsolved challenge in network science. The goal is hindered by difficulties associated to nonlinearities, correlations and feedbacks between the structure and dynamics of complex systems. In this work, we develop an approach based on machine learning algorithms that is shown to provide an answer to the previous challenge. Specifically, we show that it is possible to estimate the outbreak size of a disease starting from a single node as well as the degree of synchronicity of a system made up of Kuramoto oscillators. In doing so, we show which topological features of the network are key for this estimation, and provide a rank of the importance of network metrics with higher accuracy than previously done. Our approach is general and can be applied to any dynamical process running on top of complex networks. Likewise, our work constitutes an important step towards the application of machine learning methods to unravel dynamical patterns emerging in complex networked systems."

--- I don't see any way this can be done which won't be entirely at the mercy of data-set shift.
to:NB  network_data_analysis  macro_from_micro  statistics  to_be_shot_after_a_fair_trial  kurths.jurgen 
9 weeks ago by cshalizi
[1812.10519] Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks
"Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and show that in this model framework, the solution to the graph matching problem is a maximum likelihood estimator. Necessary and sufficient conditions for consistency of this MLE are presented, as well as a relaxed notion of consistency in which a negligible fraction of the vertices need not be matched correctly. The results are used to study matchability in several families of random graphs, including edge independent models, random regular graphs and small-world networks. We also use these results to introduce measures of matching feasibility, and experimentally validate the results on simulated and real-world networks."
to:NB  network_data_analysis  statistics  to_teach:baby-nets 
9 weeks ago by cshalizi
[1910.00452] On the Equivalence between Node Embeddings and Structural Graph Representations
"This work provides the first unifying theoretical framework for node embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that the relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which fixes significant shortcomings of standard operating procedures used today."

--- So I can forget about node2vec, right?
to:NB  to_read  network_data_analysis  statistics  re:hyperbolic_networks 
9 weeks ago by cshalizi
[1708.02107] Adaptive Estimation of Nonparametric Geometric Graphs
"This article studies the recovery of graphons when they are convolution kernels on compact (symmetric) metric spaces. This case is of particular interest since it covers the situation where the probability of an edge depends only on some unknown nonparametric function of the distance between latent points, referred to as Nonparametric Geometric Graphs (NGG).
"In this setting, adaptive estimation of NGG is possible using a spectral procedure combined with a Goldenshluger-Lepski adaptation method. The latent spaces covered by our framework encompass (among others) compact symmetric spaces of rank one, namely real spheres and projective spaces. For these latter, explicit computations of the eigen-basis and of the model complexity can be achieved, leading to quantitative non-asymptotic results. The time complexity of our method scales cubicly in the size of the graph and exponentially in the regularity of the graphon. Hence, this paper offers an algorithmically and theoretically efficient procedure to estimate smooth NGG.
"As a by product, this paper shows a non-asymptotic concentration result on the spectrum of integral operators defined by symmetric kernels (not necessarily positive)."
to:NB  network_data_analysis  nonparametrics  statistics  re:smoothing_adjacency_matrices  to_read 
9 weeks ago by cshalizi
[1909.13456] Improving Textual Network Learning with Variational Homophilic Embeddings
"The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. Homophilic vertex embeddings encourage similar embedding vectors for related (connected) vertices. The proposed VHE promises better generalization for downstream tasks, robustness to incomplete observations, and the ability to generalize to unseen vertices. Extensive experiments on real-world networks, for multiple tasks, demonstrate that the proposed method consistently achieves superior performance relative to competing state-of-the-art approaches."
to:NB  text_mining  network_data_analysis  joint_modeling_of_text_and_networks 
9 weeks ago by cshalizi
[1909.11894] Social Network Analysis for Social Neuroscientists
"Although social neuroscience is concerned with understanding how the brain interacts with its social environment, prevailing research in the field has primarily considered the human brain in isolation, deprived of its rich social context. Emerging work in social neuroscience that leverages tools from network analysis has begun to pursue this issue, advancing knowledge of how the human brain influences and is influenced by the structures of its social environment. In this paper, we provide an overview of key theory and methods in network analysis (especially for social systems) as an introduction for social neuroscientists who are interested in relating individual cognition to the structures of an individual's social environments. We also highlight some exciting new work as examples of how to productively use these tools to investigate questions of relevance to social neuroscientists. We include tutorials to help with practical implementation of the concepts that we discuss. We conclude by highlighting the broad range of exciting research opportunities for social neuroscientists who are interested in using network analysis to study social systems."
to:NB  network_data_analysis  social_life_of_the_mind  social_networks  porter.mason_a. 
9 weeks ago by cshalizi
[1909.13253] Changing the tune: mixtures of network models that vary in time
"Many of the complex systems we study in their representation as networks are growing objects, evolving by the addition of nodes and links over time. The rules governing this growth are attributed to mechanisms such as preferential attachment and triangle closure. We demonstrate a method for estimating the relative roles of these mechanisms, and further, investigating how they change as the network evolves. We show that a rich class of network evolution models can be built from a weighted mixture of these model mechanisms. Using a likelihood based formulation we show how to calculate the optimal mixture for a given set of observations of network data, and show that this framework can be used to distinguish competing models that are indistinguishable by their summary statistics. Using real data from Facebook user interactions, we show that we can improve the ability of a model to reproduce network statistics using tuned model mixtures. We further investigate the idea that the underlying model of a network can change in time, for example, a technology based network might respond to changes in the underlying technology or a financial network might respond to economic shocks. Using artificial data we show that we can recapture the time at which a known change occurred. We use the Enron email dataset to show that we can estimate how mixtures of models change over time."
to:NB  network_formation  time_series  statistics  network_data_analysis  re:network_differences 
9 weeks ago by cshalizi
[1903.01059] Limit Theorems for Network Dependent Random Variables
"This paper considers a general form of network dependence where dependence between two sets of random variables becomes weaker as their distance in a network gets longer. We show that such network dependence cannot be embedded as a random field on a lattice in a Euclidean space with a fixed dimension when the maximum clique increases in size as the network grows. This paper applies Doukhan and Louhichi (1999)'s weak dependence notion to network dependence by measuring dependence strength by the covariance between nonlinearly transformed random variables. While this approach covers examples such as strong mixing random fields on a graph and conditional dependency graphs, it is most useful when dependence arises through a large functional-causal system of equations. The main results of our paper include the law of large numbers, and the central limit theorem. We also propose a heteroskedasticity-autocorrelation consistent variance estimator and prove its consistency under regularity conditions. The finite sample performance of this latter estimator is investigated through a Monte Carlo simulation study."
to:NB  network_data_analysis  ergodic_theory  to_read  probability  statistics  to_teach:baby-nets 
9 weeks ago by cshalizi
[1909.13464] Network Differential Connectivity Analysis
"Identifying differences in networks has become a canonical problem in many biological applications. Here, we focus on testing whether two Gaussian graphical models are the same. Existing methods try to accomplish this goal by either directly comparing their estimated structures, or testing the null hypothesis that the partial correlation matrices are equal. However, estimation approaches do not provide measures of uncertainty, e.g., p-values, which are crucial in drawing scientific conclusions. On the other hand, existing testing approaches could lead to misleading results in some cases. To address these shortcomings, we propose a qualitative hypothesis testing framework, which tests whether the connectivity patterns in the two networks are the same. Our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty, e.g., p-values. We investigate theoretical and numerical properties of our proposal and illustrate its utility in biological applications. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation datasets and applications in cancer genetics and brain imaging studies."
to:NB  network_data_analysis  hypothesis_testing  two-sample_tests  statistics  re:network_differences 
9 weeks ago by cshalizi
[1909.12907] A Quotient Space Formulation for Statistical Analysis of Graphical Data
"Complex analyses involving multiple, dependent random quantities often lead to graphical models: a set of nodes denoting variables of interest, and corresponding edges denoting statistical interactions between nodes. To develop statistical analyses for graphical data, one needs mathematical representations and metrics for matching and comparing graphs, and other geometrical tools, such as geodesics, means, and covariances, on representation spaces of graphs. This paper utilizes a quotient structure to develop efficient algorithms for computing these quantities, leading to useful statistical tools, including principal component analysis, linear dimension reduction, and analytical statistical modeling. The efficacy of this framework is demonstrated using datasets taken from several problem areas, including alphabets, video summaries, social networks, and biochemical structures."
to:NB  network_data_analysis 
9 weeks ago by cshalizi
[1909.11414] Inequality is rising where social network segregation interacts with urban topology
"Social networks amplify inequalities due to fundamental mechanisms of social tie formation such as homophily and triadic closure. These forces sharpen social segregation reflected in network fragmentation. Yet, little is known about what structural factors facilitate fragmentation. In this paper we use big data from a widely-used online social network to demonstrate that there is a significant relationship between social network fragmentation and income inequality in cities and towns. We find that the organization of the physical urban space has a stronger relationship with fragmentation than unequal access to education, political segregation, or the presence of ethnic and religious minorities. Fragmentation of social networks is significantly higher in towns in which residential neighborhoods are divided by physical barriers such as rivers and railroads and are relatively distant from the center of town. Towns in which amenities are spatially concentrated are also typically more socially segregated. These relationships suggest how urban planning may be a useful point of intervention to mitigate inequalities in the long run."

--- How on Earth are those exogenous instruments? (And, of course, neither a theoretical argument for linear functional forms nor an empirical examination of the goodness of fit of the linear models anywhere.)
to:NB  have_skimmed  inequality  social_networks  network_data_analysis  instrumental_variables  statistics  to_be_shot_after_a_fair_trial 
10 weeks ago by cshalizi
[1909.10325] Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs
"The focus of Part I of this monograph has been on both the fundamental properties, graph topologies, and spectral representations of graphs. Part II embarks on these concepts to address the algorithmic and practical issues centered round data/signal processing on graphs, that is, the focus is on the analysis and estimation of both deterministic and random data on graphs. The fundamental ideas related to graph signals are introduced through a simple and intuitive, yet illustrative and general enough case study of multisensor temperature field estimation. The concept of systems on graph is defined using graph signal shift operators, which generalize the corresponding principles from traditional learning systems. At the core of the spectral domain representation of graph signals and systems is the Graph Discrete Fourier Transform (GDFT). The spectral domain representations are then used as the basis to introduce graph signal filtering concepts and address their design, including Chebyshev polynomial approximation series. Ideas related to the sampling of graph signals are presented and further linked with compressive sensing. Localized graph signal analysis in the joint vertex-spectral domain is referred to as the vertex-frequency analysis, since it can be considered as an extension of classical time-frequency analysis to the graph domain of a signal. Important topics related to the local graph Fourier transform (LGFT) are covered, together with its various forms including the graph spectral and vertex domain windows and the inversion conditions and relations. A link between the LGFT with spectral varying window and the spectral graph wavelet transform (SGWT) is also established. Realizations of the LGFT and SGWT using polynomial (Chebyshev) approximations of the spectral functions are further considered. Finally, energy versions of the vertex-frequency representations are introduced."
to:NB  network_data_analysis  to_be_shot_after_a_fair_trial 
10 weeks ago by cshalizi
Phys. Rev. E 100, 032308 (2019) - Scale-variant topological information for characterizing the structure of complex networks
"The structure of real-world networks is usually difficult to characterize owing to the variation of topological scales, the nondyadic complex interactions, and the fluctuations in the network. We aim to address these problems by introducing a general framework using a method based on topological data analysis. By considering the diffusion process at a single specified timescale in a network, we map the network nodes to a finite set of points that contains the topological information of the network at a single scale. Subsequently, we study the shape of these point sets over variable timescales that provide scale-variant topological information, to understand the varying topological scales and the complex interactions in the network. We conduct experiments on synthetic and real-world data to demonstrate the effectiveness of the proposed framework in identifying network models, classifying real-world networks, and detecting transition points in time-evolving networks. Overall, our study presents a unified analysis that can be applied to more complex network structures, as in the case of multilayer and multiplex networks."
to:NB  network_data_analysis 
10 weeks ago by cshalizi
[1909.09844] Universal Lossless Compression of Graphical Data
"Graphical data is comprised of a graph with marks on its edges and vertices. The mark indicates the value of some attribute associated to the respective edge or vertex. Examples of such data arise in social networks, molecular and systems biology, and web graphs, as well as in several other application areas. Our goal is to design schemes that can efficiently compress such graphical data without making assumptions about its stochastic properties. Namely, we wish to develop a universal compression algorithm for graphical data sources. To formalize this goal, we employ the framework of local weak convergence, also called the objective method, which provides a technique to think of a marked graph as a kind of stationary stochastic processes, stationary with respect to movement between vertices of the graph. In recent work, we have generalized a notion of entropy for unmarked graphs in this framework, due to Bordenave and Caputo, to the case of marked graphs. We use this notion to evaluate the efficiency of a compression scheme. The lossless compression scheme we propose in this paper is then proved to be universally optimal in a precise technical sense. It is also capable of performing local data queries in the compressed form."
to:NB  network_data_analysis  information_theory 
10 weeks ago by cshalizi
[1909.06841] Latent Distance Estimation for Random Geometric Graphs
"Random geometric graphs are a popular choice for a latent points generative model for networks. Their definition is based on a sample of n points X1,X2,⋯,Xn on the Euclidean sphere~𝕊d−1 which represents the latent positions of nodes of the network. The connection probabilities between the nodes are determined by an unknown function (referred to as the "link" function) evaluated at the distance between the latent points. We introduce a spectral estimator of the pairwise distance between latent points and we prove that its rate of convergence is the same as the nonparametric estimation of a function on 𝕊d−1, up to a logarithmic factor. In addition, we provide an efficient spectral algorithm to compute this estimator without any knowledge on the nonparametric link function. As a byproduct, our method can also consistently estimate the dimension d of the latent space."
to:NB  network_data_analysis  re:hyperbolic_networks  statistics  inference_to_latent_objects  to_read 
10 weeks ago by cshalizi
[1903.10833] Network reconstruction and community detection from dynamics
"We present a scalable nonparametric Bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. We show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. We illustrate the use of our method with observations arising from epidemic models and the Ising model, both on synthetic and empirical networks, as well as on data containing only functional information."
to:NB  network_data_analysis  community_discovery  to_read 
10 weeks ago by cshalizi
[1909.07578] Stacking Models for Nearly Optimal Link Prediction in Complex Networks
"Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results."
to:NB  ensemble_methods  network_data_analysis  link_prediction  statistics  clauset.aaron  airoldi.edo  galstyan.aram 
10 weeks ago by cshalizi
[1707.04800] Exponential-Family Models of Random Graphs: Inference in Finite-, Super-, and Infinite Population Scenarios
"Exponential-family Random Graph Models (ERGMs) constitute a large statistical framework for modeling sparse and dense random graphs, short- and long-tailed degree distributions, covariates, and a wide range of complex dependencies. Special cases of ERGMs are generalized linear models (GLMs), Bernoulli random graphs, β-models, p1-models, and models related to Markov random fields in spatial statistics and other areas of statistics. While widely used in practice, questions have been raised about the theoretical properties of ERGMs. These include concerns that some ERGMs are near-degenerate and that many ERGMs are non-projective. To address them, careful attention must be paid to model specifications and their underlying assumptions, and in which inferential settings models are employed. As we discuss, near-degeneracy can affect simplistic ERGMs lacking structure, but well-posed ERGMs with additional structure can be well-behaved. Likewise, lack of projectivity can affect non-likelihood-based inference, but likelihood-based inference does not require projectivity. Here, we review well-posed ERGMs along with likelihood-based inference. We first clarify the core statistical notions of "sample" and "population" in the ERGM framework, and separate the process that generates the population graph from the observation process. We then review likelihood-based inference in finite-, super-, and infinite-population scenarios. We conclude with consistency results, and an application to human brain networks"

--- Needless (?) to say, I am unconvinced by the arguments that projectivity doesn't matter.
to:NB  exponential_family_random_graphs  network_data_analysis  statistics  to_teach:baby-nets  re:your_favorite_ergm_sucks 
12 weeks ago by cshalizi
[1908.01596v3] A Unifying Analysis of Shift Operators on a Graph
The maximum entropy principle is employed to introduce a general class of shift operators (GSO) for random signals on a graph. By virtue of the assumed probabilistic framework, the proposed GSO is shown to be both bounded and to exhibit the desired property of asymptotic power preservation over graph shifts. For rigour, the sensitivity of the GSO to a graph topology misspecification is also addressed. The advantages of the proposed operator are demonstrated in a real-world multi-sensor signal average setting.
network_data_analysis  to:NB 
12 weeks ago by cshalizi
[1908.01297] A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models
"With the great success of graph embedding model on both academic and industry area, the robustness of graph embedding against adversarial attack inevitably becomes a central problem in graph learning domain. Regardless of the fruitful progress, most of the current works perform the attack in a white-box fashion: they need to access the model predictions and labels to construct their adversarial loss. However, the inaccessibility of model predictions in real systems makes the white-box attack impractical to real graph learning system. This paper promotes current frameworks in a more general and flexible sense -- we demand to attack various kinds of graph embedding model with black-box driven. To this end, we begin by investigating the theoretical connections between graph signal processing and graph embedding models in a principled way and formulate the graph embedding model as a general graph signal process with corresponding graph filter. As such, a generalized adversarial attacker: GF-Attack is constructed by the graph filter and feature matrix. Instead of accessing any knowledge of the target classifiers used in graph embedding, GF-Attack performs the attack only on the graph filter in a black-box attack fashion. To validate the generalization of GF-Attack, we construct the attacker on four popular graph embedding models. Extensive experimental results validate the effectiveness of our attacker on several benchmark datasets. Particularly by using our attack, even small graph perturbations like one-edge flip is able to consistently make a strong attack in performance to different graph embedding models."
in_NB  network_data_analysis  adversarial_examples 
12 weeks ago by cshalizi
[1909.03217] Community detection in inhomogeneous random graphs
"We study the problem of detecting whether an inhomogeneous random graph contains a planted community. Specifically, we observe a single realization of a graph. Under the null hypothesis, this graph is a sample from an inhomogeneous random graph, whereas under the alternative, there exists a small subgraph where the edge probabilities are increased by a multiplicative scaling factor. We present a scan test that is able to detect the presence of such a planted community, even when this community is very small and the underlying graph is inhomogeneous. We also derive an information theoretic lower bound for this problem which shows that in some regimes the scan test is almost asymptotically optimal. We illustrate our results through examples and numerical experiments."
to:NB  community_discovery  network_data_analysis  statistics 
12 weeks ago by cshalizi
[1909.02900] On the Estimation of Network Complexity: Dimension of Graphons
"Network complexity has been studied for over half a century and has found a wide range of applications. Many methods have been developed to characterize and estimate the complexity of networks. However, there has been little research with statistical guarantees. In this paper, we develop a statistical theory of graph complexity in a general model of random graphs, the so-called graphon model. Given a graphon, we endow the latent space of the nodes with the so-called neighborhood distance that measures the propensity of two nodes to be connected with similar nodes. Our complexity index is then based on the covering number and the Minkowski dimension of (a purified version of) this metric space. Although the latent space is not identifiable, these indices turn out to be identifiable. This notion of complexity has simple interpretations on popular examples of random graphs: it matches the number of communities in stochastic block models; the dimension of the Euclidean space in random geometric graphs; the regularity of the link function in Hölder graphon models. From a single observation of the graph, we construct an estimator of the neighborhood-distance and show universal non-asymptotic bounds for its risk, matching minimax lower bounds. Based on this estimated distance, we compute the corresponding covering number and Minkowski dimension and we provide optimal non-asymptotic error bounds for these two plug-in estimators."
to:NB  to_read  complexity_measures  graph_limits  network_data_analysis  re:smoothing_adjacency_matrices 
september 2019 by cshalizi
[1909.00958] Graph Representation Learning: A Survey
"Research on graph representation learning has received a lot of attention in recent years since many data in real-world applications come in form of graphs. High-dimensional graph data are often in irregular form, which makes them more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several state-of-the-art methods against small and large datasets and compare their performance. Finally, potential applications and future directions are presented."
to:NB  graph_embedding  network_data_analysis 
september 2019 by cshalizi
[1909.01076] Link Prediction in Networks Using Effective Transitions
"We introduce a new method for predicting the formation of links in real-world networks, which we refer to as the method of effective transitions. This method relies on the theory of isospectral matrix reductions to compute the probability of eventually transitioning from one vertex to another in a (biased) random walk on the network. Unlike the large majority of link prediction techniques, this method can be used to predict links in networks that are directed or undirected which are either weighted or unweighted. We apply this method to a number of social, technological, and natural networks and show that it is competitive with other link predictors often outperforming them. We also provide a method of approximating our effective transition method and show that aside from having much lower temporal complexity, this approximation often provides more accurate predictions than the original effective transition method. We also, prove a number of mathematical results regarding our effective transition algorithm and its approximation."
to:NB  network_data_analysis  link_prediction 
september 2019 by cshalizi
[1909.00226] Small worlds and clustering in spatial networks
"Networks with underlying metric spaces attract increasing research attention in network science, statistical physics, applied mathematics, computer science, sociology, and other fields. This attention is further amplified by the current surge of activity in graph embedding. In the vast realm of spatial network models, only a few reproduce even the most basic properties of real-world networks. Here, we focus on three such properties---sparsity, small worldness, and clustering---and identify the general subclass of spatial homogeneous and heterogeneous network models that are sparse small worlds and that have nonzero clustering in the thermodynamic limit. We rely on the maximum entropy approach where network links correspond to noninteracting fermions whose energy dependence on spatial distances determines network small worldness and clustering."
to:NB  network_data_analysis  network_formation  probability  krioukov.dmitri  to_teach:baby-nets 
september 2019 by cshalizi
[1707.03186] Community Discovery in Dynamic Networks: a Survey
"Networks built to model real world phenomena are characeterised by some properties that have attracted the attention of the scientific community: (i) they are organised according to community structure and (ii) their structure evolves with time. Many researchers have worked on methods that can efficiently unveil substructures in complex networks, giving birth to the field of community discovery. A novel and challenging problem started capturing researcher interest recently: the identification of evolving communities. To model the evolution of a system, dynamic networks can be used: nodes and edges are mutable and their presence, or absence, deeply impacts the community structure that composes them. The aim of this survey is to present the distinctive features and challenges of dynamic community discovery, and propose a classification of published approaches. As a "user manual", this work organizes state of art methodologies into a taxonomy, based on their rationale, and their specific instanciation. Given a desired definition of network dynamics, community characteristics and analytical needs, this survey will support researchers to identify the set of approaches that best fit their needs. The proposed classification could also help researchers to choose in which direction should future research be oriented."
to:NB  community_discovery  network_data_analysis 
september 2019 by cshalizi
[1908.10129] Network Communities of Dynamical Influence
"Fuelled by a desire for greater connectivity, networked systems now pervade our society at an unprecedented level that will affect it in ways we do not yet understand. Nature, in contrast, has already developed efficient and resilient large-scale networks, including brain connectomes and bird flocks. These natural systems rely on the stimulation of key elements, which access effective pathways of communication, to instigate response and consensus. In this paper, we explore the link between network structure and dynamical influence to further our understanding of these effective networks. Our technique identifies key vertices that rapidly drive a network to consensus, and the communities that form under their dynamical influence, by investigating the relationships between the system's dominant eigenvectors. These communities of dynamical influence enable the clear identification of human subjects from their brain connectomes and provides an insight into functional activity. They are also used to highlighted the effectiveness of starling flocks, where increasing the outdegree is likely to produce a less responsive flock that has the most influential birds poorly positioned to observe a predator and, hence, instigate an evasion manoeuvre."
to:NB  social_influence  community_discovery  network_data_analysis  to_be_shot_after_a_fair_trial 
august 2019 by cshalizi
[1908.09867] Consistency of community structure in complex networks
"The most widely used techniques for community detection in networks, including methods based on modularity, statistical inference, and information theoretic arguments, all work by optimizing objective functions that measure the quality of network partitions. There is a good case to be made, however, that one should not look solely at the single optimal community structure under such an objective function, but rather at a selection of high-scoring structures. If one does this one typically finds that the resulting structures show considerable variation, and this has been taken as evidence that these community detection methods are unreliable, since they do not appear to give consistent answers. Here we argue that, upon closer inspection, the structures found are in fact consistent in a certain way. Specifically, we show that they can all be assembled from a set of underlying "building blocks", groups of network nodes that are usually found together in the same community. Different community structures correspond to different arrangements of blocks, but the blocks themselves are largely invariant. We propose an information theoretic method for discovering the building blocks in specific networks and demonstrate it with several example applications. We conclude that traditional community detection is not the failure some have suggested it is, and that in fact it gives a significant amount of insight into network structure, although perhaps not in exactly the way previously imagined."
to:NB  community_discovery  information_theory  network_data_analysis  kith_and_kin  newman.mark  statistics  to_teach:baby-nets  to_read 
august 2019 by cshalizi
[1906.10258] Policy Targeting under Network Interference
"This paper discusses the problem of estimating individualized treatment allocation rules under network interference. We propose a method with several appealing features for applications: we let treatment and spillover effects be heterogeneous in the population, and we construct targeting rules that exploit such heterogeneity; we accommodate for arbitrary, possibly non-linear, regression models, and we propose estimators that are robust to model misspecification; treatment allocation rules depend on an arbitrary set of individual, neighbors' and network characteristics, and we allow for general constraints on the policy function and capacity constraints on the number of treated units; the proposed methodology is valid also when only local information of the network is observed. From a theoretical perspective, we establish the first set of guarantees on the utilitarian regret under interference, and we show that it achieves the min-max optimal rate in scenarios of practical and theoretical interest. We provide a mixed-integer linear program formulation of the optimization problem, that can be solved using standard optimization routines. We discuss the empirical performance in simulations, and we illustrate our method by investigating the role of social networks in micro-finance decisions."
to:NB  social_influence  causal_inference  network_data_analysis  statistics  econometrics 
august 2019 by cshalizi
[1908.08572] From Community to Role-based Graph Embeddings
"Roles are sets of structurally similar nodes that are more similar to nodes inside the set than outside, whereas communities are sets of nodes with more connections inside the set than outside (based on proximity/closeness, density). Roles and communities are fundamentally different but important complementary notions. Recently, the notion of roles has become increasingly important and has gained a lot of attention due to the proliferation of work on learning representations (node/edge embeddings) from graphs that preserve the notion of roles. Unfortunately, recent work has sometimes confused the notion of roles and communities leading to misleading or incorrect claims about the capabilities of network embedding methods. As such, this manuscript seeks to clarify the differences between roles and communities, and formalize the general mechanisms (e.g., random walks, feature diffusion) that give rise to community or role-based embeddings. We show mathematically why embedding methods based on these identified mechanisms are either community or role-based. These mechanisms are typically easy to identify and can help researchers quickly determine whether a method is more prone to learn community or role-based embeddings. Furthermore, they also serve as a basis for developing new and better methods for community or role-based embeddings. Finally, we analyze and discuss the applications and data characteristics where community or role-based embeddings are most appropriate."
to:NB  graph_theory  community_discovery  network_data_analysis 
august 2019 by cshalizi
[1703.06558] Using Maximum Entry-Wise Deviation to Test the Goodness-of-Fit for Stochastic Block Models
"The stochastic block model is widely used for detecting community structures in network data. How to test the goodness-of-fit of the model is one of the fundamental problems and has gained growing interests in recent years. In this article, we propose a novel goodness-of-fit test based on the maximum entry of the centered and re-scaled adjacency matrix for the stochastic block model. One noticeable advantage of the proposed test is that the number of communities can be allowed to grow linearly with the number of nodes ignoring a logarithmic factor. We prove that the null distribution of the test statistic converges in distribution to a Gumbel distribution, and we show that both the number of communities and the membership vector can be tested via the proposed method. Further, we show that the proposed test has asymptotic power guarantee against a class of alternatives. We also demonstrate that the proposed method can be extended to the degree-corrected stochastic block model. Both simulation studies and real-world data examples indicate that the proposed method works well."
to:NB  network_data_analysis  stochastic_block_models  goodness-of-fit  statistics  to_teach:baby-nets 
august 2019 by cshalizi
[1908.09029] Dyadic Regression
"Dyadic data, where outcomes reflecting pairwise interaction among sampled units are of primary interest, arise frequently in social science research. Regression analyses with such data feature prominently in many research literatures (e.g., gravity models of trade). The dependence structure associated with dyadic data raises special estimation and, especially, inference issues. This chapter reviews currently available methods for (parametric) dyadic regression analysis and presents guidelines for empirical researchers."
to:NB  regression  network_data_analysis  statistics 
august 2019 by cshalizi
[1908.08429] On the Structural Properties of Social Networks and their Measurement-calibrated Synthetic Counterparts
"Data-driven analysis of large social networks has attracted a great deal of research interest. In this paper, we investigate 120 real social networks and their measurement-calibrated synthetic counterparts generated by four well-known network models. We investigate the structural properties of the networks revealing the correlation profiles of graph metrics across various social domains (friendship networks, communication networks, and collaboration networks). We find that the correlation patterns differ across domains. We identify a non-redundant set of metrics to describe social networks. We study which topological characteristics of real networks the models can or cannot capture. We find that the goodness-of-fit of the network models depends on the domains. Furthermore, while 2K and stochastic block models lack the capability of generating graphs with large diameter and high clustering coefficient at the same time, they can still be used to mimic social networks relatively efficiently."
to:NB  network_data_analysis  to_teach:baby-nets  to_read 
august 2019 by cshalizi
[1908.08478] Two Decades of Network Science as seen through the co-authorship network of network scientists
"Complex networks have attracted a great deal of research interest in the last two decades since Watts & Strogatz, Barabási & Albert and Girvan & Newman published their highly-cited seminal papers on small-world networks, on scale-free networks and on the community structure of complex networks, respectively. These fundamental papers initiated a new era of research establishing an interdisciplinary field called network science. Due to the multidisciplinary nature of the field, a diverse but not divided network science community has emerged in the past 20 years. This paper honors the contributions of network science by exploring the evolution of this community as seen through the growing co-authorship network of network scientists (here the notion refers to a scholar with at least one paper citing at least one of the three aforementioned milestone papers). After investigating various characteristics of 29,528 network science papers, we construct the co-authorship network of 52,406 network scientists and we analyze its topology and dynamics. We shed light on the collaboration patterns of the last 20 years of network science by investigating numerous structural properties of the co-authorship network and by using enhanced data visualization techniques. We also identify the most central authors, the largest communities, investigate the spatiotemporal changes, and compare the properties of the network to scientometric indicators."

--- Haven't read this, but at the very least the data would be useful for teaching
to:NB  to_read  bibliometry  sociology_of_science  network_data_analysis  social_networks  to_teach:baby-nets 
august 2019 by cshalizi
[1709.01577] Auto-G-Computation of Causal Effects on a Network
"Methods for inferring average causal effects have traditionally relied on two key assumptions: (i) the intervention received by one unit cannot causally influence the outcome of another; and (ii) units can be organized into non-overlapping groups such that outcomes of units in separate groups are independent. In this paper, we develop new statistical methods for causal inference based on a single realization of a network of connected units for which neither assumption (i) nor (ii) holds. The proposed approach allows both for arbitrary forms of interference, whereby the outcome of a unit may depend on interventions received by other units with whom a network path through connected units exists; and long range dependence, whereby outcomes for any two units likewise connected by a path in the network may be dependent. Under network versions of consistency and no unobserved confounding, inference is made tractable by an assumption that the network's outcome, treatment and covariate vectors are a single realization of a certain chain graph model. This assumption allows inferences about various network causal effects via the auto-g-computation algorithm, a network generalization of Robins' well-known g-computation algorithm previously described for causal inference under assumptions (i) and (ii)."

--- I feel like I've read something very similar in another paper by Ilya & Eric, but maybe I just heard them talk about it?
to:NB  graphical_models  causal_inference  network_data_analysis  statistics 
august 2019 by cshalizi
[1908.08880] Graphical Construction of Spatial Gibbs Random Graphs
"We present a Spatial Gibbs Random Graphs Model that incorporates the interplay between the statistics of the graph and the underlying space where the vertices are located. We propose a graphical construction of a model with vertices located in a finite subset of ℤ2 that penalizes edges between distant vertices as well as other structures such as stars or triangles. We prove the existence and uniqueness of a measure defined on graphs with vertices in ℤ2 as the limit along the measures over graphs with finite vertex set. Moreover, a perfect simulation algorithm is obtained in order to sample a subgraph from the measure defined on graphs with vertex set ℤ2."
to:NB  network_data_analysis  statistics 
august 2019 by cshalizi
[1808.10593] Asymptotic Seed Bias in Respondent-driven Sampling
"Respondent-driven sampling (RDS) collects a sample of individuals in a networked population by incentivizing the sampled individuals to refer their contacts into the sample. This iterative process is initialized from some seed node(s). Sometimes, this selection creates a large amount of seed bias. Other times, the seed bias is small. This paper gains a deeper understanding of this bias by characterizing its effect on the limiting distribution of various RDS estimators. Using classical tools and results from multi-type branching processes (Kesten and Stigum, 1966), we show that the seed bias is negligible for the Generalized Least Squares (GLS) estimator and non-negligible for both the inverse probability weighted and Volz-Heckathorn (VH) estimators. In particular, we show that (i) above a critical threshold, VH converge to a non-trivial mixture distribution, where the mixture component depends on the seed node, and the mixture distribution is possibly multi-modal. Moreover, (ii) GLS converges to a Gaussian distribution independent of the seed node, under a certain condition on the Markov process. Numerical experiments with both simulated data and empirical social networks suggest that these results appear to hold beyond the Markov conditions of the theorems."
to:NB  respondent-driven_sampling  network_data_analysis  rohe.karl  statistics 
august 2019 by cshalizi
Kreiß , Mammen , Polonik : Nonparametric inference for continuous-time event counting and link-based dynamic network models
"A flexible approach for modeling both dynamic event counting and dynamic link-based networks based on counting processes is proposed, and estimation in these models is studied. We consider nonparametric likelihood based estimation of parameter functions via kernel smoothing. The asymptotic behavior of these estimators is rigorously analyzed in an asymptotic framework where the number of nodes tends to infinity. The finite sample performance of the estimators is illustrated through an empirical analysis of bike share data."
to:NB  nonparametrics  network_data_analysis  time_series  statistics 
august 2019 by cshalizi
[1812.06406] Community Detection with Dependent Connectivity
"In network analysis, within-community members are more likely to be connected than between-community members, which is reflected in that the edges within a community are intercorrelated. However, existing probabilistic models for community detection such as the stochastic block model (SBM) are not designed to capture the dependence among edges. In this paper, we propose a new community detection approach to incorporate within-community dependence of connectivities through the Bahadur representation. The proposed method does not require specifying the likelihood function, which could be intractable for correlated binary connectivities. In addition, the proposed method allows for heterogeneity among edges between different communities. In theory, we show that incorporating correlation information can lower estimation bias and accelerate algorithm convergence. Our simulation studies show that the proposed algorithm outperforms the popular variational EM algorithm assuming conditional independence among edges. We also demonstrate the application of the proposed method to agricultural product trading networks from different countries."
to:NB  community_discovery  network_data_analysis  statistics 
august 2019 by cshalizi
[1904.05523] Relational flexibility of network elements based on inconsistent community detection
"Community identification of network components enables us to understand the mesoscale clustering structure of networks. A number of algorithms have been developed to determine the most likely community structures in networks. Such a probabilistic or stochastic nature of this problem can naturally involve the ambiguity in resultant community structures. More specifically, stochastic algorithms can result in different community structures for each realization in principle. In this study, instead of trying to "solve" this community degeneracy problem, we turn the tables by taking the degeneracy as a chance to quantify how strong companionship each node has with other nodes. For that purpose, we define the concept of companionship inconsistency that indicates how inconsistently a node is identified as a member of a community regarding the other nodes. Analyzing model and real networks, we show that companionship inconsistency discloses unique characteristics of nodes, thus we suggest it as a new type of node centrality. In social networks, for example, companionship inconsistency can classify outsider nodes without firm community membership and promiscuous nodes with multiple connections to several communities. In infrastructure networks such as power grids, it can diagnose how the connection structure is evenly balanced in terms of power transmission. Companionship inconsistency, therefore, abstracts individual nodes' intrinsic property on its relationship to a higher-order organization of the network."
to:NB  community_discovery  network_data_analysis 
august 2019 by cshalizi
[1908.06940] Consistent Community Detection in Continuous-Time Networks of Relational Events
"In many application settings involving networks, such as messages between users of an on-line social network or transactions between traders in financial markets, the observed data are in the form of relational events with timestamps, which form a continuous-time network. We propose the Community Hawkes Independent Pairs (CHIP) model for community detection on such timestamped relational event data. We demonstrate that applying spectral clustering to adjacency matrices constructed from relational events generated by the CHIP model provides consistent community detection for a growing number of nodes. In particular, we obtain explicit non-asymptotic upper bounds on the misclustering rates based on the separation conditions required on the parameters of the model for consistent community detection. We also develop consistent and computationally efficient estimators for the parameters of the model. We demonstrate that our proposed CHIP model and estimation procedure scales to large networks with tens of thousands of nodes and provides superior fits compared to existing continuous-time network models on several real networks."
to:NB  network_data_analysis  community_discovery  point_processes  statistics 
august 2019 by cshalizi
[1908.06339] Indetermination of networks structure from the dynamics perspective
"Networks are universally considered as complex structures of interactions of large multi-component systems. In order to determine the role that each node has inside a complex network, several centrality measures have been developed. Such topological features are also important for their role in the dynamical processes occurring in networked systems. In this paper, we argue that the dynamical activity of the nodes may strongly reshape their relevance inside the network making centrality measures in many cases misleading. We show that when the dynamics taking place at the local level of the node is slower than the global one between the nodes, then the system may lose track of the structural features. On the contrary, when that ratio is reversed only global properties such as the shortest distances can be recovered. From the perspective of networks inference, this constitutes an uncertainty principle, in the sense that it limits the extraction of multi-resolution information about the structure, particularly in the presence of noise. For illustration purposes, we show that for networks with different time-scale structures such as strong modularity, the existence of fast global dynamics can imply that precise inference of the community structure is impossible."

--- This tends to reinforce my long-standing gut skepticism about the universal value of centrality measures.
to:NB  network_data_analysis  dynamical_systems  to_teach:baby-nets  to_read 
august 2019 by cshalizi
[1902.03002] Covariance and Correlation Kernels on a Graph in the Generalized Bag-of-Paths Formalism
"This work derives closed-form expressions computing the expectation of co-presences and of number of co-occurences of nodes on paths sampled from a network according to general path weights (a bag of paths). The underlying idea is that two nodes are considered as similar when they appear together on (preferably short) paths of the network. The results are obtained for both regular and hitting paths and serve as a basis for computing new covariance and correlation measures between nodes. Experiments on semi-supervised classification show that the introduced similarity measures provide competitive performances compared to other state-of-the-art distances and similarities."
to:NB  network_data_analysis  to_teach:baby-nets  probability  graph_theory 
august 2019 by cshalizi
[1908.06438] Spectral inference for large Stochastic Blockmodels with nodal covariates
"In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. To this end, we develop spectral estimators for both unobserved blocks and the effect of covariates in stochastic blockmodels. Our main strategy is to reformulate the stochastic blockmodel estimation problem as recovery of latent positions in a generalized random dot product graph. On the theoretical side, we establish asymptotic normality of our estimators for the subsequent purpose of performing inference. On the applied side, we show that computing our estimator is much faster than standard variational expectation--maximization algorithms and scales well for large networks. The results in this paper provide a foundation to estimate the effect of observed covariates as well as unobserved latent community structure on the probability of link formation in networks."
to:NB  community_discovery  stochastic_block_models  network_data_analysis  spectral_methods  priebe.carey_e.  statistics  to_teach:baby-nets 
august 2019 by cshalizi
[1907.03902] Uncertainty and causal emergence in complex networks
"The connectivity of a network conveys information about the dependencies between nodes. We show that this information can be analyzed by measuring the uncertainty (and certainty) contained in paths along nodes and links in a network. Specifically, we derive from first principles a measure known as effective information and describe its behavior in common network models. Networks with higher effective information contain more information within the dependencies between nodes. We show how subgraphs of nodes can be grouped into macro-nodes, reducing the size of a network while increasing its effective information, a phenomenon known as causal emergence. We find that causal emergence is common in simulated and real networks across biological, social, informational, and technological domains. Ultimately, these results show that the emergence of higher scales in networks can be directly assessed, and that these higher scales offer a way to create certainty out of uncertainty."
to:NB  information_theory  network_data_analysis  emergence  to_be_shot_after_a_fair_trial  via:vaguery 
august 2019 by cshalizi
[1908.05873] Selection of Exponential-Family Random Graph Models via Held-Out Predictive Evaluation (HOPE)
"Statistical models for networks with complex dependencies pose particular challenges for model selection and evaluation. In particular, many well-established statistical tools for selecting between models assume conditional independence of observations and/or conventional asymptotics, and their theoretical foundations are not always applicable in a network modeling context. While simulation-based approaches to model adequacy assessment are now widely used, there remains a need for procedures that quantify a model's performance in a manner suitable for selecting among competing models. Here, we propose to address this issue by developing a predictive evaluation strategy for exponential family random graph models that is analogous to cross-validation. Our approach builds on the held-out predictive evaluation (HOPE) scheme introduced by Wang et al. (2016) to assess imputation performance. We systematically hold out parts of the observed network to: evaluate how well the model is able to predict the held-out data; identify where the model performs poorly based on which data are held-out, indicating e.g. potential weaknesses; and calculate general summaries of predictive performance that can be used for model selection. As such, HOPE can assist researchers in improving models by indicating where a model performs poorly, and by quantitatively comparing predictive performance across competing models. The proposed method is applied to model selection problem of two well-known data sets, and the results are compared to those obtained via nominal AIC and BIC scores."

--- But unless the ERGM is projective, with hold-out you'd have to fit it by explicitly marginalizing out the configuration of the held-out portion of the graph, which is going to be computationally very expensive...
to:NB  cross-validation  exponential_family_random_graphs  network_data_analysis  statistics  re:your_favorite_ergm_sucks 
august 2019 by cshalizi
Corrected Bayesian Information Criterion for Stochastic Block Models: Journal of the American Statistical Association: Vol 0, No 0
"Estimating the number of communities is one of the fundamental problems in community detection. We re-examine the Bayesian paradigm for stochastic block models (SBMs) and propose a “corrected Bayesian information criterion” (CBIC), to determine the number of communities and show that the proposed criterion is consistent under mild conditions as the size of the network and the number of communities go to infinity. The CBIC outperforms those used in Wang and Bickel and Saldana, Yu, and Feng which tend to underestimate and overestimate the number of communities, respectively. The results are further extended to degree corrected SBMs. Numerical studies demonstrate our theoretical results."
to:NB  information_criteria  community_discovery  network_data_analysis  statistics  stochastic_block_models 
august 2019 by cshalizi
[1908.04901] On community structure in complex networks: challenges and opportunities
"Community structure is one of the most relevant features encountered in numerous real-world applications of networked systems. Despite the tremendous effort of a large interdisciplinary community of scientists working on this subject over the past few decades, more investigations are needed in order to better understand the impact of their structure and dynamics on networked systems. Here, in the first section, we review the work on generative models of communities and their role in developing strong foundation for community detection algorithms. We discuss modularity and algorithms based on modularity maximization. Then, we follow with an overview of the Stochastic Block Model and its different variants as well as inference of communities structures from the model. The following section focuses on time evolving networks, where existing nodes and links can disappear, and in parallel new nodes and links may be introduced. The extraction of communities under such circumstances poses an interesting and non-trivial problem that has gained considerable interest over the last decade. We briefly discuss considerable advances made in this field recently. In the last section, we discuss immunization strategies essential for targeting the influential spreaders of epidemics in modular networks. Their main goal is to select and immunize a small proportion of individuals from the whole network to control the discussion process. Various strategies have emerged over the years suggesting different ways to immunize nodes in networks with overlapping and non-overlapping community structure. We first discuss stochastic strategies that require little or no information about the network topology at the expense of their performance. Then, we introduce deterministic strategies that have proven to be very efficient in controlling the epidemic outbreaks, but require complete knowledge of the network."
in_NB  community_discovery  network_data_analysis  epidemics_on_networks 
august 2019 by cshalizi
[1704.07114] Regular Decomposition: an information and graph theoretic approach to stochastic block models
"A method for compression of large graphs and non-negative matrices to a block structure is proposed. Szemerédi's regularity lemma is used as heuristic motivation of the significance of stochastic block models. Another ingredient of the method is Rissanen's minimum description length principle (MDL). We propose practical algorithms and provide theoretical results on the accuracy of the method."

--- How is this different from the old Rosvall & Bergstrom MDL approach to community discovery?
to:NB  network_data_analysis  MDL  community_discovery  stochastic_block_models  statistics 
august 2019 by cshalizi
[1908.03152] Analysis of Networks via the Sparse $β$-Model
"Data in the form of networks are increasingly available in a variety of areas, yet statistical models allowing for parameter estimates with desirable statistical properties for sparse networks remain scarce. To address this, we propose the Sparse β-Model (SβM), a new network model that interpolates the celebrated Erdős-Rényi model and the β-model that assigns one different parameter to each node. By a novel reparameterization of the β-model to distinguish global and local parameters, our SβM can drastically reduce the dimensionality of the β-model by requiring some of the local parameters to be zero. We derive the asymptotic distribution of the maximum likelihood estimator of the SβM when the support of the parameter vector is known. When the support is unknown, we formulate a penalized likelihood approach with the ℓ0-penalty. Remarkably, we show via a monotonicity lemma that the seemingly combinatorial computational problem due to the ℓ0-penalty can be overcome by assigning nonzero parameters to those nodes with the largest degrees. We further show that a β-min condition guarantees our method to identify the true model and provide excess risk bounds for the estimated parameters. The estimation procedure enjoys good finite sample properties as shown by simulation studies. The usefulness of the SβM is further illustrated via the analysis of a microfinance take up example."
to:NB  network_data_analysis  statistics 
august 2019 by cshalizi
« earlier      
per page:    204080120160

related tags

20th_century_history  academia  adamic.lada  adversarial_examples  afghanistan  agent-based_models  ahmed.amr  airoldi.edo  algebra  algebraic_statistics  amaral.luis  american_history  american_southwest  ancel_meyers.lauren  ancient_trade  anomaly_detection  anthropology  approximation  arab_spring  archaeology  arms_trade  aronow.peter  asta.dena  backstrom.lars  bad_data_analysis  bad_science  bakshy.eytan  barabasi.albert-laszlo  bassett.danielle_s.  belief_propagation  belkin.mikhail  bender-de_moll.skye  bergstrom.carl_t.  bibliometry  bickel.peter_j.  biochemical_networks  bioinformatics  biology  bitcoin  blei.david  blitzstein.joseph  blogged  blogging  blogs  books:noted  books:owned  books:recommended  books:reviewed  boosting  bootstrap  borgs.christian  borsboom.denny  brandeis.louis  brantingham.jeff  butts.carter  butts.carter_t.  campaign_finance  can't_remember_if_I_read_it_or_not  carnegie_mellon  cats  causal_inference  cellular_automata  cesa-bianchi.nicolo  chandrasekhar.arun  chandrashekar.arun  change-point_problem  chatterjee.sourav  chayes.jennifer  choi.david  choi.david_s.  citation_networks  classifiers  clauset.aaron  climate_change  clustering  cmu  cohn.henry  collaborative_filtering  collective_cognition  community_discovery  complexity  complexity_measures  compressed_sensing  computational_complexity  computational_humanities  computational_statistics  conferences  confidence_sets  confounding  congress  contagion  context-free_grammars  contingency_tables  control_theory_and_control_engineering  convergence_of_stochastic_processes  convexity  corporations  corruption  counter-insurgency  counter-terrorism  crawford.forrest  creeping_authoritarianism  crime  criminal_conspiracies  cross-validation  cultural_evolution  cumulative_advantage  darpa  data_analysis  data_mining  data_sets  deceiving_us_has_become_an_industrial_process  democracy  density_estimation  deviation_inequalities  diaconis.persi  diffusion_of_innovations  dimension_reduction  discretization  distributed_systems  drugs  dudoit.sandrine  dynamical_systems  dynamic_networks  eagle.nathan  eckles.dean  ecology  econometrics  economics  economic_history  electronic_copy  eliassi-rad.tina  emergence  em_algorithm  enron  ensemble_methods  entableted  environmental_management  epidemics_on_networks  epidemic_models  epidemiology  epidemiology_of_representations  ergodic_theory  estimation  ethology  events  evisceration  evolutionary_biology  evolutionary_game_theory  evolution_of_cooperation  exchangeability  exchangeable_arrays  experimental_design  experimental_psychology  experimental_sociology  exponential_families  exponential_family_random_graphs  facebook  factor_analysis  farrell.henry  FBI  feith.douglas  fienberg.stephen  fienberg.stephen_e.  filtering  finance  financial_crisis_of_2007--  financial_markets  fish  fisher_information  flickr  fmri  food_webs  fosdick.bailey  fraud  freeman.linton_c.  functional_connectivity  functional_data_analysis  funny:geeky  funny:laughing_instead_of_screaming  galesic.mirta  galstyan.aram  gene_expression_data_analysis  gentile.claudio  geometry  getoor.lise  geyer.charles_j.  ghahramani.zoubin  gigs  gile.krista  gile.krista_j.  going_to_miss_the_talk  goldenberg.anna  goodness-of-fit  goodreau.steven_m  grants  graphical_models  graphons  graph_embedding  graph_limits  graph_spectra  graph_theory  gutfraind.sasha  handcock.mark  harrison.matthew_t.  have_read  have_read_too_many_times  have_skimmed  have_written  healy.kieran  heard_the_talk  heavy_tails  hero.alfred_o._iii  hierarchical_statistical_models  hierarchical_structure  high-dimensional_statistics  hilbert_space  historical_linguistics  historiography  history  hoff.peter  hofman.jake  holme.petter  holmes.susan  homophily  hooi.bryan  hsu.daniel  humanities  human_genetics  hunter.david_r.  hyperbolic_geometry  hypergraphs  hypothesis_testing  identifiability  ideology  increasing_returns  inequality  inference_to_latent_objects  influence  information_cascades  information_criteria  information_theory  instrumental_variables  intelligence_(spying)  interacting_particle_systems  inverse_problems  in_library  in_NB  ising_model  i_knew_him_when  jackson.matthew_o.  jacobs.abigail_z.  jin.jiashun  joint_modeling_of_text_and_networks  jordan.michael_i.  kakade.sham  kallenberg.olav  karrer.brian  kernel_methods  kirshner.sergey  kith_and_kin  kleinberg.jon  kolaczyk.eric  kolar.mladen  kossinets.gueorgi  krioukov.dmitri  krivitsky.pavel  kronecker_graphs  kurths.jurgen  large_deviations  lasso  latent_space_network_models  latent_variables  lauritzen.steffen  lazer.david  learning  learning_theory  leenders.roger  lei.jing  lerman.kristina  leskovec.jure  levina.elizaveta  levina.liza  likelihood  linear_algebra  linear_regression  linguistics  link_prediction  literary_history  liu.fuchen  lizardo.omar  locality-sensitive_hashing  logistic_regression  lombardi.mark  lovasz.laszlo  low-rank_approximation  low-regret_learning  lunde.robert  machine_learning  macro_from_micro  mahoney.michael  mahoney.michael_w.  manifold_learning  mankad.shawn  map-reduce  markov_models  mason.winter  mathematics  matrix_and_tensor_factorization  mccormick.tyler  mccormick.tyler_h.  mcpherson.miller  mdl  mean-field_theory  measurement  meila.marina  menczer.filippo  mental_illness  military_industrial_complex  minimax  minimum_description_length  mixture_models  modeling  model_checking  model_selection  modernism  monte_carlo  moody.james  moore.cris  moore.cristopher  morris.martina  mucha.peter  mucha.peter_j.  mukherjee.sayan  multidimensional_scaling  multiple_testing  mythology  national_surveillance_state  natural_history_of_truthiness  networked_life  networks  networks_over_time  network_alignment  network_comparison  network_data_analysis  network_differences  network_experiments  network_formation  network_growth  network_sampling  network_visualization  neural_data_analysis  neural_networks  neuropsychology  neuroscience  neville.jennifer  newman.mark  nexus-7  niezink.nynke  niyogi.partha  nobel.andrew  noel.hans  nonparametrics  NSA  null_models  nyhan.brendan  oatley.thomas  oconnor.brendan  ogburn.elizabeth  ok_not_quite_scooped  online_learning  optimization  orbanz.peter  oscillators  owen.art  parallel_computing  particle_filters  pattison.philippa  peer_production  pentland.alex  perturbation_theory  phase_transitions  phylogenetics  pittsburgh  please_dont_let_me_be_scooped  poetry  point_processes  poldrack.russell  political_economy  political_networks  political_science  porter.mason_a.  pr0n  practices_relating_to_the_transmission_of_genetic_information  prediction  preferential_attachment  pretty_pictures  priebe.carey_e.  principal_components  probability  programming  psychiatry  psychometrics  R  raftery.adrian  random_fields  random_graphs  random_matrices  random_projections  random_walks  rauchway.eric  re:6dfb  re:ADAfaEPoV  re:almost_none  re:bootstrapping_graphs  re:community_control  re:critique_of_diffusion  re:democratic_cognition  re:do-institutions-evolve  re:donor_networks  re:do_not_adjust_your_receiver  re:fractal_asymptotics_of_networks  re:fractal_network_asymptotics  re:friday_science_cat_blogging  re:functional_communities  re:homophily_and_confounding  re:hyperbolic_networks  re:network_bootstraps  re:network_differences  re:neutral_cultural_networks  re:neutral_model_of_inquiry  re:small-area_estimation_by_smoothing  re:smoothing_adjacency_matrices  re:social_networks_as_sensor_networks  re:sporns_review  re:stacs  re:XV_for_networks  re:your_favorite_ergm_sucks  recommender_systems  regression  relational_learning  replicator_dynamics  respondent-driven_sampling  rinaldo.alessandro  robin.stephane  robins.james  roeder.kathryn  rohe.karl  rosenbaum.paul  rosvall.martin  roth.camille  running_dogs_of_reaction  sadeghi.kayvan  salganik.matthew_j.  samii.cyrus  sampling  sarkar.purna  sarkar.purnamrita  sat_on_the_thesis_committee  schizophrenia  schweinberger.michael  science_as_a_social_process  scooped?  self-centered  self-promotion  semantics_from_syntax  semi-supervised_learning  sentiment_analysis  sharpnack.james  shore.jesse_c.  shot_after_a_fair_trial  simulation  singh.aarti  smoothing  smyth.padhraic  snijders.t.a.b.  snijders.tom  snowball_sampling  social_influence  social_life_of_the_mind  social_measurement  social_media  social_movements  social_networks  social_science_methodology  sociolinguistics  sociology  sociology_of_science  software  song.le  spam  sparse_graph_models  sparsity  spatial_statistics  spatio-temporal_statistics  spectral_clustering  spectral_graph_theory  spectral_methods  splines  sra.suvrit  stability_of_learning  state-space_models  state_estimation  statistical_computing  statistical_inference_for_stochastic_processes  statistical_mechanics  statistics  stochastic_block_models  stochastic_differential_equations  stochastic_processes  stovel.katherine  stupid_security  sufficiency  summer_schools  surrogate_data  surveillance  surveys  survey_sampling  survival_analysis  tagging  taskar.ben  terrorism  terrorism_fears  text_mining  theoretical_computer_science  the_continuing_crises  thomas.a.c.  tibshirani.ryan  time-series  time_series  to:blog  to:NB  tofias.michael  topic_models  topology  to_be_shot_after_a_fair_trial  to_read  to_teach  to_teach:baby-nets  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:data_over_space_and_time  to_teach:graphons  to_teach:statcomp  to_teach:undergrad-ADA  traceroute  track_down_references  transaction_networks  two-sample_tests  ugander.johan  us-iraq_war  us_military  us_politics  utter_stupidity  vanderweele.tyler  van_der_laan.mark  van_handel.ramon  variance_estimation  variational_inference  vast_right-wing_conspiracy  veitch.victor  via:?  via:aeo  via:ale  via:arinaldo  via:arthegall  via:benjaminlind  via:chl  via:cris_moore  via:cwiggins  via:deaneckles  via:dsparks  via:gabriel_rossman  via:guslacerda  via:henry_farrell  via:iqss  via:jbdelong  via:john-burke  via:kevin_drum  via:kjhealy  via:krackhardt  via:laura_rozen  via:liza_levina  via:magistra_et_mater  via:mason_porter  via:nyhan  via:ogburn  via:phnk  via:rvenkat  via:shivak  via:vaguery  via:wiggins  via:yorksranter  vicious_parody  visual_display_of_quantitative_information  voter_model  wasserman.larry  was_on_the_committee  watts.duncan  wheel_reinvention  when_you_walk_through_the_garden_you_gotta_watch_your_back  where_by_"heard_the_talk"_I_mean_had_it_explained_over_coffee  why_oh_why_cant_we_have_a_better_academic_publishing_system  willett.rebecca_m.  wolfe.patrick_j.  wurmser.david  xing.eric  your_favorite_deep_neural_network_sucks  yu.bin 

Copy this bookmark: