Modern Network Inference

Stockholm University

August 2018

Daniel Morgan

Network Inference

???

Gardner 2005

Mechanistic

identify direct interactions

Influence

capture information flow to understand control system 

 

 

 

 

*(not necessarily measured or direct)

Purpose

Gardner 2005

Biology

  • Cells replicate, grow, replicate and die
  • In order to do this, they must express genes, which confer trait
  • their capabilities are robust but not infinite in response to environment

Next-Gen Sequencing

  • genome sequencing
  • ChIP-seq
  • RNA-Seq/ transcriptome
  • epigenome characterization
Advantage Disadvantages
Time Ion, 454 SOLiD
Coverage/ reads
Cost per base Ion, Illumina, SOLiD Sanger
Read length
Accuracy Ion, SOLiD, Sanger Nanopore, single-molecule

Levels of Biology

Linde 2015

Biology

  • We digitize these gene expressions, similar to characterizing hair or eye color

Fractionate, Tag, Bind for Selection

Selection, stabilize/amplify, relative measures

Measure, put back tother, analyze

Interactions = Data

More

Smet 2010

Informed Hypothesis

Simulation

Tune/ optimize parameters

Experiment

Analysis

Infer Network

Confirm

Why is network inference difficult?

Time

Environment (food)

Environment (neighbors)

digitizing methods

redundancy / robustness

Pre-Processing

Feature Selection

e.g. pre-processing by Differential Expression Analysis

  1. Set threshold and ceiling values.
  2. Convert each expression value to the log base 2 of the value.
  3. Remove genes < given threshold.
  4. Remove genes < minimum fold change
  5. Discretize or normalize the data.

Feature Selection/Mapping

(clustering)

  • Hierarchical clustering groups elements based on how close they are to one another. The result is a tree structure, referred to as dendrogram.
  • K-means clustering groups elements into a specified number of clusters, which can be useful when you know or suspect the number of clusters in the data.
  • Non-negative matrix factorization (NMF) clusters the data by breaking it down into metagenes or metasamples, each of which represents a group of genes or samples, respectively.

Parameter Optimization

(Structure Optimization)

Fitting Model to Data

Hecker 2009

influence -based reverse-engineering approaches

Influence-based

  • Identify functional modules

    • subsets of genes regulating one another through interactions

      • likely multiple

      • not necessarily direct

  • Predict behavior of system following perturbation

    • predict response of network to external change

    • identify genes directly targeted by perturbation

      • which genes interact directly with drug

  • Identify real physical interactions by integrating gene network with additional information from sequence data and other CHiP, Y2H,etc. data

network construction

Bellot 2015

main GRN model arch.

Hecker 2009

Information Theory (MI)

(Probabilistic) Boolean

(Dynamic) Bayesian

Ordinary Differential Equations

Neural Networks

Mutual Info

 

  • The area contained by both circles is the joint entropy H(X,Y)
  • An individual circle is the individual entropy H(X) or H(Y)
    •  the unique area is the conditional entropy H(X|Y) or H(Y|X)
  • The violet is the mutual information I(X;Y)

Boolean

YES or NO

  • State space of a Boolean Network
  • Thin (black) arrows symbolise the inputs of the Boolean function.
  • The thick (grey) arrows show what a synchronous update does.

Bayesian

conditional probabilities per link > DAGs

I (A; E)

I (B; D | A,E)

I (C; A,D,E | B)

I (D; B,C,E | A)

I (E; A, D)

Friedman 2000

System of Equations

coupled ODE related [mRNA] of gene to all other genes

Penfold 2011

Classifiers > Neural Nets

(x) hidden layers connecting input and output

 

supervised

  • infer the mapping implied by the (training) data

unsupervised

  • inferring a function to describe hidden structure from "unlabeled" data

State of the Art

Lecca 2013

So why bother?

fission yeast cell cycle network:

10 genes      23 links

14TP      3 FP    9FN

Pros & Cons

(pick a metric)

Smet 2010

Lots of methods

Universal benchmark (tool) needed

GeneSpider
NestBoot
MYC

L1000

GRN

Inference

LASSO (Glmnet)

(T)LSCO, RNI, ARACNe, Genie3

Nested Bootstrapping

Perturbation (sh/siRNA)

SNR, Condition No. (IAA)

MCC, AUROC, wRSS, R^2

Terms:

GeneSpider

Generation and Simulation Package for Informative Data ExploRation

GS case study

via some 200 networks & 600 expression sets

consisting of 4 different topologies

with varied SNR,

IAA degrees,

and sizes

RNICO

Robust Network Inference  decouples the model selection problem from parameter estimation; is very harsh but among the best methods when noise is low

ARACNe

focuses on mutual information between links in a link by link fashion rather than upon entire system as a whole. Also disregards self-regulating elements

LASSO/Glmnet

Least Absolute Shrinkage & Selection Operator: minimizes RSS by penalizing |coefficient| rather than their square, thus harshest (zeros possible)

(T)LSCO

Fit cases to regression line minimizing difference on X and XandY axis

GS case study

GS case study

with self loops  null ARACNe

NestBoot

A Generalized Framework for Controlling FDR in Gene Regulatory Network Inference

???

NestBoot

Threshold

(T)LSCO

Glmnet - LASSO

ARACNe

???

A Generalized Framework for Controlling FDR in Gene Regulatory Network Inference

NestBoot

A Generalized Framework for Controlling FDR in Gene Regulatory Network Inference

NestBoot

NestBoot

A Generalized Framework for Controlling FDR in Gene Regulatory Network Inference

N10 Performance

N45 Performance

B. subtilis

biological time series dataset

  • 1/3 expt duplicates
  • 1/3 4 time points
  • 1/3 3 time points
  • use first and last as background & stead state

collapse from 16k genes to 28

  • w/ single perturbations and replicates
  • via Schur method to maintain network properties

Real Data Performance

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

qRTPCR of 40 genes, singly & doubly knocked down via siRNA

  • 3 biological replicates
  • 2 technical replicates

 

​= gene fold change

&

variance of expression measures

1. qRTPCR of40 genes, singly & doubly knocked down via siRNA

  • 3 biological replicates
  • 2 technical replicates

​--> gene fold change & variance of expression measures

Y: expression data
A: network
P: perturbation matrix
E: input noise estimate
F: output noise estimate

Y=-mA^{-1}(P+F)+E
Y=mA1(P+F)+EY=-mA^{-1}(P+F)+E

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

Myc

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

comparison to random networks

Myc

Perturbation-based gene regulatory network inference to reliably predict oncogenic mechanisms

Making sense of Landmark 978 (L1000)

Platform

Making sense of Landmark 978 (L1000)

Data

Making sense of Landmark 978 (L1000)

Pipeline

To account for:

screening plate, bead arrays, cell passage, drug batch, equipment units, personal

4 RUV methods' performance quantified by 7 endpoint measures compared to 4 standard normalization methods

Y = X \beta + W \alpha + \epsilon
Y=Xβ+Wα+ϵY = X \beta + W \alpha + \epsilon

FC = log2 ((expression * ? + systematic noise + Gaussian noise) / control )

4 RUV methods' performance quantified by 7 endpoint measures compared to 4 standard normalization methods

(1) MAD - mean absolute deviation from zero (for reference)

heatmap patterns

(2) SlopeVerti & (3) SlopeHoriz

knockdown controls

(4) AdistKS - Kolmogorov-Smirnov distance between 2 subsets

(5) Q3P -  third quartile of p-values differentiating targeted knockdowns from zero

p-values

(6) UnifKS -  Kolmogorov-Smirnov distance between P>0.001 subsets

(7) Lambda - inflation of median p-value

all vs all

aim: high AdistKS, low lambda, unifKS, slopeHoriz, slopeVerti & MAD

calculating fold change, increasing SNR

  • analysis of fold change profiles estimated by a RUV method
  • remove unwanted variation
  • separates signal from unwanted effects
  • This greatly improves signal-to-noise over standard methods
  • Initially, rather than trying to minimize off target effects of siRNA for use as replicates, just use cell lines as replicates for our noise estimation.
  • Use single normalization method on all cell lines to optimized to these 7 criteria
  • choose RUV3 method which does not require using replicate experiments for alpha calculation
    • hone in to cell specific networks once SNR is increased in this way

L1000: A global perspective

  1. Treat cell lines as replicates for GRNI
  2. Delta drug expression and empty vector (control)
    • match DE genes to GRN hubs
  3. Treat drugs as multiple perturbations
    • aggregated into new expression matrix
    • assemble P matrix by comparing each expt to control
  4. Pathway Enrichment??

L1000: A global perspective

Next Step

In Conclusion

GeneSPIDER

benchmarking environment

NestBoot

FDR-informed inference

Perturbation Inference

MYC siRNA dataset & novel interaction

L1000

large scale inference & generic cancer network