Hetionet in Neo4j


Hetionet v1.0

Hetionet is a network of biology, disease, and pharmacology. Knowledge from millions of biomedical studies over the last half century have been encoded into a single hetnet. Version 1.0 contains 47,031 nodes of 11 types and 2,250,197 relationships of 24 types.

We created Hetionet v1.0 for Project Rephetio (publication) — where we systematically looked at why drugs work and predicted new uses for existing drugs.

Neo4j

Neo4j is a database designed for hetnets (graphs with multiple node or relationship types). You’re currently interacting with Hetionet through the Neo4j Browser, which provides a web interface to the database. This is a read-only instance, so you can’t modify the network. However, you can run queries and and explore the network.

If you are new to Neo4j, check out the following guides:

Metagraph

The metagraph shows the data model (schema) for Hetionet v1.0. For an interactive version, run CALL db.schema().

Explore Hetionet


Label counts

In Neo4j, node types are called labels. The following query counts the number of nodes per label. Run it by clicking the text box to prefill the command and then hitting in the upper right.

MATCH (node)
RETURN
  head(labels(node)) AS label,
  count(*) AS count
ORDER BY count DESC

Relationship type counts

Play the following query to count the number of relationships per type.

MATCH ()-[rel]->()
RETURN
  type(rel) AS rel_type,
  count(*) AS count
ORDER BY count DESC

Notice the suffixes (e.g. GpPW in PARTICIPATES_GpPW) which we include to ensure that relationship types between different labels are distinct.

Random relationships

The following query retrieves a random relationship of each type. The query goes through every relationship and thus may take several seconds.

MATCH ()-[rel]->()
WITH type(rel) AS rel_type, collect(rel) AS rels
WITH rels[toInteger(rand() * size(rels))] AS rel
RETURN startNode(rel), rel, endNode(rel)

By default, the Hetionet Neo4j Browser only shows relationships that were returned by the query. To show every relationship between the displayed nodes instead, select Connect result nodes under settings.

Project Rephetio


Project Rephetio uses Hetionet to predict the probability that each compound treats each disease. The approach uses hetnet edge prediction — an algorithm that learns which types of paths occur more frequently between known treatments. Use the Prediction Browser to browse the predicted probabilities of treatment for 209,168 compound–disease pairs.

Each prediction has a corresponding Neo4j guide, which provides additional details and visualization. For example, play the following commands to see the evidence for

  • bupropion for nicotine dependence (read more):

    :play https://neo4j.het.io/guides/rep/DB01156/DOID_0050742.html
  • clofarabine treating multiple sclerosis (read more):

    :play https://neo4j.het.io/guides/rep/DB00631/DOID_2377.html
  • nortriptyline treating migraine (read more):

    :play https://neo4j.het.io/guides/rep/DB00540/DOID_6364.html

Learn how to query Hetionet using Cypher


Here are some simple queries to help new users get acquainted with Cypher and Hetionet.

  1. Retrieve the Disease node named lung cancer:

    MATCH (node:Disease {name: "lung cancer"}) RETURN node

    Which is equivalent to:

    MATCH (node:Disease)
    WHERE node.name = "lung cancer"
    RETURN node
  2. Find the anatomies (tissue types) where lung cancer localizes:

    MATCH path = (:Disease {name: 'lung cancer'})-[:LOCALIZES_DlA]->()
    RETURN path
  3. Find all genes associated with spinal cancer:

    MATCH path = (:Disease {name: 'spinal cancer'})-[:ASSOCIATES_DaG]->()
    RETURN path
  4. Find all genes associated with both liver and kidney cancer (return results as a table):

    MATCH (source:Disease)-[:ASSOCIATES_DaG]-(gene:Gene)-[:ASSOCIATES_DaG]-(target:Disease)
    WHERE source.name = 'liver cancer'
      AND target.name = 'kidney cancer'
    RETURN
      gene.name AS gene_symbol,
      gene.description AS gene_name,
      gene.url AS url
    ORDER BY gene_symbol
  5. Find all genes that participate in the mitotic spindle checkpoint biological process:

    MATCH path = ({name: 'mitotic spindle checkpoint'})-[rel:PARTICIPATES_GpBP]-()
    RETURN path
  6. Find all genes that participate in the mitotic spindle checkpoint and are expressed in the lung:

    MATCH path = (bp:BiologicalProcess)-[:PARTICIPATES_GpBP]-(gene:Gene)-[:EXPRESSES_AeG]-(anatomy:Anatomy)
    WHERE bp.name = 'mitotic spindle checkpoint'
      AND anatomy.name = 'lung'
    RETURN path

For more advanced examples, see our query depot.

Miscellany


Style

Execute this command to load the hetionet style. Once the style is loaded, the node coloring in the browser will match the metagraph from the first slide in this guide. This command only needs to be run once per web browser.

:style https://neo4j.het.io/guides/graphstyle.grass

Hosting

neo4j.het.io is hosted on DigitalOcean (learn more). We’d like to thanks DigitalOcean for their sponsorship of the Hetionet Browser.

Querying Hetionet from Python


We allow users to programmatically query Hetionet. Our Neo4j instance supports HTTP(S) and Bolt connections. The code below shows how to query Hetionet from Python using the official neo4j driver and the py2neo community driver.

# We use Pandas DataFrames to store tabular query results
# However, this is an optional step for downstream convenience
import pandas

# Return 5 arbitrary diseases
query = '''
MATCH (disease:Disease)
RETURN
  disease.identifier as identifier,
  disease.name AS name
LIMIT 5
'''

# Uses the official neo4j-python-driver. See https://github.com/neo4j/neo4j-python-driver
from neo4j.v1 import GraphDatabase
driver = GraphDatabase.driver("bolt://neo4j.het.io")
with driver.session() as session:
    result = session.run(query)
    result_df = pandas.DataFrame((x.values() for x in result), columns=result.keys())

# Uses py2neo. See http://py2neo.org/v3/
import py2neo
graph = py2neo.Graph("bolt://neo4j.het.io", bolt=True, secure=True,
    http_port=80, https_port=443, bolt_port=7687)
cursor = graph.run(query)
result_df = pandas.DataFrame.from_records(cursor, columns=cursor.keys())

In addition to Python, Neo4j has driver support for many other languages.

We currently limit queries to 120 seconds. If you notice that the Neo4j server is overloaded, please hold off automated queries. If you are doing a substantial amount of querying, please run the database locally (see the Hetionet Docker).