ENCODE Element Browser and the 3D Genome Browser – Yanli Wang

ENCODE Element Browser and the 3D Genome Browser – Yanli Wang


Yanli Wang:
All right. Good afternoon everybody. My name is Yanli, and I’m actually in Dr. Yue’s
lab so his compliments are — it means nothing to me [laughs]. So today, I would like to present the ENCODE
Element browser and the 3D Genome browser that our lab have built to help hopefully
to increase the accessibility of the ENCODE data. So the ENCODE Element browser is the
first thing we are going to cover and it broadly is a suite of four tools and it covers two
types of data: one is gene expression, and the other is cis-regulatory elements, and
this part actually ties back into the dataset that Michael talked about and then the next
browser is the 3D Genome browser which visualizes Hi-C and ChIA-PET data. So what are the goals
for this browser? Why do we make it? So as I said, one of the goals is so that the user
could query for the most relevant ENCODE data, and ENCODE have generated millions of data
and it’s important to isolate the ones that you need for your own biological experiments,
and the next component is to visualize complex data and this is especially true for Hi-C
and ChIA-PET, and the last part is providing an additional layer of evidence. As was said,
it’s difficult to know what the target gene is, even given the cis-regulatory element
because the cis-regulatory element may not regulate the closest gene that’s relative
to its position, so Hi-C and chromatin ligation experiments are important to find the chromatin
loops that develop between the cis-regulatory element and the gene promoter. So the Hi-C
browser is also geared toward that, and hopefully those are used to guide any biological validation
experiments. Okay, so without further ado, let’s dive
in. So, well, I see a bunch of Macs. Oh my gosh, that’s a lot of Macs. Okay, but let
us get started. So first, we are going to do the ENCODE Element browser, and let’s
first go to the www.endcodeproject.org website, and then just follow what this animated GIF
is showing, so click on data and notations and below annotated genome regions there is
query tool at Penn State and to arrive at the Penn State website. Pardon me? Male Speaker:
[inaudible] Yanli Wang:
I think so. It’s not? Does anyone else have this issue? [inaudible commentary] Yanli Wang:
Okay. Male Speaker:
I think the most important thing is go to click data and then click annotation and they
will show you this link. Yanli Wang:
Okay and click on human, the human tab, to access the human data, and let us enter the
gene IKZF1, the Ikaros DNA binding zinc finger 1 and enter it under option one, like the
GIF is trying to show. So did everybody get the — this page here? Everybody get — Yeah?
Okay. As you can see option one is gene expression, so we are looking at the expression of the
Ikaros family pro — gene here across 80-plus tissues. And here is the gene ID with a lot
of synonyms. Oh, and before I forget — so the first part you — so here we enter the
gene symbol, but you can actually enter the refseq ID or Uniprot ID, as well, and when
you enter the ID with the symbol, the website should prompt you to the correct spelling,
just — if you see no prompts, that means there is something wrong with your spelling,
so be careful with that. Okay. So we see the bar graph in RPKM and then below that is the
list of — basically the list of bar graph — the information that the bar graph is trying
to convey in a table format, and the IKZF1 gene is a protein that is involved in hematopoiesis,
as well as immune system development. So it makes sense — as you can see there
is the cells that have relatively high expression is CD20, GM12878, K562, and so on. And if
you think that the label on the bar graph is too small, you can actually click on it
to get an enlarged image and you can read the label from that. Okay, so the next two
options, option two and three, directly stems from Michael’s presentation, the data set
that he presented. So it queries for the candid cis-regulatory element regions with DHS or
transcription factor binding site, and it is a fast and easy way to really determine
the cis-regulatory elements as well as their tissue specificity. So without further ado,
go back to the — I guess click on the back button and under option two, select chromosome
7, and that’s 50 million and 300,000 for the start and 50 million and 305,000 for the
end, and then click submit. Okay. So you see a pretty busy page, but the just of what you
see is going to be the DNAs — DHSs and the TF binding sites and then what the identity
of the trans factor binding, if applicable, and the tissue that the region occurs in,
and as you can see, you see a lot of immune related cells, which is very applicable to
Ikaros. Okay. So it is not always — I guess it is
— people don’t really remember the regions that they want out of the top of my head — out
of the top of their head so the next tool complements option two, option three actually
you can gene as well as extended region — extended window, which — and then to search cis-regulatory
elements. So for option three, let’s enter IKZF1 and for extended region, if you don’t
put anything there, it is 20KB. Let’s put one for today because there are a lot of people
trying to access data at the same time. Okay, so you see this page, which is very similar
to the page that you saw before, except this time it is more honed in toward the gene,
and as you can see, the cells that are here are also immune related, which is very appropriate.
Okay, and the next one is option four, and option four is actually — as Michael talked
about, it is actually trying to coordinate the activity between a proximal DHS and a
distal DHS, and as you can recall, proximal DHS is a DHS that is near a transcription
start site of a gene, but the distal DHS have some sort of cis-regulatory function — enhancer
function, and as you can see from this example, if we correlate the activities of DHSA with
the distal DHS, there is very low correlation between the two because the DHSA is active
in the tissue in which the distal DHS is not active in, but this is not the case with DHSB
because DHSB and the distal DHS actually are active in the same tissue or similar tissues
— set of tissues, so that means they have higher correlation and this is — and when
this occurs, the pair of DHS is referred to as linked, so DHS linkage, and when they are
linked, there means there is some sort of biological connectivity between the two, and
for more information you can refer to the Thurmond et al. paper in Nature 2012. So without
further ado, let’s try this out on the Element browser. So under option four, enter the IKZF1 gene
again and click on submit. Okay. So you should see the page that looks like this. So the
first three columns is the proximal — the location of the proximal DHS, which is near
this gene in the center column, and the next three columns are — is the location of the
distal DHS, and the last column represents their correlation, and those DHSs only recorded
when their correlation is above 0.7. So I took one of the higher correlating DHS pairs,
the .96, this location here, and I actually ran it through option two which searches elements
in the given genomic region and I saw that there is a transcription factor binding site,
which is for EBF1 and ELF1. These two are also transcription factors that are involved
in immune development, so — which is — which is kind of — which is agreeable with Ikaro’s
function, so this means there is some sort of biological connectivity between the two,
and indeed, if you knock out any of these genes, you are going to — the patient is
going to develop AOL. So this sort of illustrates that this suite of tools are meant to be used
in concert, to really find out the answer to biological question, and keep in mind that
correlation doesn’t imply causation. Just because these two transcription factors correlate
with Ikaros doesn’t mean that they necessarily directly regulate, so we need an additional
layer of evidence to see if indeed regulation happens, and this brings me to the 3D Genome
Browser. So the 3D Genome browser’s URL is www.3DGenome.org. Yeah? Male Speaker:
[inaudible] Male Speaker:
Can you repeat the question? Yeah. Yanli Wang:
Oh, so the question was, “Can you download the data somehow?” You could save the page
as an HTML and then export it to Excel. Male Speaker:
Yes. Let me answer the question. I think all the files can be downloaded. All the files
we used, actually, the majority are from the ENCODE portal, the annotation page. They can
be downloaded, and the linkage files we used are from John Stem’s the Nature 2012 paper,
and that file can be downloaded from that paper. We will use another updated version
soon. Yanli Wang:
Well, the question is not to download the whole data, but part of the data, right? [inaudible commentary] Yanli Wang:
I — yeah, we will add that option in the future. If you don’t know how to use XML
that’s fine. Okay. So is everybody at the three — yeah? Male Speaker:
I think it would be really nice if we can click to the option two from the result of
the option seven, so — Yanli Wang:
Yeah, that’s — actually, I had an epiphany this morning that I should do that so, yeah,
that will be done soon. So is everyone at the 3DGenome.org website? Okay. So there is
two parts to this website. The first part is the visualization of the Hi-C, the Hi-C
Genome browser, and the second part is the visualization of virtual 4C, and 4C, as you
know, is a one to many query of the chromatin interactions data, and the reason it is virtual
— so you are looking at the interactions of your loci — locus of interest toward the
other loci, and the reason it is called virtual 4C is it’s derived from the Hi-C data. It’s
actually not experimentally derived, so that is why it is virtual 4C, and then — well,
along with the virtual 4C is the ChIA-PET data, and unlike the ELEMENT browser, this
part of the website requires Java Script and HTML5. So most modern browsers actually include
these two, but if you haven’t updated your browser in a long time, I really encourage
you to do that, not only to access the website, but for security concerns, as well. Okay.
So the main features of these two browsers, you can easily browse some of the Hi-C –of
the high quality published Hi-C data available, including the ones that were generated for
ENCODE, and you can contextualize the data with a customized UCSC browser session. And
lastly, you can browse your own Hi-C data, and we will show you how to do that. Okay.
Let’s click on Hi-C interactions tab at the top, and let’s enter the gene SOX2 and
click on show interactions. So as with the Element browser, we have two options, one
is search by gene name and the other one is search by location, if you know the exact
location, and this part, I will explain later, the UCSC browser session, in which you can
upload your own session, not the default one that is loaded here, but your own session
with your customized tract. Male Speaker:
So if you ever click submit, you might have to scroll down a little bit so the Hi-C image
is not shown on the first page. Yanli Wang:
Yeah, make sure you scroll down. So, for this session only I actually, like, I actually
filled in a customized browser session so people could use. Okay. So this is the results
page, so remember to scroll down, and in the center is your Hi-C heat map and you can adjust
the intensity with this bar up here and at the top is a navigation bar in which you can
zoom in or zoom out, move left or move right from the region that you are currently in,
and then below is the UCSC Genome browser and it should be aligned to the Hi-C data.
So everybody got that page? Okay, what am I looking at with that weird triangle? So
normally, the Hi-C data is the heat map visualization of the contact matrix of a n by n matrix,
and this matrix — so the size of the n is determined by the resolution of the matrix
and normally the Hi-C matrix is diagonally symmetrical, which means that if you want
to look at the interaction of loci m minus two with loci three with loci m minus two,
it’s is going to be the same whether you are on this side of the diagonal or this side
of the diagonal. So to really save some time and energy, we cut off the upper triangle
out so what you see is actually the rotated upper triangular part of the matrix. Okay.
So this part, I am going to actually go to. Okay, so does everybody have this, right? Okay. So you could adjust the intensity of
the matrix with this bar here. So if you increase the value here, only the values that are — the
Hi-C matrix values that are up here are going to be cut off and represented as red and if
you increase this value down here, the values that are less than that are going to be represented
as white, so you can
slide these sliders to increase or decrease
the color values for your matrix, or you can directly manipulate the values with the arrows
here and then click refresh, or you can directly just input the values you want as the cutoff.
Click refresh, and that’s the — that’s actually the contrast that I like, and then
if you scroll down a bit, you can actually — you can notice some really high signals.
You can find out what two loci contribute to that signal just by click on it and then
they will extend a gray bar in the UCSC genome browser, and you can look at — it looks like
it’s really near the transcription start site of SOX2 and it looks like over here we
got some histone modifications, the K27 isolation, so it could be potential enhancer here. So
if I’m not sure, I could just double click on the region and then we would zoom in to
the region of interest. Not much here, huh? And then we could adjust the intensity, as
I said, but it’s — and then remember these navigation bars. You can always click here
to zoom out. Okay. And then, is the UCSC genome browser aligned for everybody or is it off
by a bit? Yes? [inaudible commentary] Male Speaker:
I’m not sure if it is aligned because if you scroll the bottom, you know, it is getting
defined positions. So the bottom scroll bar, just to the bottom of it — Yanli Wang:
Here? Male Speaker:
Yeah. Yeah. Yanli Wang:
Oh. So, yeah, this is meant to be scrollable. This is meant to be scrollable. You are — so
the user is supposed to find the optimal scroll and then click on set UCSC scroll, so there
is a default value in which the track should be — should align to the Hi-C matrix. So
if you have a different browser or a different operating system, it’s possible that you
are off by a little bit, and when that happens, you can just manually align it — Male Speaker:
[unintelligible] Yanli Wang:
— yourself and then click on set UCSC scroll, and then as I said, you can — we can manipulate
this tract UCSC browser session however we want. So let’s give an example. So let’s
say — I’m going to change the ChromHMM data. Instead of dense I want pack here and
then scroll up and click submit. So you will interact with this window here as you would
a UCSC if you were on a UCSC page, and then we notice that the alignment is sort of screwed,
so we want to scroll all the way up, so that you see zero for the horizontal and vertical
scroll here and you see the upper left page of the — the upper left corner of the page.
Then, you click on align UCSC and then the alignment is done automatically for you, and
then you can scroll down to see the different tracks, and I changed this track to pack,
so that is what you see here. So now with the session here, if you just manipulate whatever
region you want, the track stays constant, so this is a customized session just for the
user. Okay, let’s go back to the slide. So as I said, it’s possible to use your
data, and to do that, we can convert our Hi-C matrix data, contact matrix data, into a file
format called the BUTLR format, the binary upper triangular matrix file, and this is
a file format that is pioneered by our lab. The goal for this file is that will act the
same as bigWig or bigBed with UCSC. So it’s a binary index file, so we can query the regions
that we want, as long as you put this file in this file format in the remote server and
enter the URL. The data browser — our 3D Genome browser will query that file in the
particular region that you want without having to upload the entire matrix file onto our
server. And the BUTLR file has many advantages, so it decreases disc memory usage because
it is binary and index and allows random access, so this increases the portability of the file,
the speed of the file when you manipulate memory, and so on. Okay. So the last part is the virtual 4C and
ChIA-PET data. So if you go up to the tabs, click on virtual 4C, and this time we are
going to do the gene BCR1, the B-cell receptor 1 and then the extender region, the default
is 500 KB, if you leave that blank, but you can enter 500 there, and then click on go
to retrieve the virtual 4C region, and you can see here, you
can choose a ChIA-PET data that is relative
to the cell point, which is GM12878 and you can actually change the tissue of interest
that you look at, and then also down here is the UCSC browser session that you want
to enter, and as I said, I provided a default one for us today for the demo. So after you
click go, here is the page you should see. The color here may be a bit different, but
— so we have the navigation bar, so you can zoom in and zoom out your region of choice,
and then if you enter a gene or a SNP, it actually takes that point and that becomes
your bait loci or your loci of interest — bait locus or locus of interest and grabs the virtual
4-C for you. If you enter a gene, it actually takes the TSS as the region of interest and
then it grabs the virtual 4-C. Applaud for you, and this is supplemented with the DHS
linkage data that is produced by Dr. John Stem’s lab, and the DHS linkage, so the
different colors represents one set — different sets of proximal and distal DHS that are highly
correlated, and then down here you see this ChIA-PET data, in which the loci that interact
are represented as an arc and, of course, down there is the UCSC Genome browser, so
if I go to the page here, so that — what is it? [inaudible commentary] Yanli Wang:
[laughs] Yeah. We should all go to — so, you could actually — up here, as I said,
there is — you can click on whichever THS. The BCR1 gene actually have two THS. So this
is one. You can click on this tab to get the other one, and then, as you can see, this
is not aligned, so you can click on align and it is automatically aligned for you, and
then if you mouse over any region of interest on the virtual 4C plot, you can see the corresponding
regions in the other tracts. So here we go back to the first region. You can see a peak
here, which interacts strongly with — of course, the TSS is going to interact strongly
with itself, but the next strongest point is here, which looks like there’s some weak
H3K4 monomethylation and as you can see there is actually see some ChIA-PET arc that goes
from here to a region that is close to the TS, not quite, but pretty close, and then
the DHS linkage data, I think there is a little bit of brown there and a little brighter brown
here, which indicate that these three genes do correlate activity-wise. Okay. Whoops.
Okay. So that was our browsers, and then I would like to thank you, thank everyone who
provided the data for us, Dr. Stem and Dr. Wong and Dr. Hardison [spelled phonetically],
as well, and the entire ENCODE group, and we — the browsers are still in their infancy,
so we will welcome any feedbacks. Thank you. [applause] Male Speaker:
I think we have asked enough questions. Let’s move
on to the next speaker, Dr. Luca Pinello from Dana Farber. So he is going to give a tutorial
on ChromHMM, which is one of the most popular tools these days, and at end of this half
hour, you can probably claim you will be able to run it, or you have seen how it is run.
All right — [end of transcript]

Leave a Reply

Your email address will not be published. Required fields are marked *