Monday, April 28, 2008

2. Making scripts and fetching data

Before doing something with the list of Id's from the previous post and to make things a lot easier, let's turn the code that we have developed so far into a REBOL script file. Open your favorite plain text editor and enter (or copy and paste) the following:

REBOL [
Title: "Search the NCBI Protein Database"
Date: 24-Apr-2008
File: %esearch-01.r
Author: "Peter C Marks"
Version: 1
]

Explanation:
  • This is called the "header" of a REBOL script file. It must be there. This example shows some of the information fields that can be included in the header. Please consult the REBOL documentation for more information. Minimally, you can get away with just this "REBOL []" but, of course, it's better to have some documentation.
Here's the code we developed in the last post:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Copy the this and append it to the header text - after the "]". This is the actual script that will be executed. Save this as a file with the name esearch-01.r Remember what directory/folder you stored this file; you'll need the path later on.

Start the REBOL command window if you haven't already. We are going to execute this script file. There are two ways of doing this:
  • by positioning ourselves to the directory where the script is located or
  • by specifying the location of the script file.
Here's how to do it the first way, if you're running Windows:

change-dir %c/languages/rebol

or, if you're running some version Linux, Mac OS X, or Unix:

change-dir %/home/pcmarks/languages/rebol.

Explanation:
  • The current directory is changed to the specified directory. Notice that the argument starts with a % This indicates that a file/directory name follows.
To execute the script, simply type the following:

do %esearch-01.r

Explanation:
  • The do command will attempt to execute the REBOL code in the file. You can also use a URL and other values as arguments.
Alternatively, you could have typed the complete path to the script file:

do %/home/pcmarks/languages/rebol/esearch-01.r

By the way, at the DOS or shell command prompt (not the REBOL command prompt), you can type the following:

rebol esearch-01.rebol

and the script will be executed - assuming that your system can find the rebol executable.

I usually keep a text editor open, make changes, save and execute the script at the REBOL prompt. Also, as with most command lines, you can touch the up-arrow to recall previous commands from a history of commands.

In the last post, part of the result from the search was a list of NCBI Id's that were relevant to our search for the protein inulin. As a next step, we'd like to select an Id from that list and see what type of information it points to. Here's what the list portion of the response looked like:

...
<IdList>
<Id> 2507051 </Id>
<Id> 72132980 </Id>
<Id> 1110443 </Id>
<Id> 12060499 </Id>
<Id> 9963676 </Id>
<Id> 1906792 </Id>
<Id> 119714336 </Id>
<Id> 169196951 </Id>
<Id> 169175440 </Id>
<Id> 169175430 </Id>
<Id> 169175429 </Id>
<Id> 169090591 </Id>
<Id> 169016425 </Id>
<Id> 169016415 </Id>
<Id> 169016414 </Id>
<Id> 167362208 </Id>
<Id> 167070948 </Id>
<Id> 116668619 </Id>
<Id> 158318775 </Id>
<Id> 119534997 </Id>
</IdList>
...

The NCBI provides another CGI utility called efetch. Given an Id value it will return information about this resource. To use efetch we post a request the same way we did for esearch. We'll try it with the first Id in the list. Create a new text file in your editor, copy the header from the last script, change the values as necessary, and finally enter the following code and save it as efetch-01.r:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: "efetch.fcgi?db=protein&id=2507051"
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Explanation:
  • The difference between this script and our first is the second line: Instead of calling the esearch utility at the NCBI, we are calling the efetch utility. We need to tell it from what database to fetch/get information and the Id.
Now execute the script:

do %efetch-01.r

There will be a fairly long response. The beginning of the response should look this:

Seq-entry ::= seq {
id {
swissprot {
name "INU2_ARTGO" ,
accession "P19870" ,
release "reviewed" ,
version 3 } ,
gi 2507051 } ,
descr {
title "Inulin fructotransferase [DFA-I-forming] (Inulin fructotransferase
[depolymerizing, difructofuranose-1,2':2',1-dianhydride-forming])." ,
sp {
class standard ,
seqref {

gi 1110442 ,
gi 1110443 ,
gi 2127394 } ,
...

Briefly, the response says that this data is <Sequence> entry from the <Swissprot> database - another large database available over the web. Notice that this response is not an XML document. Instead it is formatted using an ISO standard called ASN.1 We won't worry about this right now. What is important is that we were able to take an Id value from the list in our original response and give it to another NCBI utility, efetch, and have it return information about the protein inulin. (Notice the title of this sequence entry in the response.)

Friday, April 11, 2008

1. Accessing Bioinformatic Data


[You will need to download and install REBOL (it's freeware and runs on many systems). Follow the pointers under Rebol Resources to the right. Download the REBOL/View version; it comes with a GUI component that we will be using later.]

To start things off, we will use REBOL to access the biological databases at the National Center for Biotechnology Information (NCBI). By the end of the post you'll be searching for species that encode the genomic sequences for a particular protein. The NCBI was
“Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.”
The NCBI site is well worth visiting - there is a wealth of information available there.

The NCBI hosts several bioinformatic databases all of which can be publicly accessed. The NCBI has also created a set of online utilities - again publicly available - that can be used to programmatically access these databases. Collectively, the utilities go by the name of eutils. They are located at this CGI gateway URL:

http://www.ncbi.nlm.nih.gov/entrez/eutils

Much of the content in these first few posts is based on a course given by the NCBI. Handouts, slides, etc. are available for download from the course web page. The course teaches you how to use the eutils with Perl. Perl has a long tradition of being used by biologists. I've tried to re-code some of the course's Perl scripts in REBOL.

Our first REBOL statement will create a variable that will hold the value of the eutils gateway URL. This statement will also serve to illustrate some of the unique features of REBOL. Make sure you are using the REBOL console. If you find yourself in REBOL/View (a graphical access to REBOL facilities), click the Console icon on the left. At the console prompt “>>”, type the following and press Enter:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/

Explanation:
  • Placing a colon “:” after the variable name (actually REBOL calls these words) tells REBOL to assign the value that follows (the URL) to that variable. There should be NO space between the variable name and the colon.

  • Notice that the variable name contains a hyphen “-”, a character usually not allowed in the variable names of many programming language. To distinguish between this use of a hyphen and its use as a sign for subtraction, the hyphen must be surrounded by spaces, e.g., ” 5 - 2 “. Spaces are important in REBOL.

  • In REBOL, URLs (like the one to the NCBI site) are typed “as is”, that is, they do not need to be surrounded by quotes and treated as a character string. In REBOL, a URL is one of the many specialized data types. For a value a URL should also specify the protocol, in this case, http:. Other protocols can be used as well, e.g., mailto:, ftp:, etc. Because REBOL dynamically assigns data types to variables based on their current value, eutilities-url will have the URL data type.

We will be using the eutils database search utility named "esearch". For illustration, we are going to look for those species that encode, in their genome, for the protein inulin, a type of plant sugar. Translated to eutils parameter values this means telling the NCBI server to search the Protein database looking for any entries that contain the term "inulin”. We’ll use another variable, esearch-arguments, to hold these search values:

esearch-arguments: “esearch.fcgi?db=protein&term=inulin”

Explanation:
  • This is an example of a string literal. In REBOL they are enclosed in double quotation marks or for multi-line strings, curly braces "{}".
In the next statement we attach the arguments to the end of the search url and and assign this value to another variable named esearch-url:

esearch-url: join eutilities-url esearch-arguments

Explanation:
  • “join” is a built-in REBOL command that will concatenate two values that are of REBOL type called a series. A series is similar to but more inclusive than a list. The URL and string data types are both series and for this reason we can join the two variables. We'll be showing other types of series in subsequent posts.
Of course, we could have created esearch-url in one statement.

Now we’re ready to perform the search. This is done by sending the NCBI server an HTTP request, containing the search arguments, and getting back a response. Type in the following:

response: load/markup esearch-url

Explanation:
  • The REBOL load command is used to send a request to the given URL and retrieve the response - in our example, from the NCBI server. Our use of the load command is modified by what is called in REBOL, a "refinement". The refinement, "/markup", is appended to the load command. As a result the load command will expect the response to be formatted with tags (markup), using the markup languages HTML, XML, WSDL, for example.

  • The result is stored in the variable response.
If things went well, you should see the following written to the console:

connecting to: www.ncbi.nlm.nih.gov = [ <?xml version ...

If there’s a problem, you’ll see an error message, something like this:

connecting to:
www.ncbi.nlm.nih.gov ** User Error: Error. Target url:
http://www.ncbi.nlm.nih.gov/entrez/
eutils/esearch?db=protien&term=inulin[…

** Near: response: load/markup esearch-url

Errors are often the result of a misspelling. In the above request, the word protein is spelled incorrectly.

Let's see what the response was. Type the following in:

print response

You should see an XML document nicely pretty-printed:

<esearchresult>
<count> 125 </count>
<retmax> 20 </retmax>
<retstart> 0 </retstart>
<idlist>
<id> 2507051 </id>
<id> 72132980 </id>
<id> 1110443 </id>
<id> 12060499 </id>
<id> 9963676 </id>
<id> 1906792 </id>
<id> 169196951 </id>
<id> 169175440 </id>
<id> 169175430 </id>
<id> 169175429 </id>
<id> 169090591 </id>
<id> 169016425 </id>
<id> 169016415 </id>
<id> 169016414 </id>
<id> 167362208 </id>
<id> 167070948 </id>
<id> 116668619 </id>
<id> 158318775 </id>
<id> 119714336 </id>
<id> 119534997 </id>
</idlist>
<translationset>
</translationset>
<translationstack>
<termset>
<term> inulin[All Fields] </term>
<field> All Fields </field>
<count> 125 </count>
<explode> Y </explode>
</termset>
<op> GROUP </op>
</translationstack>
<querytranslation> inulin[All Fields] </querytranslation>
</esearchresult>

The result says that there are 125 different entries in the Protein database for the protein inulin. The first 20 results are returned as a list of Id's. This Id uniquely identifies a source within the Protein database. Note that a species may have more than one entry for a protein. This is because the NCBI gathers information from several other biological databases - each entry represents a different source.

XML is one of several response value types that the NCBI utilities can provide. In the next post we'll do something with this data.



Here's all the REBOL code that was used to retrieve and print the response seen above:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Tuesday, April 8, 2008

Inaugural Post

This blog will be populated with posts about using the REBOL programming language in the field of bioinformatics. Why REBOL? Why bioinformatics? I just happened to be learning both at the same time. I thought that trying to use REBOL to access biological data might be a good way to learn both. In effect, the following posts will chronicle this learning experience.  You will probably also be witness to my making mistakes in both areas, as well. ;-)

Nowadays, there are literally dozens of publicly accessible bioinformatic databases. Many of these databases have excellent web-based interfaces. But there are times when one needs to access this information programmatically. To that end, a number of packages and libraries have been created for a variety of languages. The following websites are good starting points to learn more about BioPerl, BioPython and BioJava.

REBOL is a remarkable scripting language; I will only touch on part of its capabilities. Please visit the REBOL website for pointers to tutorials and other learning resources. In the next post we will begin to learn how to search through on-line bioinformatic data.