[You will need to download and install REBOL (it's freeware and runs on many systems). Follow the pointers under Rebol Resources to the right. Download the REBOL/View version; it comes with a GUI component that we will be using later.]
To start things off, we will use REBOL to access the biological databases at the National Center for Biotechnology Information (NCBI). By the end of the post you'll be searching for species that encode the genomic sequences for a particular protein. The NCBI was
“Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.”The NCBI site is well worth visiting - there is a wealth of information available there.
The NCBI hosts several bioinformatic databases all of which can be publicly accessed. The NCBI has also created a set of online utilities - again publicly available - that can be used to programmatically access these databases. Collectively, the utilities go by the name of eutils. They are located at this CGI gateway URL:
http://www.ncbi.nlm.nih.gov/entrez/eutils
Much of the content in these first few posts is based on a course given by the NCBI. Handouts, slides, etc. are available for download from the course web page. The course teaches you how to use the eutils with Perl. Perl has a long tradition of being used by biologists. I've tried to re-code some of the course's Perl scripts in REBOL.
Our first REBOL statement will create a variable that will hold the value of the eutils gateway URL. This statement will also serve to illustrate some of the unique features of REBOL. Make sure you are using the REBOL console. If you find yourself in REBOL/View (a graphical access to REBOL facilities), click the Console icon on the left. At the console prompt “>>”, type the following and press Enter:
eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
Explanation:
- Placing a colon “:” after the variable name (actually REBOL calls these words) tells REBOL to assign the value that follows (the URL) to that variable. There should be NO space between the variable name and the colon.
- Notice that the variable name contains a hyphen “-”, a character usually not allowed in the variable names of many programming language. To distinguish between this use of a hyphen and its use as a sign for subtraction, the hyphen must be surrounded by spaces, e.g., ” 5 - 2 “. Spaces are important in REBOL.
- In REBOL, URLs (like the one to the NCBI site) are typed “as is”, that is, they do not need to be surrounded by quotes and treated as a character string. In REBOL, a URL is one of the many specialized data types. For a value a URL should also specify the protocol, in this case, http:. Other protocols can be used as well, e.g., mailto:, ftp:, etc. Because REBOL dynamically assigns data types to variables based on their current value, eutilities-url will have the URL data type.
We will be using the eutils database search utility named "esearch". For illustration, we are going to look for those species that encode, in their genome, for the protein inulin, a type of plant sugar. Translated to eutils parameter values this means telling the NCBI server to search the Protein database looking for any entries that contain the term "inulin”. We’ll use another variable, esearch-arguments, to hold these search values:
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
Explanation:
- This is an example of a string literal. In REBOL they are enclosed in double quotation marks or for multi-line strings, curly braces "{}".
esearch-url: join eutilities-url esearch-arguments
Explanation:
- “join” is a built-in REBOL command that will concatenate two values that are of REBOL type called a series. A series is similar to but more inclusive than a list. The URL and string data types are both series and for this reason we can join the two variables. We'll be showing other types of series in subsequent posts.
Now we’re ready to perform the search. This is done by sending the NCBI server an HTTP request, containing the search arguments, and getting back a response. Type in the following:
response: load/markup esearch-url
Explanation:
- The REBOL load command is used to send a request to the given URL and retrieve the response - in our example, from the NCBI server. Our use of the load command is modified by what is called in REBOL, a "refinement". The refinement, "/markup", is appended to the load command. As a result the load command will expect the response to be formatted with tags (markup), using the markup languages HTML, XML, WSDL, for example.
- The result is stored in the variable response.
connecting to: www.ncbi.nlm.nih.gov = [ <?xml version ...
If there’s a problem, you’ll see an error message, something like this:
connecting to:
www.ncbi.nlm.nih.gov ** User Error: Error. Target url:
http://www.ncbi.nlm.nih.gov/entrez/
eutils/esearch?db=protien&term=inulin[…
** Near: response: load/markup esearch-url
www.ncbi.nlm.nih.gov ** User Error: Error. Target url:
http://www.ncbi.nlm.nih.gov/entrez/
eutils/esearch?db=protien&term=inulin[…
** Near: response: load/markup esearch-url
Errors are often the result of a misspelling. In the above request, the word protein is spelled incorrectly.
Let's see what the response was. Type the following in:
print response
You should see an XML document nicely pretty-printed:
<esearchresult>
<count> 125 </count>
<retmax> 20 </retmax>
<retstart> 0 </retstart>
<idlist>
<id> 2507051 </id>
<id> 72132980 </id>
<id> 1110443 </id>
<id> 12060499 </id>
<id> 9963676 </id>
<id> 1906792 </id>
<id> 169196951 </id>
<id> 169175440 </id>
<id> 169175430 </id>
<id> 169175429 </id>
<id> 169090591 </id>
<id> 169016425 </id>
<id> 169016415 </id>
<id> 169016414 </id>
<id> 167362208 </id>
<id> 167070948 </id>
<id> 116668619 </id>
<id> 158318775 </id>
<id> 119714336 </id>
<id> 119534997 </id>
</idlist>
<translationset>
</translationset>
<translationstack>
<termset>
<term> inulin[All Fields] </term>
<field> All Fields </field>
<count> 125 </count>
<explode> Y </explode>
</termset>
<op> GROUP </op>
</translationstack>
<querytranslation> inulin[All Fields] </querytranslation>
</esearchresult>
<count> 125 </count>
<retmax> 20 </retmax>
<retstart> 0 </retstart>
<idlist>
<id> 2507051 </id>
<id> 72132980 </id>
<id> 1110443 </id>
<id> 12060499 </id>
<id> 9963676 </id>
<id> 1906792 </id>
<id> 169196951 </id>
<id> 169175440 </id>
<id> 169175430 </id>
<id> 169175429 </id>
<id> 169090591 </id>
<id> 169016425 </id>
<id> 169016415 </id>
<id> 169016414 </id>
<id> 167362208 </id>
<id> 167070948 </id>
<id> 116668619 </id>
<id> 158318775 </id>
<id> 119714336 </id>
<id> 119534997 </id>
</idlist>
<translationset>
</translationset>
<translationstack>
<termset>
<term> inulin[All Fields] </term>
<field> All Fields </field>
<count> 125 </count>
<explode> Y </explode>
</termset>
<op> GROUP </op>
</translationstack>
<querytranslation> inulin[All Fields] </querytranslation>
</esearchresult>
The result says that there are 125 different entries in the Protein database for the protein inulin. The first 20 results are returned as a list of Id's. This Id uniquely identifies a source within the Protein database. Note that a species may have more than one entry for a protein. This is because the NCBI gathers information from several other biological databases - each entry represents a different source.
XML is one of several response value types that the NCBI utilities can provide. In the next post we'll do something with this data.
Here's all the REBOL code that was used to retrieve and print the response seen above:
eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

4 comments:
As a happy REBOL developer, it is nice to see it applied to such a field and then documented. Kudos Peter.
Cheers,
Brian Tiffin
As refugee from Maine ...
http://www.cs.uoregon.edu/~tomc/efetch.r
Well, Tom! It looks like you've done the hard work already!
Thanks very much for sending me that pointer. With your permission, I could put that link in my next post. Of course, the link is already part of the blog.
Now what should I do? Continue on my stumbling journey (which is the intent of the blog) or follow your great code (again, with your permission)?
Peter
continue by all means! yours is by far the more important, as prior to last night I was the only one who had ever looked at mine. mine also has only been uses for a fairly narrow task ... fetch these nt sequence so I can blast them. and could benefit from broadening, heck I know I haven't even tested the majority of the parameters.
please feel free to email me. I am a terrible writer and would never blog but am delighted that there is someone out there that can and I will help in any way I can.
Post a Comment