Thursday, June 5, 2008

4. Functions and a Graphical User Interface

[This link and this link are REBOL tutorials written by Nick Antonaccio. They are among the best that I have come across. I consult them constantly as I learn REBOL and I highly recommend them.]

So far we have created two scripts to perform the two tasks of searching and then fetching protein information from the NCBI Protein database. We need to tie them together so that the results of the search can be used to fetch data. Rather than continue to develop the two scripts we will turn each into a function and place these two functions into a single script file. Function definition and use is an important structural construct in REBOL as it is in most languages.

REBOL functions look and operate very much like functions in other languages. They accept arguments, compute and return a value. However REBOL functions share a characteristic in common with languages like LISP, Scheme, and others in that they are considered "first class citizens". This means that functions can be treated like other data types. REBOL functions and data are both constructed out of blocks. As a result, functions can be manipulated like any other data structure.

Here's how we create a function that doubles a number:

double: func [ "This function doubles it argument" n ] [ 2 * n ]

Explanation:
  1. A function is created by using the word func, followed by a block of arguments (if any) followed by the code to be executed in the function.
  2. The string in the first position of the first block is an optional comment.
  3. The function is then assigned to the variable double which becomes its name.
  4. The data type of the variable double is function! That is, if you enter type? double at the REBOL console, it will return the word function!
Go ahead and enter the code above. Then type double 21 Finally, enter ? double The ? is used to get help on anything in the REBOL environment, including your own functions.

The script file that we will create will contain two function definitions and some REBOL code to execute immediately upon "doing" it. In addition, to make the functions more generally useful, we'll remove any user interface interactions; these will become part of the executable code.

Here's what the esearch script becomes after we turn it into a function:
;; The esearch function
esearch: func [
"Search for references to a protein name and return a list of ids or none"
protein-name
] [
; assemble the cgi arguments
esearch-arguments: join "esearch.fcgi?db=protein&term=" protein-name
esearch-url: join eutilities-url esearch-arguments
; load as a tagged (XML, HTML, etc.) document
response: load/markup esearch-url
; Parse the response
ids: []
clear ids
parse response [
thru
some [
thru copy idvalue to
(append ids idvalue)]
to

to end ]
; Return the id list
ids]


And, here's what the efetch script becomes:

;; The efetch function definition
efetch: func [
"Fetch the NCBI data associated with this reference Id according to the
return type"
id [string!] "A valid NCBI reference Id"
return-type [string!] "For example, fasta"
] [
; Assemble the CGI parameters - note that we use the function argument
; named return-type as a value for the NCBI fetch argument named rettyp
esearch-arguments:
rejoin ["efetch.fcgi?db=protein&retmode=text&rettype="
return-type
"&id="
id]
esearch-url: join eutilities-url esearch-arguments
response: read/lines esearch-url
response]


Finally, the code that will be executed when we "do" the script:

; Begin executable script code
; URL shared by the esearch and efetch functions
eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
;; Ask the user for a protein name
protein-name: request-text/title "Enter a protein name"
; search for references to this protein
ids: esearch protein-name
; let the user choose one of the references
id: request-list "Choose an ID" ids
if id = none [print "Halting the script" halt]
; use the id to fetch the reference - The return value is the fasta
; file represented as a block of strings - one per line
fasta: efetch id
print fasta

You can create a script file from the two functions and the executable bit or download ncbi.r (You might have to remove a ".txt" extension, although REBOL doesn't care: do %ncbi.r.txt will still work.)

To close this post, I'll introduce you to REBOL's GUI facilities - they are remarkably powerful and easy to use. By way of introduction, we will replace the executable code in the ncbi.r script (after the "URL shared by the ..." comment) with a window layout - a declarative description of what the GUI looks like and how it will behave. If you want a fuller explanation of the GUI facilities, again, please take a look at Nick Antonaccio's tutorials. You might even want to go there first.

Here's the GUI replacement code and an explanation:

; URL shared by the esearch and efetch functions
eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/

; The layout (GUI) for this NCBI dialog

ncbi-layout: layout [
across
text "Protein Name: "
protein-name: field
button "Search" [the-list/data: [] show the-list
the-list/data: esearch protein-name/text
the-fasta/text: " " show the-fasta
show the-list]
return
text "Choose an Id: "
the-list: text-list [id-selection: value]
button "Fetch" [the-fasta/text: efetch id-selection "fasta"
show the-fasta]
return
text "The Data: "
the-fasta: tt 400x200
return
button "Exit" [unview]
]


; Begin executable script code
; Display the dialog

view ncbi-layout


Explanation:
  1. As mentioned earlier, the form and function of a GUI layout is described first. Later, the layout will be displayed (viewed). The variable ncbi-layout is set to the value of the description.
  2. The layout is written using what is called a "dialect" in REBOL. A dialect is an extension to the REBOL language. It is usually designed for a special purpose: in our case for describing GUI layout. It has its own set of reserved words that have special meaning in the context of the dialect. The GUI dialect is announced with the word layout followed by a block containing the description. Note that a "hot" topic nowadays is DSLs (Domain-Specific Language). REBOL dialects are DSLs. One can create their own dialect as needed.
  3. One way to understand the layout is to read it from top to bottom. While reading, special words will be encountered. For example, the first word across says that all the following UI elements should be positioned from left to right, across the window. This is followed by a text command that says place the following text in the window. Next, a variable protein-name is declared and its value will be set to the value of a field. A field will create an input field in the window. A little further down is the word return. This is not a return from the layout; it is a command to return to the left of the window and begin placing the ui components there.
  4. Buttons are used to start some action. Pushing the "Search" button causes the block that follows to be executed. The action, in this case, is for the layout to use the value typed into the protein-name field as an argument to the esearch function. Recall that esearch returns a list of ids. This list is assigned to the variable the-list. If you look down a few lines, you'll see the-list actually contains a ui text list. That is, the initial value is displayable list. When the-list is set to the result of esearch, the list of ids will appear in the window as a scrollable and selectable list.
  5. To actually display the ui layout, the command view is used. This will begin the dialog and not return until the dialog is exited.
This code is also available for download.

In the next post, I want to get back to developing more bioinformatic-related code.

Thursday, May 22, 2008

3. Adding a User Interface

The two REBOL scripts that we have built require the user to change the script in order to search for a different protein. This is not too user-friendly. What we want to do first is to prompt the user for the protein name. REBOL has a function named ask that will do this. Type in:

protein-name: ask "Please enter a protein name: "
print protein-name

REBOL also has another function that prompts a user to enter text. It is named request-text. Try this:

protein-name: request-text "Please enter a protein name:"
print protein-name

The obvious difference between the two functions is that request-text uses a Graphical User Interface (GUI). We'll continue using a GUI for interactions with a user.

Let's incorporate request-text into the esearch-01.erl script (new or changed lines in bold):

eutilities-url:
http://www.ncbi.nlm.nih.gov/entrez/eutils/
protein-name: request-text "Please enter a protein name"
esearch-arguments: join esearch.fcgi?db=protein&term=” protein-name
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Explanation:
  • The second line is new. It prompts the user for a protein name and stores it in the protein-name variable.

  • In the third line, the hardwired protein Id is replace by the protein-name variable.
Run the script ( do %esearch-01.erl ) and enter the protein name "inulin" at the prompt. The result should be the same XML document as before. The document should contain a list of Id's.

We also have to modify the second script, efetch-01.r, to include the Id of the information we were interested in. Instead of having to modify the script, a better user interface would be to create a list of the Id's and have the user pick one of the Id's. As it turns out REBOL has a request-list function. Try this:

request-list "Choose a number, any number: " [2 4 "six" 8 10]

Explanation:
  • request-list takes two arguments: a string that becomes a prompt and a block. Blocks are similar to lists, a type that is often available in other programming languages. Blocks are surrounded by square brackets "[ ]" when displayed or when typed in as literals. Just like strings, blocks are also a series type as well.

  • Notice the "six" in the list. The elements in a list need not be all of the same type. That is, you can mix numbers and character strings, for example.
Before we can use the request-list function in our script, we will have to convert the id's that the search-01.r script returns into a REBOL list, the expected type of the second argument. Recall that this list of id's is buried inside the XML document response.

At this point I was going to illustrate how to use REBOL's extensive list of search functions to locate and extract the entire list of Id's. The code would have been straight-forward, similar to solutions in other languages and it would have introduced you to the series type search functions. But I decided that I wanted to show you a very unique built-in facility in REBOL called parsing. While perhaps a conceptually more difficult way to extract our list of Ids, I think that once you understand what's going on you'll appreciate how powerful parsing can be.

For a very good tutorial on using the parsing facility, take a look at the website of Nick Antonaccio. He has written about many other areas of REBOL as well.

Parsing, for those who are not familiar with the term, is essentially scanning a character string trying to identify textual units that obey the rules specified in a grammar. There are many tools that will take a set of grammatical rules, say for a language, and generate a program that can parse programs written in that language.

Add the following REBOL code to the esearch-01.r script file. The additional code is shown in bold.

ids: []
REBOL [
Title: "Search the NCBI Protein Database"
Date: 24-Apr-2008
File: %esearch-01.r
Author: "Peter C Marks"
Version: 1
]
eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
protein-name: request-text "Please enter a protein name"
esearch-arguments: join "esearch.fcgi?db=protein&term=" protein-name
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url

ids: []
parse response [
thru <idlist>
some [
thru <id> copy idvalue to </id> (append ids idvalue)]
to </idlist>
to end ]

Explanation:
  • The variable ids is initialized to an empty block "[]"

  • Parse is a built in function that can parse a series (string, block, etc.) - in our example a block, response - according to a set of grammatical rules.

  • Here is an English translation of the parse lines in the script: Search the variable response for and thru the token/tag <IdList>. Next there will be some number of grammatical elements. Each element is identified by first scanning for and thru the <Id> tag. After doing this, copy the following characters into the variable idvalue up to an </Id> tag. After the copy append to the ids list the value in idvalue. When there are no more (some) id's remaining continue scanning to the tag </IdList> and finally scanning from here until the end of the response value is reached.
Save the script file. Execute this script from the REBOL command line. If there were no errors, you should see the value "true" printed. Print out the value of the ids variable by typing this at the command line prompt:

print ids

The value of ids should look something like this:

["2507051" "72132980" "1110443" "12060499" "9963676" "1906792" "116668619" "119714
336" "169196951" "169175440" "169175430" "1691...

Parsing has found and extracted the list of Id's and assigned them to the ids variable. We can use the ids variable as an argument to the request-list function.

Incidently, this style of parsing is very similar to what is called Parsing expression grammars (PEG). Basically, a PEG is a program that is directly based on the grammar (syntactic rules) of a language, for example. That is the grammar becomes a parsing program.

Enter the following line:

request-list "Choose an Id" ids

A popup dialog should appear like the image below.




Click on one of the Id's. The value you clicked will be printed in the REBOL console. What this means is that the value of the function request-list is the value of the selection or if the Cancel button is pushed, the value "none".

We now need to take this selected value and use it as an argument to the NCBI efetch utility. I'll cover that in the next post.

Monday, April 28, 2008

2. Making scripts and fetching data

Before doing something with the list of Id's from the previous post and to make things a lot easier, let's turn the code that we have developed so far into a REBOL script file. Open your favorite plain text editor and enter (or copy and paste) the following:

REBOL [
Title: "Search the NCBI Protein Database"
Date: 24-Apr-2008
File: %esearch-01.r
Author: "Peter C Marks"
Version: 1
]

Explanation:
  • This is called the "header" of a REBOL script file. It must be there. This example shows some of the information fields that can be included in the header. Please consult the REBOL documentation for more information. Minimally, you can get away with just this "REBOL []" but, of course, it's better to have some documentation.
Here's the code we developed in the last post:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Copy the this and append it to the header text - after the "]". This is the actual script that will be executed. Save this as a file with the name esearch-01.r Remember what directory/folder you stored this file; you'll need the path later on.

Start the REBOL command window if you haven't already. We are going to execute this script file. There are two ways of doing this:
  • by positioning ourselves to the directory where the script is located or
  • by specifying the location of the script file.
Here's how to do it the first way, if you're running Windows:

change-dir %c/languages/rebol

or, if you're running some version Linux, Mac OS X, or Unix:

change-dir %/home/pcmarks/languages/rebol.

Explanation:
  • The current directory is changed to the specified directory. Notice that the argument starts with a % This indicates that a file/directory name follows.
To execute the script, simply type the following:

do %esearch-01.r

Explanation:
  • The do command will attempt to execute the REBOL code in the file. You can also use a URL and other values as arguments.
Alternatively, you could have typed the complete path to the script file:

do %/home/pcmarks/languages/rebol/esearch-01.r

By the way, at the DOS or shell command prompt (not the REBOL command prompt), you can type the following:

rebol esearch-01.rebol

and the script will be executed - assuming that your system can find the rebol executable.

I usually keep a text editor open, make changes, save and execute the script at the REBOL prompt. Also, as with most command lines, you can touch the up-arrow to recall previous commands from a history of commands.

In the last post, part of the result from the search was a list of NCBI Id's that were relevant to our search for the protein inulin. As a next step, we'd like to select an Id from that list and see what type of information it points to. Here's what the list portion of the response looked like:

...
<IdList>
<Id> 2507051 </Id>
<Id> 72132980 </Id>
<Id> 1110443 </Id>
<Id> 12060499 </Id>
<Id> 9963676 </Id>
<Id> 1906792 </Id>
<Id> 119714336 </Id>
<Id> 169196951 </Id>
<Id> 169175440 </Id>
<Id> 169175430 </Id>
<Id> 169175429 </Id>
<Id> 169090591 </Id>
<Id> 169016425 </Id>
<Id> 169016415 </Id>
<Id> 169016414 </Id>
<Id> 167362208 </Id>
<Id> 167070948 </Id>
<Id> 116668619 </Id>
<Id> 158318775 </Id>
<Id> 119534997 </Id>
</IdList>
...

The NCBI provides another CGI utility called efetch. Given an Id value it will return information about this resource. To use efetch we post a request the same way we did for esearch. We'll try it with the first Id in the list. Create a new text file in your editor, copy the header from the last script, change the values as necessary, and finally enter the following code and save it as efetch-01.r:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: "efetch.fcgi?db=protein&id=2507051"
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Explanation:
  • The difference between this script and our first is the second line: Instead of calling the esearch utility at the NCBI, we are calling the efetch utility. We need to tell it from what database to fetch/get information and the Id.
Now execute the script:

do %efetch-01.r

There will be a fairly long response. The beginning of the response should look this:

Seq-entry ::= seq {
id {
swissprot {
name "INU2_ARTGO" ,
accession "P19870" ,
release "reviewed" ,
version 3 } ,
gi 2507051 } ,
descr {
title "Inulin fructotransferase [DFA-I-forming] (Inulin fructotransferase
[depolymerizing, difructofuranose-1,2':2',1-dianhydride-forming])." ,
sp {
class standard ,
seqref {

gi 1110442 ,
gi 1110443 ,
gi 2127394 } ,
...

Briefly, the response says that this data is <Sequence> entry from the <Swissprot> database - another large database available over the web. Notice that this response is not an XML document. Instead it is formatted using an ISO standard called ASN.1 We won't worry about this right now. What is important is that we were able to take an Id value from the list in our original response and give it to another NCBI utility, efetch, and have it return information about the protein inulin. (Notice the title of this sequence entry in the response.)

Friday, April 11, 2008

1. Accessing Bioinformatic Data


[You will need to download and install REBOL (it's freeware and runs on many systems). Follow the pointers under Rebol Resources to the right. Download the REBOL/View version; it comes with a GUI component that we will be using later.]

To start things off, we will use REBOL to access the biological databases at the National Center for Biotechnology Information (NCBI). By the end of the post you'll be searching for species that encode the genomic sequences for a particular protein. The NCBI was
“Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.”
The NCBI site is well worth visiting - there is a wealth of information available there.

The NCBI hosts several bioinformatic databases all of which can be publicly accessed. The NCBI has also created a set of online utilities - again publicly available - that can be used to programmatically access these databases. Collectively, the utilities go by the name of eutils. They are located at this CGI gateway URL:

http://www.ncbi.nlm.nih.gov/entrez/eutils

Much of the content in these first few posts is based on a course given by the NCBI. Handouts, slides, etc. are available for download from the course web page. The course teaches you how to use the eutils with Perl. Perl has a long tradition of being used by biologists. I've tried to re-code some of the course's Perl scripts in REBOL.

Our first REBOL statement will create a variable that will hold the value of the eutils gateway URL. This statement will also serve to illustrate some of the unique features of REBOL. Make sure you are using the REBOL console. If you find yourself in REBOL/View (a graphical access to REBOL facilities), click the Console icon on the left. At the console prompt “>>”, type the following and press Enter:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/

Explanation:
  • Placing a colon “:” after the variable name (actually REBOL calls these words) tells REBOL to assign the value that follows (the URL) to that variable. There should be NO space between the variable name and the colon.

  • Notice that the variable name contains a hyphen “-”, a character usually not allowed in the variable names of many programming language. To distinguish between this use of a hyphen and its use as a sign for subtraction, the hyphen must be surrounded by spaces, e.g., ” 5 - 2 “. Spaces are important in REBOL.

  • In REBOL, URLs (like the one to the NCBI site) are typed “as is”, that is, they do not need to be surrounded by quotes and treated as a character string. In REBOL, a URL is one of the many specialized data types. For a value a URL should also specify the protocol, in this case, http:. Other protocols can be used as well, e.g., mailto:, ftp:, etc. Because REBOL dynamically assigns data types to variables based on their current value, eutilities-url will have the URL data type.

We will be using the eutils database search utility named "esearch". For illustration, we are going to look for those species that encode, in their genome, for the protein inulin, a type of plant sugar. Translated to eutils parameter values this means telling the NCBI server to search the Protein database looking for any entries that contain the term "inulin”. We’ll use another variable, esearch-arguments, to hold these search values:

esearch-arguments: “esearch.fcgi?db=protein&term=inulin”

Explanation:
  • This is an example of a string literal. In REBOL they are enclosed in double quotation marks or for multi-line strings, curly braces "{}".
In the next statement we attach the arguments to the end of the search url and and assign this value to another variable named esearch-url:

esearch-url: join eutilities-url esearch-arguments

Explanation:
  • “join” is a built-in REBOL command that will concatenate two values that are of REBOL type called a series. A series is similar to but more inclusive than a list. The URL and string data types are both series and for this reason we can join the two variables. We'll be showing other types of series in subsequent posts.
Of course, we could have created esearch-url in one statement.

Now we’re ready to perform the search. This is done by sending the NCBI server an HTTP request, containing the search arguments, and getting back a response. Type in the following:

response: load/markup esearch-url

Explanation:
  • The REBOL load command is used to send a request to the given URL and retrieve the response - in our example, from the NCBI server. Our use of the load command is modified by what is called in REBOL, a "refinement". The refinement, "/markup", is appended to the load command. As a result the load command will expect the response to be formatted with tags (markup), using the markup languages HTML, XML, WSDL, for example.

  • The result is stored in the variable response.
If things went well, you should see the following written to the console:

connecting to: www.ncbi.nlm.nih.gov = [ <?xml version ...

If there’s a problem, you’ll see an error message, something like this:

connecting to:
www.ncbi.nlm.nih.gov ** User Error: Error. Target url:
http://www.ncbi.nlm.nih.gov/entrez/
eutils/esearch?db=protien&term=inulin[…

** Near: response: load/markup esearch-url

Errors are often the result of a misspelling. In the above request, the word protein is spelled incorrectly.

Let's see what the response was. Type the following in:

print response

You should see an XML document nicely pretty-printed:

<esearchresult>
<count> 125 </count>
<retmax> 20 </retmax>
<retstart> 0 </retstart>
<idlist>
<id> 2507051 </id>
<id> 72132980 </id>
<id> 1110443 </id>
<id> 12060499 </id>
<id> 9963676 </id>
<id> 1906792 </id>
<id> 169196951 </id>
<id> 169175440 </id>
<id> 169175430 </id>
<id> 169175429 </id>
<id> 169090591 </id>
<id> 169016425 </id>
<id> 169016415 </id>
<id> 169016414 </id>
<id> 167362208 </id>
<id> 167070948 </id>
<id> 116668619 </id>
<id> 158318775 </id>
<id> 119714336 </id>
<id> 119534997 </id>
</idlist>
<translationset>
</translationset>
<translationstack>
<termset>
<term> inulin[All Fields] </term>
<field> All Fields </field>
<count> 125 </count>
<explode> Y </explode>
</termset>
<op> GROUP </op>
</translationstack>
<querytranslation> inulin[All Fields] </querytranslation>
</esearchresult>

The result says that there are 125 different entries in the Protein database for the protein inulin. The first 20 results are returned as a list of Id's. This Id uniquely identifies a source within the Protein database. Note that a species may have more than one entry for a protein. This is because the NCBI gathers information from several other biological databases - each entry represents a different source.

XML is one of several response value types that the NCBI utilities can provide. In the next post we'll do something with this data.



Here's all the REBOL code that was used to retrieve and print the response seen above:

eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch-arguments: “esearch.fcgi?db=protein&term=inulin”
esearch-url: join eutilities-url esearch-arguments
response: load/markup esearch-url
print response

Tuesday, April 8, 2008

Inaugural Post

This blog will be populated with posts about using the REBOL programming language in the field of bioinformatics. Why REBOL? Why bioinformatics? I just happened to be learning both at the same time. I thought that trying to use REBOL to access biological data might be a good way to learn both. In effect, the following posts will chronicle this learning experience.  You will probably also be witness to my making mistakes in both areas, as well. ;-)

Nowadays, there are literally dozens of publicly accessible bioinformatic databases. Many of these databases have excellent web-based interfaces. But there are times when one needs to access this information programmatically. To that end, a number of packages and libraries have been created for a variety of languages. The following websites are good starting points to learn more about BioPerl, BioPython and BioJava.

REBOL is a remarkable scripting language; I will only touch on part of its capabilities. Please visit the REBOL website for pointers to tutorials and other learning resources. In the next post we will begin to learn how to search through on-line bioinformatic data.