tag:blogger.com,1999:blog-41641093104237383792008-07-17T07:08:56.981-04:00BioRebolExploring the use of the REBOL scripting language<br>
for accessing and manipulating Bioinformatic information.Peter C Marksnoreply@blogger.comBlogger5125tag:blogger.com,1999:blog-4164109310423738379.post-51488929086906860322008-06-05T21:10:00.022-04:002008-07-17T07:08:56.997-04:002008-07-17T07:08:56.997-04:004. Functions and a Graphical User Interface[This <a href="http://musiclessonz.com/rebol_tutorial.html">link </a>and this <a href="http://musiclessonz.com/rebol.html">link </a>are REBOL tutorials written by Nick Antonaccio. They are among the best that I have come across. I consult them constantly as I learn REBOL and I highly recommend them.]<br /><br />So far we have created two scripts to perform the two tasks of searching and then fetching protein information from the NCBI Protein database. We need to tie them together so that the results of the search can be used to fetch data. Rather than continue to develop the two scripts we will turn each into a function and place these two functions into a single script file. Function definition and use is an important structural construct in REBOL as it is in most languages.<br /><br />REBOL functions look and operate very much like functions in other languages. They accept arguments, compute and return a value. However REBOL functions share a characteristic in common with languages like LISP, Scheme, and others in that they are considered "first class citizens". This means that functions can be treated like other data types. REBOL functions and data are both constructed out of blocks. As a result, functions can be manipulated like any other data structure.<br /><br />Here's how we create a function that doubles a number:<br /><div class="rebol-code" style="white-space: pre;">double: func [ "This function doubles it argument" n ] [ 2 * n ]</div><br />Explanation:<br /><ol><li>A function is created by using the word <span style="font-family:courier new;">func</span>, followed by a block of arguments (if any) followed by the code to be executed in the function.</li><li>The string in the first position of the first block is an optional comment.<br /></li><li>The function is then assigned to the variable double which becomes its name.</li><li>The data type of the variable double is <span style="font-family:courier new;">function!</span> That is, if you enter <span style="font-family:courier new;">type? double</span> at the REBOL console, it will return the word <span style="font-family:courier new;">function!</span><br /></li></ol>Go ahead and enter the code above. Then type <span style="font-family:courier new;">double 21</span> Finally, enter <span style="font-family:courier new;">? double</span> The <span style="font-family:courier new;">?</span> is used to get help on anything in the REBOL environment, including your own functions.<br /><br />The script file that we will create will contain two function definitions and some REBOL code to execute immediately upon "doing" it. In addition, to make the functions more generally useful, we'll remove any user interface interactions; these will become part of the executable code.<br /><br />Here's what the esearch script becomes after we turn it into a function:<div class="rebol-code" style="white-space: pre;">;; The esearch function<br />esearch: func [<br />"Search for references to a protein name and return a list of ids or none"<br />protein-name<br />] [<br />; assemble the cgi arguments<br />esearch-arguments: join "esearch.fcgi?db=protein&term=" protein-name<br />esearch-url: join eutilities-url esearch-arguments<br />; load as a tagged (XML, HTML, etc.) document<br />response: load/markup esearch-url<br />; Parse the response<br />ids: []<br />clear ids<br />parse response [<br />thru <idlist><br />some [<br />thru <id> copy idvalue to </id><br />(append ids idvalue)]<br />to </idlist><br />to end ]<br />; Return the id list<br />ids]</div><br /><br />And, here's what the efetch script becomes:<br /><br /><div class="rebol-code" style="white-space: pre;">;; The efetch function definition<br />efetch: func [<br />"Fetch the NCBI data associated with this reference Id according to the <br /> return type"<br />id [string!] "A valid NCBI reference Id"<br />return-type [string!] "For example, fasta"<br />] [<br />; Assemble the CGI parameters - note that we use the function argument <br />; named return-type as a value for the NCBI fetch argument named rettyp<br />esearch-arguments:<br /> rejoin ["efetch.fcgi?db=protein&retmode=text&rettype="<br /> return-type <br /> "&id=" <br /> id]<br />esearch-url: join eutilities-url esearch-arguments<br />response: read/lines esearch-url<br />response]</div><br /><br />Finally, the code that will be executed when we "do" the script:<br /><br /><div class="rebol-code" style="white-space: pre;">; Begin executable script code<br />; URL shared by the esearch and efetch functions<br />eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br />;; Ask the user for a protein name<br />protein-name: request-text/title "Enter a protein name"<br />; search for references to this protein<br />ids: esearch protein-name<br />; let the user choose one of the references<br />id: request-list "Choose an ID" ids<br />if id = none [print "Halting the script" halt]<br />; use the id to fetch the reference - The return value is the fasta<br />; file represented as a block of strings - one per line<br />fasta: efetch id<br />print fasta<br /></div><br />You can create a script file from the two functions and the executable bit or download <a href="http://docs.google.com/Doc?id=agkfgrg4fbz6_38cq4zf73t">ncbi.r</a> (You might have to remove a ".txt" extension, although REBOL doesn't care: <span style="font-family:courier new;">do %ncbi.r.txt</span> will still work.)<br /><br />To close this post, I'll introduce you to REBOL's GUI facilities - they are remarkably powerful and easy to use. By way of introduction, we will replace the executable code in the ncbi.r script (after the "URL shared by the ..." comment) with a window layout - a declarative description of what the GUI looks like and how it will behave. If you want a fuller explanation of the GUI facilities, again, please take a look at Nick Antonaccio's <a href="http://musiclessonz.com/rebol.html">tutorials</a>. You might even want to go there first.<br /><br />Here's the GUI replacement code and an explanation:<br /><br /><div class="rebol-code" style="white-space: pre;">; URL shared by the esearch and efetch functions<br />eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br /><br />; The layout (GUI) for this NCBI dialog<br /><br />ncbi-layout: layout [<br />across<br />text "Protein Name: "<br />protein-name: field<br />button "Search" [the-list/data: [] show the-list<br /> the-list/data: esearch protein-name/text<br /> the-fasta/text: " " show the-fasta<br /> show the-list]<br />return<br />text "Choose an Id: "<br />the-list: text-list [id-selection: value]<br />button "Fetch" [the-fasta/text: efetch id-selection "fasta"<br /> show the-fasta]<br />return<br />text "The Data: "<br />the-fasta: tt 400x200<br />return<br />button "Exit" [unview]<br />]<br /><br /><br />; Begin executable script code<br />; Display the dialog<br /><br />view ncbi-layout</div><br /><br />Explanation:<br /><ol><li>As mentioned earlier, the form and function of a GUI layout is described first. Later, the layout will be displayed (viewed). The variable <span style="font-family:courier new;">ncbi-layout</span> is set to the value of the description.<br /></li><li>The layout is written using what is called a "dialect" in REBOL. A dialect is an extension to the REBOL language. It is usually designed for a special purpose: in our case for describing GUI layout. It has its own set of reserved words that have special meaning in the context of the dialect. The GUI dialect is announced with the word <span style="font-family:courier new;">layout</span> followed by a block containing the description. Note that a "hot" topic nowadays is DSLs (<a href="http://en.wikipedia.org/wiki/Domain_Specific_Language">Domain-Specific Language</a>). REBOL dialects are DSLs. One can create their own dialect as needed.<br /></li><li>One way to understand the layout is to read it from top to bottom. While reading, special words will be encountered. For example, the first word <span style="font-family:courier new;">across</span> says that all the following UI elements should be positioned from left to right, across the window. This is followed by a <span style="font-family:courier new;">text</span> command that says place the following text in the window. Next, a variable <span style="font-family:courier new;">protein-name</span> is declared and its value will be set to the value of a <span style="font-family:courier new;">field</span>. A <span style="font-family:courier new;">field</span> will create an input field in the window. A little further down is the word <span style="font-family:courier new;">return</span>. This is not a return from the layout; it is a command to return to the left of the window and begin placing the ui components there.</li><li>Buttons are used to start some action. Pushing the "Search" button causes the block that follows to be executed. The action, in this case, is for the layout to use the value typed into the <span style="font-family:courier new;">protein-name</span> field as an argument to the <span style="font-family:courier new;">esearch</span> function. Recall that <span style="font-family:courier new;">esearch</span> returns a list of ids. This list is assigned to the variable <span style="font-family:courier new;">the-list</span>. If you look down a few lines, you'll see <span style="font-family:courier new;">the-list</span> actually contains a ui text list. That is, the initial value is displayable list. When <span style="font-family:courier new;">the-list</span> is set to the result of esearch, the list of ids will appear in the window as a scrollable and selectable list.</li><li>To actually display the ui layout, the command <span style="font-family:courier new;">view</span> is used. This will begin the dialog and not return until the dialog is exited.<br /></li></ol>This <a href="http://docs.google.com/Doc?id=agkfgrg4fbz6_39d9xzcqhr">code</a> is also available for download.<br /><br />In the next post, I want to get back to developing more bioinformatic-related code.Peter C Marksnoreply@blogger.com1tag:blogger.com,1999:blog-4164109310423738379.post-74034982698611737032008-05-22T13:20:00.017-04:002008-05-23T06:34:06.112-04:002008-05-23T06:34:06.112-04:003. Adding a User InterfaceThe two REBOL scripts that we have built require the user to change the script in order to search for a different protein. This is not too user-friendly. What we want to do first is to prompt the user for the protein name. REBOL has a function named ask that will do this. Type in:<br /><br /><div class="rebol-code"> protein-name: ask "Please enter a protein name: "<br />print protein-name<br /></div><br />REBOL also has another function that prompts a user to enter text. It is named request-text. Try this:<br /><br /><div class="rebol-code">protein-name: request-text "Please enter a protein name:"<br />print protein-name</div><br />The obvious difference between the two functions is that request-text uses a Graphical User Interface (GUI). We'll continue using a GUI for interactions with a user.<br /><br />Let's incorporate request-text into the esearch-01.erl script (new or changed lines in <span style="font-weight: bold;">bold</span>):<br /><br /><div class="rebol-code"> eutilities-url:<br />http://www.ncbi.nlm.nih.gov/entrez/eutils/<br /><span style="font-weight: bold;">protein-name: request-text "Please enter a protein name"</span><br /><span style="font-weight: bold;">esearch-arguments: join esearch.fcgi?db=protein&term=” protein-name</span><br />esearch-url: join eutilities-url esearch-arguments<br />response: load/markup esearch-url<br />print response </div><br />Explanation:<br /><ul><li>The second line is new. It prompts the user for a protein name and stores it in the protein-name variable.</li><br /><li>In the third line, the hardwired protein Id is replace by the protein-name variable.</li></ul>Run the script ( do %esearch-01.erl ) and enter the protein name "inulin" at the prompt. The result should be the same XML document as before. The document should contain a list of Id's.<br /><br />We also have to modify the second script, efetch-01.r, to include the Id of the information we were interested in. Instead of having to modify the script, a better user interface would be to create a list of the Id's and have the user pick one of the Id's. As it turns out REBOL has a request-list function. Try this:<br /><br /><div class="rebol-code"> request-list "Choose a number, any number: " [2 4 "six" 8 10]</div><br />Explanation:<br /><ul><li>request-list takes two arguments: a string that becomes a prompt and a block. Blocks are similar to lists, a type that is often available in other programming languages. Blocks are surrounded by square brackets "[ ]" when displayed or when typed in as literals. Just like strings, blocks are also a series type as well.</li><br /><li>Notice the "six" in the list. The elements in a list need not be all of the same type. That is, you can mix numbers and character strings, for example.</li></ul>Before we can use the request-list function in our script, we will have to convert the id's that the search-01.r script returns into a REBOL list, the expected type of the second argument. Recall that this list of id's is buried inside the XML document response.<br /><br />At this point I was going to illustrate how to use REBOL's extensive list of search functions to locate and extract the entire list of Id's. The code would have been straight-forward, similar to solutions in other languages and it would have introduced you to the series type search functions. But I decided that I wanted to show you a very unique built-in facility in REBOL called parsing. While perhaps a conceptually more difficult way to extract our list of Ids, I think that once you understand what's going on you'll appreciate how powerful parsing can be.<br /><br />For a very good tutorial on using the parsing facility, take a look at the website of <a href="http://www.musiclessonz.com/rebol_tutorial.html"> Nick Antonaccio</a>. He has written about many other areas of REBOL as well.<br /><br />Parsing, for those who are not familiar with the term, is essentially scanning a character string trying to identify textual units that obey the rules specified in a grammar. There are many tools that will take a set of grammatical rules, say for a language, and generate a program that can parse programs written in that language.<br /><br />Add the following REBOL code to the esearch-01.r script file. The additional code is shown in <strong>bold</strong>.<br /><br /><div class="rebol-code" style="white-space: pre;">ids: []<br />REBOL [<br />Title: "Search the NCBI Protein Database"<br />Date: 24-Apr-2008<br />File: %esearch-01.r<br />Author: "Peter C Marks"<br />Version: 1<br />]<br />eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br />protein-name: request-text "Please enter a protein name"<br />esearch-arguments: join "esearch.fcgi?db=protein&term=" protein-name<br />esearch-url: join eutilities-url esearch-arguments<br />response: load/markup esearch-url<br /><strong><br />ids: []<br />parse response [<br /> thru <idlist><br /> some [<br /> thru <id> copy idvalue to </id> (append ids idvalue)]<br /> to </idlist><br /> to end ]<br /></strong><br /></div>Explanation:<br /><ul><li>The variable ids is initialized to an empty block "[]"</li><br /><li>Parse is a built in function that can parse a series (string, block, etc.) - in our example a block, response - according to a set of grammatical rules.</li><br /><li>Here is an English translation of the parse lines in the script: Search the variable response for and <strong>thru</strong> the token/tag <strong><IdList></strong>. Next there will be <strong>some</strong> number of grammatical elements. Each element is identified by first scanning for and <strong>thru</strong> the <strong><Id></strong> tag. After doing this, <strong> copy</strong> the following characters into the variable <strong>idvalue</strong> up <strong>to </strong> an </Id> tag. After the copy <strong>append</strong> to the <strong>ids</strong> list the value in <strong> idvalue</strong>. When there are no more (<strong>some</strong>) id's remaining continue scanning <strong> to</strong> the tag </IdList> and finally scanning from here until the <strong>end</strong> of the response value is reached.</li></ul>Save the script file. Execute this script from the REBOL command line. If there were no errors, you should see the value "true" printed. Print out the value of the ids variable by typing this at the command line prompt:<br /><br /><div class="rebol-code">print ids<br /></div><br />The value of ids should look something like this:<br /><br /><div class="rebol-code" style="border: 1px dotted green; padding: 2px; white-space: pre; background-color: white;">["2507051" "72132980" "1110443" "12060499" "9963676" "1906792" "116668619" "119714<br />336" "169196951" "169175440" "169175430" "1691...</div><br />Parsing has found and extracted the list of Id's and assigned them to the ids variable. We can use the ids variable as an argument to the request-list function.<br /><br />Incidently, this style of parsing is very similar to what is called <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">Parsing expression grammars</a> (PEG). Basically, a PEG is a program that is directly based on the grammar (syntactic rules) of a language, for example. That is the grammar becomes a parsing program.<br /><br />Enter the following line:<br /><br /><div class="rebol-code">request-list "Choose an Id" ids<br /></div><br />A popup dialog should appear like the image below.<br /><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_rBUNya5qW3A/SDWxCWwv2rI/AAAAAAAAADo/cCiKPkthcoI/s1600-h/3-1.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp3.blogger.com/_rBUNya5qW3A/SDWxCWwv2rI/AAAAAAAAADo/cCiKPkthcoI/s320/3-1.jpg" alt="" id="BLOGGER_PHOTO_ID_5203259598524373682" border="0" /></a><br /><br />Click on one of the Id's. The value you clicked will be printed in the REBOL console. What this means is that the value of the function request-list is the value of the selection or if the Cancel button is pushed, the value "none".<br /><br />We now need to take this selected value and use it as an argument to the NCBI efetch utility. I'll cover that in the next post.Peter C Marksnoreply@blogger.com0tag:blogger.com,1999:blog-4164109310423738379.post-13050760208050081402008-04-28T16:17:00.022-04:002008-05-06T15:05:47.551-04:002008-05-06T15:05:47.551-04:002. Making scripts and fetching dataBefore doing something with the list of Id's from the previous post and to make things a lot easier, let's turn the code that we have developed so far into a REBOL script file. Open your favorite plain text editor and enter (or copy and paste) the following:<br /><br /></data:content><div class="rebol-code">REBOL [<br /> Title: "Search the NCBI Protein Database"<br /> Date: 24-Apr-2008<br /> File: %esearch-01.r<br /> Author: "Peter C Marks"<br /> Version: 1<br />]<br /></div><br />Explanation:<br /><ul><li>This is called the "header" of a REBOL script file. It must be there. This example shows some of the information fields that can be included in the header. Please consult the REBOL documentation for more information. Minimally, you can get away with just this "REBOL []" but, of course, it's better to have some documentation.<br /></li> </ul>Here's the code we developed in the last post:<br /><br /><div class="rebol-code">eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br />esearch-arguments: “esearch.fcgi?db=protein&term=inulin”<br />esearch-url: join eutilities-url esearch-arguments<br />response: load/markup esearch-url<br />print response</div><br />Copy the this and append it to the header text - after the "]". This is the actual script that will be executed. Save this as a file with the name esearch-01.r Remember what directory/folder you stored this file; you'll need the path later on.<br /><br />Start the REBOL command window if you haven't already. We are going to execute this script file. There are two ways of doing this:<br /><ul><li>by positioning ourselves to the directory where the script is located or</li><li>by specifying the location of the script file.</li></ul>Here's how to do it the first way, if you're running Windows:<br /><br /><div class="rebol-code">change-dir %c/languages/rebol </div><br />or, if you're running some version Linux, Mac OS X, or Unix:<br /><br /><div class="rebol-code">change-dir %/home/pcmarks/languages/rebol.</div><br />Explanation:<br /><ul><li>The current directory is changed to the specified directory. Notice that the argument starts with a % This indicates that a file/directory name follows.</li></ul>To execute the script, simply type the following:<br /><br /><div class="rebol-code">do %esearch-01.r</div><br />Explanation:<br /><ul><li>The do command will attempt to execute the REBOL code in the file. You can also use a URL and other values as arguments.</li></ul>Alternatively, you could have typed the complete path to the script file:<br /><br /><div class="rebol-code">do %/home/pcmarks/languages/rebol/esearch-01.r </div><br />By the way, at the DOS or shell command prompt (not the REBOL command prompt), you can type the following:<br /><br />rebol esearch-01.rebol<br /><br />and the script will be executed - assuming that your system can find the rebol executable.<br /><br />I usually keep a text editor open, make changes, save and execute the script at the REBOL prompt. Also, as with most command lines, you can touch the up-arrow to recall previous commands from a history of commands.<br /><br />In the last post, part of the result from the search was a list of NCBI Id's that were relevant to our search for the protein inulin. As a next step, we'd like to select an Id from that list and see what type of information it points to. Here's what the list portion of the response looked like:<br /><br /><div class="rebol-code" style="border: 1px dotted green; padding: 2px; white-space: pre; background-color: white;"> ...<br /><IdList><br /><Id> 2507051 </Id><br /><Id> 72132980 </Id><br /><Id> 1110443 </Id><br /><Id> 12060499 </Id><br /><Id> 9963676 </Id><br /><Id> 1906792 </Id><br /><Id> 119714336 </Id><br /><Id> 169196951 </Id><br /><Id> 169175440 </Id><br /><Id> 169175430 </Id><br /><Id> 169175429 </Id><br /><Id> 169090591 </Id><br /><Id> 169016425 </Id><br /><Id> 169016415 </Id><br /><Id> 169016414 </Id><br /><Id> 167362208 </Id><br /><Id> 167070948 </Id><br /><Id> 116668619 </Id><br /><Id> 158318775 </Id><br /><Id> 119534997 </Id><br /></IdList><br /> ...</div><br />The NCBI provides another CGI utility called efetch. Given an Id value it will return information about this resource. To use efetch we post a request the same way we did for esearch. We'll try it with the first Id in the list. Create a new text file in your editor, copy the header from the last script, change the values as necessary, and finally enter the following code and save it as efetch-01.r:<br /><br /><div class="rebol-code">eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br />esearch-arguments: "efetch.fcgi?db=protein&id=2507051"<br />esearch-url: join eutilities-url esearch-arguments<br />response: load/markup esearch-url<br />print response</div><br />Explanation:<br /><ul><li>The difference between this script and our first is the second line: Instead of calling the esearch utility at the NCBI, we are calling the efetch utility. We need to tell it from what database to fetch/get information and the Id.</li></ul>Now execute the script:<br /><br /><div class="rebol-code">do %efetch-01.r</div><br />There will be a fairly long response. The beginning of the response should look this:<br /><br /><div class="rebol-code" style="border: 1px dotted green; padding: 2px; white-space: pre; background-color: white;">Seq-entry ::= seq {<br />id {<br /> swissprot {<br /> name "INU2_ARTGO" ,<br /> accession "P19870" ,<br /> release "reviewed" ,<br /> version 3 } ,<br /> gi 2507051 } ,<br />descr {<br /> title "Inulin fructotransferase [DFA-I-forming] (Inulin fructotransferase<br /> [depolymerizing, difructofuranose-1,2':2',1-dianhydride-forming])." ,<br /> sp {<br /> class standard ,<br /> seqref {<br /><br /> gi 1110442 ,<br /> gi 1110443 ,<br /> gi 2127394 } ,<br /> ...</div><br />Briefly, the response says that this data is <Sequence> entry from the <Swissprot> database - another large database available over the web. Notice that this response is not an XML document. Instead it is formatted using an ISO standard called <a href="http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html">ASN.1</a> We won't worry about this right now. What is important is that we were able to take an Id value from the list in our original response and give it to another NCBI utility, efetch, and have it return information about the protein inulin. (Notice the title of this sequence entry in the response.)Peter C Marksnoreply@blogger.com0tag:blogger.com,1999:blog-4164109310423738379.post-23900547078358423192008-04-11T11:34:00.001-04:002008-04-11T10:21:56.520-04:002008-04-11T10:21:56.520-04:001. Accessing Bioinformatic Data<blockquote></blockquote><br />[You will need to download and install REBOL (it's freeware and runs on many systems). Follow the pointers under Rebol Resources to the right. Download the REBOL/View version; it comes with a GUI component that we will be using later.]<br /><br />To start things off, we will use REBOL to access the biological databases at the National Center for Biotechnology Information (<a href="http://www.ncbi.nlm.nih.gov/">NCBI</a>). By the end of the post you'll be searching for species that encode the genomic <a href="http://en.wikipedia.org/wiki/Sequence_%28biology%29">sequences</a> for a particular protein. The NCBI was<br /><blockquote>“Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.” </blockquote>The NCBI site is well worth visiting - there is a wealth of information available there.<br /><br />The NCBI hosts several bioinformatic <a href="http://www.ncbi.nlm.nih.gov/sites/gquery">databases</a> all of which can be publicly accessed. The NCBI has also created a set of online utilities - again publicly available - that can be used to programmatically access these databases. Collectively, the utilities go by the name of eutils. They are located at this CGI gateway URL:<br /><br /><div>http://www.ncbi.nlm.nih.gov/entrez/eutils</div><br />Much of the content in these first few posts is based on a <a href="http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html">course</a> given by the NCBI. Handouts, slides, etc. are available for download from the course web page. The course teaches you how to use the eutils with Perl. Perl has a long tradition of being used by biologists. I've tried to re-code some of the course's Perl scripts in REBOL.<br /><br />Our first REBOL statement will create a variable that will hold the value of the eutils gateway URL. This statement will also serve to illustrate some of the unique features of REBOL. Make sure you are using the REBOL console. If you find yourself in REBOL/View (a graphical access to REBOL facilities), click the Console icon on the left. At the console prompt “>>”, type the following and press Enter:<br /><br /><div class="rebol-code"> eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/</div><br />Explanation:<br /><ul><li>Placing a colon “:” after the variable name (actually REBOL calls these words) tells REBOL to assign the value that follows (the URL) to that variable. There should be NO space between the variable name and the colon.<br /><br /></li><li>Notice that the variable name contains a hyphen “-”, a character usually not allowed in the variable names of many programming language. To distinguish between this use of a hyphen and its use as a sign for subtraction, the hyphen must be surrounded by spaces, e.g., ” 5 - 2 “. <span>Spaces are important in REBOL</span>.</li><br /><li>In REBOL, URLs (like the one to the NCBI site) are typed “as is”, that is, they do not need to be surrounded by quotes and treated as a character string. In REBOL, a URL is one of the many specialized data types. For a value a URL should also specify the protocol, in this case, http:. Other protocols can be used as well, e.g., mailto:, ftp:, etc. Because REBOL dynamically assigns data types to variables based on their current value, <span class="rebol-text">eutilities-url </span>will have the URL data type.</li></ul><br />We will be using the eutils database search utility named "esearch". For illustration, we are going to look for those species that encode, in their genome, for the protein <a href="http://en.wikipedia.org/wiki/Inulin">inulin</a>, a type of plant sugar. Translated to eutils parameter values this means telling the NCBI server to search the Protein database looking for any entries that contain the term "inulin”. We’ll use another variable, esearch-arguments, to hold these search values:<br /><br /><div class="rebol-code"> esearch-arguments: “esearch.fcgi?db=protein&term=inulin”</div><br />Explanation:<br /><ul><li>This is an example of a string literal. In REBOL they are enclosed in double quotation marks or for multi-line strings, curly braces "{}".<br /></li></ul>In the next statement we attach the arguments to the end of the search url and and assign this value to another variable named esearch-url:<br /><br /><div class="rebol-code"> esearch-url: join eutilities-url esearch-arguments</div><br />Explanation:<br /><ul><li>“join” is a built-in REBOL command that will concatenate two values that are of REBOL type called a series. A series is similar to but more inclusive than a list. The URL and string data types are both series and for this reason we can join the two variables. We'll be showing other types of series in subsequent posts.</li></ul>Of course, we could have created esearch-url in one statement.<br /><br />Now we’re ready to perform the search. This is done by sending the NCBI server an HTTP request, containing the search arguments, and getting back a response. Type in the following:<br /><br /><div class="rebol-code"> response: load/markup esearch-url </div><br />Explanation:<br /><ul><li>The REBOL load command is used to send a request to the given URL and retrieve the response - in our example, from the NCBI server. Our use of the load command is modified by what is called in REBOL, a "refinement". The refinement, "/markup", is appended to the load command. As a result the load command will expect the response to be formatted with tags (markup), using the markup languages HTML, XML, WSDL, for example.<br /><br /></li><li>The result is stored in the variable response.<br /></li></ul>If things went well, you should see the following written to the console:<br /><br /><div class="rebol-code" style="white-space: pre; background-color: white; border: green dotted 1px; padding: 2px;">connecting to: www.ncbi.nlm.nih.gov = [ <?xml version ...</div><br />If there’s a problem, you’ll see an error message, something like this:<br /><br /><div class="rebol-code" style="white-space: pre; background-color: white; border: green dotted 1px; padding: 2px;">connecting to:<br />www.ncbi.nlm.nih.gov ** User Error: Error. Target url:<br />http://www.ncbi.nlm.nih.gov/entrez/<br />eutils/esearch?db=protien&term=inulin[…<br /><br />** Near: response: load/markup esearch-url</div><br />Errors are often the result of a misspelling. In the above request, the word protein is spelled incorrectly.<br /><br />Let's see what the response was. Type the following in:<br /><br /><div class="rebol-code"> print response </div><br />You should see an XML document nicely pretty-printed:<br /><br /><div class="rebol-code" style="white-space: pre; background-color: white; border: green dotted 1px; padding: 2px;"><esearchresult><br /><count> 125 </count><br /><retmax> 20 </retmax><br /><retstart> 0 </retstart><br /><idlist><br /> <id> 2507051 </id><br /> <id> 72132980 </id><br /> <id> 1110443 </id><br /> <id> 12060499 </id><br /> <id> 9963676 </id><br /> <id> 1906792 </id><br /> <id> 169196951 </id><br /> <id> 169175440 </id><br /> <id> 169175430 </id><br /> <id> 169175429 </id><br /> <id> 169090591 </id><br /> <id> 169016425 </id><br /> <id> 169016415 </id><br /> <id> 169016414 </id><br /> <id> 167362208 </id><br /> <id> 167070948 </id><br /> <id> 116668619 </id><br /> <id> 158318775 </id><br /> <id> 119714336 </id><br /> <id> 119534997 </id><br /></idlist><br /><translationset><br /></translationset><br /><translationstack><br /> <termset><br /> <term> inulin[All Fields] </term><br /> <field> All Fields </field><br /> <count> 125 </count><br /> <explode> Y </explode><br /> </termset><br /> <op> GROUP </op><br /></translationstack><br /><querytranslation> inulin[All Fields] </querytranslation><br /></esearchresult><br /></div><br />The result says that there are 125 different entries in the Protein database for the protein inulin. The first 20 results are returned as a list of Id's. This Id uniquely identifies a source within the Protein database. Note that a species may have more than one entry for a protein. This is because the NCBI gathers information from several other biological databases - each entry represents a different source.<br /><br />XML is one of several response value types that the NCBI utilities can provide. In the next post we'll do something with this data.<br /><br /><hr align="center" size="2"/><br />Here's all the REBOL code that was used to retrieve and print the response seen above:<br /><br /><div class="rebol-code" style="white-space: pre"> eutilities-url: http://www.ncbi.nlm.nih.gov/entrez/eutils/<br /> esearch-arguments: “esearch.fcgi?db=protein&term=inulin”<br /> esearch-url: join eutilities-url esearch-arguments<br /> response: load/markup esearch-url<br /> print response</div>Peter C Marksnoreply@blogger.com4tag:blogger.com,1999:blog-4164109310423738379.post-1776057328728130442008-04-08T12:34:00.001-04:002008-04-10T10:57:31.126-04:002008-04-10T10:57:31.126-04:00Inaugural Post<blockquote></blockquote>This blog will be populated with posts about using the <a href="http://www.rebol.com/">REBOL</a> programming language in the field of <a href="http://en.wikipedia.org/wiki/Bioinformatics">bioinformatics</a>. Why REBOL? Why bioinformatics? I just happened to be learning both at the same time. I thought that trying to use REBOL to access biological data might be a good way to learn both. In effect, the following posts will chronicle this learning experience. You will probably also be witness to my making mistakes in both areas, as well. ;-)<br /><br />Nowadays, there are literally dozens of publicly accessible bioinformatic databases. Many of these databases have excellent web-based interfaces. But there are times when one needs to access this information programmatically. To that end, a number of packages and libraries have been created for a variety of languages. The following websites are good starting points to learn more about <a href="http://www.bioperl.org/wiki/Main_Page">BioPerl</a>, <a href="http://biopython.org/wiki/Main_Page">BioPython</a> and <a href="http://biojava.org/wiki/Main_Page">BioJava</a>.<br /><br />REBOL is a remarkable scripting language; I will only touch on part of its capabilities. Please visit the REBOL <a href="http://www.rebol.com/">website</a> for pointers to tutorials and other learning resources. In the next post we will begin to learn how to search through on-line bioinformatic data.Peter C Marksnoreply@blogger.com0