Handout    NAR 2006 Paper     NAR 2002 Paper     Email GEO  
   NCBI > GEO > Info

   

Programmatic access to GEO



Introduction


GEO data can be programmatically accessed using a suite of programs called the Entrez Programming Utilities (E-Utils).

E-Utils are a set of server-side programs that provide a stable interface to the search, retrieval, and linking functions of the Entrez system, using a fixed URL syntax. E-Utils are designed to be called from within a computer program that can process their output. Output is provided in XML format.


GEO data are stored in two separate databases:
  • Entrez GEO DataSets: contains descriptive and accession information for all records (db=gds)
  • Entrez GEO Profiles: contains gene annotation and synoptic/visual data for each expression profile (db=geo)

Three key concepts to keep in mind are:
  1. E-Utils are only capable of retrieving data that is stored within the Entrez system. For GEO databases, only metadata is stored in Entrez. To retrieve full GEO records, complete data tables, or raw data files, a second step is required, namely constructing an FTP URL (see FTP directory structure table) and downloading the data.
  2. Each Entrez record is identified using a unique integer ID (UID). UIDs are used for both data input and output. Search history parameters (query_key and WebEnv) can also be used to identify previous search results.
  3. Your initial search can be refined using field qualifiers which can filter results based on data types, publication date ranges, and much more .

A typical workflow might have the following steps:
  • Use the qualifier fields in Entrez GEO DataSets to fine-tune a search
  • Construct the appropriate eSearch query in your script/program
  • Run the query, retrieve the results in the form of UIDs or history parameters (query_key and WebEnv) as needed
  • Run eSummary or eFetch and/or eLink depending on your needs to retrieve the final metadata or accessions.
  • If you need to download full records or supplementary files, use the accession information to construct an FTP URL and download the data.

For more information, check out the complete E-Utils documentation.


Examples


For most applications GEO DataSets is the more useful and sensible place to construct a search. All the examples hereon will demonstrate GEO DataSet search and retrievals.

In each example, note that the query_key and WebEnv parameters are for demonstration purposes only.
These parameters are stored in the History server for a limited time; perform the eSearch to generate new query_key and WebEnv parameters.



Example I: Retrieve a complete list of Series accession numbers and their associated Sample accession numbers.

  • Construct and perform an eSearch in db=gds to retrieve all Series records using:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=GSE[ETYP]&retmax=5000&usehistory=y

  • Use the query_key and WebEnv parameters from the eSearch to perform an eSummary:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gds&query_key=X&WebEnv=ENTER_WEBENV_PARAMETER_HERE

    This retrieves summary documents for all Series records.

  • Within those summary documents, 'GSM_L' lists the Sample accession numbers for each Series.


Example II: Fetch a document summary text file listing all Saccharomyces cerevisiae experiments released within the last 3 months.


Example III: Retrieve all CEL files corresponding to Affymetrix Platform HG-U133A.

  • When looking for data relating to a specific array, it is usually safest to use that Platform's GEO accession number, rather than its name. The official version of HG-U133A has accession number GPL96, as determined by a manual search.

  • Construct and perform an eSearch query in db=gds for all Series records that have Samples relating to GPL96 and have CEL files, using:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=GPL96[ACCN]+AND+gse[ETYP]+AND+cel[suppFile]&retmax=5000&usehistory=y

  • Use the query_key and WebEnv parameters from the eSearch to perform an eSummary:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gds&query_key=X&WebEnv=ENTER_WEBENV_PARAMETER_HERE

    This returns summary documents for all Series records that contain HG-U133A CEL files.

  • Extract the Series accession numbers from the eSummary document. You can then use this Series accession list to construct URLs to get the raw data files, for example:
    ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE5290/GSE5290_RAW.tar


Example IV: Retrieve all PubMed IDs that correlate with rat experiments in GEO.



More details



E-Util programs


eSearch responds to a text query with the list of unique identifiers (UIDs) matching the query in a given database, along with the term translations of the query
eSummary responds to a list of UIDs with the corresponding document summaries
eFetch responds to a list of UIDs with the corresponding data records
ePost accepts a list of UIDs, stores the set on the History Server, and responds with the corresponding query key and Web environment
eLink responds to a list of UIDs in a given database with either a list of related IDs in the same database or a list of linked IDs in another Entrez database
eInfo provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases


FTP directory structure

All GEO data are available for download from the FTP site. Directory structure is organized by format, type, and GEO accession number.
For more information, please see README.

SOFT format, by DataSet
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDS1.soft.gz
SOFT format, by Platform:
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_platform/GPL10/GPL10_family.soft.gz
SOFT format, by Series:
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE1/GSE1_family.soft.gz
MINiML format, by Platform:
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/MINiML/by_platform/GPL10/GPL10_family.xml.tgz
MINiML format, by Series:
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/MINiML/by_series/GSE1/GSE1_family.xml.tgz
SeriesMatrix format:
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE1/GSE1_series_matrix.txt.gz
Supplementary files, by Platform
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/platforms/GPL1073/
Supplementary files, by Series
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE1000/GSE1000_RAW.tar
Supplementary files, by Sample
This directory structure is a bit more complicated in order to accommodate for rapid growth.
A subdirectory name is created by replacing the three last digits of the accession with letters "nnn", e.g.
GSM575        /samples/GSMnnn/GSM575/
GSM1234      /samples/GSM1nnn/GSM1234/
GSM12345    /samples/GSM12nnn/GSM12345/
Example ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM1nnn/GSM1137/GSM1137.CEL.gz






| NLM | NIH | GEO Help | NCBI Help | Disclaimer | Section 508 |
NCBI Home NCBI Search NCBI SiteMap