ENTREZ QUERY HELP DOCUMENTATION. This is the ENTREZ-Query server, that uses the daily updatable Entrez service. ***************************************************************************** The QUERY E-mail Retrieval System January 27, 1999 The QUERY e-mail server allows users to retrieve records by a variety of queries from the nucleotide sequence, protein sequence, structure, and PubMed MEDLINE databases at the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH, in Bethesda, MD. For performing sequence similarity searches, the BLAST e-mail server can be used. To obtain help documentation for that server, send the word HELP in the body of a message to blast@ncbi.nlm.nih.gov If you have access to the World Wide Web, you might prefer to use the Entrez and BLAST search tools accessible from the NCBI WWW page: http://www.ncbi.nlm.nih.gov/ The Web page provides access to a number of other tools as well, and help documentation is available for each one. If you have questions about NCBI services after reading the documentation, send e-mail to info@ncbi.nlm.nih.gov ========================================================================= CONTENTS (for quick intro, just read sections 1 and 2) ========================================================================= 1. Introduction 2. Query Format 3. Brief Search Examples 3.1 Searches by Unique Identifiers (UID) 3.2 Searches by Text Term 3.2.1 Field Specifiers 3.2.2 Boolean Searches 3.2.3 Truncation Searches 3.2.4 Range Searches 4. Detailed Information and Search Tips 4.1 Database Domains 4.2 Searches by Unique Identifiers (UID) 4.2.1 UID/FASTA format 4.2.2 Multiple UIDS 4.2.3 GenBank Accession Number vs. Sequence ID 4.2.4 FASTA Format for UIDs 4.2.5 Examples of UID and FASTA searches 4.3 Searches by Text Term 4.3.1 Field Specifiers 4.3.2 Boolean Operators 4.3.3 Truncation 4.3.4 Ranging 4.4 Display Options 4.4.1 Default record formats 4.4.2 Recommended formats 4.4.3 MEDLINE entries 4.4.4 Nucleotide entries 4.4.5 Protein entries 4.4.6 Structure entries 4.4.7 Example display options 4.5 Optional Search Parameters 4.6 Escaping Characters ========================================================================= 1. INTRODUCTION ========================================================================= The QUERY e-mail server uses the Entrez Query Engine to obtain data. Entrez arranges information by domain rather than by source database. The domains currently supported are: 1) Nucleotide Sequences: GenBank, EMBL, DDBJ, dbEST, dbSTS, dbGSS 2) Protein Sequences: GenPept (translated coding regions from DNA), PIR, SWISSPROT, PRF, PDB (sequences from solved structures) 3) 3-D Structures: Molecular Modeling Database (MMDB) database, derived from the Protein Data Bank (PDB) 4) PubMed MEDLINE: bibliographic records from complete MEDLINE, PreMedline, and publisher supplied citations Consistent with the model of integrated information access provided by Web Entrez (http://www.ncbi.nlm.nih.gov), the Query server enables you to retrieve records of interest from a target domain, related records ("neighbors") from the same domain, or associated records ("links") from other domains. ========================================================================= 2. QUERY FORMAT ========================================================================= Queries should be sent by e-mail message to: query@ncbi.nlm.nih.gov Each message should contain a single query. There are two main types of searches that can be done: (1) unique identifier (UID) search, or (2) text term search. A basic query contains two lines: DB _____ OR DB ______ UID _____ TERM ______ FIRST LINE indicates the database domain you want to search, e.g., 'm' for MEDLINE; 'n' for nucleotide; 'p' for protein; 's' for both nuc/prot; 't' for 3D structure (section 4.1). SECOND LINE contains your search term. For a UID search,the search term should be a Unique Identifier. If you include multiple unique identifiers, separate them by commas. (section 4.2) For a TERM search, the search term can be a title word, author name, journal name, or other type of text word (section 4.3). Terms can be followed by optional field specifiers (section 4.3.1). If no field is specified, all fields will be searched. For searches containing multiple terms, Boolean operators can be used (section 4.3.2). If terms are separated by spaces rather than Boolean operators, and the terms are not recognized as a phrase by the search system, a default Boolean AND will be used. Terms can be truncated (section 4.3.3), and in some cases, ranged (section 4.3.4). A THIRD LINE, DOPT (display option), can be added to the query to specify the desired record format. If no DOPT line is included, default display options are used. For TERM searches, the default is 'document summary' format, which shows titles only (DOPT d). For UID searches, the default depends on the database being searched. MEDLINE records are shown in 'Citation report' format, which shows the citation, title, abstract, and MeSH indexing terms (DOPT r). Nucleotide and protein records are shown in GenBank/GenPept format (DOPT g). (section 4.4.1) A number of other display options are available for each data domain, e.g., 'DOPT f' for FASTA format of sequence records; 'DOPT l' for MEDLINE format of bibliographic records, suitable for importing into a bibliographic database management program. The DOPT line can also be used to retrieve related records ("neighbors") from the same domain, or associated records ("links") from other domains. (sections 4.4.2 - 4.4.7) ADDITIONAL LINES can be added to the query, if desired, to specify optional search parameters. E.g., DISPMAX specifies maximum number of documents; HTML retrieves search results in that format; and PATH specifies the e-mail address to which results should be sent (section 4.5). EXAMPLES -------------------------------------------------------------------- UID SEARCH TERM SEARCH -------------------------------------------------------------------- DB n DB m UID U30150,U30151 TERM angiostatin DOPT f DOPT l DISPMAX 20 PATH user@home.net searches the nucleotide database searches the MEDLINE database for records with accession numbers for records that contain the term U30150 or U30151; displays the angiostatin; displays MEDLINE records in FASTA format format for a maximum of 20 records (default 200); returns the search results to the e-mail address 'user@home.net' ========================================================================= 3. BRIEF SEARCH EXAMPLES ========================================================================= ========================= 3.1 SEARCHES BY UNIQUE IDENTIFIERS (UID) (see section 4.2 for more info) ========================= DB n UID U30150 * Search the nucleotide database for an entry whose accession number is U30150. Since no DOPT line is present, the record will be displayed the record in the default GenBank format. DB n UID U50871,D31839,AF030413,X13467 DOPT f * Search the nucleotide database for records with accession numbers U50871, D31839, AF030413, and X13467, and display them in FASTA format. DB m UID 88055872 DOPT r HTML * Search the MEDLINE database for the record with MEDLINE UID 88055872 and display it in MEDLINE Report format. Send the results in HTML format for viewing through a WWW browser. DB p UID sp|P11598| DOPT m * Search the protein database, using a FASTA formatted UID, to retrieve the entry whose Swiss-Prot accession number is P11598, and display the MEDLINE links for that protein record as document summaries. DB p UID sp|P11598| DOPT ml * Search the protein database, using a FASTA formatted UID, to retrieve the entry whose Swiss-Prot accession number is P11598, and display the MEDLINE links for that protein record in MEDLINE format (DOPT ml : m=MEDLINE record and l=MEDLINE format). ========================= 3.2 SEARCHES BY TEXT TERM (see section 4.3 for more info) ========================= DB p TERM saccharomyces * Search the protein database for the term "saccharomyces" in all fields (default). Since no DOPT line is included, Query will display the records in the Entrez Document Summary format, which is the default DOPT format for TERM searches and shows the document titles only. Also, since no DISPMAX line is included, Entrez will display up to 200 records (default). ========================= Field specifiers must be in square brackets [ ]. 3.2.1 FIELD SPECIFIERS It does not matter if a space separates a ========================= search term and field specifier. (see section 4.3.1 for complete list of searchable fields) DB p TERM saccharomyces [orgn] * Search the protein database for the term "saccharomyces" in the Organism field, then display the most recent 200 entries (the default DISPMAX value) in Entrez Document Summary format (the default DOPT for TERM searches, which shows only titles of retrieved documents rather than the full records). DB n TERM boguski ms [auth] DISPMAX 15 HTML * Search the nucleotide database for entries by the author M.S. Boguski, and display the most recent 15 in Entrez Document Summary format (default DODPT). Send the results in HTML format for viewing through a WWW browser. DB n TERM promoter [fkey] DOPT d DISPMAX 25 * Search the nucleotide database for the term 'promoter' in the Feature Key of sequence records, and display the document summary format for the most recent 25 records. DB m TERM ras[word] * Search the MEDLINE database for records containing the word 'ras' in any one of a number Text Word containing fields, and display the Entrez Document Summary format (default DOPT) for the most recent 200 MEDLINE articles (default number) DB m TERM ras[titl] * Search the MEDLINE database for records containing the word 'ras' in the Title field, and display the Entrez Document Summary format (default DOPT) for the most recent 200 MEDLINE articles (default number) DB m TERM Alzheimer disease [mesh] DISPMAX 10 DOPT pg * Search the MEDLINE database for records containing the word 'Alzheimer disease' in the Medical Subject Headings (MeSH) field, and display the most recent 10 associated protein records in GenPept format (DOPT pg : p=protein record and g=genpept format). ========================= Boolean operators (AND, OR, NOT) 3.2.2 BOOLEAN SEARCHES must be written in upper case. ========================= (see section 4.3.2 for more info) DB n TERM p21[word] AND J Biol Chem[jour] DOPT n * Search the nucleotide database for entries containing the text word 'p21' and the journal name "J Biol Chem." Display their nucleotide neighbors (the default number of the most recent 200) as document summaries. DB n TERM (Mus musculus[ORGN] AND 1998/05[MDAT]) NOT gbdiv est[PROP] DOPT g DISPMAX 150 * Search the nucleotide database for mouse sequence records modified or added to the nucleotide database in May 1998, but not records in the EST division of GenBank, and display the most recent 150 in GenBank format. DB p TERM tgf[word] OR (transcription[word] AND growth[word] AND factor[word]) DOPT d * Search the protein database for records containing the abbreviation 'tgf' or the fully spelled words 'transcription, growth, and factor' in the text word fields, and display the results in document summary format. (Note: The terms 'transcription, growth, and factor' will not necessarily be adjacent to each other. Rather, the system will retrieve records that contain all three terms.) DB m TERM (sensory[titl] OR sense[titl]) AND receptor[titl] DOPT n * Search the PubMed MEDLINE database for records containing the word 'sensory' or 'sense,' and the word 'receptor' in the Title field, and display a list of the associated nucleotide records. ========================= 3.2.3 TRUNCATION SEARCHES (see section 4.3.3 for more info) ========================= DB n TERM boguski m* [auth] DISPMAX 20 * Search the nucleotide database for entries by the author Boguski, whose initials are M, or M and any other character, and display the most recent 20 in the default Entrez Document Summary format. (If no initials are included in an Author Field search, e.g., boguski [auth], the system will retrieve records by all authors with that last name, regardless of their initials.) DB p TERM laryng* [titl] * Search the protein database for records containing a term that begins with the characters 'laryng' in the title field, and display the most recent 200 as a document summary (default number and format). ========================= For ranges of values in numerical fields 3.2.4 RANGE SEARCHES such as accession number, sequence length, ========================= and date. Note that sequence length must be written with six digits. (see section 4.3.4 for more info) DB n TERM U12345:U12350 [accn] DOPT f * Search the nucleotide database for records with accession numbers ranging from U12345 through U12350 and display them in FASTA format. DB n TERM 050000:100000 [SLEN] DOPT d * Search the nucleotide database for records with a sequence length between 050000 and 200000 bp long and display the document summaries for the most recent 200 records (default number). DB n TERM 200000:999999[SLEN] dopt f * Search the nucleotide database for records with a sequence length of 200000 base pairs or longer, and display them in FASTA format. (Note that GenBank records generally have a maximum length of 350000 bp. although there are a few exceptions, in order to ensure compatibility with various software programs. See Sept. 1995 NCBI News, "GenBank Enters Megabase Era," and Section 1.4.1 of the GenBank 107.0 release notes for more information about the 350 kb limit.) DB n Term 000001:000075 [SLEN] NOT gbdiv est [prop] dopt g DISPMAX 30 * Search the nucleotide database for records with a sequence length of 75 base pairs or less, but which are NOT in the EST division of GenBank, and display the most recent 30 records in GenBank format. (Note that the minimum length for general GenBank submissions is 50 bp.) DB n TERM Caenorhabditis elegans[ORGN] AND 1998/01/05:1998/02/23 [MDAT] DISPMAX 350 * Search the nucleotide database for C. elegans records modified or added to the database between Jan. 5, 1998 and Feb. 23, 1998, and display a maximum of 350 in the default document summary format. DB m TERM alzheimer [titl] AND 1998/03:1998/05 [EDAT] * Search the PubMed MEDLINE database for records containing the term 'alzheimer' in the title, and ADDED to the database between March 1998 and May 1998, and display up to the most recent 200 in document summary format. DB m TERM alzheimer [titl] AND 1998/03:1998/05 [PDAT] * Search the PubMed MEDLINE database for records containing the term 'alzheimer' in the title, and PUBLISHED between March 1998 and May 1998, and display up to the most recent 200 in document summary format. ========================================================================= 4. DETAILED INFORMATION AND SEARCH TIPS ========================================================================= ========================================================================= 4.1 DATABASE DOMAINS ========================================================================= DB must be the first line of the query. The order of the other lines does not matter. Database is a single-character representation of the database: 'm' for MEDLINE 'n' for nucleotide 'p' for protein 't' for 3D structure. 's' for sequence if you do not know whether the entry you want is in the nucleotide or protein database; Entrez will look in both databases. Nucleotide Sequences: GenBank, EMBL, DDBJ, dbEST, dbSTS, dbGSS Protein Sequences: GenPept (translated coding regions from DNA), PIR, SWISSPROT, PRF, PDB (sequences from solved structures) 3-D Structures: Molecular Modeling Database (MMDB) database, derived from Brookhaven National Laboratory's Protein Data Bank (PDB) PubMed MEDLINE: bibliographic records from complete MEDLINE, PreMedline, and publisher supplied citations ========================================================================= 4.2 SEARCHES BY UNIQUE IDENTIFIERS (UID) ========================================================================= UID searches can be used to retrieve specific records from the database for which you know the unique identifiers. ========================= 4.2.1 UID/FASTA FORMAT ========================= UID/FASTA can be any of the following: * the MEDLINE UID (Unique Identifier) of the desired entry; * the Sequence ID (GI) of the entry; * the GenBank Accession Number of the entry; or * A FASTA specification (see below) for the entry. ========================= 4.2.2 MULTIPLE UIDS ========================= More than one UID, Accession Number, or FASTA specification can be given; separate them by commas (without spaces). If you specify more than one entry, you may wish to use the 'd' display option (see section 4.8) in order to view summary information for each entry before seeing detailed information on any one. To view the full records in the default format, you may omit the DOPT line from the query. ========================= 4.2.3 GENBANK ACCESSION NUMBER VS. SEQUENCE ID (GI NUMBER) ========================= An accession number applies to the complete record and is usually a combination of a letter(s) and numbers, such as a single letter followed by five digits (e.g., U12345), or two letters followed by six digits (e.g., AF123456). Accession numbers do not change, even if information in the record is changed at the author's request. The gi number refers only to the sequence within the record. There are two kinds of gi numbers: NID (nucleotide sequence ID number) and PID (protein sequence ID number). If the sequence is revised by the authors, the sequence ID number changes but the accession number of the record does not. E.g., if a GenBank record has one DNA sequence and three amino acid translations, it will have one accession number for the whole record and four gi numbers, one for each sequence. One of those gi numbers will be an NID (nucleotide sequence ID), and three of the gi numbers will be PID's (protein sequence IDs). If there is a revision to the DNA sequence, it will receive a new NID. If that change did not affect the protein translations, the PIDs will stay the same. However, if, for example, the DNA sequence change affects the second amino acid translation, the second PID will change, too. In early 1999, sequence identification numbers will be written in a new accession.version format. For additional information about gi numbers and the forthcoming accession.version format, please refer to section 1.4.6 of the release notes for GenBank 107.0, June 15, 1998. Our WWW home page (http://www.ncbi.nlm.nih.gov) provides access to the current GenBank release notes, under GenBank/Overview. The file is updated with each new release, around the 15th day of February, April, June, August, October, and December. Past release notes are accessible at ftp://ncbi.nlm.nih.gov/genbank/release.notes/ . ============================= 4.2.4 FASTA FORMAT FOR UIDs ============================= The UID for a sequence (protein or nucleotide) entry can be specified using the FASTA format, if desired. FASTA Formatted UIDs are of the general form: * database_name | id1 | id2 Where id1 and id2 are identifier fields appropriate to the database. Normally only one of the fields is used, even if both are filled in. If a field is not used, you must still put in the specified number of vertical bar separators (e.g. database_name | | id2). Supported FASTA specifications include: name format to use ----------------------------------------------------------------------- gi gi|integer genbank gb|accession|locus embl emb|accession|locus ddbj dbj|accession|locus pir pir|accession|name (note: only the name field is indexed for PIR) swissprot sp|accession|name patent pat|country|patent number (string)|seq number (integer) prf prf|accession|name pdb pdb|entry name (string)|chain id (single character) gibbsq bbs|integer gibbmt bbm|integer The parenthesized notes, e.g. (string), should not be included in the specification; they are there merely to indicate what type of data the format expects. =========================================== 4.2.5 EXAMPLES of UID and FASTA searches =========================================== DB n UID U30150 * Search the nucleotide database for an entry whose accession number is U30150. Since no DOPT line is present, the record will be displayed in the default GenBank format. DB n UID U50871,D31839,AF030413,X13467 DOPT f * Search the nucleotide database for records with accession numbers U50871, D31839, AF030413, and X13467, and display them in FASTA format. DB m UID 88055872 DOPT r HTML * Search the MEDLINE database for the record with MEDLINE UID 88055872 and display it in MEDLINE Report format. Send the results in HTML format for viewing through a WWW browser. DB p UID sp|P11598| DOPT m * Search the protein database, using a FASTA formatted UID, to retrieve the entry whose Swiss-Prot accession number is P11598, and display the MEDLINE links for that protein record as document summaries. DB p UID sp|P11598| DOPT ml * Search the protein database, using a FASTA formatted UID, to retrieve the entry whose Swiss-Prot accession number is P11598, and display the MEDLINE links for that protein record in MEDLINE format (DOPT ml : m=MEDLINE record and l=MEDLINE format). ========================================================================= 4.3 SEARCHES BY TEXT TERM ========================================================================= TERM searches can be used to retrieve records that contain your search term(s) in All Fields or in a specified field. The sections below discuss various options for TERM searches, including the use of Field Specifiers (section 4.3.1), Boolean Operators (section 4.3.2), Truncation (section 4.3.3), and Ranging (4.3.4). ========================================================================= 4.3.1 FIELD SPECIFIERS FOR TERM SEARCH ========================================================================= If you do not include a field specifier after a search term, All Fields will be searched by default. If you would like to limit retrieval to records that contain the search term in a specific field, such as the Title Word or Author Name field, the field specifiers below can be used. Field specifiers can be written in upper or lower case, but must be enclosed in square brackets. E.g., [Auth], [AUTH], and [auth] will be treated the same. It does not matter if there is a space between the search term and field specifier. However, there must be a space on both sides of a Boolean operator. E.g., all the searches below will work: boguski ms[auth] boguski ms [auth] boguski ms[auth] AND 1997 [pdat] ======================================= 4.3.1.1 FIELD SPECIFIERS, BY DATABASE ======================================= Entrez Field designation can be: * for MEDLINE: AFFL (or AD), ALL, AUTH (or AU), ECNO (or RN), EDAT, IP, JOUR (or TA), LA, MAJR, MESH (or MH), PG, PS, PDAT (or DP), PTYP (or PT), SH, SUBS (or NM), SUBSET (or SB or FILTER), WORD (or TW), TITL (or TI), VI * for Nucleotide: ACCN, ALL, AUTH (or AU), ECNO (or RN), FKEY, GENE, IP, JOUR, KYWD, MDAT, ORGN, PAGE (or PG), PACC, PROP, PROT, PDAT, SLEN, SUBS, WORD, TITL, VI * for Protein: ACCN, ALL, AUTH (or AU), ECNO (or RN), GENE, IP, JOUR, KYWD, MDAT, ORGN, PAGE (or PG), PACC, PROP, PROT, PDAT, SLEN, SUBS, WORD, TITL, VI * for Structure: ACCN, ALL, AUTH (or AU), ECNO (or RN), IP, JOUR, ORGN, PAGE (or PG), PDAT, SUBS, WORD, VI ======================================= 4.3.1.2 FIELD DESCRIPTIONS ======================================= - Accession Number [ACCN] This field is present in nucleotide sequence, protein sequence, and structure records. An accession number applies to the complete record and for nucleotide and protein sequence records, it is usually a combination of a letter(s) and numbers, such as a single letter followed by five digits (e.g., U12345), or two letters followed by six digits (e.g., AF123456). Accession numbers for structure records are usually a single digit followed by three letters (e.g., 1AMC). Some sequence records might contain more than one number in the ACCESSION field, e.g., if two or more records have been merged or if older accessions have become secondary to a new accession for various reasons. The ACCN field will retrieve the record by any one of the numbers present in the ACCESSION field. In contrast, the Primary Accession [PACC] field, described below, retrieves records only by the first number listed in the ACCESSION field. (Section 4.2.3, GENBANK ACCESSION NUMBER VS. SEQUENCE ID, contains additional information about accession numbers versus sequence identification numbers.) - Affiliation [AFFL or AD] This field is present in MEDLINE records and contains the institutional affiliation and address of the first author. The field can also be used to search by grant numbers (e.g., LM05545/LM/NLM [ad]). All three pieces of this field (actual number, grant acronym, and institute mnemonic) are each individually searchable. In the sequence databases, institutional affiliation of authors is often included in the last reference noted in the record. However, that field is not searchable using the [AFFL] or [AD] field specifier. Rather, you can do a search for an institution's name in All Fields or the Text Word field (e.g. Caltech [all], or Caltech [word]). Note that you might get some false hits, however, - All Fields [ALL] Includes all searchable fields. - Author Name [AUTH or AU] For MEDLINE records, up to 25 authors provided (current NLM author indexing policy). Full names are not listed. Rather, the format to search for an author name is: last name, followed by a space and up to the first two initials, without periods (e.g., fauci as). Initials may be omitted when searching. Entrez automatically truncates on an author's name to account for varying initials, e.g., o'brien j [au] will retrieve o'brien ja, o'brien jb, o'brien jc jr, as well as o'brien j. To turn off this automatic truncation, enclose the author's name in double quotes and qualify with [au] in brackets, e.g., "o'brien j" [au] to retrieve just o'brien j. - EC/RN Number [ECNO or RN] Number assigned by the Enzyme Commission to designate a particular enzyme or by the Chemical Abstracts Service (CAS) for Registry Numbers. - Entrez Date [EDAT] Date a bibliographic citation was added to the PubMed MEDLINE database. Citations are displayed in Entrez Date order which is last in, first out. Dates or date ranges must be entered using the format YYYY/MM/DD [edat], e.g. 1998/04/06 [edat] . The month and day are optional (e.g., 1998[edat] or 1998/03 [edat]). To enter a date range, insert a colon (:) between each date (e.g., 1996:1997 [edat] or 1998/01:1998/04 [edat]). Note: The Entrez Date will remain unchanged and is not updated to reflect the date a Publisher Supplied record is elevated to PREMEDLINE or when a PREMEDLINE record is elevated to MEDLINE. Therefore, use caution when your strategy includes only MeSH terms and a date or date range using the search field tag, [edat], because the addition of MeSH terms to a record will not change the Entrez Date [edat]. - Feature Key [FKEY] Is a keyword denoting a particular DNA feature (e.g., coding region, primer bind, promoter). To see the list of terms in the Feature Key index, connect to WWW Entrez (http://www.ncbi.nlm.nih.gov/Entrez/), select the nucleotide database, Search Field = Feature Key, Search Mode = List terms, and type an 'a' (without the quotes) in the text box to see the top of the index and then scroll down. Features are defined in the Sequin documentation (http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html) and in Section 3.5.12.1 of the GenBank release notes. - Gene Symbol [GENE] The standard name for a given gene. If you cannot find a gene using Gene Symbol, or you would like to broaden your retrieval, try using All Fields or Text Words instead. - Issue [IP] The number of the journal issue in which the article is published. - Journal Title [JOUR or TA] The journal title abbreviation, full journal name or ISSN number (e.g., J Biol Chem, Journal of Biological Chemistry, 0021-9258). The Journal Browser is also available from the sidebar of WWW PubMed (http://www.ncbi.nlm.nih.gov/PubMed) to look up the full name, abbreviation, and ISSN number of a journal. - Keywords [KYWD] Allows you to search using index terms associated with the GenBank, EMBL, DDBJ, SWISS-Prot, PIR, PRF, or PDB databases. If you are not familiar with the keywords used in these databases, this field may not be useful to you. Also, the Keyword field of many records is blank, so the Text Word field (see below) is preferable for searching. - Language [LA] The language in which the article was published. Note that many non-English articles have English language abstracts. Note: You can also enter just the first three characters of the language,, e.g., chi [la] retrieves the same as chinese [la]. The lone exception is jpn [la] for Japanese. - MeSH Major Topic [MAJR] A MeSH term that is one of the main topics discussed in the article. See MeSH Terms below. - MeSH Terms [MH] MeSH Terms includes all of the terms in the National Library of Medicine (NLM) Medical Subject Headings, a controlled vocabulary of keywords used to index MEDLINE. Each MEDLINE citation is given a group of MeSH terms that relate to the subject of the paper from which it is drawn. Frequently, MeSH terms will have an additional term, called a "subheading", which further defines how the MeSH term relates to the article it is associated with. This subheading is appended to the MeSH term, e.g. "pneumonia diagnosis". Searching on the MeSH term (here, pneumonia) will retrieve all of the articles that use that MeSH term, whether they have subheadings or not. Use the subheading terms if you require more specificity than the MeSH term allows. Note: MeSH terms searched for using the Mesh or MeSH Major Topic fields are automatically "Exploded" by WWW Entrez; that is, all terms which are logical subsets of the term entered are included. For instance, "pneumococcal infections" includes "streptococcus pneumoniae" . MeSH terms found using the "All Fields" search are NOT exploded. The WWW PubMed help document (accessible from http://www.ncbi.nlm.nih.gov/PubMed) provides detailed information on the use of MeSH terms, MeSH/subheading combinations, how to turn off the automatic inclusion of the more specific terms using the [field:noexp] syntax (e.g., hypertension [mh:noexp]), etc. WWW PubMed also provides access to a MeSH browser. - Modification Date [MDAT] Contains the date that the record was last touched, and in some cases corresponds to the date on which a record was made available in the public database. It is written in the format year/month/day, as for Publication Date, see below. - Organism [ORGN] Contains the scientific and common names for the organisms associated with protein and nucleotide sequences. Organism names are "exploded" much like MeSH terms; for instance, searching on "mammalia" will find all entries indexed under any mammal. - Primary Accession Number [PACC] This field is available when searching nucleotides and proteins. Some sequence records might contain more than one number in the ACCESSION field of a record. That is true, for example, if two records have been merged*, or if original accessions have become secondary to a newer accession**. The PACC field retrieves the record in which the accession number you have indicated appears as the primary accession (i.e., the first number in the ACCESSION field of the GenBank record). In contrast, the Accession Number [ACCN] field described earlier retrieves records that contain the given accession number in any position within in the ACCESSION field of the record. * Two records might be merged, for example, if an author inadvertently submitted the sequence to two databases, and received an accession number from each one. The ACCN field will retrieve the record with any of the accessions. The PACC field will retrieve the record only with the first accession. ** One example in which original accessions might become secondary to a new one is if an author originally submits a number of records, each containing a separate exon from genomic DNA, and the same author later submits a longer, contiguous piece of genomic DNA that also contains the exons previously reported. In that case, the new record will supercede the earlier records, but will contain the original accession numbers as secondary to the new accession. - Page Number [PAGE or PG] Enter only the first page number that the article appears on. The citation will display the full pagination of the article but this field is searchable using only the first page number. - Personal Name as Subject [PS] Use this search field tag to limit retrieval to where the name is the subject of the article, e.g., varmus h [ps]. The search rules for Author [au] apply to this field, see Author Name field. To restrict a search to the Personal Name as Subject field users must include the search field tag, [ps]. Note: This field is not in the Search Fields pull-down menu in the Advanced Search Mode of WWW PubMed because the data are actually indexed as part of the Text Word [tw] search field. - Property [PROP] Property is one or more terms that denote the type of sequence a record contains. Some examples of properties are: - GenBank division, such as gbdiv_est [prop]; - molecule type, such as biomol_genomic [prop]; - location of gene, such as gene_in_mitochondrion [prop]; - information about the biological source, such as src_pop_variant [prop]; - source database, such as srcdb_pdb [prop] To see the list of terms in the Properties index, connect to WWW Entrez (http://www.ncbi.nlm.nih.gov/Entrez/), select the nucleotide database, Search Field = Properties, Search Mode = List terms, and type an 'a' (without the quotes) in the text box to see the top of the index and then scroll down. - Publication Date [PDAT or DP] Publication Date contains the date that the article was published (for PubMed citations) or the date that the record was added to the public GenBank database (for nucleotide sequence records), in the format YYYY/MM/DD, e.g. 1984/10/06 [pdat]. The month and day are optional. A year alone, (e.g. 1984 [PDAT]) will retrieve all articles for that year; a year and month (e.g. 1984/03[pdat]) will retrieve all for that month. To enter a date range, insert a colon (:) between the dates (e.g., 1996:1998 [pdat], or 1998/01/15:1998/02/17 [pdat]). Note: Journals vary in the way the publication date appears on an issue. Some journals include just the year, whereas others include the year plus month or year plus month plus day. And, some journals use the year and season (e.g.,Winter 1997). The publication date in the citation is recorded as it appears in the journal. It is recommended that you search only by year (e.g., 1996 [dp] or 1995:1997 [dp]). - Publication Type [PTYP or PT] Describes the type of material the article represents (e.g., Review, Clinical Trial, Retracted Publication, Letter). The full list of publication types is available by using WWW PubMed (http://www.ncbi.nlm.nih.gov/PubMed/medline.html), Search Field = Publication Type, Search Mode = Selection, and type an 'a' (without the quotes) in the text box to see the top of the index and then scroll down. - SeqId is the special string identifier, similar to a FASTA identifier, for a given sequence. To retrieve a record using a FASTA identifier, use a UID type of search (no field specifier necessary). - Subheadings [SH] Subheadings are used with MeSH terms to help describe more completely a particular aspect of a subject. For example, the drug therapy of asthma is displayed as asthma/drug therapy, see MeSH/Subheading Combinations. The Subheading field allows users to "free-float" subheadings, e.g., hypertension [mh] AND toxicity [sh]. Subheadings automatically include the more specific subheading terms under the term in a search. To turn off this automatic feature, use the search syntax [sh:noexp], e.g., therapy [sh:noexp]. In addition, you can enter the MEDLINE two-letter subheading abbreviations rather than spelling out the subheading, e.g., dh [sh] = diet therapy [sh]. - Subset of PubMed [SUBSET or SB or FILTER] The values for this field can be the following (case does not matter): MEDLINE PreMEDLINE Publisher AIDS The PubMed database contains four subsets of bibliographic records: (1) MEDLINE records, which have undergone MeSH indexing and quality control by NLM staff; (2) PreMEDLINE records, which provide basic citation information and abstracts before the full records are prepared and added to MEDLINE; (3) Publisher supplied citations that have been supplied electronically to the NLM by publishers and (a) have not yet been processed for PreMEDLINE, or (b) will never be in MEDLINE because they are not within the scope of the database, but are included in PubMed because they are present in journals selectively indexed for MEDLINE; (4) AIDS, a subset of PubMed created by running all PubMed citations through a special AIDS search filter that uses the search strategy developed for creating NLM's AIDSLINE database. - Substance Name [SUBS or NM] The name of a chemical discussed in the article (MEDLINE Name of Substance field). Synonyms to the Names of Substances will automatically map when qualified with [nm]. This field was implemented in mid-1980. Many chemical names are searchable as MeSH terms before that date. - Text Words [WORD or TW] Includes all of the "free text" associated with a record, specifically: MEDLINE records: title, abstract, MeSH terms, subheadings, chemical substance names, personal name as subject, and MEDLINE Secondary Source (SI) field (In MEDLINE, the Secondary Source Identifier field (SI) contains the genetic databank label and accession number, e.g., GENBANK/AA001794. In PubMed, these data are searchable using only the accession number, e.g, AA001794 [word].) Protein records: definition, comment, protein name, and protein description. Nucleotide records: definition, comment, gene name, and gene description. - Title Words [TITL or TI] For PubMed records, includes only those words found in the title of an article. For sequence records, includes only those words found in the definition line of a record. - Volume [VI] The number of the journal volume in which an article is published. - PubMed Identifier (PMID) & MEDLINE Unique Identifier (UI) To search the PubMed databases for either the PMID or UI, simply type the number without a search field qualifier. You can search for several ID numbers by entering each number in the query box separated by a space (e.g., 95091318 97465762), Entrez will OR the terms together. =========================================== 4.3.1.3 EXAMPLES of TERM and FIELD requests: =========================================== DB p TERM saccharomyces [orgn] * Search the protein database for the term "saccharomyces" in the Organsism field, then display the most recent 200 entries (the default DISPMAX value) in Entrez Document Summary format (the default DOPT for TERM searches, which shows only titles of retrieved documents rather than the full records). Note: if no field specifier was used, the term 'saccharomyces' would be searched in all fields of the record and might therefore retrieve sequence records from other organisms in which the term 'saccharomyces' was mentioned in a different context. DB n TERM boguski ms [auth] DISPMAX 15 HTML * Search the nucleotide database for entries by the author M.S. Boguski, and display the most recent 15 in Entrez Document Summary format (default DOPT). Send the results in HTML format for viewing through a WWW browser. DB n TERM promoter [fkey] DOPT d DISPMAX 25 * Search the nucleotide database for the term 'promoter' in the Feature Key of sequence records, and display the document summary format for the most recent 25 records. DB m TERM ras[word] * Search the MEDLINE database for records containing the word 'ras' in any one of a number of Text Word containing fields, and display the Entrez Document Summary format (default DOPT) for the most recent 200 MEDLINE articles (default number) DB m TERM ras[titl] * Search the MEDLINE database for records containing the word 'ras' in the Title field, and display the Entrez Document Summary format (default DOPT) for the most recent 200 MEDLINE articles (default number) DB m TERM Alzheimer disease [mesh] DISPMAX 10 DOPT pg * Search the MEDLINE database for records containing the word 'Alzheimer disease' in the Medical Subject Headings (MeSH) field, and display the most recent 10 associated protein records in GenPept format (DOPT pg : p=protein record and g=genpept format). ========================================================================= 4.3.2 BOOLEAN OPERATORS ========================================================================= The Query server can understand complex Boolean expressions in place of UIDs or single term-and-field constructions. ========================================= 4.3.2.1 BOOLEAN QUERY FORMAT AND OPERATORS ========================================= Boolean operators, AND, OR, NOT, must be entered in UPPERCASE. E.g.: vitamin c OR zinc If a field specifier is not included, that term will be searched in All Fields be default. Field specifiers described in section 4.3.2 can follow terms if desired, so the query format can be: term [field] BOOLEAN term [field] BOOLEAN term [field]... E.g.: vitamin c [titl] OR zinc [titl] Boolean Expressions are normally processed left to right. If you want part of your Boolean expression to be processed out of order, enclose it in parentheses. The terms inside the set of parentheses will be processed as a unit and then incorporated into the overall strategy. E.g.: common cold AND vitamin c OR zinc retrieves records that contain both the terms common cold AND vitamin c, or records that contain only the term zinc. E.g.: common cold AND (vitamin c OR zinc) retrieves records that contain the term common cold, and either the term vitamin c OR the term zinc ================================== 4.3.2.2 EXAMPLES OF BOOLEAN QUERIES ================================== DB n TERM p21[word] AND J Biol Chem[jour] DOPT n * Search the nucleotide database for entries containing the text word 'p21' and the journal name "J Biol Chem." Display their nucleotide neighbors (the default number of the most recent 200) as document summaries. DB n TERM (Mus musculus[ORGN] AND 1998/05[MDAT]) NOT gbdiv est[PROP] DOPT g DISPMAX 150 * Search the nucleotide database for mouse sequence records modified or added to the nucleotide database in May 1998, but not records in the EST division of GenBank, and display the most recent 150 in GenBank format. DB p TERM tgf[word] OR (transcription[word] AND growth[word] AND factor[word]) DOPT d * Search the protein database for records containing the abbreviation 'tgf' or the fully spelled words 'transcription, growth, and factor' in the text word fields, and display the results in document summary format. (Note: The terms 'transcription, growth, and factor' will not necessarily be adjacent to each other. Rather, the system will retrieve records that contain all three terms.) DB m TERM (sensory[titl] OR sense[titl]) AND receptor[titl] DOPT n * Search the PubMed MEDLINE database for records containing the word 'sensory' or 'sense,' and the word 'receptor' in the Title field, and display a list of the associated nucleotide records. ========================================================================= 4.3.3 TRUNCATION ========================================================================= All of the terms that begin with a given string can be searched on by appending an asterisk (*) to the end of the term. For instance, 'rhizop*' will retrieve all terms that begin with the characters 'rhizop'. EXAMPLES: DB n TERM boguski m* [auth] DISPMAX 20 * Search the nucleotide database for entries by the author Boguski, whose initials are M, or M and any other character, and display the most recent 20 in the default Entrez Document Summary format. (If no initials are included in an Author Field search, e.g., boguski [auth], the system will retrieve records by all authors with that last name, regardless of their initials.) DB p TERM laryng* [titl] * Search the protein database for records containing a term that begins with the characters 'laryng' in the title field, and display the most recent 200 as a document summary (default number and format). ========================================================================= 4.3.4 RANGING ========================================================================= Ranging can be used for numerical fields such as accession number, date, and sequence length. The symbol for ranging is the colon ( : ). The search should be constructed in a manner similar to a Boolean search. When ranging on sequence length, the numbers specified should be six digits; leading zeros are significant. For example, 1000 base pairs should be written as 001000. EXAMPLES: DB n TERM U12345:U12350 [accn] DOPT f * Search the nucleotide database for records with accession numbers ranging from U12345 through U12350 and display them in FASTA format. DB n TERM 050000:100000 [SLEN] DOPT d * Search the nucleotide database for records with a sequence length between 050000 and 100000 bp long and display the document summaries for the most recent 200 records (default number). DB n TERM 200000:999999[SLEN] dopt f * Search the nucleotide database for records with a sequence length of 200000 base pairs or longer, and display them in FASTA format. (Note that GenBank records generally have a maximum length of 350000 bp. although there are a few exceptions, in order to ensure compatibility with various software programs. See Sept. 1995 NCBI News, "GenBank Enters Megabase Era," and Section 1.4.1 of the GenBank 107.0 release notes for more information about the 350 kb limit.) DB n Term 000001:000075 [SLEN] NOT gbdiv est [prop] dopt g DISPMAX 30 * Search the nucleotide database for records with a sequence length of 75 base pairs or less, but which are NOT in the EST division of GenBank, and display the most recent 30 records in GenBank format. (Note that the minimum length for general GenBank submissions is 50 bp.) DB n TERM Caenorhabditis elegans[ORGN] AND 1998/01/05:1998/02/23 [MDAT] DISPMAX 350 * Search the nucleotide database for C. elegans records modified or added to the database between Jan. 5, 1998 and Feb. 23, 1998, and display a maximum of 350 in the default document summary format. DB m TERM alzheimer [titl] AND 1998/03:1998/05 [EDAT] * Search the PubMed MEDLINE database for records containing the term 'alzheimer' in the title, and ADDED to the database between March 1998 and May 1998, and display up to the most recent 200 in document summary format. DB m TERM alzheimer [titl] AND 1998/03:1998/05 [PDAT] * Search the PubMed MEDLINE database for records containing the term 'alzheimer' in the title, and PUBLISHED between March 1998 and May 1998, and display up to the most recent 200 in document summary format. ========================================================================= 4.4 DISPLAY OPTIONS ========================================================================= Display Option (DOPT) is a code used to specify the record format you would like to see in the search results. DOPT can also be used to display 'neighbors' of the retrieved records (related records from the same database), or 'links' of the retrieved records (associated records from the other Entrez databases). It conists of any of the following: 1) one 'Record format' character OR 2) one 'Neighbors/Links' character OR 3) one 'Neighbors/Links' character followed by one 'Record format' character that applies to the type of database whose records you want to display. For example, if you search the nucleotide database for records containing the term 'leptin receptor' in the title field, then: DB n TERM leptin receptor [titl] DOPT ... DOPT g will show the retrieved nucleotide records in GenBank format; DOPT f will show the retrieved nucleotide records in FASTA format; DOPT p will show the protein records that are linked to the nucleotide records retrieved by your search, and will display those protein records in the default document summary format; DOPT pf will show the protein records that are linked to the nucleotide records retrieved by your search, and will display those protein records in FASTA format; DOPT m will show the MEDLINE records that are linked to the nucleotide records retrieved by your search, and will display them in the default document summary format; DOPT ml will show the MEDLINE records that are linked to the nucleotide records retrieved by your search, and will display them in the MEDLINE format. The sections below show the record formats available for each database. ==================================================== 4.4.1 DEFAULTS: ==================================================== The "dopt" line of the query can be omitted. In that case, the following defaults will be used: UID searches for designated records in the MEDLINE domain: 'r' UID searches for designated records in the nucleotide or protein domains: 'g' UID searches for neighbors or links: 'd' TERM searches: 'd' ==================================================== 4.4.2 RECOMMENDED FORMATS: ==================================================== If DOPT is included in the search request, we recommend using the record format 'r' for single or few MEDLINE records, and 'g' for single or few protein or nucleotide sequence entries. For multiple documents in any database, or for neighbor and link searches, we recommend using 'd' to first retrieve a title list of records. Then complete records of interest from the list can be retrieved by searching for their UID's and displaying them in the desired format. IMPORTANT! Searches using TERMS and fields can return a lot of documents. If you expect that your search will return more than a handful of documents, you will almost always want to use "dopt d" or omit the "dopt" portion, in order to have a manageably small display to read. ==================================================== 4.4.3 For MEDLINE entries (DB m), Display Option (DOPT) can be : ==================================================== Record formats: * 'r' Citation report format (Citation, title, abstract, indexing terms) * 'b' Abstract format (Citation, title, abstract only) * 'l' MEDLINE format * 'a' ASN.1 format * 'u' PubMed Unique Identifiers * 'um' MEDLINE Unique Identifiers * 'q' quantity (number) of entries, only * 'd' Entrez document summary format (titles) Neighbors/Links: * 'm' MEDLINE neighbors * 'p' protein links * 'n' nucleotide links * 't' structure links ==================================================== 4.4.4 For nucleotide entries (DB n), the DOPT can be : ==================================================== Record formats: * 'g' GenBank format * 'r' Report format * 'f' FASTA format * 'a' ASN.1 format * 'u' GenBank Unique Identifiers (GI's) * 'q' quantity (number) of entries, only * 'd' Entrez document summary format (titles) Neighbors/Links: * 'm' MEDLINE links * 'p' protein links * 'n' nucleotide neighbors * 't' structure links ==================================================== 4.4.5 For protein entries (DB p), the DOPT can be : ==================================================== Record formats: * 'g' GenPept format * 'r' Report format * 'f' FASTA format * 'a' ASN.1 format * 'u' GenBank Unique Identifiers (GI's) * 'q' quantity (number) of entries, only * 'd' Entrez document summary format (titles) Neighbors/Links: * 'm' MEDLINE links * 'p' protein neighbors * 'n' nucleotide links * 't' structure links ==================================================== 4.4.6 For structure entries (DB t), the DOPT can be : ==================================================== Record formats: * 's' Structure summary * 'u' MMDB Unique Identifiers * 'q' quantity (number) of entries, only * 'd' Entrez document summary format (titles) Neighbors/Links: * 'm' MEDLINE links * 'p' protein links * 'n' nucleotide links * 't' structure neighbors ==================================================== 4.4.7 EXAMPLE DISPLAY OPTIONS ==================================================== DB n UID U42467 * Display the GenBank format (default) of the nucleotide sequence record with accession number U42467 DB n UID U42467 DOPT f * Display the FASTA format of the nucleotide sequence from U42467 DB n UID U42467 DOPT pf * Display the FASTA format of protein sequence records linked to the nucleotide record U42467 DB n UID U42467 DOPT ml * Display the MEDLINE format of MEDLINE records linked to U42467 DB m TERM alzheimer [titl] * Display the most recent 200 document summaries (the default number and viewing method) for all MEDLINE articles containing the word Alzheimer in their title. DB m TERM alzheimer [titl] DISPMAX 10 DOPT l * Display the MEDLINE format of the 10 most recently added MEDLINE records that contain the word 'alzheimer' in the title. DB m TERM alzheimer [titl] DOPT pg * Display the GenBank format of protein records linked to the MEDLINE records which have the word 'alzheimer' in their title. ========================================================================= 4.5 OPTIONAL SEARCH PARAMETERS ========================================================================= PARAMETER FUNCTION EXAMPLE HTML asks for the output in HTML suitable for DB n loading into a WWW browser. TERM leptin [prot] DOPT r HTML DISPMAX indicates the maximum number of articles you DB n wish to have displayed, with the most recent TERM insulin entries being displayed first. If you only DOPT g expect a few articles, you may omit the DISPMAX 50 "dispmax=" portion of the request. Omitting "dispmax" causes Entrez to use a default maximum of 200. (But note, the maximum number of lines in a single Query results message is 100000.) PATH indicates the e-mail address to which you PATH user@xxx.yyy.zzz would like the Query Results message returned. ========================================================================= 4.6 ESCAPING CHARACTERS ========================================================================= Escaping characters will probably not be necessary for most users. This section applies only to users whose e-mail tools do not properly handle query messages that contain spaces or other special characters. For example, in some e-mail systems, the search phrase "hiv protease" might be broken up at the internal space by the mailer. In that case, you can "escape" the characters in the Query request by replacing spaces with plus signs (+). So you would use "hiv+protease" (quotes not necessary). As another example, some mail tools might not properly handle messages that contain special characters, such as an exclamation mark (!). In that case, you may "escape" the characters in the Query request by replacing any special characters with %DD, where DD is the hexadecimal equivalent of the character replaced. To replace an exclamation mark, for example, you would use "%21" (without the quotes). You need an ASCII table to find hexadecimal equivalents, but such tables are widely available. Please note, however, that most queries should not include special characters such as exclamation marks. Please see the sections above for sample queries. The Query search engine understands this 'escaping' for spaces and special characters, and will translate it properly. ========================================================================= Questions to: info@ncbi.nlm.nih.gov