NCBI PubChem logo
PubMed Entrez Structure GenBank PubChem Help

PubChem Help

This document provides tips and examples for searches of the three PubChem databases by text term/keyword, as well as tips for searching PubChem Compound by chemical properties. The help document for structure search provide tips on using chemical information for basic and advanced structure search options in the PubChem Structure Search. In addition, the PubChem Deposition Gateway help document also provides procedures and instructions for users to deposit their structure/assay data into the PubChem system. The PubChem Download Facility Help document describes how to use the newly implemented PubChem download facility.


 

PubChem Overview

back to top

PubChem provides information on the biological activities of small molecules. It is a component of NIH's Molecular Libraries Roadmap Initiative.

PubChem includes substance information, compound structures, and bioactivity data in three primary databases, PCSubstance, PCCompound, and PCBioAssay, respectively.

  • PCSubstance contains more than 19 million records. You can check the count of substance records as of today.

  • PCCompound contains more than 10 million unique structures. You can check the count of compound records as of today.

  • PCBioAssay contains more than 600 bioassays. Each bioassay contains a various number of data points. You can check the count of bioassay records as of today.
The Substance/Compound database, where possible, provides links to bioassay description, literature, references, and assay data points. The BioAssay database also includes links back to the Substance/Compound database. PubChem is integrated with Entrez, NCBI's primary search engine, and also provides compound neighboring, sub/superstructure, similarity structure, bioactivity data, and other searching features.

PubChem contains substance and bioassay information from a multitude of depositors. You can check the PubChem data source status as of today. 

PubChem Substance Database

back to top

The PubChem substance database contains chemical structures, synonyms, registration IDs, description, related urls, database cross-reference links to PubMed, protein 3D structures, and biological screening results. If the contents of a chemical sample are known, the description includes links to PubChem Compound.

 Query Examples:

  • Molecule synonym search

    Which substances have "methotrexate" as a part of their molecule name?

        Simply enter methotrexate in the Search textbox on the PubChem homepage or Entrez search page and press the Go button. You will get all substances with the synonym methotrexate and/or with any other keyword methotrexate.

        Or enter methotrexate[synonym] in the Search textbox and press the Go button. Note: the term in the brackets "[]", such as "[synonym]", is an index field name or alias. For more information about index searches, please see PubChem Indexes and Index Search.

    Which substances have "3'-Azido-3'-deoxythymidine" as their molecule name?

        Enter "3'-Azido-3'-deoxythymidine" (including the quotes) in the Search textbox and press the Go button.
     

  • External ID search

    Which substances have "NSC78" for DTP/NCI's external ID ?

        Simply enter "NSC78" in the Search textbox and press the Go button.

        Or enter "78[objectid],dtp[sourcename]" in the Search textbox and press the Go button.

    Which substances have "aids000006" for NIAID's anti-HIV chemical database external ID ?

        Enter "aids000006" in the Search textbox and press the Go button.

        Or enter "000006[objectid],niaid[sourcename]" in the Search textbox and press the Go button.
     

  • Biology links search

    Which substances have biological activity links?

        1.  Select the Limits tab or go to the Limits page
        2.  In the Specify Required Links section, click the checkbox next to BioAssay and press the Go button.

  • Combined searches

    Which substances contain the element Platinum and have biological activity links?

        1.  Select the Limits tab or go to the Limits page.
        2.  In the Specify Required Links section, click the checkbox next to BioAssay.
        3.  In the Specify Required Elements section, click the checkbox to the left of Pt.
        4.  Press the Go  button in the Search toolbar.

 

PubChem Compound Database

back to top

Starting from April 6, 2005, PubChem/Compound has been updated to a new version, and now contains unique structures only, and compound identifiers have changed. If you are using the old version "cid"s as links, please update them. The best way to update links to PubChem's Compound database, based on the new cid identifiers, is simply to rerun the query or search procedure that was used to create these links in the first place. PubChem's query and search tools maintain all features available previously.

The PubChem Compound Database contains validated chemical depiction information that is provided to describe substances in PubChem Substance.

Structures stored within PubChem Compound are pre-clustered and cross-referenced by identity and similarity groups. Additionally, calculated properties and descriptors are available for searching and filtering of chemical structures.

Users can perform a term/keyword search in a same manner as for substance database (see above). In addition, the PubChem compound database also provides a chemical property search.

Examples:
  • Molecular weight search

    Which compounds have molecular weight  between 100 and 200?

        Enter 100:200[mw] or 100:200[molecularweight] in the Search textbox and press the Go button.

    Note: The term in the brackets "[]", such as "[mw]", is an index field name or alias. For more information about index searches, please see PubChem Indexes and Index Search.

        Or simply enter 164.2[mw] in the Search textbox and press the Go button to retrieve all compounds with 164.2 as the molecular weight.
     

  • XLogP search

    Which compounds have XLogP between 2.3 and 2.4?

        Enter 2.3:2.4[xlogp] in the Search textbox and press the Go button.
     

  • Heavy atom count search (Heavy atom means all atoms except hydrogen.)

    Which compounds contain 8 heavy atoms?

        Enter 8[heavyatomcount] in the Search textbox and press the Go button. Users can also carry out this search for the Substance database.

The PubChem Compound Limits page provides a very useful way to rapidly perform complex searches. All search examples showed above can be done at the Limits page. Select the Limits tab or go to the Limits page to begin any of the examples below.

Examples:

  • Chemical property range searches

    Which substances do not violate the "Lipinski Rule of 5"?

        1.  In the Chemical Property Search section:
               a.  For the Molecular Weight (MW) range, type 0 and 500 in the from and to text boxes, respectively.
               b.  For the Hydrogen Bond Donor Count (HBD) range, type 0 and 5 in the from and to text boxes, respectively.
               c.  For the Hydrogen Bond Acceptor Count (HBA) range, type 0 and 10 in the from and to text boxes, respectively.
               d.  For the XLogP range, type -5 and 5 in the from and to text boxes, respectively.

        2.  Push the Go button in the top Search bar
     

  • Simple elemental searches of PubChem Compounds

    Which substances contain Gallium?

        1.  In the Specify Required Elements section, select the checkbox to the left of the Ga atomic symbol
        2.  Push the Go button in the top Search bar

    Which substances contain Carbon, Nitrogen, Oxygen, and Fluorine?

        1.  In the Specify Required Elements  section select the checkboxes to the left of the C, N, O, and F atomic symbols
        2.  Push the Go button in the top Search bar

 

PubChem BioAssay Database

back to top

The PubChem BioAssay Database contains bioactivity screens of chemical substances described in PubChem Substance. It provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to a screening protocol.

Query Help:

  • Searching for PubChem BioAssay datasets

    Select PubChem BioAssay from the pull-down menu. In the Search textbox, enter terms you might expect to find in the description of an assay of interest. The search will consider terms in both the overall description of the assay and in the description of its individual parameters and readouts.

    Examples:

        1.  Searching for yeast cell cycle control finds BioAssay result sets from the NCI Yeast Anticancer Drug Screen.
        2.  Searching for HIV growth inhibition finds the NCI AIDS Antiviral Assay
     

  • Browsing and downloading PubChem BioAssay results
    The PubChem BioAssay result browser helps you to examine descriptions of each assay's parameters and readouts. You may use it to select those parameters and readouts most relevant to the biological activity of interest. An example on how to work with assay data is below.

    Example:

        1.  From the Entrez search page Search bar
            a.  Select PubChem BioAssay from the pull-down menu.
            b.  Type "NCI AIDS Antiviral Assay" (include quote) in the textbox.

            You will see a description of the "NCI AIDS Antiviral Assay" within Entrez.

        2.  Click the hypertext link for "AID: 179".

            You will be brought to the "BioAssay Summary" page, where you will see the detailed description of the assay. You can find more help content about the bioassay summary and result browser.
  • Cross BioAssay records search
You may search and compare BioAssay records for results and conditions and display the readouts of a list of substances and chemical structures found in the BioAssay records. 

Example: 

    1.  From the Entrez search page Search bar:
        a. Select PubChem BioAssay from the pull-down menu.
        b. Enter "P388 Leukemia" (include quotes) in the textbox.

        You will see a list of BioAssay records related to leukemia. Two of the assays,  AID 338 and AID 336, are similar, with slightly different tumor models but the same mice strain to compare them:

    2.  Click the hypertext link "AID: 338".
    3.  On the bioassay summary page, click Select.
    4.  In the last filter of the "Select Other BioAssays", enter ",336" in the textbox (you will see "338,336") and click Go.

        You will see an updated page showing both BioAssay records.

    5.  On the updated select page, you may specify search criteria for the selected results and conditions, or simply click Show button, you will be brought to the "Selected  BioAssay Results" page, where you can see the combined result table. 
  • Using BioAssay graphical tools

In the BioAssay summary, you may view a histogram for distribution of one assay readout and scatter plot for relationships between two assay readouts.

Example 1:

    1.  From the Entrez search page Search bar
        a. Select PubChem BioAssay from the pull-down menu.
        b. Enter prostate in the textbox.

    2.  Click on the hypertext link "AID: 43".
    3.  Click Plot to go to the "Plot BioAssay Results" page.
    4.  You can choose to launch histogram or scatter plot according to the instruction. Click to see a histogram.
    5.  To see a scatter plot, you have to select at least two check box, then click  Scatter Plot button.
 

PubChem Summary Display

back to top


The PubChem summary results are displayed in three pages: substance, compound, and bioassay summary pages. They provide rich cross links to each PubChem database, other NCBI databases, and depositor's databases. PubChem's default results page is part of the Entrez summary list display system.

PubChem Substance Result Summary:
From the Entrez PubChem substance database, users can get substance summary with thumbnails, corresponding compound ID, depositors source information, etc. You can see an example of a substance result in Entrez.



On this page, users can choose to display brief, summary, ID map, substance neighboring information, synonyms, and other database information from the dropdown list. On the right of the page, users can select few pop-up windows (when available) to get related structure, BioAssay, and literature links related to this substance. Users can choose to either "display", or "send" the searched results to "text" or to a "file".

Users can reach the more detailed substance information and cross links by click the structure image or the ID link. Here is an example of the PubChem Substance Summary page:



This page displays the standardized compound information for the substance. This page also provides a rich set of choices for users to get property data, synonyms, descriptors, comments, cross links, depositor's structure drawing, etc. Power users can even download different data formats, such as ASN.1, XML, and SDF. 
 

PubChem Compound Result Summary:
All compounds have been extracted from deposited substances. For natural products substances and those don't have structures, there will be no compound records associated. A substance that is in form of mixture has the mixture format compound record and a/few component(s) compounds associated with.

From the Entrez PubChem compound database, users see a compound summary with thumbnails, few compound property data, etc. Here is an example of a compound result in Entrez.

The page is in the same style as substances. Clicking on thumbnails or CID hyperlink will lead users to the Compound Summary page. Users can find this compound's property data, description, related substance information, neighboring structures, and cross links.

All compounds are structurally unique when compared with each other. One compound may link to many substances.


PubChem BioAssay Result Display:

You can see an example of Entrez BioAssay search result for the term "HIV". Using the "Display" pull-down menu in this page, users may choose to view lists of summaries, brief summaries, unique identifiers, compounds, substances, free text article links (via PMC), and PubMed citations. Similar to other Entrez databases, you may add the result to theEntrez Clipboard, view as HTML text and save to text file using the "Send to" menu. The "Links" pop-up menu on the far right provides shortcuts to PubChem Substance and PubChem Compound screened in the BioAssay.

The BioAssay result page may be navigated by using the PubChem BioAssay browser. The BioAssay Summary page shows detail descriptions of a bioassay including citation links, experiment protocols and depositor comments. You may select descriptors to view in an assay in the BioAssay Result and Condition Summary page. The BioAssay Result Search page provides an interface to specify search criteria for the assay descriptors. You may also combine the current query with other Entrez PubChem Substance searches in this page. The query result counts will be shown in the BioAssay Results Preview page, where you may also select to display other chemical properties of the search result. The BioAssay Results page tabulates the search results and shows thumbnail images of chemical structures, PubChem Substance IDs, and other assay and compound descriptors specified in your selections.

BioAssay Summary -- This page shows the bioassay basic information:


Navigate Buttons:

Press the "Show" button at Test Results to retrieve the bioassay results. 
Press the "Select" button to bring you to the Bioassay's select page.
Press the "Plot" button to bring you the PubChem BioAssay's plot platform, then you can further carryout result plot with either histogram or scatter plot.

Press the Structure Activity Analysis "Show" button to display the heatmap of active compounds vs selected bioassays.  
Press the Structure Clustering "Show" button to display the clustering of the active compounds.

AID : PubChem's bioassay identifier.

BioAssay Version : The bioassay version number is composed of major version number and minor version number. We encourage you to look at the current version result as it is the updated data from the depositors.

Name : The BioAssay name provide by the depositor.

Data Source : Depositor's source name (unique in PubChem)

Deposit Date : Date when data was first deposited.

Modify Date : Date when data was revised.

BioActives : Active compounds/substances tested in the bioassay. Related links for the compound/substance set.

Neighbors, Related BioAssays : Related BioAssays by activity overlap, target similarity, and/or related to the same tested compound/substance set.

Links : Extra linked information to this bioassay.

Compound Information : Compounds tested for this bioassay, including activity information.

Substance Information : Substances tested for this bioassay, including activity information.

PubMed : PubMed citations related to this bioassay.

Protein : NCBI Entrez Protein links to this bioassay if available.

Nucleotide : NCBI Entrez Nucleotide links to this bioassay if available.

Taxonomy : NCBI Entrez Taxonomy links to this bioassay if available.

Structure : MMDB links to this bioassay.

Gene : Gene links to this bioassay.

BioAssay : The bioassays related to this one.

Description : The bioassay's description provided by the depositor.

Protocol : The bioassay's protocol provided by the depositor.

Comment : The bioassay's Comment provided by the depositor.

Result Definition : The bioassay's result definition provided by the depositors.

Three action buttons allow you to show, select, or plot bioassay the bioassay result data. 

 

BioAssay Select -- This page provides an interface to let you to carry out the bioassay result search.

Navigate Buttons:

Press the "Show" button to retrieve the bioassay(s) results based on your query criteria.
Press the "Summary" button to bring you to the Bioassay's summary page.
Press the "Clear" button to clear/reset the query form.

BioAssay Display Option : You can choose to display the results in the view of either compound or substance. In compound view, you can choose to group the results by compound activity, or tested date, or to exclude duplicated results.
In case of multiple bioassays, you can choose to show the tested compounds/substances across all bioassay (intersection) or just tested in any bioassay (union).  

AID : PubChem's bioassay identifier.

Name : The BioAssay name provide by the depositor.

Data Source : Depositor's source name (unique in PubChem)

Outcome Filter allows you to select tested compounds/substances based the activity outcome. The checkbox allows the outcome to be displayed in the result page. By default, it is checked.

Activity Score Filter allows you to select tested compounds/substances based the activity rank acore. The checkbox allows the rank score to be displayed in the result page. By default, it is checked.

Updated Date Filter allows you to the date range for the assay. By default, all result will be returned if no input. The input format is yyyy/mm/dd. mm and dd are optional.

Select BioAssay Results provides a detailed search interface for you. You can search the activity outcome, rank score, and/or test date from the displayed search form. Click the to expand the bioassay result search form. (Then the will be shown up. Click it will collapse the form)

All results fields are checked by default. You can unselect/select all by click the checkbox in the header row. Selected results will be displayed in the result page.

Results with integer/float type can be searched with lower-bound value and/or upper-bound value. String type results can be searched by either select one string term from the dropdown list or by a pattern string. Boolean type result can be searched by select one radio button.

Pattern search : You can use pattern to perform a string search. A PATTERN is a part of a search term.

Result Filter : There are few result filters to allow you to make your result search.

Substance Filter. You can provide a SID list using list file, list text, or Entrez history to your search.

Compound Filter. You can provide a CID list using list file, list text, or Entrez history to your search.

Select Other BioAssayst provide a function to allow you to add/change bioassays. WE DON'T ENCOURAGE YOU TO PROCESS MULTIPLE BIOASSAYS UNLESS YOU KNOW TWO OR MORE BIOASSAYS HAVE RELATION SHIP AND YOU WANT TO COMPARE THEIR RESULTS. You can choose up to 5 bioassays to process their data together.

 

BioAssay Show -- The BioAssay Show page displays your searched results.

Navigate Buttons:

Press the "Select" button to bring you back to the Bioassay's select page
Press the "Summary" button to bring you to the Bioassay's summary page.
 

Result Download. : You can download result set, and compound/substance hits from the two pull down lists.

BioAssay Plot : This page provides an interface for plotting "Scatter Plot" and "Histogram". You can select up to 5 rows. The "Scatter Plot" will show figures for all pairs. The "Histogram" will show figures for all rows. You can also click on each to get the histogram for that row.

Scatter Plot and Histogram : Clicking two diagonal points in the figure, you can view the data with four options: "Plot selected data", "Show selected data", "Show selected data, active only", and "Show selected data, inactive only".





PubChem Structure-Activity Analysis:
From BioAssay pages, users can launch Structure-Activity Analysis tool. The sample page is shown below. The default limit of compounds and BioAssays is set to 1000 in order to get the job done in around one minute. If more than 1000 compounds or BioAssays are input, a warning message will show up and you can change the limit to a number <= 5000. However, you need to wait for more than one minute to get the job done. To launch this tool directly, Click Here.


Select from Heatmap : One way to show a subset of the Heatmap is to click two diagonal points in the Heatmap to select compounds and BioAssays in the region as shown in the image below. A menu will popup with five options. The first option "Zoom in" displays a new Heatmap with compounds and BioAssays in the selected region. The second option "BioActivity Summary, Selected Compounds and BioAssays" shows the selected compounds and BioAssays in PubChem BioActivity Summary page. The third option "Selected Compounds in Structure Clustering" shows the selected compounds in PubChem Structure Clustering page with all structures displayed. The last two options "Selected Compounds in Entrez" and "Selected BioAssays in Entrez" show the selected compounds or BioAssays in Entrez.


Select from Cluster Trees : As shown in the following two images, if you click on a blue circle or the line above the circle in the Compound Cluster Tree or BioAssay Cluster Tree, a menu will pop up. The options for the Compound Cluster Tree are "Compounds in Structure Clustering", "Compounds in Entrez", "Compounds in BioActivity Summary", "Display Subtree Only", "Remove Subtree & Display the Rest", "Expand Subtree", and "Add Similar Compounds". The options for the BioAssay Cluster Tree are "BioAssays in Entrez", "BioAssays in BioActivity Summary", "Display Subtree Only", "Remove Subtree & Display the Rest", "Add Related BioAssays, by Activity Overlap", and "Add Related BioAssays, by Target Similarity".



Collapse Compound Cluster Tree : The Compound Cluster Tree can be collapsed if you click on the ruler as shown below. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded. The corresponding rows in the Heatmap are collapsed as well. The color of collapsed cells are the mixture of green and yellow.

Group Results by and Duplicate Test Option : are two display options. In the Heatmap display, they are combined to show the tested data. For example, if the selection of "Group Results by" is "Compound, Same Connectivity" and the selection of "Duplicate Test Option" is "Most Recent", all current compounds in the Heatmap will be grouped according to "Same Connectivity". Among each group, the compound with the "Most Recent" data will be picked as the group representative.

  • Group Results by : Compounds can be grouped into different levels: "Compound", "Compound, Same Connectivity", "Parent Compound", "Parent Compound, Same Connectivity". Compounds themselves are grouped from "Substances".

  • Duplicate Test Option : Each compound might be tested multiple times in an assay. There are five ways to obtain a representative data: Flag Discrepancies, Exclude Duplicates, Most Recent, Most Active, and Least Active. The default is Flag Discrepancies. If a compound has multiple tested data in an assay, these data are considered "Duplicates". If the "Duplicated" data have different activity outcomes, the corresponding cell in the Heatmap will be shown as "Discrepant" using a mixture of red and blue.

Clusters : This is probably the most important feature in this tool. You can cluster compounds and BioAssays differently to do the Structure-Activity analysis. Four kinds of Compounds/BioAssays relationship can be displayed: Structure/Activity, Activity/Activity, Structure/Target, and Activity/Target.

  • Compounds : can be clustered based on the structures or the activity of these compounds in the selected BioAssays. The Clustering of Compounds is shown on the left of the Heatmap. The scoring function of structure similarity is Tanimoto score, which is calculated from Structure Fingerprint. The scoring function of activity similarity is described below. The clustering algorithm for both compound and BioAssay clusters is Single Linkage.

  • BioAssays : can be clustered based on the activity of the selected compounds in these BioAssays, or the sequence similarity of the BioAssay targets, which are the proteins interacting with the compounds in the BioAssays. The Clustering of BioAssays is shown at the top of the Heatmap. The target similarity is presented with the sequence identity.

Activity Similarity : is calculated from different data using different scoring functions.

  • Calculated from : Each cell in the Heatmap corresponds to a test of one compound in one BioAssay. The result can be expressed in three kinds of data: Activity, Normalized Linear Score, or Normalized Percentile Score. The Activity includes Active, Inactive, Unspecified/Inconclusive, or Untested, which are represented with "1", "0", "?", and "", respectively. The Normalized Linear Score and Normalized Percentile Score are calculated from the raw scores of all compounds in one BioAssay. The Normalized Linear Score = ([score] - [min]) / ([max] - [min]), where [min] and [max] are the minimum and maximum score of this assay. The Normalized Percentile Score = [rank] / [N], where [rank] is the rank of one compound among all compounds in the assay, [N] is the total number of compounds in the assay.

  • Scoring Functions : are different for Activity Data, and Normalized Linear/Percentile Score Data.

    1. Activity Data : There are two scoring functions for Activity data: "Weighted Similarity" (WS) and "Activity Similarity of Active Compounds" (ASAC). "WS" is preferred since it clusters both active and inactive data. However, "ASAC" only clusters active data and treats inactive data same as untested data. "ASAC" = [number of compounds active in both sets A and B] / ([number of compounds active in set A] + [number of compounds active in set B] - [number of compounds active in both sets A and B]). Similarly "Activity Similarity of Inactive Compounds" (ASIC) can be calculated. "WS" = ("ASAC" + 0.1 * "ASIC") / (1 + 0.1).

    2. Normalized Linear or Percentile Score Data : The selected scoring function for score data is "Euclidean Distance". "Euclidean Distance" = 1.0 - SUM of ([diff] * [diff]) / [N], where [N] is the total number of cells in set A (same as that in set B), [diff] = [score of a cell in set A] - [score of the corresponding cell in set B]. If both cells are untested, [diff] = 0. If one cell is tested with a score of "S" and the other cell is untested, [diff] is the higher value of "S" and 1 - "S".

Method / Color : These view options are used to select "Method Code" and "Color Code".

  • Activity Outcome Method : The default "Show" will show BioAssays with different shapes according to their Activity Outcome Methods and the "Method Code". You can select "Hide" to show all Heatmap cells in the same square shape. There are five kinds of methods: "Confirmatory", "Summary", "Screening", "Other", and "Unassigned".

  • Color Based on : While activity similarity can be "Calculated from" Activity, Normalized Linear Score, and Normalized Percentile Score, the cells in the image can be "Colored Based on" any one of the three kinds of data: Activity, Normalized Linear Score, or Normalized Percentile Score.

Revise Compound Input : There are four ways to modify the data set. The "Current" data set is shown in the Heatmap. The "Active Only" data set is a subset of the Heatmap, and shows only Rows with at least one active compound. The "from Entrez History" option provides a new set of compounds from Entrez history. The "Add Similar Compounds/Substances" option adds compounds/substances similar to the current ones.

Revise BioAssay Input : There are eight ways to modify the data set. The "Current" data set is shown in the Heatmap. The "Active Only" data set is a subset of the Heatmap, and shows only Columns with at least one active compound. The "from Entrez History" option provides a new set of BioAssays from Entrez history. The "Known Targets Only" data set is a subset of the Heatmap, and shows only those BioAssays with known targets. The "Add Related BioAssays, by Activity Overlap" option adds the top 15 related BioAssays (by activity overlap). The "Add Related BioAssays, by Target Similarity" option adds the top 15 related BioAssays (by target similarity). The "Confirmatory Method Only" data set shows only those BioAssays with their Activity Outcome Method as confirmatory. The "Summary/Confirmatory Method Only" data set shows only those BioAssays with their Activity Outcome Method as summary or confirmatory.

Counts : The "Counts" only appear when users launch the Heatmap for the first time. The counts include the "Input" compounds or BioAssays, the number of compounds or BioAssays with "Tested" data since some compounds or BioAssays may have no tested data in the Heatmap, the number of compounds or BioAssays "Shown" in the Heatmap.


PubChem Structure Clustering:
The compounds are clustered based on the structure similarity, which is represented using the Tanimoto score calculated from the structure fingerprints. Both the simple view with the compound IDs (CIDs) and the view with the compound structures are provided. The limit of compounds is 5000. If more than 5000 compounds are input, a warning message will show up. To launch this tool directly, Click Here.

Collapse Compound Cluster Tree : The Compound Cluster Tree can be collapsed if you click on the ruler as shown below. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded.


Select from Cluster Trees : As shown in the following image, if you click on a blue circle in the Cluster Tree, a menu will pop up. The options are "Compounds in Entrez", "Compounds in BioActivity Summary", "Display Subtree Only", "Remove Subtree & Display the Rest", and "Expand Subtree".


Group Results by : You can switch between "Compound" and "Substance" views. These compounds are grouped from these substances.

Related BioAssays by Activity Overlap:
This page shows the related BioAssays based on activity similarity between the queried BioAssay and all the rest BioAssays. The top 15 related BioAssays are pre-selected.


Activity Similarity : For BioAssays A and B, the activity similarity = [Active in Both A and B] / ([Active in A] + [Active in B] - [Active in Both A and B]).

Active in Both : Links to compounds active in both the queried BioAssay and the BioAssay listed in the current row.

Data : Links to a BioAssay data table showing the activity and score information for each pair of BioAssays.


Related BioAssays by Target Similarity
This page shows the related BioAssays based on target similarity between the queried BioAssay and all the rest BioAssays. The top 15 related BioAssays are pre-selected.


Target : the protein(s) which the compounds interact with in the BioAssay.

Target Similarity : the similarity of the protein sequences for a pair of targets in two BioAssays. There are three ways to present the target sequence similarity: BLAST score, E Value, and Sequence Identity. The related BioAssays are sorted according to BLAST score.

Related Target : the target of the BioAssay listed in the current row. If there are multiple targets in one BioAssay, only the target with the highest BLAST score is shown.


PubChem BioActivity Summary:


Pubchem BioActivity summary page displays tested compound/substance activity summary in bioassays. You can launch this page from PubChem Entrez docsum. In the pccompound and pcsubstance, click the display dropdown, choose "PubChem BioActivity Summary", you will be able to see the compound/substance activity distribution across all bioassays. In case of the Entrez pcassay, you will see all tested compounds across each bioassay. There are also few other launch points for the page across the bioassay services.

Total BioAssay Count: Refers to the total bioassays in the bioactivity summary page.

Total Substance Count: Refers to the total substances in the bioactivity summary page.

Total Compound Count: Refers to the total compounds in the bioactivity summary page.

Total tested: Refers to the count of compound/substance tested in the total imported ones for the current bioassay set.
Active: Refers to the active count of compound/substance tested.
Inactive: Refers to the inactive count of compound/substance tested.

Revise Substance Selection allows you to reset substance based on few options.

Select Active resets the substances to all active substances for current or selected bioassays based on current substance set.
Add Active
adds all active substances tested in current or selected bioassays.
Add Similar Substances
allows you to retrieve all similar substance (threshold>=90) based on the current substance set (limited to 200).

Revise BioAssay Selection allows you to reset bioassays from following options.

Select Active resets the all bioassays containing active compounds (based on the current compound set).
Add Active
adds any bioassay which has active compound in the current set.
Selected BioAssays
chooses the selected bioassays.

Structure Activity Analysis links to the activity analysis tool based on the selected bioassays or all bioassays (when no selection) across the current compound set.

Structure Clustering is based on the current compound set.

Selected BioAssays to Entrez displays selected or all (when no selection) bioassays in Entrez pcassay.

BioAssay Data Table shows result data table for selected or all (when no selection, maximum up to 20) bioassays with the substance set in the page.

The summary table contains AID, active compound/substance count, inactive compound/substance count, discrepant count (in case of flag discrepancies option), total count, outcome method, and the bioassay name.




BioAssay View File:

A BioAssay view file enables you save the state of a BioAssay display so that you may view it again at a later date or to share with colleagues. Please note that PubChem data may change over time as depositors add, update, and delete data. As such, saving a view does not absolutely guarantee that exactly the same information will be displayed at a later date. The BioAssay view file is in XML format. The specification for this file can be found at: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pug.xsd.

PubMed MeSH Keyword Summary

back to top


PubMed MeSH Keyword Summary tool is intended to help users narrowing down their PubMed searches by finding most frequest co-occuring keywords. When launched from substance/compound summary page, the tool takes a MeSH keyword, finds all the PubMed articles associated with that keyword, and generates a keyword summary page consisting of all the keywords found in those articles. The the found keywords are ranked by article count.
Additionally, keywords are categorized based on the parts of MeSH tree.

The displayed categories are:
Chemicals: All keywords related to "Chemicals and Drugs" except "Chemical Actions and Uses" (keywords from D-tree besides D-27 branch)
Pharmacological Actions of Chemicals: Pharmacological actions annotated on Chemicals and Drugs that themselves annotated onto PubMed articles.
Directly Annotated Pharmacological Actions: Pharmacological Actions keywords directly annotated onto PubMed articles (keywords from D27.505 branch)
Toxicological Actions: Toxicology-related keywords ( D27.888 branch)
Biological Sciences: Keywords related to Biological Sceinces ( G-branch)
Others: All other terms not included above
PubMed Articles found: the number of all articles associated with keyword(s); linked to the list of all found PubMed articles

Medical Subject keywords found: number of all keywords found; linked to a list of all MeSh keywords

Search section: the search section is for querying PubMed with a selected set of MeSH terms. The user can select different MeSH terms by clicking on checkboxes and then adding the selected terms into the "Search" box by clicking on "Add Selection" button. The radiobuttons AND/OR/NOT are used to logically link together queries. The default is "AND". "Clear" button cleares the query box.

Search PubMed this button is for sending the query command to PubMed

Categories: All categories have keywords listed in a descending order of article count. MeSH keywords are linked to the MeSH keyword summary page. Counts are linked to all articles associated with the keyword.

PubChem Cross Links

back to top

PubChem provides cross links to other databases when that information is available. You can find those links from either entrez PubChem pages or individual record summary pages.

SID: Link to Entrez PCSubstance.

Substance Version: The dropdown action displays substance information from old version if available.

CID: Link to Entrez PCCompound.

Substances Links: (on the compound summary page)
All: Links to Entrez PCSubstance records with all substances that contain this structure.
Same: Links to Entrez PCSubstance records with those that contain this structure only.
Mixture: Links to Entrez PCSubstance records where this molecule is a component of a mixture.
Component: Links to Entrez PCCompound records with component records only.

BioAssay Information: Links to the Entrez PCAssay or the BioAssay summary page related to this compound/substance.

Related Compounds/Substances Links:
Same: The molecules in this group are exactly the same, including connectivity, isotopes and stereochemistry.
Same, Connectivity: The molecules in this group have the same regular chemical connectivity, ignoring isotopes and stereochemistry.
Same, Stereochemistry: Molecules that have the same connectivity and stereochemistry, ignoring isotopes.
Same, Isotope: Molecules that have the same connectivity and isotopes, ignoring stereochemistry.
Same, Any Tautomers: Those molecules that are tautomers.

Parent Compound Link: Link to the parent compound of the record. A parent is conceptually the "important" part of the molecule when the molecule has more then one covalent component. Specifically, a parent component must have at least one carbon and contain at least 70% of the heavy (non-hydrogen) atoms of all the unique covalent units (ignoring stoichiometry). Note that this is a very empirical definition and is subject to change.

Same Parent Links: Links to records that share the same parent component. This is a sort of merger of the parent and related concepts: these are links to groups that share the same parent at different levels of "sameness" as described above for the Related links.

Similar Compounds Link: (on the compound summary page) Link to Entrez PCCompound records. All compounds shown have a similarity score [Tanimoto] >90%. If you want to find compounds with different scores, you can visit the PubChem structure search page.

Similar Substances Link: (on the substance summary page) Link to Entrez PCsubstance records. All substances shown have a similarity score [Tanimoto] >90%. If you want to find substances with different scores, you can visit the PubChem structure search page.


Similarity links are pre-computed in PubChem using a dictionary-based fingerprint at 90% using the Tanimoto score equation:
Tanimoto = AB / ( A + B - AB )
Where:
      Tanimoto is the Tanimoto score, a fraction between 0 and 1.
      AB is the count of bits set after bit-wise & of fingerprints A and B
      A is the count of bits set in fingerprint A
      B is the count of bits set in fingerprint B

Each similarity link is equivalent to a chemical structure similarity search of the PubChem Compound database yielding all chemical structures with a Tanimoto score that is 90% or above.

In addition to the Tanimoto equation above, PubChem uses a "boost" scheme that assigns a similarity score of:

      104% to structures with identical stereo, isotope, and connectivity.
      103% to structures with identical connectivity and either stereo or isotope.
      102% to structures with identical connectivity.
      101% to structures that are tautomers of the query.

The cases of "boosted" scores greater than 101% correspond to cases that originally would have had a score of 100% similarity. However, in the case where tautomers get an artificial score of 101%, their natural score could be much lower, sometimes as low as 60%, especially for small compounds where the tautomeric system is a large part of the structure.

There are 881 substructure-keys (skeys) in each fingerprint. Each bit in the fingerprint represents the presence (or absence) of a particular chemical substructure (e.g., a carboxylic acid) or a particular count of the same. These skeys are similar in nature to the well-known MDL MACCS skeys fingerprints.

Structure Search: Leads you to the PubChem structure search page by transferring this compound's isomeric SMILES string into the search field. For more information about the structure search, visit the structure search help page.

BioActivity: Link to the bioassay data summary page.

Structure Links: Link to the Entrez Structure database with associated mmdb IDs provided by depositors.

PubMed:
Links to PubMed references with this compound/substance. The linkage is mainly based on the pmid(s) associated with the substance/compound.

Nucleotides: Linking to the Entrez nucleotides database with associated nucleotides gis provided by depositors.

Protein: Links to the Entrez protein database with associated protein GIs provided by depositors.

Source and Source-ID: Links to the depositor's original information page by the depositor's source name and/or source-id (external id), if available.

Medical Subject Annotations (MeSH):
Linking information to records in NLM's Medical Subject Heading (MeSH) database. Linkage is based on matching names and synonyms supplied with the chemical structure record to those in the MeSH record. The names and synonyms creating a link to each MeSH record are indicated by "MeSH" links in the "Synonyms" section of PubChem Compound and Substance Summary displays.
This section also provides PubMed information through MeSH heading and subheadings. To learn more about MeSH, visit the MeSH site.

NLM Toxicology Link: Links to the SIS/ChemIDPlus record for this compound/substance, which provides additional links to toxicology information sources. More information is available.


PubChem Indexes and Filters in Entrez

back to top

The PubChem index search is a very powerful tool within the Entrez system. Users can simply type search term(s) followed by the bracketed index field name. Then click the "Go" button.

Examples:


  Search for DTP/NCI's record with NSC#78:
      On the PubChem homepage or Entrez search page, enter "DTP/NCI[Sourcename], 78[objectid]" in
      the search box, then click the Go button.

  Search for all substances containing gold:
      On the PubChem homepage or Entrez search page enter "Au[el]", and click the Go button.

  Search for all compounds with heavy atom count between 10 and 12:
      On the PubChem homepage or Entrez search page, choose 'PCCompound' database from the
      search dropdown list, enter "10:12[hac]", and click the Go button.

The following fields can be searched within Entrez PubChem databases (with field aliases in square brackets; pick one alias that's easily memorized in case multiple aliases are available). For integer/real number fields, the range search can be done as shown above.

PCCompound:

All [ALL] : All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid
[UID] : The integer represents CID for each PCCompound database. By default, an integer without a field alias is recognized as a UID. Same as [CID].
Filter [Filter] : Limits the records. A number of filters are available to restrict the search to compounds with particular information. The specialized Filters in this database are:

  • has_mesh: records with associated MeSH terms
  • has_pharm: records with associated pharmacological actions
  • has_parent: records that have a parent structure
  • has_no_parent: records that do not have a parent
ActiveAid [AA] : Active BioAssay identifier, integer. This choice is intended for users quering for active compounds for a particular assay. Bioassay ID (AID) should be provided.
back to top

ActiveAidCount [AC, ACNT]: Using this filter users can query for compounds which are active in a certain number of assays
ActiveAidRatio [AAR] : Ratio should be between zero and 1. Ratio equals to the number of bioassays where compounds were tested active devided by number of bioassays where compounds tested with any result.
AssaySourceName [ASRC, ASRCNAM, ASRCNAME] : Allows filtering of by assay source name. For available data sources look here
AtomChiralCount [ACC, ACCNT] : Total count of chiral atoms in a given compound, integer.
AtomChiralDefCount [ACDC, ACDCNT] : Total count of defined chiral atoms in a given compound, integer.
AtomChiralUndefCount [ACUC, ACUCNT] : Total count of undefined chiral atoms in a given compound, integer.
BioAssayID [BAID, AID] : BioAssay identifier, integer.
BondChiralCount [BCC, BCCNT] : Total count of chiral bonds in a given compound, integer.
BondChiralDefCount [BCDC, BCDCNT] : Total count of defined chiral bonds in a given compound, integer.
BondChiralUndefCount [BCUC, BCUCNT] : Total count of undefined chiral bonds in a given compound, integer.
CompleteSynonym [CSYN, CSYNO] : Compound's synonyms, based on all substance related to this compound.
Complexity [CPLX] : Compound complexity.
CompoundID [CID] : Compound ID. Same as [UID].
CovalentUnitCount [CUC, CUCNT] : Integer.
CreateDate : Date this compound created inPubChem.
Element [ELMT, EL] : Chemical element in a compound.
ExactMass [EMAS, EXMASS] : The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense mol/molecule peak in a MS spec. A real number.
HeavyAtomCount [HAC, HACNT] : Atom count in a compound except hydrogen, integer.
HydrogenBondAcceptorCount [HBAC, HBACNT] : Hydrogen bond acceptors for a compound, integer.
HydrogenBondDonorCount [HBDC, HBDCNT] : Hydrogen bond donors for a compound, integer.
InactiveAid [IA] : Inactive BioAssay identifier, integer. This choice is intended for users quering for inactive compounds for a particular assay. Bioassay ID (AID) should be provided.
InactiveAidCount [IC, ICNT]: Using this filter users can query for compounds which are inactive in a certain number of assays
InChI [INCH, INCHI] : IUPAC International Chemical Identifier.


InChI string can be searched through the Entrez PubChem databases. e.g.

To search with the the InChI string of aspirin: "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H":

Type or paste "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H"[InChI] into the PubChem Compound, or PubChem Substance, or the Entrez Global search box, then click Go button.

Note:
     The quote marks and the square brackets are required.
     'InChI=' is required.
IsotopeAtomCount [IAC, IACNT] : Isotope atom numbers in a compound.
IUPACName [UPAC, IUPAC] : Standard IUPAC name for compound.
MeSHDescription [MHD]
MeSHTerm [MSHT, MESHT] : Medical Subject Heading term. Note that MeSH entry terms (synonyms for the Medical Subject Heading term) are also indexed.
MeSHTreeNode [MSHN, MESHTN] : Medical Subject Heading tree node (tree structures). Searching for a term corresponding to a higher level node in the MeSH hierarchy will find records matching any MeSH terms beneath that node. For example, "Penicillins[MeSHTreeNode]" will find records linked to MeSH terms "Oxacillin", "Cloxacillin", etc. Note that entry terms (synonyms) for MeSH tree nodes are also indexed.
MolecularWeight [MW, MWT, MOLWT] : Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number.
MonoisotopicMass [MMAS, MIMASS] : Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number.
PharmAction [PHMA, PHARMA] : MeSH pharmacological actions.
RotatableBondCount [RBC, RBCNT] : Count of rotatable bonds
SourceID [SRID, SRCID] : Depositor's external id.
SourceName [SRC, SRCNAM, SRCNAME] : Depositor name officially recorded in PubChem databases. For current data sources look here
SourceCategory [SRCC, SRCCAT, SRCCATG] : Depositor categories. For more information and possible categories look here
SubstanceID [SID] : Substance identifier, integer.
Synonym [SYNO] : Synonyms for substance.
TautomerCount [TC, TCNT, TTMC] : Possible tautomer count for each given structure, no more than 200, integer. 
TotalAidCount [TAC] : TotalAidCount includes any assay that a compound is tested, it should cover active/inactive/inconclusive/unspecified
TotalFormalCharge [TFC, CHG, CHRG] : Total formal charge.
TPSA[TPSA] : Topological Polar Surface Area.
XLogP [XLGP, LOGP]


PCSubstance:

All [ALL] : All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID] : The integer represents SID for PCSubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].
Filter [Filter] : Limits the records. A number of filters are available to restrict the search to substances with particular information. The specialized Filters in this database are:
  • has_mesh: records with associated MeSH terms
  • has_pharm: records with associated pharmacological actions
  • has_parent: records that have a parent structure
  • has_no_parent: records that do not have a parent
ActiveAid [AA] : Active BioAssay identifier, integer. This choice is intended for users quering for active substances for a particular assay.
ActiveAidCount [AC, ACNT] : Using this filter users can query for compounds which are active in a certain number of assays.
ActiveAidRatio [AAR] : Ratio should be between zero and 1. Ratio equals to the number of bioassays where substances were tested active devided by number of bioassays where substances tested with any result.
                         (# bioassys where tested active) / (# bioassys tested with any result)
AssaySourceName [ASRC, ASRCNAM, ASRCNAME] : Allows filtering of by assay source name. For available data sources look here
AtomChiralCount [ACC, ACCNT] : Total count of chiral atoms in a given compound, integer.
AtomChiralDefCount [ACDC, ACDCNT] --  Total count of defined chiral atoms in a given compound, integer.
AtomChiralUndefCount [ACUC, ACUCNT] --  Total count of undefined chiral atoms in a given compound, integer.
BioAssayID [BAID, AID] : BioAssay identifier, integer.
BondChiralCount [BCC, BCCNT] : Total count of chiral bonds in a given compound, integer.
BondChiralDefCount [BCDC, BCDCNT] : Total count of defined chiral bonds in a given compound, integer.
BondChiralUndefCount [BCUC, BCUCNT] : Total count of undefined chiral bonds in a given compound, integer.
Comment [CMT] : Substance or bioassay comment.
CompleteSynonym [CSYN, CSYNO] : Compound's synonyms, based on all substance related to this compound.
Complexity [CPLX] : Substance structure
complexity.
ComponentCID [CCID] : Component compound identifier.
CompoundID [CID] : Compound identifier, integer.
CovalentUnitCount [CUC, CUCNT] : Integer.
DepositDate [DDAT, DEPDAT] : Deposition timestamp for a substance.
Element [ELMT, EL] : Chemical element in a substance/compound.
ExactMass [EMAS, EXMASS]-- The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a MS spec. A real number.
HeavyAtomCount [HAC, HACNT] : Atom count in a compound except hydrogen, integer.
HydrogenBondAcceptorCount [HBAC, HBACNT] : Hydrogen bond acceptors for a compound, integer.
HydrogenBondDonorCount [HBDC, HBDCNT] : Hydrogen bond donors for a compound, integer.
InactiveAid [IA] : Inactive BioAssay identifier, integer. This choice is intended for users quering for inactive substances for a particular assay. Bioassay ID (AID) should be provided.
InactiveAidCount [IC, ICNT] : Using this filter users can query for substances which are inactive in a certain number of assays
InChI [inchi] : IUPAC International Chemical Identifier.


InChI string can be searched through the Entrez PubChem databases. e.g.

To search with the the InChI string of aspirin: "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H":

Type or paste "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H"[InChI] into the PubChem Compound, or PubChem Substance, or the Entrez Global search box, then click Go button.

Note:
     The quote marks and the square brackets are required.
     'InChI=' is required.
IsotopeAtomCount [IAC, IACNT] : Isotope atom numbers in a compound.
IUPACName [UPAC, IUPAC] : Standard IUPAC name for compound.
MeSHDescription [MHD]
MeSHTerm [MSHT, MESHT] : Medical Subject Heading term. Note that MeSH entry terms (synonyms for the Medical Subject Heading term) are also indexed.
MeSHTreeNode [MSHN, MESHTN] : Medical Subject Heading tree node (tree structures). Searching for a term corresponding to a higher level node in the MeSH hierarchy will find records matching any MeSH terms beneath that node. For example, "Penicillins[MeSHTreeNode]" will find records linked to MeSH terms "Oxacillin", "Cloxacillin", etc. Note that entry terms (synonyms) for MeSH tree nodes are also indexed.
ModifiedDate : Date this substance record is modified.
MolecularWeight [MW, MWT, MOLWT] : Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number.
MonoisotopicMass [MMAS, MIMASS] : Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number.
PharmAction [PHMA, PHARMA] : MeSH pharmacological actions.
RotatableBondCount [RBC, RBCNT] : Count of rotatable bonds
SourceCategory [SRCC, SRCCAT, SRCCATG] : Depositor categories. For more information and possible categories look here
SourceID [SRID, SRCID] : Depositor's external id.
SourceName [SRC, SRCNAM, SRCNAME] : Depositor name officially recorded in PubChem databases. For current data sources look here
SourceReleaseDate [SRD, SRDAT, RLSDAT]
StandardizedCID [SCID] : Standardized compound identifier, integer.
SubstanceID [SID] : Substance ID. Same as [UID].
Synonym [SYNO] : Synonyms for substance.
TautomerCount [TC, TCNT, TTMC] : Possible tautomer count for each given structure, no more than 200, integer. 
TotalAidCount [TAC]
TotalFormalCharge [TFC, CHG, CHRG] : Total formula charge.
TPSA [TPSA] : Topological Polar Surface Area.
XLogP [XLGP, LOGP]
 
PCBioAssay:

All [ALL] : All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID] : The integer represents AID for PCAssay database. By default, an integer without a field alias is recognized as a UID.
Filter [Filter] : Limits the records. A number of filters are available, to retrieve records in the same or other databases that the current bioassay records are cross-referenced to.
ActiveCidCount [ACC, ACCNT] : Number of unique chemicals (identified by cid:compound identifiers from PCCompound) that are considered as active in a bioassay.
ActiveSidCount [AC, ACNT] : Number of substances (identified by sid:substance identifier from PCSubstance) that are considered as active in a bioassay.
AssayComment [AssayComment] : Comment for the bioassay.
AssayDescription [ADES, ADESC, ADSC] : Description for the bioassay.
AssayName [ANAM, ANAME] : Name of a bioassay provided by depositor.
AssayProtocol [APRL, APRTL] : Protocol for a bioassay provided by depositor.
AssayComment [ACMT, ACMMNT] : comment for a bioassay provided by depositor.
CompoundIDActive [CIDA] : CID(compound identifier from PCCompound) of a chemical that is considered as active in a bioassay.
CompoundIDTested [CIDT] : CID(compound identifier from PCCompound) of a chemical that is tested in a bioassay.
InactiveCidCount [IACC, IACCNT] : Number of unique chemicals (identified by cid--compound identifiers from PCCompound) that are considered as inactive in a bioassay.
InactiveSidCount [IAC, IACNT] : Number of substances (identified by sid--substance identifier from PCSubstance) that are considered as inactive in a bioassay.
InconclusiveCidCount [ICC, ICCNT] : Number of unique chemicals (identified by cid--compound identifiers from PCCompound) that are considered as inconclusive regarding to be active or inactive in a bioassay.
InconclusiveSidCount [IC, ICNT] : Number of substances (identified by sid--substance identifier from PCSubstance) that are considered as inconclusive regarding to be active or inactive in a bioassay.
MeSHDescriptionActive MHDA] : Medical Subject Heading descriptions that are associated with an active chemical structure in a bioassay.
MeSHDescriptionTested [MHDT] : Medical Subject Heading descriptions that are associated with any chemical structure tested in a bioassay.
MeSHTermActive [MHTA] : Medical Subject Heading terms that are only associated with an active chemical structure in a bioassay.
MeSHTermTested [MHTT] : Medical Subject Heading terms that are associated with any chemical structure tested in a bioassay.
ModifiedDate [MDAT, MDATE] : Last date when a bioassay data content is modified. Date format is yyyy/mm/dd. mm and dd are optional.
PharmActionActive [PHAA] : MeSH pharmacological actions that are associated with an active chemical structure in a assay.
PharmActionTested [PHAT] : MeSH pharmacological actions that are associated with an any chemical structure tested in a assay.
ReadoutCount [RC, RCNT] : Number of total test result fields(readouts) in a bioassay
ReleaseDate [RDAT, RDATE] : Date when a bioassay data is released to public by PubChem. Date format is yyyy/mm/dd. mm and dd are optional.
SourceCategory [SRCC, SRCCAT, SRCCATG] : Category of bioassay data source
SourceName [SNME, SNAME] : Source name of a bioassay data specified by depositor.
SubstanceIDActive [SIDA] : SID(substance identifier from PCSubstance) of a substance that is considered as active in a bioassay.
SubstanceIDTested [SIDT] : SID(substance identifier from PCSubstance) of a substance that is tested in a bioassay.
SynonymActive : MESH names and synonyms that are associated only with an active chemical structure in a bioassay.
SynonymTested [SYNT] : MESH names and synonyms that are associated with any chemical structure tested in a bioassay.
TidDescription [TDES, TIDD, TDESC, TDSC] : Description of a test result field in a bioassay
TidName [TNAM, TIDN, TNAME] : Name of a test result field in a bioassay
TotalCidCount [TCC] : Total number of unique chemicals (identified by cid--compound identifiers from PCCompound) that are tested in a bioassay.
TotalSidCount [TSC] : Total number of substances tested in a bioassay.
XRefAid [XRAD] : NCBI Entrez PubChem Bioassays identifier.
XRefAsurl [XRAL] : URL of the bioassay provided by depositor.
XRefComment [XRCT]
XRefDburl [XRDL] : URL of Depositor's organization.
XRefGene : NCBI Entrez Gene identifier referred by a bioassay.
XRefGi [XRGI] : NCBI Entrez Protein/Nucleotide GI number referred by a bioassay.
XRefMmdb [XRMB] : NCBI Entrez Molecular Modeling Database(MMDB) identifiers(MMDB ID) referred by a bioassay.
XRefNucleotidegi [XRefNCGI, XRNI] : NCBI Entrez Nucleotide identifier referred by a bioassay.
XRefOmim [XRMM] : NCBI Entrez OMIM identifier.
XRefPmid [XRPD] : NCBI Entrez Pubmed identifiers(Pubmed ID) referred by a bioassay.
XRefProteingi [XRefPRGI, XRPI] : NCBI Entrez Protein identifier referred by a bioassay.
XRefSburl [XRSL]
XRefTaxonomy [XRTY] : NCBI Entrez Taxonomy identifier.

 
PubChem Batch Processing Utility

back to top

PubChem batch processing utility allows power users to process PubChem data in batch mode. Please follow the procedures provided to perform your job. Note: Please limit your batch to 50,000 records as this is the limit for the PubChem download facility.

Utility Type:
  • Please choose one utility that you want to process.
Hints to generate an ID-Map file:
  • Do any search at the Entrez PubChem Substance database, e.g. searching term "Cu[el]"
  • At the Display dropdown list, choose "ID Map"
  • At the "Send to" dropdown list, choose "File"
  • Save the file for the PubChem Batch Utility use.
Submit Your Job:
  • Once you finish step 1 and 2, click the Go button to submit your job.

New Job:

  • To start a new job, click this link.

PubChem FAQ

back to top


  1. What is PubChem ?

  2. What is PubChem Substance ?

  3. What is PubChem Compound ?

  4. What does the depositor's category tell users and what are the existing depositor categories ?

  5. Why search PubChem Substance and/or Compound ?

  6. What is PubChem BioAssay ?

  7. How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?

  8. How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?

  9. How do I process a text search with PubChem databases  ?

  10. How do I perform a structure search ?

  11. How do I save my search result ?

  12. Sometime I see errors in the substance record, where should I report ?

  13. What are exact mass and monoisotopic mass for a substance/compound ?


  14. How do I find INCHI version and parameters?

Q: What is PubChem ?

A: PubChem is a component of NIH's Molecular Libraries Roadmap Initiative. It provides information on the biological activities of small molecules. PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical similarity search tool.

Q: What is PubChem Substance ?

A: PubChem Substance records contain substance information electronically submitted to PubChem by depositors. This includes any chemical structure information submitted, as well as chemical names, comments, and links to the depositor's web site.

Q: What is PubChem Compound ?

A: PubChem compound records comprise a non-redundant set of standardized and validated chemical structures. A compound record may link to more than one PubChem Substance record, if different depositors supplied the same structure. Chemical names shown in PubChem Compound records are a composite derived from all linked substances, with default ranking of names by weighted frequency of use.

Q: What does the depositor's category tell users and what are the existing depositor categories ?

A: The depositor categories indicate the type of information one may expect to find when following the depositor substance URL or the type of information provided by the depositor. A list of possible categories include the following:

Status Meaning
Biological Properties Depositor provides information about the biological properties of a substance or compound
Chemical Reactions Depositor provides information about the reactivity, synthesis, or known reactions of a substance or compound
Imaging Agents Depositor provides information about the contrast agent or imaging agent used in, for example, MRI's
Journal Publishers Depositor is a journal publisher and has articles published about a substance or compound
Metabolic Pathways Depositor provides information on the metabolic pathways involving a substance or compound
Molecular Libraries Screening Center Network Depositor is part of the NIH Molecular Libraries Screening Center Network (MLSCN)
NIH Substance Repository Depositor is an NIH Molecular Libraries Small Molecule Repositor servicing the MLSCN
Physical Properties Depositor provides information about the experimental physical properties of a substance or compound
Protein 3D Structures Depositor provides information about the experimental 3-D structure of a substance or compound
Substance Vendors Depositor is a seller of a substance or compound
Theoretical Properties Depositor provides information about the theoretical properties of a substance or compound
Toxicology Depositor provides information about the toxicological properties of a substance or compound

Q: Why search PubChem Substance and/or Compound ?

A: It is useful to search PubChem's Substance database when one is looking for information from a particular depositor exclusively, and/or when one is looking for information on substances such as natural product extracts which may not have associated chemical structure information. These special cases aside, it is generally most useful to search for chemical names or structures in PubChem's Compound database. This provides a concise view, combining information derived from multiple Substance records that specified the same structure. PubChem's structure search service operates on PubChem's Compound database exclusively.

Q: What is PubChem BioAssay ?

A: The PubChem BioAssay Database contains bioactivity screens of chemical substances described in PubChem Substance.  It provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to that screening procedure.

Q: How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?

A: A PubChem Substance SID is assigned to each unique external registry ID provided by a PubChem data source. A depositor may "revoke" (or otherwise deprecate) a PubChem SID at any time for any reason. However, the link to the "revoked" PubChem SID lives on in perpetuity. There will be a message stating the depositor deprecated the SID, but the link to the archived information will still be available. In addition, the PubChem CID's pointed to by the old version of a PubChem SID at the time it was versioned or deprecated will also be available.

Q: How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?

A: A PubChem Compound CID is assigned to each unique chemical structure. It is possible that different tautomeric forms of the same compound to have different CID's. The chemical structure represented by a CID is permanent. The URL links to the compound summaries are stable (always live), regardless if any (or no) substance points to them.

Q: How do I process a text search with PubChem databases ?

A: PubChem's Substance, Compound, and BioAssay databases are fully integrated within NCBI's Entrez data retrieval system. You can process any name, keyword, or ID search through the Entrez system. The PubChem homepage also provides a search box. For a specific database query, see related content in the help document above.

Q: How do I perform a structure search ?

A: You can perform a structure search through the PubChem structure database. PubChem provides two search interfaces, basic structure search and advanced structure search. For more information, visit structure search help.

Q: How do I save my search result ?

A: To save your search from Entrez, you can use either the PubChem download facility or the Entrez generic search-save tool. You can get more help by clicking the two links above.

Q: Sometime I see errors in the substance record, where I should report ?

A: PubChem doesn't have curators and never changes/edits substance records. They remain as supplied by our depositors, just as with GenBank records. You can follow PubChem substance summary page to find the original record from the depositor's page, and report the error. Once the error is corrected by the depositor, PubChem will implement it at next update. For compound property/descriptor errors, you can report to the NCBI help desk.

Q: What are exact mass and monoisotopic mass for a substance/compound ?

A: An exact mass is the most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a mass spectrum. A monoisotopic mass is a molecule calculated using the mass of the most abundant isotope of each element.
For example, carbon has a monoisotopic mass of 12.000 g/mol.
Exact mass and monoisotopic mass are the same for more than 90% of structures but differ when atom counts are such that presence of one or more lower abundance isotopes is most probable.

For example, carbon tetrachloride, CCl4, PubChem CID 5943, has an exact mass of 153.872, where, in this case, the prototypical compound is made of three 35Cl, one 37Cl, and one 12C. The monoisotopic mass for carbon tetrachloride is 151.875, where in this case all chlorine atoms are assumed to be 35Cl, with isotope abundance of 75.77%, and the carbon atom is assumed to be 12C, with isotope abundance of 98.9%. In many cases, these two masses are identical, except for compounds with four or more Cl atoms, two or more Br atoms, or other elements not dominated by a single isotope, or for really large compounds such as with the number of carbons greater than 99, i.e., for C100 the exact mass will be 12C * 99 + 13C * 1, while the monoisotopic mass will be 12C * 100.

Q: How do I find INCHI version and parameters?

A: The InChI version and parameters are detailed in the ASN.1 and XML data for each compound. For example, for aspirin, you can find the information you are seeking by viewing the ASN.1 record:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2244&disopt;=DisplayASN1

If you scroll down to the InChI property record, you will find:
{
   urn {
      label "InChI",
      datatype string,
      parameters "options {auxnonr donotaddh w0 fixedh recmet newps}",
      implementation "E_INCHI",
      version "1.0.1",
      software "InChI",
      source "nist.gov",
      release "2007.09.04"
   },
   value sval "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H"
},
The InChI version is (currently) 1.0.1. We are using the parameter options "auxnonr donotaddh w0 fixedh recmet newps".


PubChem Courses

back to top

NCBI provides PubChem training courses:

PubChem Documents

back to top


Glossary

back to top

AID : PubChem's bioassay (protocol) identifier, a non-zero integer.

CID : PubChem's compound identifier, a non-zero integer for a unique chemical structure.

Complexity : The complexity rating of the compounds is a rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features including symmetry. However, neither stereochemistry nor isotope labeling are used as auxiliary criteria. The value is computed using the Bertz/Hendrickson/Ihlenfeldt formula. A scaling factor for aromaticity is used so that the complexity of benzene is the same as of cyclohexane. It is a floating point value, ranging from 0 (simple ions) to several thousand (complex natural products). Generally larger compounds are more complex than smaller ones, but highly symmetrical compounds, or compounds with few distinct atom types or elements are downgraded. Complexity is only loosely correlated with synthetic accessibility. The most complex compound in PubChem is CID 6338588 (C124H185N9O207S36) with a complexity rating of about 18425. The average complexity of the structures in PubChem compound database is about 551.

Comments : List all depositor's comments and additional information for this substance.

Component : For mixture substance/compound, component is one of the single molecule.

Compound : Chemical representatives in substances. Chemical structure presented in a compound is standardized through PubChem's data pipeline. A mixture substance may have several standardized compounds. A compound record is structurally unique in the PubChem compound database.

Computed Descriptors : Information to describe the compound in different formats, including SMILES, InChI, IUPAC names.

Computed Properties : These data are calculated from the compound, including molecular weight, formula, XLogP, etc.

Depositors Category : Depositors category tells users that there is an additional category-specific information either on depositors substance summary page or on the depositor's web-site.

Deprecated Compound : A Compound CID which has no links to any substance. This may occur as PubChem modifies processing. A deprecated compound will not be available within Entrez.

HBA : Number of hydrogen acceptors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].

HBD : Number of hydrogen donors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].

Heavy Atom : All atoms except hydrogen.

InChI : IUPAC International Chemical Identifier. Learn more...  InChI string can be searched through the Entrez PubChem databases. Click here to see the example.

Old Version Substance -- Substance versions are considered to be "old" when a more recent update is provided by the depositor.

Molecular Weight : The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in gr/mol. In the absence of explicit isotope labeling, averaged natural abundance (which may, for example in case of Li and U compounds, not be identical to purchasable material) is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location, even for short-lived radioactive isotopes where this is often physically unrealistic. At this moment, it is not possible to deposit more detailed isotope composition information into the PubChem database. Pseudo-atoms which are not an element have an atomic weight of 0 g/mol.

Revoked BioAssay : When a depositor removes an assay that the depositor previously deposited into PubChem, the assay is considered revoked. A revoked assay will not be available within Entrez..

Revoked Substance : When a depositor removes a substance from their substance collection, the substance is considered revoked. A revoked substance will not be available within Entrez.

SID : PubChem's substance identifier, a non-zero integer for a deposited substance.

SMILES : Simplified Molecular Input Line Entry System, a line notation (a typographical method using printable characters) for entering and representing molecules. Learn more..
You can also find more related information form PubChem's document section in PDF or Text.

SMARTS : A language that allows you to specify substructures using rules that are straightforward extensions of SMILES. Learn more..

Substance : Individual record object collected from depositors, representing a sample used at bioassay.

Substance Category : Substance categories (one or more) are assigned to each depositor, based on nature of that depositor's institution and the type of data they supply.

Suppressed Compound : A Compound CID that links only to an old version substance. A suppressed compound will not be available within Entrez.

Synonyms : All names, trivial names, synonyms, frequently used IDs, and other names collected from depositors. In the compound summary page, synonyms are distinct synonyms from all corresponding substances.

TPSA -- Topological Polar Surface Area. This is an estimate of the area (in Å squared) which is polar. The implementation follows [J. Med. Chem. 2000, 43, 3714-3717.]. It is a simple method - only N and O are considered, 3D coordinates are not used, and there are various precomputed factors for different hybridizations, charges and participation in aromatic systems.

Version : PubChem substance version number is incremented when an update is provided by the depositor.

Xref : The external references/links to PubChem database records.

XLogP : A partition coefficient or distribution coefficient that is a measure of differential solubility of a compound in two solvents. Learn more..
From November 2006, the PubChem uses version 2 of the algorithm of the reference [ Perspectives in Drug Discovery and Design. 2000, 19, 47-66.] to generate the XlogP value.


| Write to Helpdesk | Disclaimer | Privacy statement | Accessibility |
NCBI Home NCBI Search NCBI SiteMap